Managing backups across multiple servers can quickly become complex—especially when each machine uses its own filesystem mounts, directory structures, and database dumps. In our infrastructure, several issues appeared simultaneously:
Identifying the Problems
- Backups were not monitored centrally — each server had its own backup process, but no single node verified the freshness or size of remote backup sets.
- SSH authentication was inconsistent — password prompts blocked automated monitoring jobs.
- CIFS / storage mounts could silently drop — causing backup directories to appear empty or unreadable.
- Daily and weekly database snapshots needed verification — both freshness (time) and size (GB) had to be tested safely.
- Large media paths caused `du` to hang — making local disk traversal unsafe without hard timeouts.
The Solution: A Central Multi-Host Python Health Check
A fully custom Python 2.7–compatible monitoring script was built to:
- Check local and remote mountpoints
- Verify existence and readability of backup directories
- Locate the newest
YYYY-MM-DD-dailyandYYYY-MM-DD-weeklyfolders - Validate timestamp freshness (≤ 36h for daily, ≤ 8d for weekly)
- Check database backup sizes safely (≥ 10G)
- Avoid traversing large media directories (skipped by design)
- Apply universal command timeouts to prevent hangs
- Operate fully non-interactive via key-based SSH
Hardening SSH for Non-Interactive Monitoring
To avoid password prompts and ensure reliability, a dedicated SSH key was generated specifically for the monitoring script:
ssh-keygen -t ed25519 -f ~/.ssh/osg_backup_check -C "backup-monitor@$(hostname)" -N "" ssh-copy-id -i ~/.ssh/osg_backup_check.pub root@<remote-host> ssh-keyscan -H <remote-host> >> ~/.ssh/known_hosts
A matching ~/.ssh/config entry ensures that every remote host uses the correct key automatically:
Host node-118
HostName 138.201.8.118
User root
IdentityFile ~/.ssh/osg_backup_check
IdentitiesOnly yes
BatchMode yes
With this setup, SSH connections work without prompts:
ssh node-118 'echo OK'
No passwords. No blocking. Fully script-safe.
Adding Multiple Remote Hosts
Each remote server (e.g., node-118, node-159, node-144) is added to the
monitoring script via a small configuration block specifying:
- SSH target
- Mountpoint to verify
- Backup directory path
- Checks for daily and weekly backup sets
{
"name": "node-118",
"target": "root@138.201.8.118",
"mounts": ["/mnt/backupbox"],
"dir_checks": ["/mnt/backupbox/database"],
"checks": [
{ "type": "newest_named",
"base": "/mnt/backupbox/database",
"regex": "^\\d{4}-\\d{2}-\\d{2}-daily$",
"max_age_hours": 36.0,
"min_size_gb": 10 },
{ "type": "newest_named",
"base": "/mnt/backupbox/database",
"regex": "^\\d{4}-\\d{2}-\\d{2}-weekly$",
"max_age_days": 8.0,
"min_size_gb": 10 }
]
}
Introducing Fail-Safe Execution with Timeouts
Certain operations like du, find, or reads on a degraded mount can hang indefinitely.
To prevent this, every command is wrapped with Linux timeout:
timeout -k 5s 25s ssh root@host 'find ...'
If a mount is frozen or remote host is unreachable, it fails cleanly rather than blocking the monitoring job.
Typical Output
[node-118] root@138.201.8.118 MOUNT :: OK :: /mnt/backupbox DIR :: OK :: /mnt/backupbox/database BACKUP :: OK :: daily age 18.2h ≤ 36h ; size 32G BACKUP :: OK :: weekly age 1.9d ≤ 8d ; size 32G
Final Result
With the multi-host backup health check in place, all backup servers are now monitored from a central point. Failures are detected instantly, mounts are validated safely, and remote servers are accessed with zero interaction. The entire process runs predictably, even under adverse network or filesystem conditions.
This solution is now part of the internal monitoring toolkit — easy to extend, reliable under load, and fully automated.

Comments
Post a Comment