Implementing a Reliable Multi-Host Backup Health Check Script

Managing backups across multiple servers can quickly become complex—especially when each machine uses its own filesystem mounts, directory structures, and database dumps. In our infrastructure, several issues appeared simultaneously:

Identifying the Problems

Backups were not monitored centrally — each server had its own backup process, but no single node verified the freshness or size of remote backup sets.
SSH authentication was inconsistent — password prompts blocked automated monitoring jobs.
CIFS / storage mounts could silently drop — causing backup directories to appear empty or unreadable.
Daily and weekly database snapshots needed verification — both freshness (time) and size (GB) had to be tested safely.
Large media paths caused `du` to hang — making local disk traversal unsafe without hard timeouts.

The Solution: A Central Multi-Host Python Health Check

A fully custom Python 2.7–compatible monitoring script was built to:

Check local and remote mountpoints
Verify existence and readability of backup directories
Locate the newest YYYY-MM-DD-daily and YYYY-MM-DD-weekly folders
Validate timestamp freshness (≤ 36h for daily, ≤ 8d for weekly)
Check database backup sizes safely (≥ 10G)
Avoid traversing large media directories (skipped by design)
Apply universal command timeouts to prevent hangs
Operate fully non-interactive via key-based SSH

Hardening SSH for Non-Interactive Monitoring

To avoid password prompts and ensure reliability, a dedicated SSH key was generated specifically for the monitoring script:

ssh-keygen -t ed25519 -f ~/.ssh/osg_backup_check -C "backup-monitor@$(hostname)" -N ""
ssh-copy-id -i ~/.ssh/osg_backup_check.pub root@<remote-host>
ssh-keyscan -H <remote-host> >> ~/.ssh/known_hosts

A matching ~/.ssh/config entry ensures that every remote host uses the correct key automatically:

Host node-118
    HostName 138.201.8.118
    User root
    IdentityFile ~/.ssh/osg_backup_check
    IdentitiesOnly yes
    BatchMode yes

With this setup, SSH connections work without prompts:

ssh node-118 'echo OK'

No passwords. No blocking. Fully script-safe.

Adding Multiple Remote Hosts

Each remote server (e.g., node-118, node-159, node-144) is added to the monitoring script via a small configuration block specifying:

SSH target
Mountpoint to verify
Backup directory path
Checks for daily and weekly backup sets

{
    "name": "node-118",
    "target": "root@138.201.8.118",
    "mounts": ["/mnt/backupbox"],
    "dir_checks": ["/mnt/backupbox/database"],
    "checks": [
        { "type": "newest_named",
          "base": "/mnt/backupbox/database",
          "regex": "^\\d{4}-\\d{2}-\\d{2}-daily$",
          "max_age_hours": 36.0,
          "min_size_gb": 10 },
        { "type": "newest_named",
          "base": "/mnt/backupbox/database",
          "regex": "^\\d{4}-\\d{2}-\\d{2}-weekly$",
          "max_age_days": 8.0,
          "min_size_gb": 10 }
    ]
}

Introducing Fail-Safe Execution with Timeouts

Certain operations like du, find, or reads on a degraded mount can hang indefinitely. To prevent this, every command is wrapped with Linux timeout:

timeout -k 5s 25s ssh root@host 'find ...'

If a mount is frozen or remote host is unreachable, it fails cleanly rather than blocking the monitoring job.

Typical Output

[node-118] root@138.201.8.118
MOUNT  :: OK   :: /mnt/backupbox
DIR    :: OK   :: /mnt/backupbox/database
BACKUP :: OK   :: daily  age 18.2h ≤ 36h ; size 32G
BACKUP :: OK   :: weekly age 1.9d ≤ 8d  ; size 32G

Final Result

With the multi-host backup health check in place, all backup servers are now monitored from a central point. Failures are detected instantly, mounts are validated safely, and remote servers are accessed with zero interaction. The entire process runs predictably, even under adverse network or filesystem conditions.

This solution is now part of the internal monitoring toolkit — easy to extend, reliable under load, and fully automated.

Search This Blog