Designing a Simple Backup Health Check Script (Example Only)

This post outlines a practical approach to verify that your backups are mounted, recent, and complete. It uses generic examples only—no real servers, paths, or credentials.

Objectives

Verify required backup mounts are present (e.g., CIFS/NFS/local volumes).
Check local disk usage against thresholds.
Confirm recent presence of database and filesystem backups.
Optionally query a remote backup host over SSH for validations.
Return a clear pass/fail summary and a non-zero exit code on failure (for monitoring/cron).

What the Script Checks

Mounts: Ensure all expected mount points exist and are mounted.
Disk Space: Parse df -h and alert if thresholds are exceeded (e.g., 85% warn, 95% critical).
Local Backups: Scan designated directories for recent files (e.g., .sql.gz or .tar.gz within the last 24–30 hours).
Remote Checks (optional): Run limited SSH commands against a backup server to confirm recent artifacts.
Daily Uploads (optional): Confirm at least one file exists for the current date (e.g., YYYY/MM/DD/…).

Example Configuration (Placeholders)

# Example-only configuration (no real paths or hosts)
REQUIRED_MOUNTS = [
  "/mnt/backup",             # example mount point
]

LOCAL_BACKUPS = [
  {"path": "/backups/db",    "exts": [".sql.gz", ".dump.gz"], "max_age_h": 26, "label": "Local DB"},
  {"path": "/backups/files", "exts": [".tar.gz", ".zip"],      "max_age_h": 26, "label": "Local FS"}
]

REMOTE = {
  "host": "backup.example.com",
  "user": "backupuser",
  "ssh_key": "/home/user/.ssh/id_rsa",  # example
  "paths": [
    {"path": "/remote/db",    "exts": [".sql.gz"], "max_age_h": 30, "label": "Remote DB"},
    {"path": "/remote/files", "exts": [".tar.gz"], "max_age_h": 30, "label": "Remote FS"}
  ]
}

DISK_WARN = 85
DISK_CRIT = 95

Example Script Skeleton (Python)

#!/usr/bin/env python
# Example only — replace with your own logic and error handling.

import os, sys, time, subprocess

def check_mounts(mounts):
    failures = []
    with open("/proc/mounts") as f:
        mounted_text = f.read()
    for mp in mounts:
        if (" " + mp + " ") not in mounted_text:
            failures.append("Missing mount: %s" % mp)
    return failures

def disk_report(warn=85, crit=95):
    p = subprocess.Popen(["df", "-hPT"], stdout=subprocess.PIPE)
    out = p.communicate()[0].decode("utf-8", "ignore").splitlines()
    alerts = []
    for line in out[1:]:
        parts = line.split()
        if len(parts) < 7: 
            continue
        usep = parts[5].rstrip("%")
        mount = parts[-1]
        try:
            pct = int(usep)
        except:
            continue
        if pct >= crit:
            alerts.append("CRIT: %s at %s%%" % (mount, pct))
        elif pct >= warn:
            alerts.append("WARN: %s at %s%%" % (mount, pct))
    return alerts

def newest_file_age_hours(root, exts):
    newest = None
    newest_mtime = 0.0
    if not os.path.isdir(root):
        return None
    for r, d, files in os.walk(root):
        for fn in files:
            if exts and not any(fn.lower().endswith(e) for e in exts):
                continue
            fp = os.path.join(r, fn)
            try:
                st = os.stat(fp)
            except OSError:
                continue
            if st.st_mtime > newest_mtime:
                newest_mtime = st.st_mtime
                newest = fp
    if newest is None:
        return None
    return (time.time() - newest_mtime) / 3600.0

def check_local(backups):
    failures = []
    for b in backups:
        age = newest_file_age_hours(b["path"], b["exts"])
        if age is None:
            failures.append("%s: no files found in %s" % (b["label"], b["path"]))
        elif age > b["max_age_h"]:
            failures.append("%s: newest %.1fh > %dh" % (b["label"], age, b["max_age_h"]))
    return failures

def main():
    errors = []
    warnings = []

    # 1) mounts
    errors += check_mounts(REQUIRED_MOUNTS)

    # 2) disk space
    warnings += disk_report(DISK_WARN, DISK_CRIT)

    # 3) local backups
    errors += check_local(LOCAL_BACKUPS)

    # 4) (optional) add remote SSH checks here if needed

    # Summary and exit code
    print("Backup Health Check Summary")
    if warnings:
        print("Warnings:")
        for w in warnings: print(" - " + w)
    if errors:
        print("Errors:")
        for e in errors: print(" - " + e)
        sys.exit(2)
    print("All checks passed.")
    sys.exit(0)

if __name__ == "__main__":
    main()

Note: The example uses a full directory walk for simplicity. In production, prefer shallow scans, date-scoped paths (e.g., YYYY/MM/DD), or server-side queries to keep checks fast on network mounts.

Cron Integration (Example)

# Run at minute 12 every hour; log to a file
12 * * * * /usr/bin/python /opt/tools/backup_health_check.py >> /var/log/backup_check.log 2>&1

Sample Output (Example)

Backup Health Check Summary
Warnings:
 - /data at 86%
Errors:
 - Local DB: newest 28.4h > 26h
 - Missing mount: /mnt/backup

Hardening Tips

Use a credentials file for network mounts; never hardcode secrets in scripts.
For network filesystems, consider shallow scans with time boundaries (e.g., check only the current month/day).
Emit machine-readable output (JSON) in addition to human logs for dashboards.
Return non-zero exit codes on failure; integrate with your monitoring stack.
For remote checks, limit SSH commands to small, deterministic queries.

Search This Blog