Designing a Simple Backup Health Check Script (Example Only)

This post outlines a practical approach to verify that your backups are mounted, recent, and complete. It uses generic examples only—no real servers, paths, or credentials.

Objectives

  • Verify required backup mounts are present (e.g., CIFS/NFS/local volumes).
  • Check local disk usage against thresholds.
  • Confirm recent presence of database and filesystem backups.
  • Optionally query a remote backup host over SSH for validations.
  • Return a clear pass/fail summary and a non-zero exit code on failure (for monitoring/cron).

What the Script Checks

  1. Mounts: Ensure all expected mount points exist and are mounted.
  2. Disk Space: Parse df -h and alert if thresholds are exceeded (e.g., 85% warn, 95% critical).
  3. Local Backups: Scan designated directories for recent files (e.g., .sql.gz or .tar.gz within the last 24–30 hours).
  4. Remote Checks (optional): Run limited SSH commands against a backup server to confirm recent artifacts.
  5. Daily Uploads (optional): Confirm at least one file exists for the current date (e.g., YYYY/MM/DD/…).

Example Configuration (Placeholders)

# Example-only configuration (no real paths or hosts)
REQUIRED_MOUNTS = [
  "/mnt/backup",             # example mount point
]

LOCAL_BACKUPS = [
  {"path": "/backups/db",    "exts": [".sql.gz", ".dump.gz"], "max_age_h": 26, "label": "Local DB"},
  {"path": "/backups/files", "exts": [".tar.gz", ".zip"],      "max_age_h": 26, "label": "Local FS"}
]

REMOTE = {
  "host": "backup.example.com",
  "user": "backupuser",
  "ssh_key": "/home/user/.ssh/id_rsa",  # example
  "paths": [
    {"path": "/remote/db",    "exts": [".sql.gz"], "max_age_h": 30, "label": "Remote DB"},
    {"path": "/remote/files", "exts": [".tar.gz"], "max_age_h": 30, "label": "Remote FS"}
  ]
}

DISK_WARN = 85
DISK_CRIT = 95

Example Script Skeleton (Python)

#!/usr/bin/env python
# Example only — replace with your own logic and error handling.

import os, sys, time, subprocess

def check_mounts(mounts):
    failures = []
    with open("/proc/mounts") as f:
        mounted_text = f.read()
    for mp in mounts:
        if (" " + mp + " ") not in mounted_text:
            failures.append("Missing mount: %s" % mp)
    return failures

def disk_report(warn=85, crit=95):
    p = subprocess.Popen(["df", "-hPT"], stdout=subprocess.PIPE)
    out = p.communicate()[0].decode("utf-8", "ignore").splitlines()
    alerts = []
    for line in out[1:]:
        parts = line.split()
        if len(parts) < 7: 
            continue
        usep = parts[5].rstrip("%")
        mount = parts[-1]
        try:
            pct = int(usep)
        except:
            continue
        if pct >= crit:
            alerts.append("CRIT: %s at %s%%" % (mount, pct))
        elif pct >= warn:
            alerts.append("WARN: %s at %s%%" % (mount, pct))
    return alerts

def newest_file_age_hours(root, exts):
    newest = None
    newest_mtime = 0.0
    if not os.path.isdir(root):
        return None
    for r, d, files in os.walk(root):
        for fn in files:
            if exts and not any(fn.lower().endswith(e) for e in exts):
                continue
            fp = os.path.join(r, fn)
            try:
                st = os.stat(fp)
            except OSError:
                continue
            if st.st_mtime > newest_mtime:
                newest_mtime = st.st_mtime
                newest = fp
    if newest is None:
        return None
    return (time.time() - newest_mtime) / 3600.0

def check_local(backups):
    failures = []
    for b in backups:
        age = newest_file_age_hours(b["path"], b["exts"])
        if age is None:
            failures.append("%s: no files found in %s" % (b["label"], b["path"]))
        elif age > b["max_age_h"]:
            failures.append("%s: newest %.1fh > %dh" % (b["label"], age, b["max_age_h"]))
    return failures

def main():
    errors = []
    warnings = []

    # 1) mounts
    errors += check_mounts(REQUIRED_MOUNTS)

    # 2) disk space
    warnings += disk_report(DISK_WARN, DISK_CRIT)

    # 3) local backups
    errors += check_local(LOCAL_BACKUPS)

    # 4) (optional) add remote SSH checks here if needed

    # Summary and exit code
    print("Backup Health Check Summary")
    if warnings:
        print("Warnings:")
        for w in warnings: print(" - " + w)
    if errors:
        print("Errors:")
        for e in errors: print(" - " + e)
        sys.exit(2)
    print("All checks passed.")
    sys.exit(0)

if __name__ == "__main__":
    main()

Note: The example uses a full directory walk for simplicity. In production, prefer shallow scans, date-scoped paths (e.g., YYYY/MM/DD), or server-side queries to keep checks fast on network mounts.

Cron Integration (Example)

# Run at minute 12 every hour; log to a file
12 * * * * /usr/bin/python /opt/tools/backup_health_check.py >> /var/log/backup_check.log 2>&1

Sample Output (Example)

Backup Health Check Summary
Warnings:
 - /data at 86%
Errors:
 - Local DB: newest 28.4h > 26h
 - Missing mount: /mnt/backup

Hardening Tips

  • Use a credentials file for network mounts; never hardcode secrets in scripts.
  • For network filesystems, consider shallow scans with time boundaries (e.g., check only the current month/day).
  • Emit machine-readable output (JSON) in addition to human logs for dashboards.
  • Return non-zero exit codes on failure; integrate with your monitoring stack.
  • For remote checks, limit SSH commands to small, deterministic queries.

This post provided a minimal, example-only blueprint. Adjust paths, thresholds, and checks to your environment, and keep your validation fast and deterministic—especially across network mounts.

Comments