Stabilising Backups & PostgreSQL Forensics — A Full‑Day Incident Log and How‑To
How we freed a critically full root volume, enforced true offsite backups, hardened the CIFS mount, and forensically pinpointed the PostgreSQL tables and columns responsible for a year‑over‑year jump from ~80 GB to ~150 GB per full dump.
1) Context & Triage Goals
Multiple similar projects share a single RAID array. Daily full logical PostgreSQL dumps were being written to a directory on the same filesystem as the live databases, and offsite sync jobs had been disabled. Over roughly a year, the size of full dumps drifted from ~80 GB to ~150 GB. With four daily fulls retained locally, root was nearing the cliff.
- Primary goals for the day: free space now, reinstate true offsite backups, add a mount guard, and perform a forensic size analysis to identify what actually grew.
- Secondary goal: draft a structural redesign for the archive subsystem to prevent re‑growth.
2) Disk Signals & Risk
Root filesystem snapshot
df -h
Filesystem Size Used Avail Use% Mounted on
/dev/md2 1.8T 1.5T 191G 89% /
Interpretation: Only ~191 GB free on /
. Another 150 GB full dump would leave the system critically low (<10% free), risking DB crashes and failed writes.
tmpfs & boot unaffected
tmpfs: plenty free · /boot: 28% used
The issue was not memory disks or /boot; it was specifically the root array being a single large device also housing backups.
3) Local Dump Footprint (Why Root Was Filling)
du -h /media/backupboxu309313/postgres_backups
148G ./2025-09-28-daily
149G ./2025-09-26-daily
149G ./2025-09-27-daily
148G ./2025-09-29-daily
592G .
The so‑called backup path was actually part of /dev/md2
. We had four daily full dumps totalling ~592 GB consuming the same device that runs the databases.
4) Offsite Reality Check (Storage Box)
A Storage Box mount existed for offsite retention; it already contained tartarus file backups and older DB dumps (Dec 2024 → Feb 2025). Capacity was confirmed:
df -h --output=target,size,used,avail,pcent /mnt/storagebox_jobsites_backup | tail -1
/mnt/storagebox_jobsites_backup 4.9T 1.4T 3.6T 27%
Action plan:
- Stop generating local 150 GB fulls on
/
. - Move/stream new dumps directly to the Storage Box mount.
- Prune old local fulls after verifying remote copies.
- Implement explicit remote retention (daily/weekly/monthly).
5) CIFS Mount Guard (Ensure We Never Write to /)
If a remote mount drops, writing to the mountpoint directory silently writes to local disk. We added a guard that mounts when needed, verifies fstype and source, and aborts if the mount resolves to the same device as /
.
/root/.smbcredentials-storagebox
, chmod 600
), and reference with credentials=<file>
.Mount guard (publication‑safe sample)
#!/usr/bin/env bash
set -Eeuo pipefail
PATH="/usr/sbin:/usr/bin:/sbin:/bin"; LANG=C
MNT="/mnt/storagebox_jobsites_backup"
SRC="//<storagebox-host>/backup"
OPTS="credentials=/root/.smbcredentials-storagebox,vers=3.0,sec=ntlmssp,iocharset=utf8,serverino"
ensure_storagebox_mount() {
command -v /sbin/mount.cifs &>/dev/null || { echo "cifs-utils missing"; exit 90; }
mkdir -p "$MNT"
if mountpoint -q "$MNT"; then
if findmnt -no SOURCE,FSTYPE "$MNT" | grep -q "^$SRC cifs$"; then
echo "OK: already CIFS from $SRC"; return 0
else
echo "Mountpoint busy with unexpected source; remounting..."; umount -f "$MNT" || true
fi
fi
/sbin/modprobe cifs 2>/dev/null || true
/sbin/mount.cifs "$SRC" "$MNT" -o "$OPTS" || { dmesg | tail -n 20; exit 92; }
mountpoint -q "$MNT" && findmnt -no SOURCE,FSTYPE "$MNT" | grep -q "^$SRC cifs$" || exit 92
ROOT_DEV=$(df -P / | awk 'NR==2{print $1}')
MNT_DEV=$(df -P "$MNT" | awk 'NR==2{print $1}')
[ "$ROOT_DEV" != "$MNT_DEV" ] || { echo "Refusing to write to /"; exit 93; }
}
ensure_storagebox_mount
6) Streaming Dumps to Offsite + Retention
To avoid creating 150 GB files on root, we dump and compress directly onto the mounted Storage Box. We also write globals once per run and enforce retention by age and cadence suffix.
Daily streaming dump (publication‑safe sample)
#!/usr/bin/env bash
set -Eeuo pipefail
PATH="/usr/sbin:/usr/bin:/sbin:/bin"
/home/server_backup/scripts/ensure_storagebox_mount.sh
PGHOST="localhost"; PGPORT="5432"; PGUSER="postgres"; export PGPASSWORD="<redacted>"
PGDATABASES="" # empty = all non-template DBs
MNT="/mnt/storagebox_jobsites_backup"
DEST_ROOT="$MNT/server_db_jobsites_backup_database"
# cadence folder: daily/weekly/monthly
DOW=$(date +%u); DOM=$(date +%d)
if [ "$DOM" = "01" ]; then SUFFIX="monthly"
elif [ "$DOW" = "7" ]; then SUFFIX="weekly"
else SUFFIX="daily"; fi
DEST_DIR="$DEST_ROOT/$(date +%F)-$SUFFIX"
mkdir -p "$DEST_DIR"
# DB list
dbs() {
if [ -z "$PGDATABASES" ]; then
psql -h "$PGHOST" -p "$PGPORT" -U "$PGUSER" -Atc "SELECT datname FROM pg_database WHERE datallowconn AND datistemplate = false;"
else echo "$PGDATABASES"; fi
}
# globals once
pg_dumpall -h "$PGHOST" -p "$PGPORT" -U "$PGUSER" --globals-only \
| zstd -T0 -19 > "$DEST_DIR/globals.sql.zst"
# per DB: custom format + zstd; atomic rename
for DB in $(dbs); do
TMP="$DEST_DIR/${DB}.dump.zst.part"; OUT="$DEST_DIR/${DB}.dump.zst"
ionice -c2 -n7 nice -n19 \
pg_dump -h "$PGHOST" -p "$PGPORT" -U "$PGUSER" -d "$DB" -F c \
| zstd -T0 -19 > "$TMP"
mv "$TMP" "$OUT"
done
# Retention: daily 14d, weekly 8w, monthly 12m
find "$DEST_ROOT" -maxdepth 1 -type d -name '*-daily' -mtime +14 -exec rm -rf {} +
find "$DEST_ROOT" -maxdepth 1 -type d -name '*-weekly' -mtime +56 -exec rm -rf {} +
find "$DEST_ROOT" -maxdepth 1 -type d -name '*-monthly' -mtime +365 -exec rm -rf {} +
7) Cron Hygiene & Safe Ordering
# Disable legacy local full dump on /
# 0 0 * * * /home/server_backup/scripts/pg_backup_rotated.sh
# New: mount‑guard + offsite streaming at 00:15
15 0 * * * /home/server_backup/scripts/pg_dump_to_mount.sh
# Keep file backups at 20:00 (tartarus), if required
0 20 * * * /home/server_backup/scripts/filesbackup.bash
# Disk space check
0 9 * * * /usr/local/venvs/borg-env/bin/python /home/server_backup/scripts/check_disk_space.py
8) PostgreSQL Forensics (DB → Tables → TOAST)
8.1 Top databases by size
Database | Reported | Bytes |
---|---|---|
nl_nieuwejobs | 464 GB | 498,545,406,623 |
careersinfinances_nl | 217 GB | 233,252,229,791 |
nieuwejobs | 84 GB | 90,018,370,207 |
jobinhoreca | 20 GB | 21,532,537,503 |
techtopjobs | 13 GB | 14,167,315,103 |
careersinfinances | 11 GB | 11,895,722,655 |
salestopjobs | 9.7 GB | 10,171,896,479 |
jobfor | 1.0 GB | 1,093,120,671 |
… (others) | ≤ 200 MB | … |
8.2 Largest tables (heap/index/TOAST breakdown)
The archive schema dominates the growth. Below are selected highlights (sizes as reported today):
Schema.Table | Total | Heap | Index | TOAST |
---|---|---|---|---|
public.job_jobarchive | 415 GB | 83 GB | 54 GB | 278 GB |
public.job_jobarchive_locations | 12 GB | 2.8 GB | 9.5 GB | 0 |
public.job_jobarchive_categories | 12 GB | 2.9 GB | 9.3 GB | 0 |
public.job_job | 9.7 GB | 0.7 GB | 6.6 GB | 2.5 GB |
public.mailing_message | 3.2 GB | 0.2 GB | 0.014 GB | 3.0 GB |
public.icrm_cv | 3.1 GB | 1.0 GB | 1.0 GB | 1.0 GB |
public.mailing_queue | 1.7 GB | 0.024 GB | 0.069 GB | 1.6 GB |
public.job_job_locations | 0.42 GB | 0.022 GB | 0.406 GB | 0 |
public.job_job_categories | 0.41 GB | 0.022 GB | 0.382 GB | 0 |
… | (additional mailing and CRM tables in the 40 MB → 1 GB range) |
Key finding: The explosive growth is concentrated in job_jobarchive
and specifically its TOAST (large text/blob) columns. Indexes on archive helper tables are also disproportionately large.
9) Quick Wins Today
- Drop unused archive indexes (guided by
pg_stat_user_indexes.idx_scan
), one by one, off‑peak. This can recover many GB immediately. - Switch TOAST compression to lz4 (PG 14+), then batch re‑toast recent months by writing columns to themselves in manageable chunks.
- Archive retention: export full months to the Storage Box as compressed CSV (or Parquet) and delete in batches, followed by
pg_repack
on the source tables. - Mailing tables: set TTL for message bodies; purge processed queues regularly;
REINDEX CONCURRENTLY
or usepg_repack
.
Credentials, hostnames, and sensitive paths are redacted in this publication. Replace placeholders with environment‑appropriate values.
Comments
Post a Comment