Stabilising Backups & PostgreSQL Forensics — A Full‑Day Incident Log and How‑To (2025‑09‑29)

Stabilising Backups & PostgreSQL Forensics — A Full‑Day Incident Log and How‑To

How we freed a critically full root volume, enforced true offsite backups, hardened the CIFS mount, and forensically pinpointed the PostgreSQL tables and columns responsible for a year‑over‑year jump from ~80 GB to ~150 GB per full dump.

Timezone: Europe/Brussels · Date: 2025‑09‑29 · Author: OSG Engineering
Contents 1. Context & Triage Goals 2. Disk Signals & Risk 3. Local Dump Footprint (Why Root Was Filling) 4. Offsite Reality Check (Storage Box) 5. CIFS Mount Guard (Ensure We Never Write to /) 6. Streaming Dumps to Offsite + Retention 7. Cron Hygiene & Safe Ordering 8. PostgreSQL Forensics (DB → Tables → TOAST) 9. Quick Wins Today 10. Structural Redesign: Partitioned Archive + Payload De‑dup ( remove not used data )

1) Context & Triage Goals

Multiple similar projects share a single RAID array. Daily full logical PostgreSQL dumps were being written to a directory on the same filesystem as the live databases, and offsite sync jobs had been disabled. Over roughly a year, the size of full dumps drifted from ~80 GB to ~150 GB. With four daily fulls retained locally, root was nearing the cliff.

  • Primary goals for the day: free space now, reinstate true offsite backups, add a mount guard, and perform a forensic size analysis to identify what actually grew.
  • Secondary goal: draft a structural redesign for the archive subsystem to prevent re‑growth.

2) Disk Signals & Risk

Root filesystem snapshot

df -h
Filesystem   Size  Used  Avail Use% Mounted on
/dev/md2     1.8T  1.5T   191G  89%  /

Interpretation: Only ~191 GB free on /. Another 150 GB full dump would leave the system critically low (<10% free), risking DB crashes and failed writes.

tmpfs & boot unaffected

tmpfs: plenty free · /boot: 28% used

The issue was not memory disks or /boot; it was specifically the root array being a single large device also housing backups.

3) Local Dump Footprint (Why Root Was Filling)

du -h /media/backupboxu309313/postgres_backups
148G  ./2025-09-28-daily
149G  ./2025-09-26-daily
149G  ./2025-09-27-daily
148G  ./2025-09-29-daily
592G  .

The so‑called backup path was actually part of /dev/md2. We had four daily full dumps totalling ~592 GB consuming the same device that runs the databases.

4) Offsite Reality Check (Storage Box)

A Storage Box mount existed for offsite retention; it already contained tartarus file backups and older DB dumps (Dec 2024 → Feb 2025). Capacity was confirmed:

df -h --output=target,size,used,avail,pcent /mnt/storagebox_jobsites_backup | tail -1
/mnt/storagebox_jobsites_backup  4.9T  1.4T  3.6T  27%

Action plan:

  1. Stop generating local 150 GB fulls on /.
  2. Move/stream new dumps directly to the Storage Box mount.
  3. Prune old local fulls after verifying remote copies.
  4. Implement explicit remote retention (daily/weekly/monthly).

5) CIFS Mount Guard (Ensure We Never Write to /)

If a remote mount drops, writing to the mountpoint directory silently writes to local disk. We added a guard that mounts when needed, verifies fstype and source, and aborts if the mount resolves to the same device as /.

Security note: During the incident we temporarily used inline credentials to accelerate recovery. For production, always move them to a credentials file (e.g. /root/.smbcredentials-storagebox, chmod 600), and reference with credentials=<file>.

Mount guard (publication‑safe sample)

#!/usr/bin/env bash
set -Eeuo pipefail
PATH="/usr/sbin:/usr/bin:/sbin:/bin"; LANG=C

MNT="/mnt/storagebox_jobsites_backup"
SRC="//<storagebox-host>/backup"
OPTS="credentials=/root/.smbcredentials-storagebox,vers=3.0,sec=ntlmssp,iocharset=utf8,serverino"

ensure_storagebox_mount() {
  command -v /sbin/mount.cifs &>/dev/null || { echo "cifs-utils missing"; exit 90; }
  mkdir -p "$MNT"

  if mountpoint -q "$MNT"; then
    if findmnt -no SOURCE,FSTYPE "$MNT" | grep -q "^$SRC cifs$"; then
      echo "OK: already CIFS from $SRC"; return 0
    else
      echo "Mountpoint busy with unexpected source; remounting..."; umount -f "$MNT" || true
    fi
  fi

  /sbin/modprobe cifs 2>/dev/null || true
  /sbin/mount.cifs "$SRC" "$MNT" -o "$OPTS" || { dmesg | tail -n 20; exit 92; }

  mountpoint -q "$MNT" && findmnt -no SOURCE,FSTYPE "$MNT" | grep -q "^$SRC cifs$" || exit 92

  ROOT_DEV=$(df -P /      | awk 'NR==2{print $1}')
  MNT_DEV=$(df -P "$MNT" | awk 'NR==2{print $1}')
  [ "$ROOT_DEV" != "$MNT_DEV" ] || { echo "Refusing to write to /"; exit 93; }
}

ensure_storagebox_mount

6) Streaming Dumps to Offsite + Retention

To avoid creating 150 GB files on root, we dump and compress directly onto the mounted Storage Box. We also write globals once per run and enforce retention by age and cadence suffix.

Daily streaming dump (publication‑safe sample)

#!/usr/bin/env bash
set -Eeuo pipefail
PATH="/usr/sbin:/usr/bin:/sbin:/bin"

/home/server_backup/scripts/ensure_storagebox_mount.sh

PGHOST="localhost"; PGPORT="5432"; PGUSER="postgres"; export PGPASSWORD="<redacted>"
PGDATABASES="" # empty = all non-template DBs

MNT="/mnt/storagebox_jobsites_backup"
DEST_ROOT="$MNT/server_db_jobsites_backup_database"

# cadence folder: daily/weekly/monthly
DOW=$(date +%u); DOM=$(date +%d)
if   [ "$DOM" = "01" ]; then SUFFIX="monthly"
elif [ "$DOW" = "7"  ]; then SUFFIX="weekly"
else                            SUFFIX="daily";  fi
DEST_DIR="$DEST_ROOT/$(date +%F)-$SUFFIX"
mkdir -p "$DEST_DIR"

# DB list
dbs() {
  if [ -z "$PGDATABASES" ]; then
    psql -h "$PGHOST" -p "$PGPORT" -U "$PGUSER" -Atc "SELECT datname FROM pg_database WHERE datallowconn AND datistemplate = false;"
  else echo "$PGDATABASES"; fi
}

# globals once
pg_dumpall -h "$PGHOST" -p "$PGPORT" -U "$PGUSER" --globals-only \
  | zstd -T0 -19 > "$DEST_DIR/globals.sql.zst"

# per DB: custom format + zstd; atomic rename
for DB in $(dbs); do
  TMP="$DEST_DIR/${DB}.dump.zst.part"; OUT="$DEST_DIR/${DB}.dump.zst"
  ionice -c2 -n7 nice -n19 \
    pg_dump -h "$PGHOST" -p "$PGPORT" -U "$PGUSER" -d "$DB" -F c \
      | zstd -T0 -19 > "$TMP"
  mv "$TMP" "$OUT"
done

# Retention: daily 14d, weekly 8w, monthly 12m
find "$DEST_ROOT" -maxdepth 1 -type d -name '*-daily'   -mtime +14  -exec rm -rf {} +
find "$DEST_ROOT" -maxdepth 1 -type d -name '*-weekly'  -mtime +56  -exec rm -rf {} +
find "$DEST_ROOT" -maxdepth 1 -type d -name '*-monthly' -mtime +365 -exec rm -rf {} +

7) Cron Hygiene & Safe Ordering

# Disable legacy local full dump on /
# 0 0 * * * /home/server_backup/scripts/pg_backup_rotated.sh

# New: mount‑guard + offsite streaming at 00:15
15 0 * * * /home/server_backup/scripts/pg_dump_to_mount.sh

# Keep file backups at 20:00 (tartarus), if required
0 20 * * * /home/server_backup/scripts/filesbackup.bash

# Disk space check
0 9 * * *  /usr/local/venvs/borg-env/bin/python /home/server_backup/scripts/check_disk_space.py

8) PostgreSQL Forensics (DB → Tables → TOAST)

8.1 Top databases by size

DatabaseReportedBytes
nl_nieuwejobs464 GB498,545,406,623
careersinfinances_nl217 GB233,252,229,791
nieuwejobs84 GB90,018,370,207
jobinhoreca20 GB21,532,537,503
techtopjobs13 GB14,167,315,103
careersinfinances11 GB11,895,722,655
salestopjobs9.7 GB10,171,896,479
jobfor1.0 GB1,093,120,671
… (others)≤ 200 MB

8.2 Largest tables (heap/index/TOAST breakdown)

The archive schema dominates the growth. Below are selected highlights (sizes as reported today):

Schema.TableTotalHeapIndexTOAST
public.job_jobarchive415 GB83 GB54 GB278 GB
public.job_jobarchive_locations12 GB2.8 GB9.5 GB0
public.job_jobarchive_categories12 GB2.9 GB9.3 GB0
public.job_job9.7 GB0.7 GB6.6 GB2.5 GB
public.mailing_message3.2 GB0.2 GB0.014 GB3.0 GB
public.icrm_cv3.1 GB1.0 GB1.0 GB1.0 GB
public.mailing_queue1.7 GB0.024 GB0.069 GB1.6 GB
public.job_job_locations0.42 GB0.022 GB0.406 GB0
public.job_job_categories0.41 GB0.022 GB0.382 GB0
(additional mailing and CRM tables in the 40 MB → 1 GB range)

Key finding: The explosive growth is concentrated in job_jobarchive and specifically its TOAST (large text/blob) columns. Indexes on archive helper tables are also disproportionately large.

9) Quick Wins Today

  1. Drop unused archive indexes (guided by pg_stat_user_indexes.idx_scan), one by one, off‑peak. This can recover many GB immediately.
  2. Switch TOAST compression to lz4 (PG 14+), then batch re‑toast recent months by writing columns to themselves in manageable chunks.
  3. Archive retention: export full months to the Storage Box as compressed CSV (or Parquet) and delete in batches, followed by pg_repack on the source tables.
  4. Mailing tables: set TTL for message bodies; purge processed queues regularly; REINDEX CONCURRENTLY or use pg_repack.

Credentials, hostnames, and sensitive paths are redacted in this publication. Replace placeholders with environment‑appropriate values.

Comments