115
LVL 04 — SENIOR-IN-TRAININGSESSION 115DAY 115

BACKUP & RECOVERY

🎫 PIXELCRAFT-101
🔒DevOps | 🟡 Medium | Priority: 🔴 Critical

A bad migration script dropped the images collection. 50,000 user images — gone. We had no backups. It took 3 days and a heroic MongoDB support ticket to partially recover. Never again. Implement automated nightly backups, test the restore procedure, document the disaster recovery plan.
CONCEPTS.UNLOCKED
💾
Backup Strategies
Full, incremental, differential. Full: copy everything (slow, big, simple). Incremental: only changes since last backup (fast, small, complex restore). Differential: changes since last FULL backup (middle ground).
Point-in-Time Recovery
Restore to any second, not just backup time. MongoDB oplog records every write operation. Replay the oplog to the exact moment before the disaster. "Undo to 14:32:07 yesterday."
🕐
Automated Schedules
Nightly full backup + hourly incremental. Cron jobs or managed service schedules. No human intervention. Backups that depend on someone remembering don't happen.
📋
Disaster Recovery Plan
A written playbook for the worst day. Step-by-step: who does what, in what order, using which tools. Practiced and tested BEFORE the disaster. Not the time to learn when production is down.
⏱️
RTO & RPO
RTO: how fast you recover. RPO: how much data you lose. RTO = Recovery Time Objective (target: restore in 1 hour). RPO = Recovery Point Objective (target: lose ≤ 1 hour of data). These define your backup frequency.
Testing Backups
A backup you haven't tested is not a backup. Regularly restore to a test environment. Verify data integrity. Time the restore process. If you can't restore it, it doesn't exist.
HANDS-ON.TASKS
01
MongoDB Atlas Automated Backups
# MongoDB Atlas (managed): # Settings → Backup → Enable # # Continuous backup with: # - Snapshots every 6 hours # - Point-in-time recovery # (oplog-based, any second) # - Retained for 7 days # # Restore options: # - Restore to same cluster # - Restore to new cluster # - Download snapshot # Self-hosted MongoDB backup: mongodump \ --uri="mongodb://localhost:27017" \ --db=pixelcraft \ --out=/backups/$(date +%Y-%m-%d) \ --gzip # Output: # /backups/2026-02-19/ # pixelcraft/users.bson.gz # pixelcraft/images.bson.gz # pixelcraft/metadata.json
02
Automated Nightly Backup Script
#!/bin/bash # scripts/backup.sh set -euo pipefail DATE=$(date +%Y-%m-%d_%H-%M) BACKUP_DIR="/backups/${DATE}" S3_BUCKET="s3://pixelcraft-backups" RETENTION_DAYS=30 echo "📦 Starting backup: ${DATE}" # 1. Dump MongoDB mongodump \ --uri="${DATABASE_URL}" \ --out="${BACKUP_DIR}" \ --gzip # 2. Backup uploaded images # (only new files since yesterday) aws s3 sync \ s3://pixelcraft-uploads \ "${BACKUP_DIR}/uploads" \ --only-show-errors # 3. Upload backup to S3 tar -czf "${BACKUP_DIR}.tar.gz" \ "${BACKUP_DIR}" aws s3 cp "${BACKUP_DIR}.tar.gz" \ "${S3_BUCKET}/${DATE}.tar.gz" # 4. Verify upload aws s3 ls \ "${S3_BUCKET}/${DATE}.tar.gz" # 5. Clean up local + old S3 rm -rf "${BACKUP_DIR}" \ "${BACKUP_DIR}.tar.gz" aws s3 ls "${S3_BUCKET}/" \ | while read -r line; do FILE_DATE=$(echo "$line" \ | awk '{print $1}') if [[ "${FILE_DATE}" < \ "$(date -d "-${RETENTION_DAYS} \ days" +%Y-%m-%d)" ]]; then FILE=$(echo "$line" \ | awk '{print $4}') aws s3 rm \ "${S3_BUCKET}/${FILE}" fi done echo "✅ Backup complete: ${DATE}"
03
Schedule with Cron / GitHub Actions
# Option A: Cron (on a server) # Nightly at 3 AM UTC 0 3 * * * /opt/scripts/backup.sh \ >> /var/log/backup.log 2>&1 # Option B: GitHub Actions # .github/workflows/backup.yml name: Nightly Backup on: schedule: - cron: '0 3 * * *' # 3 AM UTC workflow_dispatch: # manual trigger jobs: backup: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Run backup env: DATABASE_URL: ${{ secrets.PROD_DB_URL }} AWS_ACCESS_KEY_ID: ${{ secrets.AWS_KEY }} AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET }} run: bash scripts/backup.sh - name: Notify on failure if: failure() run: | curl -X POST \ "${{ secrets.SLACK_WEBHOOK }}" \ -d '{"text":"🔴 Backup FAILED"}'
04
Test Restore Procedure
#!/bin/bash # scripts/restore-test.sh set -euo pipefail BACKUP_DATE=${1:-$(date +%Y-%m-%d)} TEST_DB="pixelcraft-restore-test" S3_BUCKET="s3://pixelcraft-backups" echo "🔄 Testing restore: ${BACKUP_DATE}" START=$(date +%s) # 1. Download backup aws s3 cp \ "${S3_BUCKET}/${BACKUP_DATE}.tar.gz" \ /tmp/restore.tar.gz tar -xzf /tmp/restore.tar.gz \ -C /tmp/ # 2. Restore to test database mongorestore \ --uri="mongodb://localhost:27017" \ --db="${TEST_DB}" \ --gzip \ "/tmp/backups/${BACKUP_DATE}/pixelcraft" # 3. Verify data integrity USERS=$(mongosh ${TEST_DB} \ --eval "db.users.countDocuments()") IMAGES=$(mongosh ${TEST_DB} \ --eval "db.images.countDocuments()") echo "Users restored: ${USERS}" echo "Images restored: ${IMAGES}" # 4. Verify minimum counts if [ "$USERS" -lt 100 ]; then echo "⚠️ User count low: ${USERS}" exit 1 fi # 5. Cleanup mongosh ${TEST_DB} \ --eval "db.dropDatabase()" END=$(date +%s) DURATION=$((END - START)) echo "✅ Restore verified in ${DURATION}s" echo " RTO estimate: ~${DURATION}s"
Run this monthly. Automated restore test in CI: if the backup can't be restored, the test fails and you get alerted. A backup you haven't tested is a false sense of security.
05
Disaster Recovery Plan Document
# DISASTER RECOVERY PLAN # PixelCraft — Last updated: 2026-02 ## Objectives - RPO: ≤ 1 hour (max data loss) - RTO: ≤ 2 hours (max downtime) ## Backup Schedule - MongoDB Atlas: continuous + 6hr snaps - S3 images: nightly sync - Full backup: nightly 3AM UTC → S3 - Retention: 30 days ## Scenarios ### Database corruption / drop 1. Identify timestamp of incident 2. MongoDB Atlas → Restore → Point-in-time → select timestamp 3. Restore to new cluster 4. Update DATABASE_URL env var 5. Verify data, switch traffic ### Server failure 1. Railway auto-restarts containers 2. If persistent: redeploy from main 3. Health check alerts in < 5 min ### S3 data loss 1. S3 versioning enabled 2. Restore previous versions 3. Or: restore from nightly backup ### Complete infrastructure failure 1. Provision new Railway project 2. Restore DB from S3 backup 3. Deploy from GitHub (main branch) 4. Update DNS records 5. Verify all services ## Contacts - On-call: #ops-alerts Slack - MongoDB Atlas support: [link] - Railway status: status.railway.app
06
Close the Ticket
git switch -c devops/PIXELCRAFT-101-backups git add scripts/ docs/disaster-recovery.md .github/workflows/ git commit -m "Add automated backups + DR plan (PIXELCRAFT-101)" git push origin devops/PIXELCRAFT-101-backups # PR → Review → Merge → Close ticket ✅
CS.DEEP-DIVE

RPO and RTO define your backup strategy.

Every backup decision is a tradeoff between cost, complexity, and acceptable loss. The business decides how much loss is tolerable. Engineering builds the system to match.

// RPO: Recovery Point Objective
// "How much data can we lose?"

RPO = 0     → synchronous replication
  (every write copied instantly)
  Cost: 2× infrastructure

RPO = 1hr  → hourly incrementals
  (lose at most 1 hour)
  Cost: moderate

RPO = 24hr → nightly full backups
  (lose at most 1 day)
  Cost: low

// RTO: Recovery Time Objective
// "How fast must we recover?"

RTO = 0     → hot standby
  (instant failover)
RTO = 1hr  → warm standby
  (start backup server)
RTO = 24hr → cold restore
  (download + restore manually)

// PixelCraft targets:
// RPO ≤ 1hr, RTO ≤ 2hr
// Achievable with Atlas + S3.
"Recovery Lab"
[A]Run a disaster drill: intentionally drop a test collection on staging. Follow the DR plan step by step. Time the recovery. Was it within RTO? Document what went wrong, what was confusing, and improve the plan.
[B]Add backup verification to CI: after each nightly backup, automatically restore to a test database, run integrity checks (row counts, checksums), and report results. Alert if verification fails.
[C]Research: what is the 3-2-1 backup rule? (3 copies, 2 different media, 1 offsite.) How would you implement it for PixelCraft? What about geographic redundancy — backups in different AWS regions?
REF.MATERIAL
ARTICLE
MongoDB
Official Atlas backup guide: continuous backups, snapshots, point-in-time recovery. The managed solution for MongoDB backup and restore.
MONGODBOFFICIALESSENTIAL
ARTICLE
MongoDB
Official mongodump/mongorestore reference: command-line backup and restore for self-hosted MongoDB. Options for compression, filtering, and incremental dumps.
MONGODUMPOFFICIAL
VIDEO
IBM Technology
DR planning fundamentals: RTO, RPO, backup strategies, failover architectures. How enterprises prepare for the worst and recover quickly.
DROVERVIEW
ARTICLE
Google
Google's approach to data protection: defense in depth for data, backup verification, recovery testing. From the free SRE book.
SREDATAESSENTIAL
VIDEO
TechWorld with Nana
Full vs incremental vs differential backups: when to use each, storage costs, restore complexity. Practical DevOps backup guide.
BACKUPTUTORIAL
// LEAVE EXCITED BECAUSE
Automated nightly backups to S3. Point-in-time recovery via MongoDB Atlas. Restore tested and verified. Disaster recovery plan documented. RPO ≤ 1 hour, RTO ≤ 2 hours. The worst day is now a 2-hour problem, not a 3-day nightmare.