#75 — pg-dump-backup

PLANIFIÉ

Priorité: 🔴 CRITIQUE · Type: C (Celery) · Conteneur: rgz-beat · Code: app/tasks/backup.pyDépendances: #4 rgz-db

Description

Backup PostgreSQL automatique quotidien à 03:00 UTC. Utilise pg_dump format custom compressé (.dump.gz), 30 jours de rétention locale + archivage optionnel S3. Critique pour disaster recovery.

Chaque backup inclut :

Schéma complet (tables, indexes, constraints)
Données (subscribers, sessions, invoices, incidents, etc.)
Séquences et valeurs auto-increment
Fonctions et triggers

En cas de corruption DB ou incident majeur, restauration en < 30 min possible. Backup journalier limite perte data à 24h maximum.

Architecture Interne

Flux de Backup

Quotidiennement 03:00 UTC:
  ↓
Celery Beat déclenche rgz.backup.pg_dump
  ↓
Vérifier conditions:
  1. Pas déjà en progress (lock file check)
  2. Espace libre sur disque > 10GB
  3. Timestamp = 03:00 +/- 5min (tolerance)
  ↓
Exécuter pg_dump:
  pg_dump -h rgz-db -U $POSTGRES_USER -Fc (custom format) \
          -d $POSTGRES_DB | gzip > /backups/$(date +%Y%m%d_%H%M%S).dump.gz
  ↓
Valider:
  1. File exists
  2. Size > 100MB (sanity check, DB pas vide)
  3. CRC32 checksum OK
  ↓
Enregistrer metadata:
  - Filename
  - Size en bytes
  - Checksum
  - Timestamp création
  - Restore test status (optionnel)
  ↓
Nettoyage rétention:
  find /backups -name "*.dump.gz" -mtime +30 -delete
  (garde 30 jours)
  ↓
Archivage optionnel:
  SI configured: aws s3 cp ... s3://rgz-backups/
  ↓
Notification:
  Email NOC: "Backup OK — 1.2 GB — checksum xyz"
  OU
  Email NOC: "Backup FAILED — reason + manual action required"
  ↓
Prometheus metrics export:
  rgz_backup_pg_dump_duration_seconds = 45.3
  rgz_backup_pg_dump_size_bytes = 1240000000
  rgz_backup_pg_dump_status{status="success"} = 1

Schéma de Données (Metadata)

sql

-- Table tracking backups (dans rgz-db)
TABLE backup_executions:
  id UUID PK
  backup_type CHECK(pg_dump|config_git|apdp)
  backup_date TIMESTAMP (quand généré)
  filename TEXT (ex: 20260205_030000.dump.gz)
  file_path TEXT (ex: /backups/20260205_030000.dump.gz)
  file_size_bytes BIGINT
  checksum_crc32 TEXT (ex: a1b2c3d4)
  status CHECK(success|failed|partial)
  duration_seconds INT
  error_message TEXT (si failed)
  archive_s3_path TEXT (si archivé S3)
  created_at TIMESTAMP
  retention_until DATE
  restore_tested_at TIMESTAMP (NULL = never tested)
  restore_test_status CHECK(success|failed|not_tested)

-- Index pour cleanup rapide
CREATE INDEX idx_backup_created ON backup_executions(created_at DESC);
CREATE INDEX idx_backup_type_date ON backup_executions(backup_type, backup_date DESC);

Exemple Backup

BACKUP POSTGRESQL — Février 5, 2026

Start:        2026-02-05 03:00:15 UTC
Filename:     /backups/20260205_030000.dump.gz
Status:       SUCCESS ✓

Metadata:
  Duration:          47.2 seconds
  Size (compressed): 1.24 GB (1,240,000,000 bytes)
  Size (uncompressed est): ~5.8 GB
  Compression ratio: 78% (⚙️ optimal pour PostgreSQL)
  Checksum CRC32:    f7e42c9f

Contenu:
  Tables: 45 (subscribers, resellers, radius_sessions, invoices, etc.)
  Rows sample:
    - subscribers: 487,234 rows
    - radius_sessions: 12,454,891 rows (sessions actives + historique)
    - invoices: 1,234,567 rows
    - incidents: 2,061 rows
  Indexes: 156
  Triggers: 23 (Audit #48, DBA #26, etc.)
  Sequences: 48

Archive:
  Uploaded S3: s3://rgz-backups/2026/02/20260205_030000.dump.gz (SUCCESS)
  Redundancy: 1 local copy + 1 S3 copy

Retention:
  Delete after: 2026-03-07 (30 days)
  ARCEP copy: Keep 7 years (compliance)

Validation:
  pg_restore --list check: OK ✓
  Restore test: Pending (scheduled for 2026-02-12 02:00)

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
End:   2026-02-05 03:01:47 UTC
Next:  2026-02-06 03:00:00 UTC
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Configuration

env

# PostgreSQL Backup
POSTGRES_BACKUP_ENABLED=true
POSTGRES_BACKUP_HOUR_UTC=3
POSTGRES_BACKUP_MINUTE_UTC=0
POSTGRES_BACKUP_TIME_TOLERANCE_MIN=5    # +/- 5min

# Backup Storage
BACKUP_LOCAL_DIR=/backups
BACKUP_LOCAL_RETENTION_DAYS=30          # Local: 30j
BACKUP_CRITICAL_RETENTION_DAYS=2555     # ARCEP: 7 years (keep forever locally)

# Disk management
BACKUP_MIN_FREE_SPACE_GB=10             # Abort if < 10GB free
BACKUP_WARNING_USED_PERCENT=80          # Alert if > 80% used

# S3 Archive (optional)
BACKUP_S3_ENABLED=false
BACKUP_S3_BUCKET=rgz-backups
BACKUP_S3_REGION=us-east-1
BACKUP_S3_ACCESS_KEY_ID=***
BACKUP_S3_SECRET_ACCESS_KEY=***

# Compression
BACKUP_COMPRESSION_FORMAT=gzip          # gzip, bzip2, or none
BACKUP_COMPRESSION_LEVEL=9              # 1-9 (9 = max compression)

# Restore Tests
BACKUP_RESTORE_TEST_ENABLED=true
BACKUP_RESTORE_TEST_SCHEDULE=weekly     # Tester 1x/week
BACKUP_RESTORE_TEST_BACKUP_AGE_DAYS=7   # Test backup 7 jours

# Notifications
BACKUP_NOTIFY_ON_SUCCESS=true
BACKUP_NOTIFY_ON_FAILURE=true
BACKUP_NOTIFY_RECIPIENTS=noc@rgz.local,ops@rgz.local

Endpoints API

Méthode	Route	Description	Réponse
GET	`/api/v1/backups/pg-dump/latest`	Récupérer dernier backup metadata
GET	`/api/v1/backups/pg-dump/list?days=30`	Lister backups 30 derniers jours	List[backup]
POST	`/api/v1/backups/pg-dump/test-restore`	Déclencher test restore (admin)	202 Accepted +
GET	`/api/v1/backups/pg-dump/restore-history`	Historique test restores	List[restore_tests]

Authentification: Admin + NOC only

Celery Task

Champ	Valeur
Task name	`rgz.backup.pg_dump`
Schedule	Daily 03:00 UTC (`0 3 * * *`)
Queue	`rgz.maintenance`
Timeout	600s (10 minutes)
Retry	3x avec backoff exponentiel (30s, 60s, 120s)

Logique esquisse:

python

@app.task(name='rgz.backup.pg_dump', bind=True)
def backup_postgresql(self):
    """
    Backup PostgreSQL daily 03:00 UTC
    """
    backup_date = datetime.utcnow()
    backup_filename = backup_date.strftime('%Y%m%d_%H%M%S') + '.dump.gz'
    backup_path = os.path.join(settings.BACKUP_LOCAL_DIR, backup_filename)

    try:
        # 1. Vérifier conditions
        free_space_gb = _get_disk_free_space() / (1024**3)
        if free_space_gb < settings.BACKUP_MIN_FREE_SPACE_GB:
            raise Exception(f"Insufficient disk space: {free_space_gb} GB < {settings.BACKUP_MIN_FREE_SPACE_GB} GB")

        # 2. Exécuter pg_dump
        logger.info(f"Starting PostgreSQL backup: {backup_path}")

        cmd = f"""
        pg_dump -h {settings.DB_HOST} \
                -U {settings.DB_USER} \
                -Fc \
                -Z{settings.BACKUP_COMPRESSION_LEVEL} \
                {settings.DB_NAME} > {backup_path}
        """

        result = subprocess.run(
            cmd, shell=True, capture_output=True, timeout=600, text=True,
            env={**os.environ, 'PGPASSWORD': settings.DB_PASSWORD}
        )

        if result.returncode != 0:
            raise Exception(f"pg_dump failed: {result.stderr}")

        # 3. Valider backup
        if not os.path.exists(backup_path):
            raise Exception("Backup file not created")

        file_size = os.path.getsize(backup_path)
        if file_size < 100 * 1024 * 1024:  # < 100 MB sanity check
            raise Exception(f"Backup too small: {file_size} bytes (expected > 100 MB)")

        # 4. Calculer checksum
        checksum = _calculate_crc32(backup_path)

        # 5. Enregistrer metadata
        backup_exec = BackupExecution(
            backup_type='pg_dump',
            backup_date=backup_date,
            filename=backup_filename,
            file_path=backup_path,
            file_size_bytes=file_size,
            checksum_crc32=checksum,
            status='success',
            duration_seconds=(datetime.utcnow() - backup_date).total_seconds(),
            retention_until=backup_date + timedelta(days=settings.BACKUP_LOCAL_RETENTION_DAYS)
        )
        db.add(backup_exec)
        db.commit()

        # 6. Archive S3 (optionnel)
        if settings.BACKUP_S3_ENABLED:
            s3_path = f"s3://{settings.BACKUP_S3_BUCKET}/2026/02/{backup_filename}"
            try:
                _upload_to_s3(backup_path, s3_path)
                backup_exec.archive_s3_path = s3_path
                db.commit()
                logger.info(f"Backup archived to S3: {s3_path}")
            except Exception as e:
                logger.warning(f"S3 archive failed (non-critical): {e}")

        # 7. Nettoyage rétention
        _cleanup_old_backups()

        # 8. Notification
        send_email.delay(
            to=settings.BACKUP_NOTIFY_RECIPIENTS,
            subject=f"PostgreSQL Backup Success — {backup_filename}",
            template='backup_success',
            context={
                'filename': backup_filename,
                'size_gb': file_size / (1024**3),
                'checksum': checksum,
                'created_at': backup_date
            }
        )

        # 9. Exporter Prometheus metrics
        _export_prometheus_metrics('pg_dump', {
            'duration': backup_exec.duration_seconds,
            'size': file_size,
            'status': 'success'
        })

        logger.info(f"PostgreSQL backup completed: {backup_filename} ({file_size/1024/1024:.1f} MB)")
        return {
            'status': 'success',
            'filename': backup_filename,
            'size': file_size,
            'checksum': checksum
        }

    except Exception as e:
        logger.error(f"PostgreSQL backup failed: {e}")

        # Enregistrer failure
        backup_exec = BackupExecution(
            backup_type='pg_dump',
            backup_date=backup_date,
            filename=backup_filename,
            file_path=backup_path,
            status='failed',
            error_message=str(e),
            duration_seconds=(datetime.utcnow() - backup_date).total_seconds()
        )
        db.add(backup_exec)
        db.commit()

        # Notification failure
        send_email.delay(
            to=settings.BACKUP_NOTIFY_RECIPIENTS,
            subject='ALERT: PostgreSQL Backup FAILED',
            template='backup_failure',
            context={'error': str(e), 'datetime': backup_date},
            priority='high'
        )

        self.retry(exc=e, countdown=300)  # Retry in 5min

Commandes Utiles

bash

# Déclencher backup manuellement
docker-compose exec rgz-api celery -A app.celery_app call rgz.backup.pg_dump

# Lister backups locaux
ls -lh /backups/*.dump.gz | sort -k 6 | tail -15

# Vérifier taille backups
du -sh /backups/

# Récupérer metadata dernier backup
curl -H "Authorization: Bearer {admin_token}" \
  "http://api-rgz.duckdns.org/api/v1/backups/pg-dump/latest" | jq

# Lister backups 30 derniers jours
curl -H "Authorization: Bearer {admin_token}" \
  "http://api-rgz.duckdns.org/api/v1/backups/pg-dump/list?days=30" | jq

# Vérifier intégrité backup (listing tables)
pg_restore --list /backups/20260205_030000.dump.gz | head -50

# Calculer CRC32 backup local
cksum /backups/20260205_030000.dump.gz | awk '{print $1}'

# Déclencher test restore manuellement
curl -X POST -H "Authorization: Bearer {admin_token}" \
  "http://api-rgz.duckdns.org/api/v1/backups/pg-dump/test-restore" -d '{
    "backup_filename": "20260205_030000.dump.gz"
  }'

# Logs backup
docker-compose logs rgz-beat | grep "pg_dump"

# Vérifier cron schedule
docker-compose exec rgz-beat celery -A app.celery_app inspect scheduled | jq

Implémentation TODO

[ ] Schéma DB backup_executions avec indexes
[ ] Tâche Celery rgz.backup.pg_dump dans app/tasks/backup.py
[ ] Script shell /entrypoint.sh pour pg_dump exécution
[ ] Fonction _calculate_crc32() validation intégrité
[ ] Fonction _cleanup_old_backups() rétention
[ ] Fonction _upload_to_s3() archivage optionnel
[ ] Fonction _export_prometheus_metrics() monitoring
[ ] Endpoints API GET/POST /api/v1/backups/pg-dump*
[ ] Email templates (success, failure)
[ ] Test restore: script Python pour tester 1x/week
[ ] Alertes Prometheus: backup > 10min, diskspace < 10%, failure
[ ] Documentation: SOP restore, disaster recovery procedure, RTO/RPO

Dernière mise à jour: 2026-02-21

#75 — pg-dump-backup ​

Description ​

Architecture Interne ​

Flux de Backup ​

Schéma de Données (Metadata) ​

Exemple Backup ​

Configuration ​

Endpoints API ​

Celery Task ​

Commandes Utiles ​

Implémentation TODO ​