#75 — pg-dump-backup
PLANIFIÉ
Priorité: 🔴 CRITIQUE · Type: C (Celery) · Conteneur: rgz-beat · Code: app/tasks/backup.pyDépendances: #4 rgz-db
Description
Backup PostgreSQL automatique quotidien à 03:00 UTC. Utilise pg_dump format custom compressé (.dump.gz), 30 jours de rétention locale + archivage optionnel S3. Critique pour disaster recovery.
Chaque backup inclut :
- Schéma complet (tables, indexes, constraints)
- Données (subscribers, sessions, invoices, incidents, etc.)
- Séquences et valeurs auto-increment
- Fonctions et triggers
En cas de corruption DB ou incident majeur, restauration en < 30 min possible. Backup journalier limite perte data à 24h maximum.
Architecture Interne
Flux de Backup
Quotidiennement 03:00 UTC:
↓
Celery Beat déclenche rgz.backup.pg_dump
↓
Vérifier conditions:
1. Pas déjà en progress (lock file check)
2. Espace libre sur disque > 10GB
3. Timestamp = 03:00 +/- 5min (tolerance)
↓
Exécuter pg_dump:
pg_dump -h rgz-db -U $POSTGRES_USER -Fc (custom format) \
-d $POSTGRES_DB | gzip > /backups/$(date +%Y%m%d_%H%M%S).dump.gz
↓
Valider:
1. File exists
2. Size > 100MB (sanity check, DB pas vide)
3. CRC32 checksum OK
↓
Enregistrer metadata:
- Filename
- Size en bytes
- Checksum
- Timestamp création
- Restore test status (optionnel)
↓
Nettoyage rétention:
find /backups -name "*.dump.gz" -mtime +30 -delete
(garde 30 jours)
↓
Archivage optionnel:
SI configured: aws s3 cp ... s3://rgz-backups/
↓
Notification:
Email NOC: "Backup OK — 1.2 GB — checksum xyz"
OU
Email NOC: "Backup FAILED — reason + manual action required"
↓
Prometheus metrics export:
rgz_backup_pg_dump_duration_seconds = 45.3
rgz_backup_pg_dump_size_bytes = 1240000000
rgz_backup_pg_dump_status{status="success"} = 1Schéma de Données (Metadata)
sql
-- Table tracking backups (dans rgz-db)
TABLE backup_executions:
id UUID PK
backup_type CHECK(pg_dump|config_git|apdp)
backup_date TIMESTAMP (quand généré)
filename TEXT (ex: 20260205_030000.dump.gz)
file_path TEXT (ex: /backups/20260205_030000.dump.gz)
file_size_bytes BIGINT
checksum_crc32 TEXT (ex: a1b2c3d4)
status CHECK(success|failed|partial)
duration_seconds INT
error_message TEXT (si failed)
archive_s3_path TEXT (si archivé S3)
created_at TIMESTAMP
retention_until DATE
restore_tested_at TIMESTAMP (NULL = never tested)
restore_test_status CHECK(success|failed|not_tested)
-- Index pour cleanup rapide
CREATE INDEX idx_backup_created ON backup_executions(created_at DESC);
CREATE INDEX idx_backup_type_date ON backup_executions(backup_type, backup_date DESC);Exemple Backup
BACKUP POSTGRESQL — Février 5, 2026
Start: 2026-02-05 03:00:15 UTC
Filename: /backups/20260205_030000.dump.gz
Status: SUCCESS ✓
Metadata:
Duration: 47.2 seconds
Size (compressed): 1.24 GB (1,240,000,000 bytes)
Size (uncompressed est): ~5.8 GB
Compression ratio: 78% (⚙️ optimal pour PostgreSQL)
Checksum CRC32: f7e42c9f
Contenu:
Tables: 45 (subscribers, resellers, radius_sessions, invoices, etc.)
Rows sample:
- subscribers: 487,234 rows
- radius_sessions: 12,454,891 rows (sessions actives + historique)
- invoices: 1,234,567 rows
- incidents: 2,061 rows
Indexes: 156
Triggers: 23 (Audit #48, DBA #26, etc.)
Sequences: 48
Archive:
Uploaded S3: s3://rgz-backups/2026/02/20260205_030000.dump.gz (SUCCESS)
Redundancy: 1 local copy + 1 S3 copy
Retention:
Delete after: 2026-03-07 (30 days)
ARCEP copy: Keep 7 years (compliance)
Validation:
pg_restore --list check: OK ✓
Restore test: Pending (scheduled for 2026-02-12 02:00)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
End: 2026-02-05 03:01:47 UTC
Next: 2026-02-06 03:00:00 UTC
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━Configuration
env
# PostgreSQL Backup
POSTGRES_BACKUP_ENABLED=true
POSTGRES_BACKUP_HOUR_UTC=3
POSTGRES_BACKUP_MINUTE_UTC=0
POSTGRES_BACKUP_TIME_TOLERANCE_MIN=5 # +/- 5min
# Backup Storage
BACKUP_LOCAL_DIR=/backups
BACKUP_LOCAL_RETENTION_DAYS=30 # Local: 30j
BACKUP_CRITICAL_RETENTION_DAYS=2555 # ARCEP: 7 years (keep forever locally)
# Disk management
BACKUP_MIN_FREE_SPACE_GB=10 # Abort if < 10GB free
BACKUP_WARNING_USED_PERCENT=80 # Alert if > 80% used
# S3 Archive (optional)
BACKUP_S3_ENABLED=false
BACKUP_S3_BUCKET=rgz-backups
BACKUP_S3_REGION=us-east-1
BACKUP_S3_ACCESS_KEY_ID=***
BACKUP_S3_SECRET_ACCESS_KEY=***
# Compression
BACKUP_COMPRESSION_FORMAT=gzip # gzip, bzip2, or none
BACKUP_COMPRESSION_LEVEL=9 # 1-9 (9 = max compression)
# Restore Tests
BACKUP_RESTORE_TEST_ENABLED=true
BACKUP_RESTORE_TEST_SCHEDULE=weekly # Tester 1x/week
BACKUP_RESTORE_TEST_BACKUP_AGE_DAYS=7 # Test backup 7 jours
# Notifications
BACKUP_NOTIFY_ON_SUCCESS=true
BACKUP_NOTIFY_ON_FAILURE=true
BACKUP_NOTIFY_RECIPIENTS=noc@rgz.local,ops@rgz.localEndpoints API
| Méthode | Route | Description | Réponse |
|---|---|---|---|
| GET | /api/v1/backups/pg-dump/latest | Récupérer dernier backup metadata | |
| GET | /api/v1/backups/pg-dump/list?days=30 | Lister backups 30 derniers jours | List[backup] |
| POST | /api/v1/backups/pg-dump/test-restore | Déclencher test restore (admin) | 202 Accepted + |
| GET | /api/v1/backups/pg-dump/restore-history | Historique test restores | List[restore_tests] |
Authentification: Admin + NOC only
Celery Task
| Champ | Valeur |
|---|---|
| Task name | rgz.backup.pg_dump |
| Schedule | Daily 03:00 UTC (0 3 * * *) |
| Queue | rgz.maintenance |
| Timeout | 600s (10 minutes) |
| Retry | 3x avec backoff exponentiel (30s, 60s, 120s) |
Logique esquisse:
python
@app.task(name='rgz.backup.pg_dump', bind=True)
def backup_postgresql(self):
"""
Backup PostgreSQL daily 03:00 UTC
"""
backup_date = datetime.utcnow()
backup_filename = backup_date.strftime('%Y%m%d_%H%M%S') + '.dump.gz'
backup_path = os.path.join(settings.BACKUP_LOCAL_DIR, backup_filename)
try:
# 1. Vérifier conditions
free_space_gb = _get_disk_free_space() / (1024**3)
if free_space_gb < settings.BACKUP_MIN_FREE_SPACE_GB:
raise Exception(f"Insufficient disk space: {free_space_gb} GB < {settings.BACKUP_MIN_FREE_SPACE_GB} GB")
# 2. Exécuter pg_dump
logger.info(f"Starting PostgreSQL backup: {backup_path}")
cmd = f"""
pg_dump -h {settings.DB_HOST} \
-U {settings.DB_USER} \
-Fc \
-Z{settings.BACKUP_COMPRESSION_LEVEL} \
{settings.DB_NAME} > {backup_path}
"""
result = subprocess.run(
cmd, shell=True, capture_output=True, timeout=600, text=True,
env={**os.environ, 'PGPASSWORD': settings.DB_PASSWORD}
)
if result.returncode != 0:
raise Exception(f"pg_dump failed: {result.stderr}")
# 3. Valider backup
if not os.path.exists(backup_path):
raise Exception("Backup file not created")
file_size = os.path.getsize(backup_path)
if file_size < 100 * 1024 * 1024: # < 100 MB sanity check
raise Exception(f"Backup too small: {file_size} bytes (expected > 100 MB)")
# 4. Calculer checksum
checksum = _calculate_crc32(backup_path)
# 5. Enregistrer metadata
backup_exec = BackupExecution(
backup_type='pg_dump',
backup_date=backup_date,
filename=backup_filename,
file_path=backup_path,
file_size_bytes=file_size,
checksum_crc32=checksum,
status='success',
duration_seconds=(datetime.utcnow() - backup_date).total_seconds(),
retention_until=backup_date + timedelta(days=settings.BACKUP_LOCAL_RETENTION_DAYS)
)
db.add(backup_exec)
db.commit()
# 6. Archive S3 (optionnel)
if settings.BACKUP_S3_ENABLED:
s3_path = f"s3://{settings.BACKUP_S3_BUCKET}/2026/02/{backup_filename}"
try:
_upload_to_s3(backup_path, s3_path)
backup_exec.archive_s3_path = s3_path
db.commit()
logger.info(f"Backup archived to S3: {s3_path}")
except Exception as e:
logger.warning(f"S3 archive failed (non-critical): {e}")
# 7. Nettoyage rétention
_cleanup_old_backups()
# 8. Notification
send_email.delay(
to=settings.BACKUP_NOTIFY_RECIPIENTS,
subject=f"PostgreSQL Backup Success — {backup_filename}",
template='backup_success',
context={
'filename': backup_filename,
'size_gb': file_size / (1024**3),
'checksum': checksum,
'created_at': backup_date
}
)
# 9. Exporter Prometheus metrics
_export_prometheus_metrics('pg_dump', {
'duration': backup_exec.duration_seconds,
'size': file_size,
'status': 'success'
})
logger.info(f"PostgreSQL backup completed: {backup_filename} ({file_size/1024/1024:.1f} MB)")
return {
'status': 'success',
'filename': backup_filename,
'size': file_size,
'checksum': checksum
}
except Exception as e:
logger.error(f"PostgreSQL backup failed: {e}")
# Enregistrer failure
backup_exec = BackupExecution(
backup_type='pg_dump',
backup_date=backup_date,
filename=backup_filename,
file_path=backup_path,
status='failed',
error_message=str(e),
duration_seconds=(datetime.utcnow() - backup_date).total_seconds()
)
db.add(backup_exec)
db.commit()
# Notification failure
send_email.delay(
to=settings.BACKUP_NOTIFY_RECIPIENTS,
subject='ALERT: PostgreSQL Backup FAILED',
template='backup_failure',
context={'error': str(e), 'datetime': backup_date},
priority='high'
)
self.retry(exc=e, countdown=300) # Retry in 5minCommandes Utiles
bash
# Déclencher backup manuellement
docker-compose exec rgz-api celery -A app.celery_app call rgz.backup.pg_dump
# Lister backups locaux
ls -lh /backups/*.dump.gz | sort -k 6 | tail -15
# Vérifier taille backups
du -sh /backups/
# Récupérer metadata dernier backup
curl -H "Authorization: Bearer {admin_token}" \
"http://api-rgz.duckdns.org/api/v1/backups/pg-dump/latest" | jq
# Lister backups 30 derniers jours
curl -H "Authorization: Bearer {admin_token}" \
"http://api-rgz.duckdns.org/api/v1/backups/pg-dump/list?days=30" | jq
# Vérifier intégrité backup (listing tables)
pg_restore --list /backups/20260205_030000.dump.gz | head -50
# Calculer CRC32 backup local
cksum /backups/20260205_030000.dump.gz | awk '{print $1}'
# Déclencher test restore manuellement
curl -X POST -H "Authorization: Bearer {admin_token}" \
"http://api-rgz.duckdns.org/api/v1/backups/pg-dump/test-restore" -d '{
"backup_filename": "20260205_030000.dump.gz"
}'
# Logs backup
docker-compose logs rgz-beat | grep "pg_dump"
# Vérifier cron schedule
docker-compose exec rgz-beat celery -A app.celery_app inspect scheduled | jqImplémentation TODO
- [ ] Schéma DB
backup_executionsavec indexes - [ ] Tâche Celery
rgz.backup.pg_dumpdansapp/tasks/backup.py - [ ] Script shell
/entrypoint.shpour pg_dump exécution - [ ] Fonction
_calculate_crc32()validation intégrité - [ ] Fonction
_cleanup_old_backups()rétention - [ ] Fonction
_upload_to_s3()archivage optionnel - [ ] Fonction
_export_prometheus_metrics()monitoring - [ ] Endpoints API GET/POST /api/v1/backups/pg-dump*
- [ ] Email templates (success, failure)
- [ ] Test restore: script Python pour tester 1x/week
- [ ] Alertes Prometheus: backup > 10min, diskspace < 10%, failure
- [ ] Documentation: SOP restore, disaster recovery procedure, RTO/RPO
Dernière mise à jour: 2026-02-21