#60 — Décision Rollback Automatique
PLANIFIÉ
Priorité: 🟠 HAUTE · Type: TYPE B · Conteneur: rgz-api · Code: app/services/rollback.pyDépendances: #58 incident-escalation
Description
Le moteur de rollback automatique détecte les configurations ou firmware défectueux et les revert sans intervention manuelle. Basé sur 4 triggers mesurables (drop sessions >50%, taux d'erreur RADIUS >20%, latence P95 >5s, 3+ alertes P0 simultanées), il applique automatiquement un rollback à l'étape précédente stable.
Deux types de rollback : rollback nftables (restaure dernière config réseau validée) et rollback firmware (retour version CPE antérieure). À chaque rollback automatique, une notification est envoyée au NOC et DG. Le système enregistre tous les triggers et décisions dans l'historique pour RCA et prévention future.
Architecture Interne
Rollback triggers et actions auto:
TRIGGER 1: SESSION COUNT DROP > 50%
Condition: active_session_count() < (baseline * 0.5) pour >2 min
Baseline: moyenne sessions des 30 derniers jours (par revendeur)
Exemple: revendeur A avg=150 sessions → drop à 70 → trigger!
Cause probable: nouvelle config réseau bloque clients, ou firmware bug
Action:
- Rollback nftables config to last stable version
- Reload firewall rules
- Monitor recovery (attendu 10 min, sinon escalade P0)
Notification: SMS DG "ROLLBACK TRIGGERED: {revendeur} session drop {old}→{new}"
TRIGGER 2: RADIUS AUTH FAILURE RATE > 20%
Condition: (failed_auth / total_auth) > 0.2 sur 5 min window
Métrique: Prometheus metric radius_auth_failures_total
Exemple: 8 failures / 40 attempts = 20% → trigger!
Cause probable: NAS secret mismatch, RADIUS server unreachable
Action:
- Rollback RADIUS client config on CPE
- Flush RADIUS cache (Redis key rgz:session:*)
- Test auth with test user
Notification: SMS NOC "AUTH FAILURE SPIKE {revendeur}: {percent}% failures"
TRIGGER 3: LATENCY P95 > 5000ms (5 seconds)
Condition: p95_latency > 5000ms on 5 min window
Métrique: Prometheus latency_p95_ms per revendeur
Cause probable: Misconfig QoS, traffic engineering broken, BGP issues
Action:
- Rollback nftables QoS config (#27 HTB)
- Rollback DSCP marking (#28)
- Revert to default routing (no policy-based)
Notification: SMS Tier2 "LATENCY DEGRADATION {revendeur}: {latency}ms"
TRIGGER 4: MULTIPLE P0 ALERTS SIMULTANEOUS
Condition: ≥3 P0 incidents open at same time
Source: incident_escalation.py (#58) monitoring
Cause probable: Cascading failures, config change broke multiple things
Action:
- Full rollback: nftables + firmware + VLAN config
- Restore from last known-good snapshot (max 30 min old)
- Escalade P0 → crisis-dispatcher (#64) broadcast
Notification: SMS DG + Tier3 "CRITICAL ROLLBACK REQUIRED {reseller}"
FLOW ROLLBACK:
1. Trigger condition detected (Prometheus metric OR incident stream)
2. System queries rollback history: last_stable_version timestamp
3. Create rollback decision record: {trigger, reseller, action, timestamp}
4. Execute rollback action (parallel):
a. nftables: nft delete table inet rgz_table; restore /etc/nftables.d/{revendeur}.bak
b. Firmware: UBNT SSH → upgrade {cpe_model} --version {stable_version}
c. Config: Kubernetes ConfigMap → apply previous RADIUS/Kea/DNS configs
5. Wait stabilization (5 min window latency/session metrics)
6. Check recovery:
- If metrics back to baseline within 15 min: APPROVED (ticket closed)
- If metrics still bad after 15 min: ESCALATE P0 (manual intervention needed)
7. Notification: Email RCA post-mortem, root cause analysis
8. Prevent future: Add regression test to CI/CD pipeline
ROLLBACK HISTORY (audit trail):
trigger_type: session_drop | auth_failure | latency_spike | multi_p0
trigger_value: "50% drop" | "22% failures" | "7200ms" | "4 P0 incidents"
action_taken: nftables_rollback | firmware_rollback | full_rollback
result: approved | failed_still_degraded | manual_intervention_needed
duration_minutes: Time from trigger to recovery or manual escalade
root_cause: Determined post-incident (config mistake, versioning issue, etc)Modèles de Données
class RollbackTrigger(str, Enum):
SESSION_COUNT_DROP = "session_drop"
RADIUS_AUTH_FAILURE = "auth_failure"
LATENCY_SPIKE = "latency_spike"
MULTI_P0_INCIDENT = "multi_p0"
class RollbackAction(str, Enum):
NFTABLES_ROLLBACK = "nftables_rollback"
FIRMWARE_ROLLBACK = "firmware_rollback"
FULL_ROLLBACK = "full_rollback"
VLAN_CONFIG_ROLLBACK = "vlan_config_rollback"
class RollbackResult(str, Enum):
APPROVED = "approved" # Recovery successful
FAILED_STILL_DEGRADED = "failed_still_degraded"
MANUAL_ESCALATION = "manual_escalation"
# Table DB
rollback_decisions:
- id UUID PK
- reseller_id UUID FK
- incident_id UUID FK NULL (link to incident #58)
- trigger_type VARCHAR(50) CHECK(session_drop|auth_failure|latency_spike|multi_p0)
- trigger_value TEXT (ex: "50% drop from 150 to 70")
- trigger_detected_at TIMESTAMP
- action_executed VARCHAR(50) CHECK(nftables_rollback|firmware_rollback|full_rollback|vlan_config_rollback)
- action_started_at TIMESTAMP
- action_completed_at TIMESTAMP
- result VARCHAR(50) CHECK(approved|failed_still_degraded|manual_escalation)
- baseline_metric_before FLOAT (ex: 150 sessions)
- metric_after_5min FLOAT (ex: 145 sessions)
- metric_after_15min FLOAT (ex: 149 sessions)
- root_cause TEXT NULL (determined in RCA)
- notes TEXT
- approved_by UUID FK users NULL (automatic if successful recovery, else manual)
- escalated_at TIMESTAMP NULL
- created_at TIMESTAMP
rollback_version_history:
- id UUID PK
- reseller_id UUID FK
- config_type VARCHAR(50) (nftables|firmware|radius|kea|dns)
- version_hash VARCHAR(64) (SHA256 of config)
- is_stable BOOLEAN (marked stable after >24h no incidents)
- created_at TIMESTAMP
- created_from_rollback_id UUID FK NULL (which rollback reverted to this)
rollback_metric_baseline:
- id UUID PK
- reseller_id UUID FK
- metric_name VARCHAR(100) (session_count|auth_failure_rate|latency_p95)
- baseline_value FLOAT
- baseline_30day_avg FLOAT
- anomaly_threshold_percent INT (50% for sessions, 20% for auth, 100% for latency)
- last_updated TIMESTAMPConfiguration
# .env.example
ROLLBACK_SESSION_DROP_THRESHOLD_PERCENT=50
ROLLBACK_SESSION_DROP_WINDOW_MINUTES=2
ROLLBACK_SESSION_BASELINE_DAYS=30
ROLLBACK_AUTH_FAILURE_THRESHOLD_PERCENT=20
ROLLBACK_AUTH_FAILURE_WINDOW_MINUTES=5
ROLLBACK_LATENCY_THRESHOLD_MS=5000
ROLLBACK_LATENCY_WINDOW_MINUTES=5
ROLLBACK_LATENCY_BASELINE_DAYS=30
ROLLBACK_MULTI_P0_THRESHOLD=3
ROLLBACK_MULTI_P0_WINDOW_MINUTES=5
ROLLBACK_STABILIZATION_WINDOW_MINUTES=15
ROLLBACK_MANUAL_ESCALATION_AFTER_MINUTES=15
ROLLBACK_NFTABLES_ENABLED=true
ROLLBACK_FIRMWARE_ENABLED=true
ROLLBACK_AUTO_APPROVED=true
ROLLBACK_NOTIFICATION_SMS=dg@rgz.bj,tier2@rgz.bj
ROLLBACK_NOTIFICATION_EMAIL=noc@rgz.bj,dg@rgz.bjEndpoints API
| Méthode | Route | Requête | Réponse | Notes |
|---|---|---|---|---|
| GET | /api/v1/rollback/triggers | — | {items:[{trigger_type, threshold, enabled}], total} | 200 OK |
| GET | /api/v1/rollback/decision | ?reseller_id={id}&status=approved | {items:[{id, trigger_type, action_executed, result, created_at}], total, page, pages} | 200 OK |
| POST | /api/v1/rollback/execute | 202 ACCEPTED | ||
| GET | /api/v1/rollback/execute/ | — | 200 OK | |
| GET | /api/v1/rollback/history | ?reseller_id={id}&limit=10 | {items:[{trigger_type, action, result, duration_min, created_at}], total} | 200 OK |
| PUT | /api/v1/rollback/decision/{id}/manual-escalate | 200 OK | ||
| GET | /api/v1/rollback/baseline | ?reseller_id= | 200 OK |
Celery Task (rgz-beat)
@celery_app.task(queue="rgz.monitoring", name="rgz.monitoring.check_rollback_triggers")
@periodic_task(run_every=crontab(minute="*/1")) # Every minute
def check_rollback_triggers():
"""
Monitor 4 triggers simultaneously. Auto-execute rollback if triggered.
"""
for reseller in Reseller.query.filter(Reseller.status == "active"):
# Trigger 1: Session drop
baseline = get_session_baseline_30day(reseller.id)
current = get_current_session_count(reseller.id)
if (baseline - current) / baseline > 0.5:
trigger_rollback(reseller.id, RollbackTrigger.SESSION_COUNT_DROP)
# Trigger 2: RADIUS auth failure
failure_rate = get_radius_auth_failure_rate(reseller.id, window_minutes=5)
if failure_rate > 0.2:
trigger_rollback(reseller.id, RollbackTrigger.RADIUS_AUTH_FAILURE)
# Trigger 3: Latency spike
p95_latency = get_latency_p95(reseller.id, window_minutes=5)
if p95_latency > 5000:
trigger_rollback(reseller.id, RollbackTrigger.LATENCY_SPIKE)
# Trigger 4: Multi P0 incidents
p0_count = Incident.query.filter(
Incident.reseller_id == reseller.id,
Incident.priority == "P0",
Incident.status.in_(["open", "investigating"])
).count()
if p0_count >= 3:
trigger_rollback(reseller.id, RollbackTrigger.MULTI_P0_INCIDENT)Commandes Utiles
# Vérifier triggers configurés
curl http://localhost:8000/api/v1/rollback/triggers
# Afficher historique rollbacks revendeur
curl http://localhost:8000/api/v1/rollback/history?reseller_id=550e8400&limit=10 \
-H "Authorization: Bearer ADMIN_TOKEN" | jq '.items'
# Récupérer baseline metrics (pour tuning triggers)
curl http://localhost:8000/api/v1/rollback/baseline?reseller_id=550e8400 \
-H "Authorization: Bearer ADMIN_TOKEN" | jq '.'
# Déclencher rollback manuel (NOC emergency)
curl -X POST http://localhost:8000/api/v1/rollback/execute \
-H "Authorization: Bearer NOC_ADMIN_TOKEN" \
-d '{
"reseller_id": "550e8400-e29b-41d4-a716-446655440000",
"type": "full",
"manual": true,
"notes": "Customer reports total outage, need immediate rollback"
}'
# Vérifier status rollback en cours
curl http://localhost:8000/api/v1/rollback/execute/5ba0e8c0-99dd-41d4-a716-446655dd0001 \
-H "Authorization: Bearer ADMIN_TOKEN" | jq '.result'
# Escalade manuelle si rollback échoue
curl -X PUT http://localhost:8000/api/v1/rollback/decision/5ba0/manual-escalate \
-H "Authorization: Bearer NOC_ADMIN_TOKEN" \
-d '{"notes": "Metrics still degraded 15min after rollback. Creating P0 incident."}'Intégration Avec Autres Outils
- #58 incident-escalation: Rollback triggered si P0 non résolu après 10min
- #38 prometheus-alert: Pushes metrics to system (session_count, auth_failure_rate, latency_p95)
- #64 crisis-dispatcher: Si multi_p0 trigger, broadcast crisis SMS
- #52 dashboard-noc: Real-time rollback queue, manual override buttons
- #72 post-incident-rca: RCA uses rollback_decisions.root_cause
- #63 email-notification: Post-incident emails avec learnings
Implémentation TODO
- [ ] Celery task detect 4 triggers every minute
- [ ] Auto-execute rollback (nftables + firmware)
- [ ] Wait 15min stabilization check
- [ ] SMS/email notifications DG + NOC
- [ ] Manual escalation if metrics not recovered
- [ ] Audit trail (rollback_decisions)
- [ ] Baseline metric tracking (30-day rolling)
- [ ] Tests: trigger session drop → verify rollback executed → check recovery
- [ ] Dashboard NOC rollback history + analytics
- [ ] RCA automation (root_cause field in decision)
Dernière mise à jour: 2026-02-21