#60 — Décision Rollback Automatique

PLANIFIÉ

Priorité: 🟠 HAUTE · Type: TYPE B · Conteneur: rgz-api · Code: app/services/rollback.pyDépendances: #58 incident-escalation

Description

Le moteur de rollback automatique détecte les configurations ou firmware défectueux et les revert sans intervention manuelle. Basé sur 4 triggers mesurables (drop sessions >50%, taux d'erreur RADIUS >20%, latence P95 >5s, 3+ alertes P0 simultanées), il applique automatiquement un rollback à l'étape précédente stable.

Deux types de rollback : rollback nftables (restaure dernière config réseau validée) et rollback firmware (retour version CPE antérieure). À chaque rollback automatique, une notification est envoyée au NOC et DG. Le système enregistre tous les triggers et décisions dans l'historique pour RCA et prévention future.

Architecture Interne

Rollback triggers et actions auto:

TRIGGER 1: SESSION COUNT DROP > 50%
  Condition: active_session_count() < (baseline * 0.5) pour >2 min
  Baseline: moyenne sessions des 30 derniers jours (par revendeur)
  Exemple: revendeur A avg=150 sessions → drop à 70 → trigger!
  Cause probable: nouvelle config réseau bloque clients, ou firmware bug
  Action:
    - Rollback nftables config to last stable version
    - Reload firewall rules
    - Monitor recovery (attendu 10 min, sinon escalade P0)
  Notification: SMS DG "ROLLBACK TRIGGERED: {revendeur} session drop {old}→{new}"

TRIGGER 2: RADIUS AUTH FAILURE RATE > 20%
  Condition: (failed_auth / total_auth) > 0.2 sur 5 min window
  Métrique: Prometheus metric radius_auth_failures_total
  Exemple: 8 failures / 40 attempts = 20% → trigger!
  Cause probable: NAS secret mismatch, RADIUS server unreachable
  Action:
    - Rollback RADIUS client config on CPE
    - Flush RADIUS cache (Redis key rgz:session:*)
    - Test auth with test user
  Notification: SMS NOC "AUTH FAILURE SPIKE {revendeur}: {percent}% failures"

TRIGGER 3: LATENCY P95 > 5000ms (5 seconds)
  Condition: p95_latency > 5000ms on 5 min window
  Métrique: Prometheus latency_p95_ms per revendeur
  Cause probable: Misconfig QoS, traffic engineering broken, BGP issues
  Action:
    - Rollback nftables QoS config (#27 HTB)
    - Rollback DSCP marking (#28)
    - Revert to default routing (no policy-based)
  Notification: SMS Tier2 "LATENCY DEGRADATION {revendeur}: {latency}ms"

TRIGGER 4: MULTIPLE P0 ALERTS SIMULTANEOUS
  Condition: ≥3 P0 incidents open at same time
  Source: incident_escalation.py (#58) monitoring
  Cause probable: Cascading failures, config change broke multiple things
  Action:
    - Full rollback: nftables + firmware + VLAN config
    - Restore from last known-good snapshot (max 30 min old)
    - Escalade P0 → crisis-dispatcher (#64) broadcast
  Notification: SMS DG + Tier3 "CRITICAL ROLLBACK REQUIRED {reseller}"

FLOW ROLLBACK:
  1. Trigger condition detected (Prometheus metric OR incident stream)
  2. System queries rollback history: last_stable_version timestamp
  3. Create rollback decision record: {trigger, reseller, action, timestamp}
  4. Execute rollback action (parallel):
     a. nftables: nft delete table inet rgz_table; restore /etc/nftables.d/{revendeur}.bak
     b. Firmware: UBNT SSH → upgrade {cpe_model} --version {stable_version}
     c. Config: Kubernetes ConfigMap → apply previous RADIUS/Kea/DNS configs
  5. Wait stabilization (5 min window latency/session metrics)
  6. Check recovery:
     - If metrics back to baseline within 15 min: APPROVED (ticket closed)
     - If metrics still bad after 15 min: ESCALATE P0 (manual intervention needed)
  7. Notification: Email RCA post-mortem, root cause analysis
  8. Prevent future: Add regression test to CI/CD pipeline

ROLLBACK HISTORY (audit trail):
  trigger_type: session_drop | auth_failure | latency_spike | multi_p0
  trigger_value: "50% drop" | "22% failures" | "7200ms" | "4 P0 incidents"
  action_taken: nftables_rollback | firmware_rollback | full_rollback
  result: approved | failed_still_degraded | manual_intervention_needed
  duration_minutes: Time from trigger to recovery or manual escalade
  root_cause: Determined post-incident (config mistake, versioning issue, etc)

Modèles de Données

python

class RollbackTrigger(str, Enum):
    SESSION_COUNT_DROP = "session_drop"
    RADIUS_AUTH_FAILURE = "auth_failure"
    LATENCY_SPIKE = "latency_spike"
    MULTI_P0_INCIDENT = "multi_p0"

class RollbackAction(str, Enum):
    NFTABLES_ROLLBACK = "nftables_rollback"
    FIRMWARE_ROLLBACK = "firmware_rollback"
    FULL_ROLLBACK = "full_rollback"
    VLAN_CONFIG_ROLLBACK = "vlan_config_rollback"

class RollbackResult(str, Enum):
    APPROVED = "approved"        # Recovery successful
    FAILED_STILL_DEGRADED = "failed_still_degraded"
    MANUAL_ESCALATION = "manual_escalation"

# Table DB
rollback_decisions:
  - id UUID PK
  - reseller_id UUID FK
  - incident_id UUID FK NULL (link to incident #58)
  - trigger_type VARCHAR(50) CHECK(session_drop|auth_failure|latency_spike|multi_p0)
  - trigger_value TEXT (ex: "50% drop from 150 to 70")
  - trigger_detected_at TIMESTAMP
  - action_executed VARCHAR(50) CHECK(nftables_rollback|firmware_rollback|full_rollback|vlan_config_rollback)
  - action_started_at TIMESTAMP
  - action_completed_at TIMESTAMP
  - result VARCHAR(50) CHECK(approved|failed_still_degraded|manual_escalation)
  - baseline_metric_before FLOAT (ex: 150 sessions)
  - metric_after_5min FLOAT (ex: 145 sessions)
  - metric_after_15min FLOAT (ex: 149 sessions)
  - root_cause TEXT NULL (determined in RCA)
  - notes TEXT
  - approved_by UUID FK users NULL (automatic if successful recovery, else manual)
  - escalated_at TIMESTAMP NULL
  - created_at TIMESTAMP

rollback_version_history:
  - id UUID PK
  - reseller_id UUID FK
  - config_type VARCHAR(50) (nftables|firmware|radius|kea|dns)
  - version_hash VARCHAR(64) (SHA256 of config)
  - is_stable BOOLEAN (marked stable after >24h no incidents)
  - created_at TIMESTAMP
  - created_from_rollback_id UUID FK NULL (which rollback reverted to this)

rollback_metric_baseline:
  - id UUID PK
  - reseller_id UUID FK
  - metric_name VARCHAR(100) (session_count|auth_failure_rate|latency_p95)
  - baseline_value FLOAT
  - baseline_30day_avg FLOAT
  - anomaly_threshold_percent INT (50% for sessions, 20% for auth, 100% for latency)
  - last_updated TIMESTAMP

Configuration

env

# .env.example
ROLLBACK_SESSION_DROP_THRESHOLD_PERCENT=50
ROLLBACK_SESSION_DROP_WINDOW_MINUTES=2
ROLLBACK_SESSION_BASELINE_DAYS=30

ROLLBACK_AUTH_FAILURE_THRESHOLD_PERCENT=20
ROLLBACK_AUTH_FAILURE_WINDOW_MINUTES=5

ROLLBACK_LATENCY_THRESHOLD_MS=5000
ROLLBACK_LATENCY_WINDOW_MINUTES=5
ROLLBACK_LATENCY_BASELINE_DAYS=30

ROLLBACK_MULTI_P0_THRESHOLD=3
ROLLBACK_MULTI_P0_WINDOW_MINUTES=5

ROLLBACK_STABILIZATION_WINDOW_MINUTES=15
ROLLBACK_MANUAL_ESCALATION_AFTER_MINUTES=15

ROLLBACK_NFTABLES_ENABLED=true
ROLLBACK_FIRMWARE_ENABLED=true
ROLLBACK_AUTO_APPROVED=true

ROLLBACK_NOTIFICATION_SMS=dg@rgz.bj,tier2@rgz.bj
ROLLBACK_NOTIFICATION_EMAIL=noc@rgz.bj,dg@rgz.bj

Endpoints API

Méthode	Route	Requête	Réponse	Notes
GET	/api/v1/rollback/triggers	—	{items:[{trigger_type, threshold, enabled}], total}	200 OK
GET	/api/v1/rollback/decision	?reseller_id={id}&status=approved	{items:[{id, trigger_type, action_executed, result, created_at}], total, page, pages}	200 OK
POST	/api/v1/rollback/execute			202 ACCEPTED
GET	/api/v1/rollback/execute/	—		200 OK
GET	/api/v1/rollback/history	?reseller_id={id}&limit=10	{items:[{trigger_type, action, result, duration_min, created_at}], total}	200 OK
PUT	/api/v1/rollback/decision/{id}/manual-escalate			200 OK
GET	/api/v1/rollback/baseline	?reseller_id=		200 OK

Celery Task (rgz-beat)

python

@celery_app.task(queue="rgz.monitoring", name="rgz.monitoring.check_rollback_triggers")
@periodic_task(run_every=crontab(minute="*/1"))  # Every minute
def check_rollback_triggers():
    """
    Monitor 4 triggers simultaneously. Auto-execute rollback if triggered.
    """
    for reseller in Reseller.query.filter(Reseller.status == "active"):
        # Trigger 1: Session drop
        baseline = get_session_baseline_30day(reseller.id)
        current = get_current_session_count(reseller.id)
        if (baseline - current) / baseline > 0.5:
            trigger_rollback(reseller.id, RollbackTrigger.SESSION_COUNT_DROP)

        # Trigger 2: RADIUS auth failure
        failure_rate = get_radius_auth_failure_rate(reseller.id, window_minutes=5)
        if failure_rate > 0.2:
            trigger_rollback(reseller.id, RollbackTrigger.RADIUS_AUTH_FAILURE)

        # Trigger 3: Latency spike
        p95_latency = get_latency_p95(reseller.id, window_minutes=5)
        if p95_latency > 5000:
            trigger_rollback(reseller.id, RollbackTrigger.LATENCY_SPIKE)

        # Trigger 4: Multi P0 incidents
        p0_count = Incident.query.filter(
            Incident.reseller_id == reseller.id,
            Incident.priority == "P0",
            Incident.status.in_(["open", "investigating"])
        ).count()
        if p0_count >= 3:
            trigger_rollback(reseller.id, RollbackTrigger.MULTI_P0_INCIDENT)

Commandes Utiles

bash

# Vérifier triggers configurés
curl http://localhost:8000/api/v1/rollback/triggers

# Afficher historique rollbacks revendeur
curl http://localhost:8000/api/v1/rollback/history?reseller_id=550e8400&limit=10 \
  -H "Authorization: Bearer ADMIN_TOKEN" | jq '.items'

# Récupérer baseline metrics (pour tuning triggers)
curl http://localhost:8000/api/v1/rollback/baseline?reseller_id=550e8400 \
  -H "Authorization: Bearer ADMIN_TOKEN" | jq '.'

# Déclencher rollback manuel (NOC emergency)
curl -X POST http://localhost:8000/api/v1/rollback/execute \
  -H "Authorization: Bearer NOC_ADMIN_TOKEN" \
  -d '{
    "reseller_id": "550e8400-e29b-41d4-a716-446655440000",
    "type": "full",
    "manual": true,
    "notes": "Customer reports total outage, need immediate rollback"
  }'

# Vérifier status rollback en cours
curl http://localhost:8000/api/v1/rollback/execute/5ba0e8c0-99dd-41d4-a716-446655dd0001 \
  -H "Authorization: Bearer ADMIN_TOKEN" | jq '.result'

# Escalade manuelle si rollback échoue
curl -X PUT http://localhost:8000/api/v1/rollback/decision/5ba0/manual-escalate \
  -H "Authorization: Bearer NOC_ADMIN_TOKEN" \
  -d '{"notes": "Metrics still degraded 15min after rollback. Creating P0 incident."}'

Intégration Avec Autres Outils

#58 incident-escalation: Rollback triggered si P0 non résolu après 10min
#38 prometheus-alert: Pushes metrics to system (session_count, auth_failure_rate, latency_p95)
#64 crisis-dispatcher: Si multi_p0 trigger, broadcast crisis SMS
#52 dashboard-noc: Real-time rollback queue, manual override buttons
#72 post-incident-rca: RCA uses rollback_decisions.root_cause
#63 email-notification: Post-incident emails avec learnings

Implémentation TODO

[ ] Celery task detect 4 triggers every minute
[ ] Auto-execute rollback (nftables + firmware)
[ ] Wait 15min stabilization check
[ ] SMS/email notifications DG + NOC
[ ] Manual escalation if metrics not recovered
[ ] Audit trail (rollback_decisions)
[ ] Baseline metric tracking (30-day rolling)
[ ] Tests: trigger session drop → verify rollback executed → check recovery
[ ] Dashboard NOC rollback history + analytics
[ ] RCA automation (root_cause field in decision)

Dernière mise à jour: 2026-02-21

#60 — Décision Rollback Automatique ​

Description ​

Architecture Interne ​

Modèles de Données ​

Configuration ​

Endpoints API ​

Celery Task (rgz-beat) ​

Commandes Utiles ​

Intégration Avec Autres Outils ​

Implémentation TODO ​