#58 — Escalade Incidents (P0/P1/P2)
PLANIFIÉ
Priorité: 🟠 HAUTE · Type: TYPE B · Conteneur: rgz-api · Code: app/services/incident.pyDépendances: #1 rgz-api, #61 sms-template-engine
Description
Le système d'escalade incidents route automatiquement les alertes critiques (P0, P1, P2) vers les tiers de support appropriés avec des SLA stricts. Les incidents sont détectés via #38 Prometheus (seuils CPU/RAM/RSSI/latence) et sont créés automatiquement ou manuellement via API.
Chaque niveau de priorité suit un calendrier d'escalade strict : P0 (coupure totale) escalade en 5→15→30min vers DG, P1 (dégradation) en 15min→1h, P2 (isolé) en 4h. Le système envoie des SMS #61 de notification et remet à zéro le timer si l'incident est acquitté. Les incidents non résolus dans le SLA déclenchent automatiquement une escalade supplémentaire (Tier 2 → Tier 3 ingénieur externe).
Architecture Interne
Escalade automatique basée sur SLA:
NIVEAU P0 - CRITIQUE (COUPURE TOTALE)
Trigger: >30 sites sans session RADIUS pendant >5 min
Exemple: BGP flap central, fiber cut, DDoS
Escalade:
T+5min: SMS Tier 1 (ingénieur support)
T+15min: SMS Tier 2 (lead technique)
T+30min: SMS DG Technique + Président
Notification: SMS + WhatsApp (#62) + email critique
Broadcast: Tous revendeurs via #64 crisis-dispatcher
Resolution: Rollback automatique si détecté stable 5min
Baseline: Incident CLÔT automatiquement si uptime >99.9% on 5min window
NIVEAU P1 - HAUTE (DÉGRADATION SERVICE)
Trigger: Perte service sur 5-30 sites OU latency P95 > 2s OU drop >5%
Escalade:
T+15min: SMS Tier 1
T+1h: SMS Tier 2
Notification: Email + SMS revendeurs affectés
Affichage: Status page public (#55) "Service dégradé"
Resolution: Auto-close si métrique revient normal 15min
NIVEAU P2 - MOYENNE (INCIDENT ISOLÉ)
Trigger: 1 site seul affecté OU utilisateur report via ticket
Escalade:
T+4h: SMS Tier 1 seulement
Notification: Email support interne
Affichage: Notes dans dashboard NOC (#52)
Resolution: Pas d'auto-close, requiert acquittement NOC
FLUX ACQUITTEMENT:
Tier reçoit SMS "INC-2026-000847: Coupure 2 sites. Équipe mobilisée."
Répond: "ACK" ou numéro du ticket dans message
Système reset timer (escalade suspendue tant qu'actif)
Tier peut changer status: OPEN → ACKNOWLEDGED → INVESTIGATING → RESOLVED
Status RESOLVED ≠ CLÔT (attendu 5min stabilité avant auto-close)
TRIGGERS AUTO-ROLLBACK (via #60):
Si incident P0 ouvert > 10min ET rollback disponible:
- Rollback VLAN config (nftables)
- Rollback firmware CPE (retour version stable)
- Kill sessions douteuses (reset RADIUS)
- Notification DG + équipe
- Incident status: ROLLBACK_APPLIED
- Si après rollback stabilité > 5min: CLÔT
- Si encore P0 après rollback: escalade critique DG+externalModèles de Données
class IncidentPriority(str, Enum):
P0 = "P0" # Coupure totale
P1 = "P1" # Dégradation
P2 = "P2" # Isolé
class IncidentStatus(str, Enum):
OPEN = "open"
ACKNOWLEDGED = "acknowledged"
INVESTIGATING = "investigating"
ESCALATED = "escalated"
ROLLBACK_APPLIED = "rollback_applied"
RESOLVED = "resolved"
CLOSED = "closed"
# Table DB
incidents:
- id UUID PK
- incident_number VARCHAR(16) UNIQUE (auto: INC-2026-000001)
- title TEXT
- priority CHECK(P0|P1|P2)
- status CHECK(open|acknowledged|investigating|escalated|rollback_applied|resolved|closed)
- description TEXT
- trigger_source VARCHAR(50) (prometheus|manual|webhook|user_report)
- affected_resellers UUID[] (array of reseller IDs)
- affected_site_count INT
- detected_at TIMESTAMP
- acknowledged_at TIMESTAMP NULL
- acknowledged_by UUID FK users NULL
- escalation_level INT (0=Tier1, 1=Tier2, 2=Tier3_external)
- last_escalation_at TIMESTAMP NULL
- next_escalation_at TIMESTAMP NULL
- rollback_triggered BOOLEAN DEFAULT false
- resolved_at TIMESTAMP NULL
- closed_at TIMESTAMP NULL
- root_cause_analysis TEXT NULL
- created_at TIMESTAMP
incident_escalation_history:
- id UUID PK
- incident_id UUID FK
- escalation_level INT
- recipient TEXT (email/phone)
- message_sent TEXT
- sent_at TIMESTAMP
- ack_received BOOLEAN DEFAULT false
- ack_at TIMESTAMP NULL
- created_by VARCHAR(50) (system|manual)
incident_metrics:
- id UUID PK
- incident_id UUID FK
- metric_name VARCHAR(100) (session_count, latency_p95, packet_loss, etc)
- trigger_value FLOAT
- current_value FLOAT
- baseline_value FLOAT
- updated_at TIMESTAMPConfiguration
# .env.example
INCIDENT_P0_ESCALATE_1_MINUTES=5
INCIDENT_P0_ESCALATE_2_MINUTES=15
INCIDENT_P0_ESCALATE_3_MINUTES=30
INCIDENT_P0_ESCALATE_AUTO_ROLLBACK=true
INCIDENT_P0_ROLLBACK_TIMEOUT_MINUTES=10
INCIDENT_P1_ESCALATE_1_MINUTES=15
INCIDENT_P1_ESCALATE_2_MINUTES=60
INCIDENT_P2_ESCALATE_1_MINUTES=240 # 4h
INCIDENT_P0_THRESHOLD_SITES=30
INCIDENT_P0_THRESHOLD_DURATION_MINUTES=5
INCIDENT_P1_THRESHOLD_SITES=5
INCIDENT_P1_THRESHOLD_LATENCY_MS=2000
INCIDENT_P1_THRESHOLD_LOSS_PERCENT=5
INCIDENT_AUTO_CLOSE_STABILITY_MINUTES=5
INCIDENT_SMS_RECIPIENTS_TIER1=tier1@rgz.bj
INCIDENT_SMS_RECIPIENTS_TIER2=tier2@rgz.bj
INCIDENT_SMS_RECIPIENTS_DG=dg@rgz.bj
# Celery task
INCIDENT_ESCALATION_TASK=rgz.monitoring.escalate_incidents
INCIDENT_ESCALATION_INTERVAL_SECONDS=60Endpoints API
| Méthode | Route | Requête | Réponse | Notes |
|---|---|---|---|---|
| POST | /api/v1/incidents | 201 CREATED | ||
| GET | /api/v1/incidents | ?priority=P0&status=open | {items:[{id, incident_number, priority, status, created_at}], total, page, pages} | 200 OK |
| GET | /api/v1/incidents/ | — | 200 OK | |
| PUT | /api/v1/incidents/{id}/acknowledge | 200 OK | ||
| PUT | /api/v1/incidents/{id}/investigate | 200 OK | ||
| PUT | /api/v1/incidents/{id}/resolve | 200 OK | ||
| PUT | /api/v1/incidents/{id}/close | — | 200 OK | |
| POST | /api/v1/incidents/{id}/escalate-manual | 201 CREATED | ||
| GET | /api/v1/incidents/active | — | {items:[{id, incident_number, priority, next_escalation_at}], total} | 200 OK |
Celery Task (rgz-beat)
@celery_app.task(queue="rgz.monitoring", name="rgz.monitoring.escalate_incidents")
@periodic_task(run_every=crontab(minute="*/1")) # Every minute
def escalate_incidents():
"""
Contrôle les incidents non acquittés vs SLA.
Déclenche escalade SMS si délai dépassé.
"""
for incident in Incident.query.filter(Incident.status.in_([OPEN, INVESTIGATING])):
if incident.priority == "P0" and not incident.acknowledged_at:
# Check SLA timers
if datetime.utcnow() >= incident.next_escalation_at:
# Send SMS next level
escalate_to_tier_x(incident)
incident.escalation_level += 1
incident.next_escalation_at = ...
db.session.commit()SMS Templates (via #61)
Template: INC_P0_INITIAL
"ALERTE RGZ P0: {title}. {affected_sites} sites hors service. Équipe mobilisée. MAJ dans {delay} min. Numéro ticket: {incident_number}"
Template: INC_P0_ESCALATE_TIER2
"ESCALADE REQUISE INC-{number}: {title}. Tier1 ne répond pas. Intervention Tier2 URGENTE. Appel: +229-....."
Template: INC_P1_ALERT
"Alerte RGZ P1: Dégradation service {title}. Latence P95: {latency}ms. Perte: {loss}%. Revendeurs affectés: {count}. Status: https://status.rgz.bj"
Template: INC_P2_REPORT
"Incident isolé INC-{number}: {title}. 1 site affecté. Support Tier1 saisi."
Template: INCIDENT_RESOLVED
"Incident {incident_number} RÉSOLU. Durée totale: {duration} min. RCA en cours. Merci pour votre patience."Commandes Utiles
# Créer incident P0 (Prometheus webhook auto)
curl -X POST http://localhost:8000/api/v1/incidents \
-H "Content-Type: application/json" \
-H "X-Prometheus-Secret: SHARED_SECRET" \
-d '{
"title": "Fiber cut core link Cotonou",
"priority": "P0",
"description": "BGP flap detected, 45 sites offline",
"affected_resellers": ["550e8400-...", "550e8400-..."],
"trigger_source": "prometheus"
}'
# Créer incident manuel (revendeur report)
curl -X POST http://localhost:8000/api/v1/incidents \
-H "Authorization: Bearer REVENDEUR_TOKEN" \
-d '{
"title": "WiFi down on 3 CPE",
"priority": "P1",
"description": "Clients complaining no connectivity",
"affected_resellers": ["550e8400-..."],
"trigger_source": "user_report"
}'
# Vérifier incidents actifs
curl http://localhost:8000/api/v1/incidents/active | jq '.items[] | select(.status=="open")'
# Acquitter incident (NOC)
curl -X PUT http://localhost:8000/api/v1/incidents/INC-2026-000847/acknowledge \
-H "Authorization: Bearer NOC_TOKEN" \
-d '{"notes": "Investigating now, will update in 5min"}'
# Escalade manuelle (Tier 1 → Tier 2)
curl -X POST http://localhost:8000/api/v1/incidents/INC-2026-000847/escalate-manual \
-H "Authorization: Bearer NOC_TOKEN" \
-d '{"escalation_level": 2}'
# Résoudre incident
curl -X PUT http://localhost:8000/api/v1/incidents/INC-2026-000847/resolve \
-H "Authorization: Bearer NOC_TOKEN" \
-d '{
"root_cause": "BGP route hijack from ISP misconfiguration",
"resolution": "Contacted ISP, route restored",
"prevention_actions": ["Add BGP monitoring alert", "Setup secondary path"]
}'Intégration Avec Autres Outils
- #38 prometheus-alert: Crée incidents automatiquement via webhook POST /api/v1/incidents
- #60 rollback-decision: Appelé par tâche Celery si P0 non résolu après 10min
- #61 sms-template-engine: Templates INC_P0_, INC_P1_, INC_P2_*
- #64 crisis-dispatcher: Si P0 confirmé, broadcast SMS revendeurs via crisis template
- #55 page-statut-public: Affiche incidents P0 actuels + historique (dernières 24h)
- #52 dashboard-noc: Real-time incident feed, escalation timers, ack buttons
- #63 email-notification: Rapports RCA post-mortem emails destinataires
Implémentation TODO
- [ ] CRUD incidents (P0/P1/P2)
- [ ] Auto-escalade Celery task every minute
- [ ] SMS escalade via #61 templates
- [ ] Webhook Prometheus → POST /incidents
- [ ] Status page #55 intégration
- [ ] Rollback trigger #60
- [ ] Audit trail (incident_escalation_history)
- [ ] Tests: créer P0 → escalade T1 → escalade T2 → resolve
- [ ] Dashboard NOC incidents queue
- [ ] RCA post-mortem email
Dernière mise à jour: 2026-02-21