#64 — Crisis Dispatcher
PLANIFIÉ
Priorité: 🟠 HAUTE · Type: TYPE B · Conteneur: rgz-api · Code: app/services/crisis.pyDépendances: #58 incident-escalation, #61 sms-template-engine
Description
Le Crisis Dispatcher orchestrate les communications multi-canal en cas de crise système : coupure totale, dégradation majeure, maintenance prévue, rétablissement. Expose 4 templates de crise avec broadcast instantané via SMS (#61), WhatsApp (#62), et email (#63). Les escalades d'astreinte sont gérées automatiquement (SMS gérant DG technique après un délai sans acquittement).
Aucune limite de rate limiting pour les crises : les messages passent directement en priorité dans les queues Celery.
Templates de Crise (4 types)
TYPE 1: COUPURE_TOTALE (Total Outage)
Trigger: P0 incident, >50 sites, aucun workaround
Broadcast: SMS + WhatsApp + Email
Recipients: Tous revendeurs actifs + NOC + DG + Tier 2
Delai auto-escalade astreinte: 5 min si pas acquittement NOC
Message SMS:
"🔴 COUPURE RGZ TOTALE. {{ affected_sites }} sites sans service.
Équipe mobilisée d'urgence. MAJ dans {{ delay }}min.
Status: https://status.rgz.bj"
Message Email:
Objet: "ALERTE CRITIQUE — Coupure RGZ Totale"
Contenu: Description complète + liens support + ETA rétablissement
TYPE 2: DEGRADATION_MAJEURE (Service Degradation)
Trigger: P1 incident, >20 sites, latence/perte élevée
Broadcast: SMS + Email
Recipients: Revendeurs affectés + support interne
Message SMS:
"⚠️ Service dégradé RGZ. {{ affected_sites }} sites ralentis.
Cause: {{ issue_short }}.
Support en cours. MAJ via {{ status_link }}"
Message Email:
Objet: "Alerte RGZ — Service Dégradé"
Contenu: Contexte + impact estimé + actions NOC
TYPE 3: MAINTENANCE_PLANIFIÉE (Scheduled Maintenance)
Trigger: Maintenance prévue (déploiement, update, infrastructure)
Broadcast: Email (48h avant) + SMS (24h avant)
Recipients: Tous revendeurs
Fenêtre maintenance: {{ maintenance_window }} (ex: 02:00-04:00 UTC)
Impact attendu: Outage total OU dégradation + recovery time
Message Email (48h):
Objet: "Maintenance RGZ Programmée — {{ maintenance_date }}"
Contenu: Date/heure précise, durée estimée, impact, contact support
Message SMS (24h):
"Rappel: Maintenance RGZ {{ maintenance_date }} {{ time_window }}.
Durée estimée: {{ duration }}min.
Support: support@rgz.bj"
TYPE 4: RETABLISSEMENT (Service Restored)
Trigger: Incident P0/P1 → status changed to RESOLVED + validation 10min stable
Broadcast: SMS + WhatsApp + Email
Recipients: Tous revendeurs affectés
Message SMS:
"✓ Service RGZ rétabli. Tous sites opérationnels.
Incident #{{ incident_number }} fermé.
Durée totale: {{ duration_minutes }}min.
RCA: https://status.rgz.bj/incident/{{ incident_number }}"
Message Email:
Objet: "Service RGZ Rétabli — Incident Fermé"
Contenu: Timeline, root cause summary, prevention actions, apologiesArchitecture Interne
FLUX CRISE:
1. Trigger détecté (#58 incident-escalation):
- P0 incident créé, >30 sites, >5 min
→ POST /api/v1/crisis/dispatch {type: COUPURE_TOTALE, ...}
2. Crisis Dispatcher:
- Vérifie type vs. seuils (affected_sites, duration)
- Construit message template par canal
- Enqueue tasks SMS/WhatsApp/Email (priorité HIGH)
- Log crisis event (crisis_events table)
- Notifie NOC/DG dashboard (WebSocket push)
3. Broadcast multi-canal (simultaneous):
- SMS: Via #61 (rate limit = disabled)
- WhatsApp: Via #62 (rate limit = disabled)
- Email: Via #63 (rate limit = disabled)
4. Escalade d'astreinte:
- Si type = COUPURE_TOTALE et no ACK from NOC in 5min
→ SMS gérant d'astreinte (escalade Tier 3)
- Répète tous les 5min jusqu'à acquittement
5. Rétablissement:
- Incident status → RESOLVED
- Wait 10 min pour validation stabilité
- Auto-close incident
- Déclenche type = RETABLISSEMENT
- Broadcast messages positifs
ACTORS & RESPONSIBILITIES:
- Trigger system: #58 (incident creation)
- Dispatcher: #64 (this tool)
- Channels: #61 SMS, #62 WhatsApp, #63 Email
- Notification: NOC/DG via dashboard + SMS
- Root cause: #72 post-incident RCAModèles de Données
python
class CrisisType(str, Enum):
COUPURE_TOTALE = "coupure_totale"
DEGRADATION_MAJEURE = "degradation_majeure"
MAINTENANCE_PLANIFIÉE = "maintenance_planifiee"
RETABLISSEMENT = "retablissement"
class CrisisStatus(str, Enum):
ACTIVE = "active"
ESCALATED = "escalated"
ACKNOWLEDGED = "acknowledged"
RESOLVED = "resolved"
# Table DB
crisis_events:
- id UUID PK
- incident_id UUID FK NULL (link to incident #58)
- type VARCHAR(50) CHECK(coupure_totale|degradation|maintenance|retablissement)
- status VARCHAR(50) CHECK(active|escalated|acknowledged|resolved)
- title TEXT
- description TEXT
- affected_resellers UUID[] (list)
- affected_site_count INT
- estimated_duration_minutes INT NULL
- broadcast_channels TEXT[] CHECK([sms|whatsapp|email])
- broadcast_started_at TIMESTAMP
- messages_sent_total INT DEFAULT 0
- messages_delivered_total INT DEFAULT 0
- acknowledged_at TIMESTAMP NULL
- acknowledged_by UUID FK users NULL
- escalation_triggered BOOLEAN DEFAULT false
- escalation_triggered_at TIMESTAMP NULL
- escalation_count INT DEFAULT 0
- resolved_at TIMESTAMP NULL
- root_cause TEXT NULL
- duration_minutes INT NULL
- created_at TIMESTAMP
crisis_escalation_history:
- id UUID PK
- crisis_id UUID FK
- escalation_level INT (1=NOC, 2=Tier2_DG, 3=Astreinte_external)
- recipient_phone VARCHAR(20)
- recipient_name VARCHAR(100)
- message_sent TEXT
- sent_at TIMESTAMP
- acknowledged_at TIMESTAMP NULL
- acknowledged_by VARCHAR(100) NULL
- created_at TIMESTAMPConfiguration
env
# .env.example
CRISIS_DISPATCH_ENABLED=true
CRISIS_COUPURE_TOTALE_THRESHOLD_SITES=30
CRISIS_COUPURE_TOTALE_THRESHOLD_DURATION_MINUTES=5
CRISIS_DEGRADATION_THRESHOLD_SITES=20
CRISIS_DEGRADATION_THRESHOLD_LATENCY_MS=2000
CRISIS_ESCALADE_ASTREINTE_TIMEOUT_MINUTES=5
CRISIS_ESCALADE_REPEAT_INTERVAL_MINUTES=5
CRISIS_ESCALADE_MAX_ATTEMPTS=3
CRISIS_CHANNELS_DEFAULT=sms,whatsapp,email
CRISIS_SMS_ENABLED=true
CRISIS_WHATSAPP_ENABLED=true
CRISIS_EMAIL_ENABLED=true
CRISIS_RECIPIENTS_NOC=noc@rgz.bj
CRISIS_RECIPIENTS_DG=dg@rgz.bj
CRISIS_RECIPIENTS_ASTREINTE=astreinte@rgz.bj,+229-97979964
# Broadcast all resellers on COUPURE_TOTALE
CRISIS_BROADCAST_ALL_RESELLERS=trueEndpoints API
| Méthode | Route | Requête | Réponse | Notes |
|---|---|---|---|---|
| POST | /api/v1/crisis/dispatch | 202 ACCEPTED | ||
| GET | /api/v1/crisis/active | — | {items:[{id, type, title, started_at, escalation_count}], total} | 200 OK |
| GET | /api/v1/crisis/ | — | 200 OK | |
| PUT | /api/v1/crisis/{id}/acknowledge | 200 OK | ||
| POST | /api/v1/crisis/{id}/escalate-astreinte | 201 CREATED | ||
| PUT | /api/v1/crisis/{id}/resolve | 200 OK | ||
| GET | /api/v1/crisis/history | ?limit=50&order=created_at | {items:[{id, type, duration_min, created_at}], total} | 200 OK |
Celery Task (rgz-beat)
python
@celery_app.task(queue="rgz.crisis", name="rgz.crisis.escalate_unacknowledged")
@periodic_task(run_every=crontab(minute="*/5")) # Every 5 minutes
def escalate_unacknowledged_crises():
"""
Check crises type COUPURE_TOTALE unacknowledged > 5 min.
Auto-escalade SMS astreinte (Tier 3).
"""
cutoff_time = datetime.utcnow() - timedelta(minutes=5)
for crisis in Crisis.query.filter(
Crisis.type == "coupure_totale",
Crisis.status == "active",
Crisis.broadcast_started_at <= cutoff_time,
Crisis.acknowledged_at.is_(None)
):
# Escalade to Tier 3 astreinte
sms_task = send_sms_to_astreinte(
crisis.id,
f"ESCALADE ASTREINTE: {crisis.title}. NOC ne répond pas. Intervention requise URGENTE."
)
crisis.escalation_count += 1
crisis.escalation_triggered = True
crisis.escalation_triggered_at = datetime.utcnow()
db.session.commit()Commandes Utiles
bash
# Dispatcher coupure totale
curl -X POST http://localhost:8000/api/v1/crisis/dispatch \
-H "Authorization: Bearer PROMETHEUS_WEBHOOK" \
-d '{
"type": "coupure_totale",
"title": "Fiber cut central link",
"description": "BGP flap detected. Central router unreachable. 45 sites offline.",
"affected_resellers": ["550e8400-e29b-...", "550e8400-e29b-..."],
"estimated_duration_min": 30,
"channels": ["sms", "whatsapp", "email"]
}'
# Dispatcher maintenance prévue (48h avant)
curl -X POST http://localhost:8000/api/v1/crisis/dispatch \
-H "Authorization: Bearer ADMIN_TOKEN" \
-d '{
"type": "maintenance_planifiee",
"title": "Maintenance DB PostgreSQL",
"description": "Upgrade à version 16.2. Expected downtime: 45 minutes.",
"estimated_duration_min": 45,
"channels": ["email"]
}'
# Vérifier crises actives
curl http://localhost:8000/api/v1/crisis/active \
-H "Authorization: Bearer NOC_TOKEN" | jq '.items'
# Acquitter crise (NOC)
curl -X PUT http://localhost:8000/api/v1/crisis/550e8400/acknowledge \
-H "Authorization: Bearer NOC_TOKEN" \
-d '{"notes": "Acknowledged, team mobilized. Will update in 5 min."}'
# Escalade manuelle astreinte (Tier 2)
curl -X POST http://localhost:8000/api/v1/crisis/550e8400/escalate-astreinte \
-H "Authorization: Bearer TIER2_TOKEN" \
-d '{"note": "NOC unable to resolve, escalading to external team."}'
# Résoudre crise (après rétablissement)
curl -X PUT http://localhost:8000/api/v1/crisis/550e8400/resolve \
-H "Authorization: Bearer NOC_TOKEN" \
-d '{
"root_cause": "ISP misconfigured BGP route announcement",
"resolution": "ISP reverted route, connectivity restored",
"prevention_actions": ["Add BGP monitoring", "Setup secondary path"]
}'
# Afficher historique crises (admin)
curl http://localhost:8000/api/v1/crisis/history?limit=10 \
-H "Authorization: Bearer ADMIN_TOKEN" | jq '.items'Intégration Avec Autres Outils
- #58 incident-escalation: Crée incidents P0 → déclenche POST /api/v1/crisis/dispatch automatiquement
- #61 sms-template-engine: Envoie SMS crisis (pas de template, texte libre rapidement)
- #62 whatsapp-business: Envoie WhatsApp en parallèle SMS
- #63 email-notification: Envoie emails formels crisis + maintenance
- #72 post-incident-rca: Consomme crisis_events.root_cause pour RCA post-mortem
- Dashboard NOC (#52): Affiche crises actives + escalation timers + acknowledge buttons
Implémentation TODO
- [ ] CRUD crisis events (4 types)
- [ ] Multi-channel broadcast (SMS + WhatsApp + Email simultané)
- [ ] Escalade astreinte Celery task (every 5min)
- [ ] ACK timer tracking (5min escalade trigger)
- [ ] Root cause tracking
- [ ] Crisis audit log (crisis_events + crisis_escalation_history)
- [ ] Dashboard NOC crisis widget (real-time active crises)
- [ ] Auto-resolve 10min après incident RESOLVED
- [ ] WebSocket notifications (push to connected clients)
- [ ] Tests: dispatch COUPURE → broadcast → wait 5min → escalade astreinte
- [ ] Integration #58 incident webhook → POST /crisis/dispatch
Dernière mise à jour: 2026-02-21