Skip to content

#64 — Crisis Dispatcher

PLANIFIÉ

Priorité: 🟠 HAUTE · Type: TYPE B · Conteneur: rgz-api · Code: app/services/crisis.pyDépendances: #58 incident-escalation, #61 sms-template-engine


Description

Le Crisis Dispatcher orchestrate les communications multi-canal en cas de crise système : coupure totale, dégradation majeure, maintenance prévue, rétablissement. Expose 4 templates de crise avec broadcast instantané via SMS (#61), WhatsApp (#62), et email (#63). Les escalades d'astreinte sont gérées automatiquement (SMS gérant DG technique après un délai sans acquittement).

Aucune limite de rate limiting pour les crises : les messages passent directement en priorité dans les queues Celery.


Templates de Crise (4 types)

TYPE 1: COUPURE_TOTALE (Total Outage)
  Trigger: P0 incident, >50 sites, aucun workaround
  Broadcast: SMS + WhatsApp + Email
  Recipients: Tous revendeurs actifs + NOC + DG + Tier 2
  Delai auto-escalade astreinte: 5 min si pas acquittement NOC

  Message SMS:
    "🔴 COUPURE RGZ TOTALE. {{ affected_sites }} sites sans service.
     Équipe mobilisée d'urgence. MAJ dans {{ delay }}min.
     Status: https://status.rgz.bj"

  Message Email:
    Objet: "ALERTE CRITIQUE — Coupure RGZ Totale"
    Contenu: Description complète + liens support + ETA rétablissement

TYPE 2: DEGRADATION_MAJEURE (Service Degradation)
  Trigger: P1 incident, >20 sites, latence/perte élevée
  Broadcast: SMS + Email
  Recipients: Revendeurs affectés + support interne

  Message SMS:
    "⚠️ Service dégradé RGZ. {{ affected_sites }} sites ralentis.
     Cause: {{ issue_short }}.
     Support en cours. MAJ via {{ status_link }}"

  Message Email:
    Objet: "Alerte RGZ — Service Dégradé"
    Contenu: Contexte + impact estimé + actions NOC

TYPE 3: MAINTENANCE_PLANIFIÉE (Scheduled Maintenance)
  Trigger: Maintenance prévue (déploiement, update, infrastructure)
  Broadcast: Email (48h avant) + SMS (24h avant)
  Recipients: Tous revendeurs
  Fenêtre maintenance: {{ maintenance_window }} (ex: 02:00-04:00 UTC)
  Impact attendu: Outage total OU dégradation + recovery time

  Message Email (48h):
    Objet: "Maintenance RGZ Programmée — {{ maintenance_date }}"
    Contenu: Date/heure précise, durée estimée, impact, contact support

  Message SMS (24h):
    "Rappel: Maintenance RGZ {{ maintenance_date }} {{ time_window }}.
     Durée estimée: {{ duration }}min.
     Support: support@rgz.bj"

TYPE 4: RETABLISSEMENT (Service Restored)
  Trigger: Incident P0/P1 → status changed to RESOLVED + validation 10min stable
  Broadcast: SMS + WhatsApp + Email
  Recipients: Tous revendeurs affectés

  Message SMS:
    "✓ Service RGZ rétabli. Tous sites opérationnels.
     Incident #{{ incident_number }} fermé.
     Durée totale: {{ duration_minutes }}min.
     RCA: https://status.rgz.bj/incident/{{ incident_number }}"

  Message Email:
    Objet: "Service RGZ Rétabli — Incident Fermé"
    Contenu: Timeline, root cause summary, prevention actions, apologies

Architecture Interne

FLUX CRISE:

1. Trigger détecté (#58 incident-escalation):
   - P0 incident créé, >30 sites, >5 min
   → POST /api/v1/crisis/dispatch {type: COUPURE_TOTALE, ...}

2. Crisis Dispatcher:
   - Vérifie type vs. seuils (affected_sites, duration)
   - Construit message template par canal
   - Enqueue tasks SMS/WhatsApp/Email (priorité HIGH)
   - Log crisis event (crisis_events table)
   - Notifie NOC/DG dashboard (WebSocket push)

3. Broadcast multi-canal (simultaneous):
   - SMS: Via #61 (rate limit = disabled)
   - WhatsApp: Via #62 (rate limit = disabled)
   - Email: Via #63 (rate limit = disabled)

4. Escalade d'astreinte:
   - Si type = COUPURE_TOTALE et no ACK from NOC in 5min
   → SMS gérant d'astreinte (escalade Tier 3)
   - Répète tous les 5min jusqu'à acquittement

5. Rétablissement:
   - Incident status → RESOLVED
   - Wait 10 min pour validation stabilité
   - Auto-close incident
   - Déclenche type = RETABLISSEMENT
   - Broadcast messages positifs

ACTORS & RESPONSIBILITIES:
  - Trigger system: #58 (incident creation)
  - Dispatcher: #64 (this tool)
  - Channels: #61 SMS, #62 WhatsApp, #63 Email
  - Notification: NOC/DG via dashboard + SMS
  - Root cause: #72 post-incident RCA

Modèles de Données

python
class CrisisType(str, Enum):
    COUPURE_TOTALE = "coupure_totale"
    DEGRADATION_MAJEURE = "degradation_majeure"
    MAINTENANCE_PLANIFIÉE = "maintenance_planifiee"
    RETABLISSEMENT = "retablissement"

class CrisisStatus(str, Enum):
    ACTIVE = "active"
    ESCALATED = "escalated"
    ACKNOWLEDGED = "acknowledged"
    RESOLVED = "resolved"

# Table DB
crisis_events:
  - id UUID PK
  - incident_id UUID FK NULL (link to incident #58)
  - type VARCHAR(50) CHECK(coupure_totale|degradation|maintenance|retablissement)
  - status VARCHAR(50) CHECK(active|escalated|acknowledged|resolved)
  - title TEXT
  - description TEXT
  - affected_resellers UUID[] (list)
  - affected_site_count INT
  - estimated_duration_minutes INT NULL
  - broadcast_channels TEXT[] CHECK([sms|whatsapp|email])
  - broadcast_started_at TIMESTAMP
  - messages_sent_total INT DEFAULT 0
  - messages_delivered_total INT DEFAULT 0
  - acknowledged_at TIMESTAMP NULL
  - acknowledged_by UUID FK users NULL
  - escalation_triggered BOOLEAN DEFAULT false
  - escalation_triggered_at TIMESTAMP NULL
  - escalation_count INT DEFAULT 0
  - resolved_at TIMESTAMP NULL
  - root_cause TEXT NULL
  - duration_minutes INT NULL
  - created_at TIMESTAMP

crisis_escalation_history:
  - id UUID PK
  - crisis_id UUID FK
  - escalation_level INT (1=NOC, 2=Tier2_DG, 3=Astreinte_external)
  - recipient_phone VARCHAR(20)
  - recipient_name VARCHAR(100)
  - message_sent TEXT
  - sent_at TIMESTAMP
  - acknowledged_at TIMESTAMP NULL
  - acknowledged_by VARCHAR(100) NULL
  - created_at TIMESTAMP

Configuration

env
# .env.example
CRISIS_DISPATCH_ENABLED=true
CRISIS_COUPURE_TOTALE_THRESHOLD_SITES=30
CRISIS_COUPURE_TOTALE_THRESHOLD_DURATION_MINUTES=5
CRISIS_DEGRADATION_THRESHOLD_SITES=20
CRISIS_DEGRADATION_THRESHOLD_LATENCY_MS=2000

CRISIS_ESCALADE_ASTREINTE_TIMEOUT_MINUTES=5
CRISIS_ESCALADE_REPEAT_INTERVAL_MINUTES=5
CRISIS_ESCALADE_MAX_ATTEMPTS=3

CRISIS_CHANNELS_DEFAULT=sms,whatsapp,email
CRISIS_SMS_ENABLED=true
CRISIS_WHATSAPP_ENABLED=true
CRISIS_EMAIL_ENABLED=true

CRISIS_RECIPIENTS_NOC=noc@rgz.bj
CRISIS_RECIPIENTS_DG=dg@rgz.bj
CRISIS_RECIPIENTS_ASTREINTE=astreinte@rgz.bj,+229-97979964

# Broadcast all resellers on COUPURE_TOTALE
CRISIS_BROADCAST_ALL_RESELLERS=true

Endpoints API

MéthodeRouteRequêteRéponseNotes
POST/api/v1/crisis/dispatch202 ACCEPTED
GET/api/v1/crisis/active{items:[{id, type, title, started_at, escalation_count}], total}200 OK
GET/api/v1/crisis/200 OK
PUT/api/v1/crisis/{id}/acknowledge200 OK
POST/api/v1/crisis/{id}/escalate-astreinte201 CREATED
PUT/api/v1/crisis/{id}/resolve200 OK
GET/api/v1/crisis/history?limit=50&order=created_at{items:[{id, type, duration_min, created_at}], total}200 OK

Celery Task (rgz-beat)

python
@celery_app.task(queue="rgz.crisis", name="rgz.crisis.escalate_unacknowledged")
@periodic_task(run_every=crontab(minute="*/5"))  # Every 5 minutes
def escalate_unacknowledged_crises():
    """
    Check crises type COUPURE_TOTALE unacknowledged > 5 min.
    Auto-escalade SMS astreinte (Tier 3).
    """
    cutoff_time = datetime.utcnow() - timedelta(minutes=5)

    for crisis in Crisis.query.filter(
        Crisis.type == "coupure_totale",
        Crisis.status == "active",
        Crisis.broadcast_started_at <= cutoff_time,
        Crisis.acknowledged_at.is_(None)
    ):
        # Escalade to Tier 3 astreinte
        sms_task = send_sms_to_astreinte(
            crisis.id,
            f"ESCALADE ASTREINTE: {crisis.title}. NOC ne répond pas. Intervention requise URGENTE."
        )
        crisis.escalation_count += 1
        crisis.escalation_triggered = True
        crisis.escalation_triggered_at = datetime.utcnow()
        db.session.commit()

Commandes Utiles

bash
# Dispatcher coupure totale
curl -X POST http://localhost:8000/api/v1/crisis/dispatch \
  -H "Authorization: Bearer PROMETHEUS_WEBHOOK" \
  -d '{
    "type": "coupure_totale",
    "title": "Fiber cut central link",
    "description": "BGP flap detected. Central router unreachable. 45 sites offline.",
    "affected_resellers": ["550e8400-e29b-...", "550e8400-e29b-..."],
    "estimated_duration_min": 30,
    "channels": ["sms", "whatsapp", "email"]
  }'

# Dispatcher maintenance prévue (48h avant)
curl -X POST http://localhost:8000/api/v1/crisis/dispatch \
  -H "Authorization: Bearer ADMIN_TOKEN" \
  -d '{
    "type": "maintenance_planifiee",
    "title": "Maintenance DB PostgreSQL",
    "description": "Upgrade à version 16.2. Expected downtime: 45 minutes.",
    "estimated_duration_min": 45,
    "channels": ["email"]
  }'

# Vérifier crises actives
curl http://localhost:8000/api/v1/crisis/active \
  -H "Authorization: Bearer NOC_TOKEN" | jq '.items'

# Acquitter crise (NOC)
curl -X PUT http://localhost:8000/api/v1/crisis/550e8400/acknowledge \
  -H "Authorization: Bearer NOC_TOKEN" \
  -d '{"notes": "Acknowledged, team mobilized. Will update in 5 min."}'

# Escalade manuelle astreinte (Tier 2)
curl -X POST http://localhost:8000/api/v1/crisis/550e8400/escalate-astreinte \
  -H "Authorization: Bearer TIER2_TOKEN" \
  -d '{"note": "NOC unable to resolve, escalading to external team."}'

# Résoudre crise (après rétablissement)
curl -X PUT http://localhost:8000/api/v1/crisis/550e8400/resolve \
  -H "Authorization: Bearer NOC_TOKEN" \
  -d '{
    "root_cause": "ISP misconfigured BGP route announcement",
    "resolution": "ISP reverted route, connectivity restored",
    "prevention_actions": ["Add BGP monitoring", "Setup secondary path"]
  }'

# Afficher historique crises (admin)
curl http://localhost:8000/api/v1/crisis/history?limit=10 \
  -H "Authorization: Bearer ADMIN_TOKEN" | jq '.items'

Intégration Avec Autres Outils

  • #58 incident-escalation: Crée incidents P0 → déclenche POST /api/v1/crisis/dispatch automatiquement
  • #61 sms-template-engine: Envoie SMS crisis (pas de template, texte libre rapidement)
  • #62 whatsapp-business: Envoie WhatsApp en parallèle SMS
  • #63 email-notification: Envoie emails formels crisis + maintenance
  • #72 post-incident-rca: Consomme crisis_events.root_cause pour RCA post-mortem
  • Dashboard NOC (#52): Affiche crises actives + escalation timers + acknowledge buttons

Implémentation TODO

  • [ ] CRUD crisis events (4 types)
  • [ ] Multi-channel broadcast (SMS + WhatsApp + Email simultané)
  • [ ] Escalade astreinte Celery task (every 5min)
  • [ ] ACK timer tracking (5min escalade trigger)
  • [ ] Root cause tracking
  • [ ] Crisis audit log (crisis_events + crisis_escalation_history)
  • [ ] Dashboard NOC crisis widget (real-time active crises)
  • [ ] Auto-resolve 10min après incident RESOLVED
  • [ ] WebSocket notifications (push to connected clients)
  • [ ] Tests: dispatch COUPURE → broadcast → wait 5min → escalade astreinte
  • [ ] Integration #58 incident webhook → POST /crisis/dispatch

Dernière mise à jour: 2026-02-21

PROJET MOSAÏQUE — 81 outils, 22 conteneurs, 500+ revendeurs WiFi Zone