Skip to content

#58 — Escalade Incidents (P0/P1/P2)

PLANIFIÉ

Priorité: 🟠 HAUTE · Type: TYPE B · Conteneur: rgz-api · Code: app/services/incident.pyDépendances: #1 rgz-api, #61 sms-template-engine


Description

Le système d'escalade incidents route automatiquement les alertes critiques (P0, P1, P2) vers les tiers de support appropriés avec des SLA stricts. Les incidents sont détectés via #38 Prometheus (seuils CPU/RAM/RSSI/latence) et sont créés automatiquement ou manuellement via API.

Chaque niveau de priorité suit un calendrier d'escalade strict : P0 (coupure totale) escalade en 5→15→30min vers DG, P1 (dégradation) en 15min→1h, P2 (isolé) en 4h. Le système envoie des SMS #61 de notification et remet à zéro le timer si l'incident est acquitté. Les incidents non résolus dans le SLA déclenchent automatiquement une escalade supplémentaire (Tier 2 → Tier 3 ingénieur externe).


Architecture Interne

Escalade automatique basée sur SLA:

NIVEAU P0 - CRITIQUE (COUPURE TOTALE)
  Trigger: >30 sites sans session RADIUS pendant >5 min
  Exemple: BGP flap central, fiber cut, DDoS
  Escalade:
    T+5min:  SMS Tier 1 (ingénieur support)
    T+15min: SMS Tier 2 (lead technique)
    T+30min: SMS DG Technique + Président
  Notification: SMS + WhatsApp (#62) + email critique
  Broadcast: Tous revendeurs via #64 crisis-dispatcher
  Resolution: Rollback automatique si détecté stable 5min
  Baseline: Incident CLÔT automatiquement si uptime >99.9% on 5min window

NIVEAU P1 - HAUTE (DÉGRADATION SERVICE)
  Trigger: Perte service sur 5-30 sites OU latency P95 > 2s OU drop >5%
  Escalade:
    T+15min: SMS Tier 1
    T+1h:    SMS Tier 2
  Notification: Email + SMS revendeurs affectés
  Affichage: Status page public (#55) "Service dégradé"
  Resolution: Auto-close si métrique revient normal 15min

NIVEAU P2 - MOYENNE (INCIDENT ISOLÉ)
  Trigger: 1 site seul affecté OU utilisateur report via ticket
  Escalade:
    T+4h: SMS Tier 1 seulement
  Notification: Email support interne
  Affichage: Notes dans dashboard NOC (#52)
  Resolution: Pas d'auto-close, requiert acquittement NOC

FLUX ACQUITTEMENT:
  Tier reçoit SMS "INC-2026-000847: Coupure 2 sites. Équipe mobilisée."
  Répond: "ACK" ou numéro du ticket dans message
  Système reset timer (escalade suspendue tant qu'actif)
  Tier peut changer status: OPEN → ACKNOWLEDGED → INVESTIGATING → RESOLVED
  Status RESOLVED ≠ CLÔT (attendu 5min stabilité avant auto-close)

TRIGGERS AUTO-ROLLBACK (via #60):
  Si incident P0 ouvert > 10min ET rollback disponible:
    - Rollback VLAN config (nftables)
    - Rollback firmware CPE (retour version stable)
    - Kill sessions douteuses (reset RADIUS)
    - Notification DG + équipe
    - Incident status: ROLLBACK_APPLIED
    - Si après rollback stabilité > 5min: CLÔT
    - Si encore P0 après rollback: escalade critique DG+external

Modèles de Données

python
class IncidentPriority(str, Enum):
    P0 = "P0"  # Coupure totale
    P1 = "P1"  # Dégradation
    P2 = "P2"  # Isolé

class IncidentStatus(str, Enum):
    OPEN = "open"
    ACKNOWLEDGED = "acknowledged"
    INVESTIGATING = "investigating"
    ESCALATED = "escalated"
    ROLLBACK_APPLIED = "rollback_applied"
    RESOLVED = "resolved"
    CLOSED = "closed"

# Table DB
incidents:
  - id UUID PK
  - incident_number VARCHAR(16) UNIQUE (auto: INC-2026-000001)
  - title TEXT
  - priority CHECK(P0|P1|P2)
  - status CHECK(open|acknowledged|investigating|escalated|rollback_applied|resolved|closed)
  - description TEXT
  - trigger_source VARCHAR(50) (prometheus|manual|webhook|user_report)
  - affected_resellers UUID[] (array of reseller IDs)
  - affected_site_count INT
  - detected_at TIMESTAMP
  - acknowledged_at TIMESTAMP NULL
  - acknowledged_by UUID FK users NULL
  - escalation_level INT (0=Tier1, 1=Tier2, 2=Tier3_external)
  - last_escalation_at TIMESTAMP NULL
  - next_escalation_at TIMESTAMP NULL
  - rollback_triggered BOOLEAN DEFAULT false
  - resolved_at TIMESTAMP NULL
  - closed_at TIMESTAMP NULL
  - root_cause_analysis TEXT NULL
  - created_at TIMESTAMP

incident_escalation_history:
  - id UUID PK
  - incident_id UUID FK
  - escalation_level INT
  - recipient TEXT (email/phone)
  - message_sent TEXT
  - sent_at TIMESTAMP
  - ack_received BOOLEAN DEFAULT false
  - ack_at TIMESTAMP NULL
  - created_by VARCHAR(50) (system|manual)

incident_metrics:
  - id UUID PK
  - incident_id UUID FK
  - metric_name VARCHAR(100) (session_count, latency_p95, packet_loss, etc)
  - trigger_value FLOAT
  - current_value FLOAT
  - baseline_value FLOAT
  - updated_at TIMESTAMP

Configuration

env
# .env.example
INCIDENT_P0_ESCALATE_1_MINUTES=5
INCIDENT_P0_ESCALATE_2_MINUTES=15
INCIDENT_P0_ESCALATE_3_MINUTES=30
INCIDENT_P0_ESCALATE_AUTO_ROLLBACK=true
INCIDENT_P0_ROLLBACK_TIMEOUT_MINUTES=10

INCIDENT_P1_ESCALATE_1_MINUTES=15
INCIDENT_P1_ESCALATE_2_MINUTES=60

INCIDENT_P2_ESCALATE_1_MINUTES=240  # 4h

INCIDENT_P0_THRESHOLD_SITES=30
INCIDENT_P0_THRESHOLD_DURATION_MINUTES=5
INCIDENT_P1_THRESHOLD_SITES=5
INCIDENT_P1_THRESHOLD_LATENCY_MS=2000
INCIDENT_P1_THRESHOLD_LOSS_PERCENT=5

INCIDENT_AUTO_CLOSE_STABILITY_MINUTES=5
INCIDENT_SMS_RECIPIENTS_TIER1=tier1@rgz.bj
INCIDENT_SMS_RECIPIENTS_TIER2=tier2@rgz.bj
INCIDENT_SMS_RECIPIENTS_DG=dg@rgz.bj

# Celery task
INCIDENT_ESCALATION_TASK=rgz.monitoring.escalate_incidents
INCIDENT_ESCALATION_INTERVAL_SECONDS=60

Endpoints API

MéthodeRouteRequêteRéponseNotes
POST/api/v1/incidents201 CREATED
GET/api/v1/incidents?priority=P0&status=open{items:[{id, incident_number, priority, status, created_at}], total, page, pages}200 OK
GET/api/v1/incidents/200 OK
PUT/api/v1/incidents/{id}/acknowledge200 OK
PUT/api/v1/incidents/{id}/investigate200 OK
PUT/api/v1/incidents/{id}/resolve200 OK
PUT/api/v1/incidents/{id}/close200 OK
POST/api/v1/incidents/{id}/escalate-manual201 CREATED
GET/api/v1/incidents/active{items:[{id, incident_number, priority, next_escalation_at}], total}200 OK

Celery Task (rgz-beat)

python
@celery_app.task(queue="rgz.monitoring", name="rgz.monitoring.escalate_incidents")
@periodic_task(run_every=crontab(minute="*/1"))  # Every minute
def escalate_incidents():
    """
    Contrôle les incidents non acquittés vs SLA.
    Déclenche escalade SMS si délai dépassé.
    """
    for incident in Incident.query.filter(Incident.status.in_([OPEN, INVESTIGATING])):
        if incident.priority == "P0" and not incident.acknowledged_at:
            # Check SLA timers
            if datetime.utcnow() >= incident.next_escalation_at:
                # Send SMS next level
                escalate_to_tier_x(incident)
                incident.escalation_level += 1
                incident.next_escalation_at = ...
                db.session.commit()

SMS Templates (via #61)

Template: INC_P0_INITIAL
"ALERTE RGZ P0: {title}. {affected_sites} sites hors service. Équipe mobilisée. MAJ dans {delay} min. Numéro ticket: {incident_number}"

Template: INC_P0_ESCALATE_TIER2
"ESCALADE REQUISE INC-{number}: {title}. Tier1 ne répond pas. Intervention Tier2 URGENTE. Appel: +229-....."

Template: INC_P1_ALERT
"Alerte RGZ P1: Dégradation service {title}. Latence P95: {latency}ms. Perte: {loss}%. Revendeurs affectés: {count}. Status: https://status.rgz.bj"

Template: INC_P2_REPORT
"Incident isolé INC-{number}: {title}. 1 site affecté. Support Tier1 saisi."

Template: INCIDENT_RESOLVED
"Incident {incident_number} RÉSOLU. Durée totale: {duration} min. RCA en cours. Merci pour votre patience."

Commandes Utiles

bash
# Créer incident P0 (Prometheus webhook auto)
curl -X POST http://localhost:8000/api/v1/incidents \
  -H "Content-Type: application/json" \
  -H "X-Prometheus-Secret: SHARED_SECRET" \
  -d '{
    "title": "Fiber cut core link Cotonou",
    "priority": "P0",
    "description": "BGP flap detected, 45 sites offline",
    "affected_resellers": ["550e8400-...", "550e8400-..."],
    "trigger_source": "prometheus"
  }'

# Créer incident manuel (revendeur report)
curl -X POST http://localhost:8000/api/v1/incidents \
  -H "Authorization: Bearer REVENDEUR_TOKEN" \
  -d '{
    "title": "WiFi down on 3 CPE",
    "priority": "P1",
    "description": "Clients complaining no connectivity",
    "affected_resellers": ["550e8400-..."],
    "trigger_source": "user_report"
  }'

# Vérifier incidents actifs
curl http://localhost:8000/api/v1/incidents/active | jq '.items[] | select(.status=="open")'

# Acquitter incident (NOC)
curl -X PUT http://localhost:8000/api/v1/incidents/INC-2026-000847/acknowledge \
  -H "Authorization: Bearer NOC_TOKEN" \
  -d '{"notes": "Investigating now, will update in 5min"}'

# Escalade manuelle (Tier 1 → Tier 2)
curl -X POST http://localhost:8000/api/v1/incidents/INC-2026-000847/escalate-manual \
  -H "Authorization: Bearer NOC_TOKEN" \
  -d '{"escalation_level": 2}'

# Résoudre incident
curl -X PUT http://localhost:8000/api/v1/incidents/INC-2026-000847/resolve \
  -H "Authorization: Bearer NOC_TOKEN" \
  -d '{
    "root_cause": "BGP route hijack from ISP misconfiguration",
    "resolution": "Contacted ISP, route restored",
    "prevention_actions": ["Add BGP monitoring alert", "Setup secondary path"]
  }'

Intégration Avec Autres Outils

  • #38 prometheus-alert: Crée incidents automatiquement via webhook POST /api/v1/incidents
  • #60 rollback-decision: Appelé par tâche Celery si P0 non résolu après 10min
  • #61 sms-template-engine: Templates INC_P0_, INC_P1_, INC_P2_*
  • #64 crisis-dispatcher: Si P0 confirmé, broadcast SMS revendeurs via crisis template
  • #55 page-statut-public: Affiche incidents P0 actuels + historique (dernières 24h)
  • #52 dashboard-noc: Real-time incident feed, escalation timers, ack buttons
  • #63 email-notification: Rapports RCA post-mortem emails destinataires

Implémentation TODO

  • [ ] CRUD incidents (P0/P1/P2)
  • [ ] Auto-escalade Celery task every minute
  • [ ] SMS escalade via #61 templates
  • [ ] Webhook Prometheus → POST /incidents
  • [ ] Status page #55 intégration
  • [ ] Rollback trigger #60
  • [ ] Audit trail (incident_escalation_history)
  • [ ] Tests: créer P0 → escalade T1 → escalade T2 → resolve
  • [ ] Dashboard NOC incidents queue
  • [ ] RCA post-mortem email

Dernière mise à jour: 2026-02-21

PROJET MOSAÏQUE — 81 outils, 22 conteneurs, 500+ revendeurs WiFi Zone