#72 — post-incident-rca

PLANIFIÉ

Priorité: 🟠 HAUTE · Type: C (Celery) · Conteneur: rgz-beat · Code: app/tasks/rca.pyDépendances: #58 incident-escalation

Description

Analyse de cause racine (RCA — Root Cause Analysis) automatique déclenchée 24h après la résolution d'un incident P0 ou P1. Ce rapport synthétise la timeline technique complète, identifie la cause profonde via la méthode "5 Pourquoi", quantifie l'impact commercial (abonnés × minutes d'indisponibilité), et propose des actions correctives pour prévention.

RCA fourni à 3 audiences :

NOC interne : comprendre ce qui s'est passé, prévenir récidive
Management : rapporter impact, coûts SLA, leçons apprises
Clients revendeurs : transparence = confiance (optionnel, excerpt)

Objectif métier: Transformer chaque incident en apprentissage → zéro récurrence.

Architecture Interne

Flux de Génération

Incident marqué "resolved" par NOC (#58)
  ↓
Enregistrement timestamp resolved_at dans DB
  ↓
Celery Beat (tous les jours) vérifie: resolved_at < NOW() - 24h ?
  ↓
OUI → Déclencher rgz.rca.generate pour cet incident
  ↓
Récupérer incident + timeline complete:
  1. Moment détection (alert #38 ou probe #43 down)
  2. Tous logs #40 ELK pendant incident
  3. Tous changements config #77 pendant incident
  4. Traces RADIUS #6 auth failures
  5. Métriques Prometheus #38 (CPU, RAM, disk)
  6. Notes NOC du ticket #57 RMA
  7. Durée totale downtime + abonnés impactés
  ↓
Appliquer "5 Pourquoi" (questionnaire):
  - Symptôme: "NAS Cotonou unreachable"
  - Pourquoi 1: "BGP route down"
  - Pourquoi 2: "NAS config error après deploy"
  - Pourquoi 3: "Deploy script missing validation"
  - Pourquoi 4: "No pre-deploy smoke test"
  - Pourquoi 5: "Process gap: validation pas in SOP"
  → Root cause: "Processus manquant"
  ↓
Générer PDF RCA (5-10 pages)
  ↓
Enregistrer dans DB + index searchable
  ↓
Email NOC + stakeholders

Schéma de Données

sql

-- Table source (créée par #58)
TABLE incidents:
  id UUID PK
  priority CHECK(P0|P1|P2)
  title TEXT
  description TEXT
  created_at TIMESTAMP
  resolved_at TIMESTAMP
  resolution_notes TEXT

-- Table RCA (créée par #72)
TABLE rca_analysis:
  id UUID PK
  incident_id UUID FK UNIQUE
  rca_type CHECK(infrastructure|configuration|process|vendor|network)
  timeline_json JSON (array of {timestamp, event, source})
  symptom TEXT
  five_whys JSON ({why1, why2, why3, why4, why5, root_cause})
  root_cause_category CHECK(...)
  impact_subscribers INT
  impact_duration_minutes INT
  impact_revenue_fcfa DECIMAL(12,2) (calculated: ARPU * impact_minutes / 43200)
  action_items JSON (array of {action, owner, deadline, status})
  pdf_path TEXT
  generated_at TIMESTAMP
  pdf_reviewed_by UUID FK (manager)
  pdf_reviewed_at TIMESTAMP
  follow_up_incident_id UUID FK (if issue recurs)

Contenu PDF RCA (Exemple)

═══════════════════════════════════════════════════════════════
RCA — ROOT CAUSE ANALYSIS
Incident RGZ-P0-002061 — Perte Connectivité Cotonou
Analysé: 2026-01-16 (24h après résolution)
═══════════════════════════════════════════════════════════════

1. RÉSUMÉ EXÉCUTIF
─────────────────────────────────────────────────────────────

Incident:           NAS Cotonu (access_tech_connect_s1) unreachable
Détection:          2026-01-15 14:30 UTC (ICMP probe timeout)
Résolution:         2026-01-15 14:55 UTC (25 minutes)
Cause racine:       Configuration error + missing validation process

Abonnés impactés:   234
Durée totale:       25 minutes
Perte revenue:      1,610 FCFA (234 * 45,000 FCFA / 1,440 min * 25 min)
SLA impact:         Downtime compté pour crédit SLA

2. TIMELINE COMPLÈTE
─────────────────────────────────────────────────────────────

14:25 UTC  - Deploy script exécuté: /scripts/gateway/deploy_bgp_config.sh
           Source: [Deploy Ticket #2061]
           Action: Update BGP neighbor config NAS Cotonou

14:30 UTC  ⚠️  ICMP probe (Prometheus) → timeout 3x
           Source: [Prometheus alert]
           Severity: P0 (uptime directly impacted)

14:30:15   - Alert déclenché #38 → SMS NOC → PagerDuty

14:30:45   - NOC analyst commence investigation
           Action: SSH access NAS Cotonou

14:35     ⚠️  SSH échoue — NAS booting but no network
           Log: "dev eth0 down"

14:40     - NOC récupère config change log:
           "BGP AS 64100 changed to 64001 (error!)"

14:42     - NOC corrige manuellement: AS 64001 → 64100 + reload bgpd

14:55     ✓ NAS Cotonou répond ICMP
          ✓ Probe résumé → alert cleared
          ✓ Abonnés reconnectés automatiquement

3. ANALYSE "5 POURQUOI"
─────────────────────────────────────────────────────────────

Problème initial: NAS BGP config error → downtime

❓ Pourquoi 1: "BGP AS numéro changé de 64100 à 64001?"
   → Réponse: Deploy script contenait valeur hardcodée erreur
             (copié d'un client test Q3 2025)

❓ Pourquoi 2: "Pourquoi deploy script pas testé avant exécution?"
   → Réponse: No pre-deploy smoke test procedure
             Validation config was manual + skipped

❓ Pourquoi 3: "Pourquoi validation config n'est pas automatique?"
   → Réponse: Process actuelle = human review seulement
             Pas d'outil de validation (nft verify, bgp check, etc.)

❓ Pourquoi 4: "Pourquoi pas d'outil de validation?"
   → Réponse: Ressources limitées, dev priorité ailleurs
             Décision: développer outil auto-validation

❓ Pourquoi 5: "Pourquoi décision risquée prise sans analyse cost?"
   → Réponse: Gap dans le processus d'évaluation technique
             Manque SOP documentée pour deploy

╔══════════════════════════════════════════════════════════════╗
║ ROOT CAUSE: Process/Governance Gap                          ║
║  - Pas d'outil auto-validation configs                      ║
║  - Pas d'SOP documentée pour deploy production              ║
║  - Validation config = manual review (humans fail)          ║
║  → FIX: Développer outil validate_bgp_config.py             ║
║       + Créer SOP Deploy + obtenir approbation              ║
╚══════════════════════════════════════════════════════════════╝

4. FACTEURS CONTRIBUTIFS
─────────────────────────────────────────────────────────────

☑ Configuration Management:
   - No rollback capability (should have kept backup config)
   - No version control for BGP AS config
   - Hardcoded values instead of env variables

☑ Process/Documentation:
   - No pre-deploy checklist
   - No SOP documented
   - Manual review insufficient

☑ Monitoring:
   ✓ GOOD: Probe detected issue in 5min
   ✓ GOOD: Alert SMS sent immediately
   ✗ BAD: Slow resolution (human + manual SSH)

5. ACTIONS CORRECTIVES (OWNER / DEADLINE)
─────────────────────────────────────────────────────────────

IMMÉDIATE (Within 3 days):
  [x] Versioning BGP config + git.rgz.local backup
      Owner: Ingénieur DevOps | Deadline: 2026-01-17
  [x] Créer backup config avant chaque deploy
      Owner: SOP PROD | Deadline: 2026-01-17

COURT TERME (Within 2 weeks):
  [ ] Développer tool: validate_bgp_config.py
      Check: AS number, neighbor IP, timers vs DB
      Owner: R4 DevOps + R5 Network | Deadline: 2026-01-29

  [ ] Créer SOP Deploy Production
      1. Pre-deploy smoke test
      2. Config validation tool
      3. Rollback procedure
      4. Approval chain
      Owner: CTO | Deadline: 2026-01-29

MOYEN TERME (Within 1 month):
  [ ] Implémenter outil auto-validation dans pipeline CI/CD
      Trigger: Before deploy to prod
      Owner: R4 DevOps | Deadline: 2026-02-15

6. IMPACT QUANTIFIÉ
─────────────────────────────────────────────────────────────

Durée downtime:       25 minutes
Abonnés impactés:     234 (48% revendeur Tech Connect)
Session interruptions:234 x 1 = 234 reconnexions (bénin)
Data loss:            0 (TCP reconnect handles)

Revenue impact:       1,610 FCFA
  = 234 subscribers * 45,000 FCFA/month / 1,440 min * 25 min

SLA impact:           Credit 5% × (99.5 - 98.2)% = Credit 6,570 FCFA
  = (MIR estimated 45K) × 5% × 1.3%

MTTR (Mean Time To Repair): 25 minutes
  Target: < 30min (MET)

MTTD (Mean Time To Detect): 5 minutes
  Target: < 10min (GOOD)

7. LEÇONS APPRISES
─────────────────────────────────────────────────────────────

✓ Monitoring + alerting works well (detection 5min)
✓ Escalation process effective (SMS + PagerDuty)
✗ Manual process + human error = incidents
✗ Config versioning missing = slow recovery

→ Investment needed: Auto-validation tools

8. SUIVI RECOMMANDÉ
─────────────────────────────────────────────────────────────

Monitoring incident recurrence:
  IF new incident P0 same NAS within 30 days
  → Investigation immédiate pour lien causal

Audit action items:
  Review status le 2026-02-15 avec management
  Escalade si actions non complétées

─────────────────────────────────────────────────────────────
RCA généré automatiquement 24h après résolution
Reviewed by: CTO [Signature] — 2026-01-16
Prochaine révision: 2026-02-15 (actions follow-up)

Configuration

env

# RCA Generation
RCA_TRIGGER_HOURS_AFTER_RESOLUTION=24    # 24h delay
RCA_ENABLED_FOR_PRIORITY=P0,P1            # Only P0/P1
RCA_FIVE_WHYS_DEPTH=5

# Timelines
RCA_LOG_RETENTION_DAYS=30
RCA_PDF_RETENTION_YEARS=7

# Output
RCA_OUTPUT_DIR=/var/reports/rca
RCA_TEMPLATE_PATH=/app/templates/reports/rca.html

# Notifications
RCA_NOTIFY_NOC=true
RCA_NOTIFY_SLACK=true
RCA_SLACK_CHANNEL=#incidents-rca

Endpoints API

Méthode	Route	Description	Réponse
GET	`/api/v1/incidents/{incident_id}/rca`	Récupérer RCA
GET	`/api/v1/incidents/{incident_id}/rca/pdf`	Télécharger PDF	PDF binary
POST	`/api/v1/incidents/{incident_id}/rca/generate`	Générer manuellement (admin)	202 Accepted
GET	`/api/v1/rca/pending`	RCA en attente (> 24h post-resolution)	List[{incident_id, ready_at}]

Celery Task

Champ	Valeur
Task name	`rgz.rca.generate`
Schedule	Daily 02:00 UTC (scan incidents resolved 24h+)
Queue	`rgz.reports`
Timeout	300s
Retry	3x

Logique esquisse:

python

@app.task(name='rgz.rca.generate', bind=True)
def generate_incident_rca(self, incident_id: str):
    incident = db.query(Incident).filter(
        Incident.id == incident_id
    ).first()

    if not incident or incident.priority not in ['P0', 'P1']:
        return {'status': 'skipped', 'reason': 'not P0/P1'}

    time_since_resolution = (
        datetime.utcnow() - incident.resolved_at
    ).total_seconds() / 3600

    if time_since_resolution < 24:
        return {'status': 'not_ready', 'hours_remaining': 24 - time_since_resolution}

    # Récupérer timeline + logs + metrics
    timeline = _build_incident_timeline(incident_id)
    logs = _fetch_elk_logs_during_incident(incident)
    metrics = _fetch_prometheus_during_incident(incident)

    # Appliquer 5 whys
    root_cause = _analyze_five_whys(incident, timeline, logs)

    # Générer PDF
    pdf_path = _render_rca_pdf(incident, timeline, root_cause)

    # Enregistrer RCA
    rca = RcaAnalysis(
        incident_id=incident_id,
        timeline_json=timeline,
        root_cause_category=root_cause['category'],
        five_whys=root_cause['five_whys'],
        pdf_path=pdf_path
    )
    db.add(rca)
    db.commit()

    # Notifier NOC
    send_email.delay(
        to='noc@rgz.local',
        subject=f"RCA disponible: {incident.title}",
        template='rca_notification',
        context={'incident': incident, 'rca': rca}
    )

    return {'status': 'success', 'rca_id': str(rca.id)}

Commandes Utiles

bash

# Générer RCA manuellement (admin)
curl -X POST -H "Authorization: Bearer {admin_token}" \
  "http://api-rgz.duckdns.org/api/v1/incidents/{incident_id}/rca/generate"

# Récupérer RCA JSON
curl -H "Authorization: Bearer {token}" \
  "http://api-rgz.duckdns.org/api/v1/incidents/{incident_id}/rca"

# Télécharger PDF RCA
curl -H "Authorization: Bearer {token}" \
  "http://api-rgz.duckdns.org/api/v1/incidents/{incident_id}/rca/pdf" \
  -o rca_incident_{id}.pdf

# Lister RCA en attente
curl -H "Authorization: Bearer {admin_token}" \
  "http://api-rgz.duckdns.org/api/v1/rca/pending" | jq .

# Logs génération RCA
docker-compose logs rgz-beat | grep "rgz.rca.generate"

# Archiver anciens RCA (> 7 ans)
find /var/reports/rca -name "*.pdf" -mtime +2555 -delete

Implémentation TODO

[ ] Schéma DB rca_analysis (incident_id, root_cause, timeline, actions)
[ ] Tâche Celery rgz.rca.generate dans app/tasks/rca.py
[ ] Fonction _build_incident_timeline() (logs ELK + Prometheus)
[ ] Fonction _analyze_five_whys() (questionnaire structuré)
[ ] Template Jinja2 PDF RCA
[ ] Endpoints API GET/POST /api/v1/incidents//rca
[ ] Intégration #58 incident-escalation (webhook resolved)
[ ] Tests: scénarios incident fictifs (BGP, CPU, disk)
[ ] Notification Slack/Email pour NOC
[ ] Dashboard : RCA status, follow-up tracking
[ ] Documentation: guide 5 whys, root cause categorization

Dernière mise à jour: 2026-02-21

#72 — post-incident-rca ​

Description ​

Architecture Interne ​

Flux de Génération ​

Schéma de Données ​

Contenu PDF RCA (Exemple) ​

Configuration ​

Endpoints API ​

Celery Task ​

Commandes Utiles ​

Implémentation TODO ​