#43 — SLA Probe Engine (Sondes ICMP+TCP 5min)
PLANIFIÉ
Priorité: 🔴 CRITIQUE · Type: TYPE C · Conteneur: rgz-beat · Code: app/tasks/sla_probe.py
Dépendances: #10 rgz-beat, #4 rgz-db
Description
Moteur de sondes SLA (Service Level Agreement) qui exécute des vérifications de disponibilité réseau toutes les 5 minutes. Les sondes sont de deux types : ICMP (ping) vers chaque Access Point et TCP vers les ports critiques (80, 443, 8000 API, 5432 DB, 6379 Redis).
Les résultats sont stockés dans la table TimescaleDB sla_results avec calcul de latence, perte de paquets et statut succès/échec. Toutes les heures, un agrégat calcule l'uptime% sur la fenêtre glissante. Si uptime<99% sur 1h, une alerte déclenche en SMS #61 et email #63.
Le dashboard NOC #52 affiche le statut live de toutes les sondes. Les rapports mensuels #70 consomment ces données pour calculer l'uptime SLA contractuel par revendeur (objectif: 99.5% minimum, sinon crédits automatiques #25).
Architecture Interne
SLA Probe Dataflow:
1. Celery Beat schedule:
└─ rgz.sla.probe task toutes les 5 minutes
└─ Fonction: scan_sla_probes()
2. Targets à scanner (config):
└─ ICMP: tous les AP (reseller_sites.ap_ip + master gateway)
└─ TCP:
• rgz-api:8000 (endpoint santé)
• rgz-db:5432 (connectivity check)
• rgz-redis:6379 (cache ping)
• rgz-radius:1812/udp (auth port)
• chaque NAS-ID site (LAN gateway)
└─ DNS: resolver.google.com (8.8.8.8) SRV lookup
3. Probe execution (async concurrent):
└─ ICMP: ping -c 3 -W 5s <ip> → extract latency, loss%
└─ TCP: socat/telnet timeout 5s <ip>:<port> → time to connect
└─ DNS: dig @8.8.8.8 access-rgz.duckdns.org → response time
4. TimescaleDB sla_results hypertable:
├─ Columns:
│ id UUID PK, time BIGINT, target TEXT, target_type CHECK(icmp|tcp|dns),
│ target_port INT, latency_ms FLOAT, loss_percent FLOAT,
│ success BOOLEAN, error_message TEXT, probe_host TEXT
├─ Compression: after 24h
└─ Retention: 12 months
5. Aggregation every hour (01:00, 02:00...):
└─ INSERT INTO sla_results_hourly (time_bucket, target, uptime_percent, avg_latency, loss_percent)
SELECT time_bucket('1 hour', time) as tb, target,
(COUNT(CASE WHEN success THEN 1 END)::float / COUNT(*)) * 100 as uptime,
AVG(latency_ms), AVG(loss_percent)
FROM sla_results
WHERE time > NOW() - INTERVAL '1 hour'
GROUP BY target, tb
ORDER BY tb DESC;
6. Alerting (every 10 min):
└─ SELECT target, uptime_percent FROM sla_results_hourly
WHERE time > NOW() - INTERVAL '1 hour' AND uptime_percent < 99.0
→ déclench AlertManager → SMS/email #61 #63
→ Celery task send_sla_alert()
7. Dashboards/Reports:
└─ Dashboard NOC #52 (refresh 30s) : statut live
└─ Rapport SLA mensuel #70 (J+5) : uptime% par revendeur
└─ Crédits SLA auto #25 : si <99.5% → calcul crédit proportionnelConfiguration
# .env Celery Beat SLA
CELERY_BEAT_SLA_PROBE_ENABLED=true
CELERY_BEAT_SLA_PROBE_INTERVAL=300 # 5 minutes en secondes
# SLA Probe targets
SLA_PROBE_ICMP_TIMEOUT=5 # secondes
SLA_PROBE_TCP_TIMEOUT=5
SLA_PROBE_ICMP_COUNT=3 # nombre de pings
SLA_PROBE_TCP_PORTS=80,443,8000,5432,6379,1812
SLA_PROBE_DNS_SERVERS=8.8.8.8,1.1.1.1
# SLA Alerting thresholds
SLA_ALERT_UPTIME_THRESHOLD=99.0 # %
SLA_ALERT_LATENCY_THRESHOLD_MS=200 # ms
SLA_ALERT_LOSS_THRESHOLD=5.0 # %
# TimescaleDB compression + retention
TIMESCALEDB_SLA_RETENTION=12 # mois
TIMESCALEDB_SLA_CHUNK_INTERVAL=1 day
# Prometheus alert integration
PROMETHEUS_SLA_UPTIME_THRESHOLD=99.0
PROMETHEUS_SLA_ALERT_DURATION=1h
# Logging
SLA_PROBE_LOG_LEVEL=INFO
SLA_PROBE_LOG_FILE=/var/log/rgz/sla_probe.logEndpoints API
| Méthode | Route | Réponse |
|---|---|---|
| GET | /api/v1/sla/current | {items: [{target, target_type, latency_ms, success, timestamp, last_update}], total} |
| GET | /api/v1/sla/{target}/history?from=&to=&interval=5m | {items: [{time, latency_ms, loss%, success}], total, pages} |
| GET | /api/v1/sla/uptime?target=&period=30d | {target, uptime_percent, avg_latency_ms, loss_percent, total_probes} |
| GET | /api/v1/sla/alerts?severity=high&days=7 | {items: [{timestamp, target, uptime%, message}], total} |
| POST | /api/v1/sla/probe-manual?targets=&timeout=5 | Déclenche probe immédiate : {status: running, probe_id: uuid} |
| GET | /api/v1/sla/probe/{probe_id}/status | {probe_id, status, results: [{target, latency_ms, success}]} |
Commandes Utiles
# Déclencher manuellement une probe SLA
docker exec rgz-beat python -c "
from app.tasks.sla_probe import scan_sla_probes
import asyncio
result = asyncio.run(scan_sla_probes())
print(result)
"
# Vérifier résultats SLA en TimescaleDB
docker exec rgz-db psql -U rgz -d rgz -c "
SELECT target, target_type, latency_ms, loss_percent, success, time
FROM sla_results
ORDER BY time DESC
LIMIT 50;
"
# Calculer uptime% pour 24 dernières heures
docker exec rgz-db psql -U rgz -d rgz -c "
SELECT target,
(COUNT(CASE WHEN success THEN 1 END)::float / COUNT(*)) * 100 as uptime_percent,
AVG(latency_ms) as avg_latency,
MAX(latency_ms) as max_latency
FROM sla_results
WHERE time > NOW() - INTERVAL '24 hours'
GROUP BY target
ORDER BY uptime_percent ASC;
"
# Lister les alertes SLA dernière heure
docker exec rgz-db psql -U rgz -d rgz -c "
SELECT time_bucket('5 min', time) as bucket, target,
(COUNT(CASE WHEN success THEN 1 END)::float / COUNT(*)) * 100 as uptime
FROM sla_results
WHERE time > NOW() - INTERVAL '1 hour'
GROUP BY bucket, target
HAVING (COUNT(CASE WHEN success THEN 1 END)::float / COUNT(*)) * 100 < 99.0
ORDER BY bucket DESC;
"
# Monitorer les probes en temps réel
docker exec rgz-beat tail -f /var/log/rgz/sla_probe.log | grep -i alert
# Consulter Prometheus metrics SLA
curl http://localhost:9090/api/v1/instant?query='sla_uptime_percent'
# Afficher dashboard NOC avec probes live
curl -H "Authorization: Bearer ${GRAFANA_API_TOKEN}" \
https://grafana-rgz.duckdns.org/api/dashboards/uid/sla_probesImplémentation TODO
- [ ] Créer hypertable TimescaleDB
sla_results(time, target TEXT, target_type, latency_ms FLOAT, loss_percent FLOAT, success BOOLEAN) - [ ] Implémenter Celery task
app/tasks/sla_probe.py::scan_sla_probes()(asyncio + icmplib + socket TCP) - [ ] Ajouter Beat schedule :
rgz.sla.probe every 5 minutes queue=rgz.monitoring - [ ] Créer fonction agrégation horaire :
hourly_uptime_calculation()toutes les heures - [ ] Implémenter Prometheus exporter :
sla_uptime_percent,sla_latency_ms,sla_loss_percent - [ ] Créer AlertingRule Prometheus :
sla_uptime_percent < 99.0 for 1h→ AlertManager - [ ] Implémenter webhook alerting : SLA violation → SMS #61 + email #63
- [ ] Ajouter endpoints API : GET /sla/current, GET /sla/{target}/history, POST /sla/probe-manual
- [ ] Créer Grafana dashboard : gauges uptime%, heatmap latency, time series loss%
- [ ] Tests : mock ICMP/TCP probes, vérifier agrégation TimescaleDB, alerting logic
Dernière mise à jour: 2026-02-21