Grafana Dashboards (15+ Panels)

Suite Grafana pour monitoring production MOSAÏQUE — 15+ dashboards temps réel, alertes Prometheus, métriques TimescaleDB.

URL de production : https://grafana-rgz.duckdns.orgCredentials : admin / tTryk8aGbk2WDi6Pxgwg (généré déploiement) Data sources : Prometheus, PostgreSQL (TimescaleDB), Elasticsearch Alerting : Prometheus + AlertManager (mail + SMS)

Dashboards Principaux

1. System Overview

Santé infrastructure 22 conteneurs + serveur host.

┌─ SYSTEM OVERVIEW ───────────────────────────┐
│ Last 24h | Last 7d | Last 30d [Refresh 30s]│
├─────────────────────────────────────────────┤
│ HOST METRICS                                │
│ ┌──────────┐ ┌──────────┐ ┌──────────────┐ │
│ │ CPU      │ │ Memory   │ │ Disk Space   │ │
│ │ 42% avg  │ │ 64% used │ │ 78% used     │ │
│ │ ↗ 45max  │ │ ↗ 70max  │ │ ↗ 79max      │ │
│ └──────────┘ └──────────┘ └──────────────┘ │
│                                             │
│ ┌──────────┐ ┌──────────┐ ┌──────────────┐ │
│ │ Network  │ │ Load Avg │ │ Uptime       │ │
│ │ 156 Mbps │ │ 1.2/1.5  │ │ 34 days      │ │
│ │ ↘ 145min │ │ stable   │ │ Good ✓       │ │
│ └──────────┘ └──────────┘ └──────────────┘ │
│                                             │
│ CONTAINER STATUS (22)                       │
│ ┌─ Running (20) ─────────────────────────┐ │
│ │ ✓ rgz-api, rgz-web, rgz-portal, ...    │ │
│ └─────────────────────────────────────────┘ │
│ ┌─ Unhealthy (1) ────────────────────────┐ │
│ │ ⚠️ rgz-gateway (expected: nft rules dev)  │ │
│ └─────────────────────────────────────────┘ │
│ ┌─ No healthcheck (1) ──────────────────┐ │
│ │ ℹ️ rgz-traefik (proxy, no check needed) │ │
│ └─────────────────────────────────────────┘ │
│                                             │
│ GRAPH : CPU % Over 24h                      │
│     ↑ 100%                                  │
│     │     ╱╲╱╲      ╱╲╱╲                   │
│  50%│────╱  ╲╱────╱  ╲────                  │
│     │                                       │
│     0└───────────────────────→ 24h           │
│                                             │
│ GRAPH : Memory usage Over 24h               │
│     ↑ 31GB (full)                           │
│  25│  ╱════════╱╲   ╱════════╱╲             │
│     │ ╱        ╲ ╱ ╱        ╲  ╲            │
│     └──────────────────────────→ 24h        │
│                                             │
└─────────────────────────────────────────────┘

Métriques System

Métrique	Source	Seuil Alert	Fréquence
CPU %	node_exporter	>80% = WARNING, >95% = CRITICAL	30s
Memory %	node_exporter	>85% = WARNING, >95% = CRITICAL	30s
Disk %	node_exporter	>85% = WARNING, >95% = CRITICAL	5 min
Network Mbps	node_exporter	>1000 = INFO (normal peak)	30s
Load average	node_exporter	CPU cores × 1.5 = WARNING	1 min
Uptime	node_exporter	<7 = WARNING (reboot récent)	5 min
Container count	Docker API	<20 = CRITICAL (missing container)	1 min

PromQL :

node_cpu_seconds_total → 1 - avg(rate(node_cpu_seconds_total{mode="idle"}))
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes
node_filesystem_avail_bytes / node_filesystem_size_bytes

2. RADIUS Authentication

Authentifications WiFi temps réel, taux accept/reject, latence.

┌─ RADIUS AUTH ───────────────────────────────┐
│ Last 1h | Last 6h | Last 24h [Refresh 10s] │
├─────────────────────────────────────────────┤
│ KPIs TOP                                    │
│ ┌──────────┐ ┌──────────┐ ┌──────────────┐ │
│ │ Auth req │ │ Accept % │ │ Latency P95  │ │
│ │ 1,234/min│ │ 98.2%    │ │ 125 ms       │ │
│ │ ↗ 1500max│ │ ↘ 99.5% │ │ ↗ 280max     │ │
│ └──────────┘ └──────────┘ └──────────────┘ │
│                                             │
│ GRAPH : Auth Rate (req/min) Over 1h         │
│  ↑ 2000                                     │
│  │      ╱╲      ╱╲      ╱╲                 │
│  │     ╱  ╲    ╱  ╲    ╱  ╲                │
│1000├────╱    ╲──╱    ╲──╱                  │
│  │                                         │
│  0└──────────────────────→ 1h              │
│                                             │
│ GRAPH : Accept % vs Reject % (pie)         │
│     ╭─────────────────╮                    │
│     │  Accept 98.2%   │  (vert ✓)          │
│     │  Reject 1.8%    │  (rouge ✗)         │
│     │  - Inv.OTP 1.0% │  (jaune ⚠️)        │
│     │  - Rate limit 0.5%                   │
│     │  - Invalid pwd 0.3%                  │
│     ╰─────────────────╯                    │
│                                             │
│ LATENCY PERCENTILES (tableau)               │
│ ┌──────────┬─────────┬──────────┐          │
│ │ Percentile│ Latency │ Status   │          │
│ ├──────────┼─────────┼──────────┤          │
│ │ P50 (med) │ 85 ms   │ ✓        │          │
│ │ P95       │ 125 ms  │ ✓        │          │
│ │ P99       │ 280 ms  │ ⚠️ slow  │          │
│ └──────────┴─────────┴──────────┘          │
│                                             │
└─────────────────────────────────────────────┘

Métriques RADIUS

Métrique	PromQL	Alert
Taux auth	rate(freeradius_auth_requests_total[1m])	>5000/min = INFO
% Accept	sum(freeradius_auth_accepts) / sum(freeradius_auth_requests)	<95% = WARNING
% Reject	sum(freeradius_auth_rejects) / sum(freeradius_auth_requests)	>5% = WARNING
Latency P95	histogram_quantile(0.95, freeradius_auth_duration_ms)	>300ms = WARNING
OTP errors	sum(freeradius_auth_rejects{reason="invalid_otp"})	>10/min = WARNING

3. Revenue & Billing

Revenus par revendeur, split 50/50, factures mensuelles.

┌─ REVENUE & BILLING ─────────────────────────┐
│ Month: Feb 2026 | Period: [Feb] [Q1] [YTD]  │
│ [Refresh 5 min]                             │
├─────────────────────────────────────────────┤
│ KPIs GLOBAUX                                │
│ ┌──────────┐ ┌──────────┐ ┌──────────────┐ │
│ │ Revenue  │ │ #Invoice │ │ ARPU (target)│ │
│ │ 47.3M F  │ │ 287      │ │ 45K F   ✓    │ │
│ │ ↗ 52M max│ │ ↗ 300max │ │ 44.8K achieved
│ └──────────┘ └──────────┘ └──────────────┘ │
│                                             │
│ REVENUE BY RESELLER (bar chart horizontal)  │
│ Kossou Tech    ████████ 450K F (50%)       │
│ TechConnect    ██████ 380K F (42%)         │
│ WiFi Center    █████ 320K F (35%)          │
│ Smart Telecom  ████ 280K F (31%)           │
│ (13+ resellers)                             │
│                                             │
│ SPLIT 50/50 BREAKDOWN                       │
│ Total revenue       47.3M F                 │
│ RGZ Commission      -0.7M F (1.5%)          │
│ NET                 46.6M F                 │
│ ├─ RGZ (50%)        23.3M F ✓               │
│ └─ Resellers (50%)  23.3M F 💰              │
│                                             │
│ INVOICE STATUS (tableau)                    │
│ ┌──────────┬──────────┬──────────────────┐ │
│ │ Reseller │ Amount   │ Status (Feb)     │ │
│ ├──────────┼──────────┼──────────────────┤ │
│ │ Kossou   │ 450K F   │ Sent 20/02 ✓     │ │
│ │ TechConn │ 380K F   │ Paid 25/02 ✓✓    │ │
│ │ WiFi Ctr │ 320K F   │ Draft (J+2 send) │ │
│ └──────────┴──────────┴──────────────────┘ │
│                                             │
│ MONTHLY TREND (série linéaire)              │
│ 2.5M│     ╱╲     ╱╲                        │
│     │    ╱  ╲   ╱  ╲                       │
│ 2.0M├───╱    ╲─╱    ╲─────                 │
│     │                                      │
│ 1.5M└──────────────────→ (Feb chaque jour) │
│                                             │
│ Payment status pie                          │
│ Draft 35% | Sent 45% | Paid 20%            │
│                                             │
└─────────────────────────────────────────────┘

Métriques Billing

Métrique	Source	Calcul	Alert
Revenue total	DB invoices	SUM(montant)	Prédictif ARPU
# Invoices	DB count	COUNT(*)	Baseline
ARPU	Revenue / # subs actifs	Month total / unique subs	<40K = WARNING
RGZ part	(Revenue × 0.985) × 0.5	Après commission 1.5%, split 50%	Audit OK
Reseller part	(Revenue × 0.985) × 0.5	Idem	Exactitude
Invoice status	DB status	GROUP_BY status	Draft >3j = WARNING
Payment delay	DB paid_at - created_at	days	>30j = WARNING

4. RF Monitoring (Signal Strength Heatmap)

Force signal WiFi 22 villes Bénin, détection dégradation RSSI.

┌─ RF MONITORING (RSSI Heatmap) ──────────────┐
│ City heatmap | Coverage map | Trend [1h]   │
├─────────────────────────────────────────────┤
│                                             │
│  22 VILLES BÉNIN (couleurs RSSI)            │
│  ┌──────────────────────────────────────┐  │
│  │  RSSI legend:                        │  │
│  │  🟢 >-60dBm (excellent)              │  │
│  │  🟡 -60 to -80dBm (good)             │  │
│  │  🟠 -80 to -90dBm (fair)             │  │
│  │  🔴 <-90dBm (poor)                   │  │
│  └──────────────────────────────────────┘  │
│                                             │
│  CARTE:                                     │
│  ┌─────────────────────────────────────┐   │
│  │ Cotonou        🟢 -65dBm ↘ -70      │   │
│  │ Parakou        🟡 -78dBm stable     │   │
│  │ Natitingou     🟠 -85dBm ↗ -80      │   │
│  │ Bohicon        🟡 -75dBm ↘ -78      │   │
│  │ Ouidah         🟢 -62dBm stable     │   │
│  │ (18+ cities)                        │   │
│  └─────────────────────────────────────┘   │
│                                             │
│ WORST SIGNAL SITES (attention)              │
│ ┌──────────────┬────────┬──────────────┐   │
│ │ Ville        │ RSSI   │ Trend        │   │
│ ├──────────────┼────────┼──────────────┤   │
│ │ Natitingou S2│ -92dBm │ ↗ improving  │   │
│ │ Kandi        │ -88dBm │ stable       │   │
│ │ Nikki        │ -85dBm │ ↗ improving  │   │
│ └──────────────┴────────┴──────────────┘   │
│                                             │
│ RSSI TREND LAST 7 DAYS (linéaire)          │
│     ↑ -60                                   │
│     │       ╭╮                             │
│ -75 ├───╭──╯ ╰───╭───                      │
│     │  ╱          ╲                        │
│ -90 └─╯            ╰─→ 7j                  │
│                                             │
└─────────────────────────────────────────────┘

Métriques RF

Métrique	Source	Seuil Alert	Fréquence
RSSI médian	SNMP poller CPE	>-60dBm ✓	5 min
RSSI min (min)	SNMP % ville	<-90 = WARNING	5 min
CCQ (link quality)	SNMP CPE	<70% = WARNING	5 min
Noise floor	SNMP spectrum	<-95dBm = quiet	5 min
Interference count	Suricata rules	>5/city = WARNING	1 min

5. SLA & Incidents

Uptime %, P95 latence par revendeur, incidents actifs P0/P1/P2.

┌─ SLA & INCIDENTS ───────────────────────────┐
│ Period: Last 30d | [7d] [90d] [YTD]        │
│ [Refresh 1 min]                             │
├─────────────────────────────────────────────┤
│ UPTIME GLOBAL                               │
│ ┌──────────────────────────────────────────┐│
│ │  99.85% uptime (target: 99.9%) ⚠️        │ │
│ │  Downtime: 2h 9min total (30j)          │ │
│ │  Last incident: P1 RADIUS auth (15min)  │ │
│ └──────────────────────────────────────────┘│
│                                             │
│ UPTIME BY RESELLER (tableaux)               │
│ ┌──────────────┬─────────┬───────────────┐ │
│ │ Reseller     │ Uptime  │ Downtime 30d  │ │
│ ├──────────────┼─────────┼───────────────┤ │
│ │ Kossou Tech  │ 99.92%  │ 1h 5min       │ │
│ │ TechConnect  │ 99.88%  │ 1h 44min      │ │
│ │ WiFi Center  │ 99.71%  │ 4h 11min      │ │
│ │ Smart Telecom│ 99.95%  │ 43 min        │ │
│ │ (13+ more)   │ ...     │ ...           │ │
│ └──────────────┴─────────┴───────────────┘ │
│                                             │
│ LATENCY PERCENTILES                         │
│ ┌──────────────┬──────────┬──────────────┐ │
│ │ Percentile   │ Latency  │ Status       │ │
│ ├──────────────┼──────────┼──────────────┤ │
│ │ P50 (median) │ 42 ms    │ ✓ Good       │ │
│ │ P95          │ 135 ms   │ ✓ Good       │ │
│ │ P99          │ 450 ms   │ ⚠️ Fair      │ │
│ │ Max          │ 2.3 s    │ ⚠️ Check     │ │
│ └──────────────┴──────────┴──────────────┘ │
│                                             │
│ INCIDENTS ACTIFS (P0/P1/P2)                 │
│ ├─ [P0] Core API offline        Duration: 0 min (RESOLVED)
│ ├─ [P1] RADIUS slow (5% latency) Duration: 23 min  ⏱️ ongoing
│ └─ [P2] Grafana disk 89%        Duration: 5 days (scheduled fix)
│                                             │
│ UPTIME TIMELINE (24h stacked area)          │
│  100%├─ ✓✓✓✓✓✓✓✓✓✓✓─⚠️─✓✓✓✓✓✓✓✓✓ │
│   98%├ ╭────────────╮                      │
│      │╱              ╰─────────────→ 24h   │
│                                             │
└─────────────────────────────────────────────┘

Métriques SLA

Métrique	Source	Target	Alert
Uptime %	probe_success / probe_total	99.9%	<99% = WARNING
P95 latency	histogram_quantile(0.95, probe_duration_ms)	<150ms	>300ms = WARNING
P99 latency	histogram_quantile(0.99, probe_duration_ms)	<500ms	>1000ms = CRITICAL
Incident duration	incident_duration_seconds	<1h (MTTR)	>4h = escalate
Downtime 30d	sum(downtime_seconds)	<2h (99.9%)	Prédictif

6. Alertes Actives

Agrégateur alertes Prometheus + AlertManager, seuils CPU/RAM/RSSI/latence.

┌─ ALERTES ACTIVES ───────────────────────────┐
│ [CRITICAL] [WARNING] [INFO] | Sort: age      │
├─────────────────────────────────────────────┤
│                                             │
│ CRITICAL (🔴 Priorité max)                  │
│ ├─ rgz-api CPU >95% (22 min ago)            │
│ │  Threshold: 95% | Current: 96.2%          │
│ │  [Investigate] [Silence 1h] [Dismiss]     │
│ │                                          │
│ └─ DB disk space >95% (8 min ago)           │
│    Threshold: 95% | Current: 96.1% (232GB) │
│    [Investigate] [Silence 4h] [Dismiss]     │
│                                             │
│ WARNING (🟡 Attention)                      │
│ ├─ RADIUS latency P95 >300ms (45 min ago)  │
│ │  Threshold: 300ms | Current: 325ms        │
│ │  [Investigate] [Silence 1h] [Dismiss]     │
│ │                                          │
│ ├─ WiFi Center RSSI <-85dBm (2 hours ago)  │
│ │  Threshold: -80dBm | Current: -88dBm      │
│ │  [Investigate] [Silence 1h] [Dismiss]     │
│ │                                          │
│ └─ 10 memory alerts ...                     │
│                                             │
│ INFO (ℹ️ Informational)                     │
│ └─ High network traffic rate (4 hours ago) │
│    Rate: 450 Mbps (normal peak ok)          │
│    [Acknowledge] [Dismiss]                  │
│                                             │
│ SILENCE RULES (en cours)                    │
│ ├─ node_cpu >85% (until 2026-02-26 15:00) │
│ └─ DB backup (weekly Mon 03:00)             │
│                                             │
└─────────────────────────────────────────────┘

Configuration Alertes Prometheus

yaml

groups:
  - name: rgz_critical
    rules:
      - alert: NodeCPUHigh
        expr: 1 - avg(rate(node_cpu_seconds_total{mode="idle"})) > 0.95
        for: 5m
        severity: critical

      - alert: RadiusAuthLatency
        expr: histogram_quantile(0.95, freeradius_auth_duration_ms) > 300
        for: 10m
        severity: warning

      - alert: ResellerUptimeLow
        expr: sla_uptime_percent < 99.0
        for: 30m
        severity: warning

7. Dashboards Supplémentaires (4)

Dashboard	Panels	Audience	Refresh
Prometheus Status	Targets health, rule evaluation, cardinality	DevOps/NOC	30s
Elasticsearch/Kibana	Log volume, errors, searches, index size	Logs team	1 min
WireGuard VPN	Tunnel status, handshakes, data/tunnel	Network eng	5 min
Celery Tasks	Queue lengths, task durations, failures	Backend team	10s

Requêtes PromQL Utiles

prometheus

# Uptime %
round((count(probe_success == 1) / count(probe_success)) * 100, 0.01)

# Top N revendeurs par revenue
topk(5, sum by(reseller_id) (invoice_amount_total))

# Latency P95 RADIUS
histogram_quantile(0.95, rate(freeradius_auth_duration_ms[5m]))

# CPU per container
container_cpu_usage_seconds_total / container_spec_cpu_quota

# Memory pressure
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes

TIP

Live Debugging — Cliquer sur graphique → "Inspect" affiche query PromQL. Modifiable en temps réel pour tester variations alertes.

Notifications Alertes

Canaux:
  1. Email: alerting@rgz.local (SMTP)
  2. SMS: +229 [numéros DG + NOC] via Letexto
  3. Dashboard: Badge "N alerts actives" (live)
  4. Webhook: POST https://api-rgz.duckdns.org/webhooks/alerts (custom)

Escalade P0: Email + SMS immédiat → escalade Directeur après 15 min
Escalade P1: Email uniquement → NOC après 30 min
Escalade P2: Dashboard info (no email)

Dernière mise à jour : 2026-02-24 Version : Grafana 11.x Data sources : Prometheus 2.50.x + PostgreSQL 16 (TimescaleDB) Alerting : AlertManager (prometheus/alertmanager/) Hébergement : docker-compose.monitoring.yml (rgz-grafana service)

Grafana Dashboards (15+ Panels) ​

Dashboards Principaux ​

1. System Overview ​

Métriques System ​

2. RADIUS Authentication ​

Métriques RADIUS ​

3. Revenue & Billing ​

Métriques Billing ​

4. RF Monitoring (Signal Strength Heatmap) ​

Métriques RF ​

5. SLA & Incidents ​

Métriques SLA ​

6. Alertes Actives ​

Configuration Alertes Prometheus ​

7. Dashboards Supplémentaires (4) ​

Requêtes PromQL Utiles ​

Notifications Alertes ​