#76 — firmware-updater
PLANIFIÉ
Priorité: 🟠 HAUTE · Type: C (Celery) · Conteneur: rgz-beat · Code: app/tasks/firmware.pyDépendances: #39 snmp-poller
Description
Mise à jour firmware batch pour CPE LiteBeam (antennes relais). Fenêtre de maintenance hebdomadaire jeudi 02:00-05:00 UTC (configurable). Workflow automatisé :
- Récupérer version firmware actuelle via SNMP
- Comparer avec version disponible (repo)
- SI nouvelle version: sauvegarder config CPE, flasher, vérifier boot
- SI échec: rollback automatique + alerte NOC
Objectif: Zéro downtime pour abonnés. CPE peut redémarrer pendant fenêtre (trafic minimal), abonnés reconnectent automatiquement.
Architecture Interne
Flux de Mise à Jour
Chaque jeudi 02:00 UTC:
↓
Celery Beat déclenche rgz.firmware.update
↓
SNMP poll sur tous CPE:
1. Récupérer sysDescr (modèle: LiteBeam AC, LiteBeam Gen2)
2. Récupérer version firmware actuelle
3. Comparer vs firmware repo (ex: 6.4.1 disponible, CPE a 6.3.8)
↓
Pour chaque CPE avec update disponible:
1. SSH connect (cpuSSHUser/cpuSSHKey)
2. Exécuter: /opt/ubnt/bin/ubntctl get-fw-version → parse
3. Télécharger FW: curl https://fw-repo/v6.4.1-LiteBeam.tar.gz
4. Backup config: cp /cfg /cfg.backup
5. Flash FW: ubntctl flash /tmp/fw.tar.gz
6. Timeout 5min: loop ping CPE avec timeout progressif
- Minute 1: ping rapide (CPE encore on)
- Minute 2-4: poll SNMP sysUpTime < 60s (just rebooted)
- Minute 5: timeout → rollback
7. Vérifier version post-boot (SNMP)
SI version OK: success
SI version non changée: rollback + alerte
↓
Notification NOC:
Email: "Firmware updated: 23/150 CPE successful, 0 failed, 2 timeout"
↓
Post-update check:
Vérifier aucun CPE down (SNMP poll)
Comparer sessions avant/après (santé réseau)Schéma de Données
sql
-- Table tracking firmware updates
TABLE cpe_firmware_updates:
id UUID PK
cpe_id UUID FK (reference à asset CPE)
cpe_nas_id TEXT (ex: access_kossou)
cpe_model TEXT (LiteBeam AC, LiteBeam Gen2)
cpe_hostname TEXT
cpe_ip_address INET
version_old TEXT (ex: 6.3.8)
version_new TEXT (ex: 6.4.1)
firmware_filename TEXT (ex: 6.4.1-LiteBeam.tar.gz)
update_status CHECK(pending|in_progress|success|failed|rolled_back)
started_at TIMESTAMP
completed_at TIMESTAMP
duration_seconds INT
error_message TEXT (si failed)
config_backed_up_at TIMESTAMP
rollback_executed_at TIMESTAMP
sysuptime_before INT (SNMP sysUpTime, secondes)
sysuptime_after INT (SNMP sysUpTime post-reboot)
-- Table firmware repository
TABLE firmware_repository:
id UUID PK
cpe_model TEXT UNIQUE (LiteBeam AC, LiteBeam Gen2, etc.)
version_current TEXT (ex: 6.4.1)
firmware_url TEXT (ex: https://fw-repo/v6.4.1-LiteBeam.tar.gz)
checksum_sha256 TEXT
release_notes TEXT
release_date DATE
deprecated_at DATE (NULL = current stable)Exemple Firmware Update
FIRMWARE UPDATE RUN — Thursday Feb 6, 2026
Start: 2026-02-06 02:00:00 UTC
Maintenance window: 02:00 — 05:00 UTC (3 hours)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
INVENTORY SCAN:
Total CPE found: 150 (SNMP SNMP poll)
Models:
- LiteBeam AC: 127 CPE (version 6.3.8)
- LiteBeam Gen2: 23 CPE (version 6.2.5)
NEW FIRMWARE AVAILABLE:
LiteBeam AC: 6.4.1 (released 2026-01-15)
LiteBeam Gen2: 6.3.2 (released 2026-02-01)
CANDIDATES FOR UPDATE:
LiteBeam AC: 127 CPE (all)
LiteBeam Gen2: 23 CPE (all)
Total: 150 CPE
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
UPDATE EXECUTION:
Batch 1/5 (30 CPE × LiteBeam AC):
CPE-COTO-001: 6.3.8 → 6.4.1 [30s] SUCCESS ✓
CPE-COTO-002: 6.3.8 → 6.4.1 [28s] SUCCESS ✓
...
CPE-COTO-030: 6.3.8 → 6.4.1 [32s] SUCCESS ✓
Batch 2/5 (30 CPE × LiteBeam AC):
CPE-KOSY-001: 6.3.8 → 6.4.1 [35s] SUCCESS ✓
...
CPE-KOSY-023: 6.3.8 → 6.4.1 [Timeout 300s] TIMEOUT ⚠️
Config backed up. Manual rollback needed.
[Continue batches...]
SUMMARY:
Total processed: 150 CPE
Successful: 148 CPE (98.7%)
Failed: 0 CPE
Timeout: 2 CPE (CPE-KOSY-023, CPE-IKOM-045)
Rolled back: 0 CPE
Session impact: 0 (all reconnected auto)
SLA impact: 0 (within maintenance window)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
VALIDATION POST-UPDATE:
SNMP poll: All CPE responding ✓
Routes: BGP stable ✓
Throughput: Normal ✓
ALERTS:
P2 Alert: "2 CPE timeout during firmware update — manual check required"
Recipients: noc@rgz.local
Action: SSH CPE-KOSY-023, run 'ubntctl get-fw-version', rollback if needed
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
End: 2026-02-06 03:47:23 UTC (duration: 1h 47min)
Next update: 2026-02-13 02:00:00 UTC (weekly)Configuration
env
# Firmware Update Schedule
FIRMWARE_UPDATE_ENABLED=true
FIRMWARE_UPDATE_DAY_OF_WEEK=3 # 0=Mon, 3=Thu, 6=Sun
FIRMWARE_UPDATE_HOUR_UTC=2
FIRMWARE_UPDATE_DURATION_MAX_MIN=120 # 02:00-04:00 (2h window)
# Firmware Repository
FIRMWARE_REPO_URL=https://fw-repo.rgz.bj/firmware
FIRMWARE_CACHE_DIR=/tmp/firmware
FIRMWARE_CHECK_INTERVAL_DAYS=7 # Check new FW weekly
# CPE SSH Access
CPE_SSH_USER=ubnt
CPE_SSH_KEY_PATH=/home/rgz-app/.ssh/cpe_rsa
CPE_SSH_TIMEOUT_SEC=30
CPE_SSH_KNOWN_HOSTS_FILE=/home/rgz-app/.ssh/known_hosts
# Update Behavior
FIRMWARE_UPDATE_BATCH_SIZE=30 # Parallel CPE (firewall limit)
FIRMWARE_UPDATE_TIMEOUT_SEC=300 # 5min per CPE
FIRMWARE_ROLLBACK_ON_FAILURE=true
FIRMWARE_VERIFY_CHECKSUM=true # SHA256 check after download
# Monitoring
FIRMWARE_ALERT_ON_TIMEOUT=true
FIRMWARE_ALERT_ON_FAILURE=true
FIRMWARE_PROMETHEUS_EXPORT=true
# Skip List (CPE à ne pas updater)
FIRMWARE_SKIP_CPE_LIST=CPE-LAB-001,CPE-TEST-999 # Dev/test CPEEndpoints API
| Méthode | Route | Description | Réponse |
|---|---|---|---|
| GET | /api/v1/firmware/updates/available | Lister FW disponibles | List[{cpe_model, current_version, new_version}] |
| POST | /api/v1/firmware/updates/schedule | Planifier update (admin) | 202 Accepted + |
| GET | /api/v1/firmware/updates/status?task_id={id} | Suivi live update | |
| GET | /api/v1/firmware/updates/history?days=30 | Historique updates | List[update_run] |
| GET | /api/v1/cpe/{cpe_id}/firmware | Version CPE actuelle |
Celery Task
| Champ | Valeur |
|---|---|
| Task name | rgz.firmware.update |
| Schedule | Weekly Thursday 02:00 UTC (0 2 * * 4) |
| Queue | rgz.maintenance |
| Timeout | 7200s (2 hours) |
| Retry | 1x (critical task, manual intervention if fails) |
Commandes Utiles
bash
# Déclencher firmware update manuellement
docker-compose exec rgz-api celery -A app.celery_app call rgz.firmware.update
# Lister CPE firmware versions (SNMP poll)
docker-compose exec rgz-api python3 -c "
from app.services.snmp import SNMPPoller
poller = SNMPPoller()
for cpe in poller.get_all_cpe():
version = poller.get_firmware_version(cpe.ip)
print(f'{cpe.hostname}: {version}')
" | sort
# Vérifier firmware repo
curl https://fw-repo.rgz.bj/firmware/ | grep -E "\.tar\.gz|\.txt" | head
# Récupérer historique updates
curl -H "Authorization: Bearer {admin_token}" \
"http://api-rgz.duckdns.org/api/v1/firmware/updates/history?days=30" | jq
# Vérifier update en cours (live status)
curl -H "Authorization: Bearer {admin_token}" \
"http://api-rgz.duckdns.org/api/v1/firmware/updates/status?task_id={task_id}" | jq
# SSH CPE directement (troubleshoot timeout)
ssh -i /home/rgz-app/.ssh/cpe_rsa ubnt@10.142.0.1 "ubntctl get-fw-version"
# Logs firmware update
docker-compose logs rgz-beat | grep "firmware.update"
# Vérifier CPE connectivity (debug)
for ip in 10.142.0.{1..10}; do
echo -n "$ip: "
timeout 2 ping -c 1 $ip > /dev/null && echo "OK" || echo "FAIL"
doneImplémentation TODO
- [ ] Schéma DB
cpe_firmware_updates+firmware_repository - [ ] Tâche Celery
rgz.firmware.updatedansapp/tasks/firmware.py - [ ] Fonction
_get_cpe_inventory_snmp()(poll toutes CPE) - [ ] Fonction
_get_available_firmware()(fetch repo) - [ ] Fonction
_update_cpe_batch()(SSH flash multi-CPE) - [ ] Fonction
_verify_update_success()(post-boot version check) - [ ] Fonction
_rollback_cpe()(restore backup config) - [ ] Endpoints API GET/POST /api/v1/firmware/*
- [ ] SSH key management (stockage sécurisé, rotation)
- [ ] Firmware checksum verification (SHA256)
- [ ] Email notification (success, failure, timeout summary)
- [ ] Tests: dry-run firmware update (simulate timeout, rollback)
- [ ] Documentation: SOP update CPE manuelle, emergency procedures
Dernière mise à jour: 2026-02-21