Skip to content

#76 — firmware-updater

PLANIFIÉ

Priorité: 🟠 HAUTE · Type: C (Celery) · Conteneur: rgz-beat · Code: app/tasks/firmware.pyDépendances: #39 snmp-poller


Description

Mise à jour firmware batch pour CPE LiteBeam (antennes relais). Fenêtre de maintenance hebdomadaire jeudi 02:00-05:00 UTC (configurable). Workflow automatisé :

  1. Récupérer version firmware actuelle via SNMP
  2. Comparer avec version disponible (repo)
  3. SI nouvelle version: sauvegarder config CPE, flasher, vérifier boot
  4. SI échec: rollback automatique + alerte NOC

Objectif: Zéro downtime pour abonnés. CPE peut redémarrer pendant fenêtre (trafic minimal), abonnés reconnectent automatiquement.

Architecture Interne

Flux de Mise à Jour

Chaque jeudi 02:00 UTC:

Celery Beat déclenche rgz.firmware.update

SNMP poll sur tous CPE:
  1. Récupérer sysDescr (modèle: LiteBeam AC, LiteBeam Gen2)
  2. Récupérer version firmware actuelle
  3. Comparer vs firmware repo (ex: 6.4.1 disponible, CPE a 6.3.8)

Pour chaque CPE avec update disponible:
  1. SSH connect (cpuSSHUser/cpuSSHKey)
  2. Exécuter: /opt/ubnt/bin/ubntctl get-fw-version → parse
  3. Télécharger FW: curl https://fw-repo/v6.4.1-LiteBeam.tar.gz
  4. Backup config: cp /cfg /cfg.backup
  5. Flash FW: ubntctl flash /tmp/fw.tar.gz
  6. Timeout 5min: loop ping CPE avec timeout progressif
     - Minute 1: ping rapide (CPE encore on)
     - Minute 2-4: poll SNMP sysUpTime < 60s (just rebooted)
     - Minute 5: timeout → rollback
  7. Vérifier version post-boot (SNMP)
     SI version OK: success
     SI version non changée: rollback + alerte

Notification NOC:
  Email: "Firmware updated: 23/150 CPE successful, 0 failed, 2 timeout"

Post-update check:
  Vérifier aucun CPE down (SNMP poll)
  Comparer sessions avant/après (santé réseau)

Schéma de Données

sql
-- Table tracking firmware updates
TABLE cpe_firmware_updates:
  id UUID PK
  cpe_id UUID FK (reference à asset CPE)
  cpe_nas_id TEXT (ex: access_kossou)
  cpe_model TEXT (LiteBeam AC, LiteBeam Gen2)
  cpe_hostname TEXT
  cpe_ip_address INET
  version_old TEXT (ex: 6.3.8)
  version_new TEXT (ex: 6.4.1)
  firmware_filename TEXT (ex: 6.4.1-LiteBeam.tar.gz)
  update_status CHECK(pending|in_progress|success|failed|rolled_back)
  started_at TIMESTAMP
  completed_at TIMESTAMP
  duration_seconds INT
  error_message TEXT (si failed)
  config_backed_up_at TIMESTAMP
  rollback_executed_at TIMESTAMP
  sysuptime_before INT (SNMP sysUpTime, secondes)
  sysuptime_after INT (SNMP sysUpTime post-reboot)

-- Table firmware repository
TABLE firmware_repository:
  id UUID PK
  cpe_model TEXT UNIQUE (LiteBeam AC, LiteBeam Gen2, etc.)
  version_current TEXT (ex: 6.4.1)
  firmware_url TEXT (ex: https://fw-repo/v6.4.1-LiteBeam.tar.gz)
  checksum_sha256 TEXT
  release_notes TEXT
  release_date DATE
  deprecated_at DATE (NULL = current stable)

Exemple Firmware Update

FIRMWARE UPDATE RUN — Thursday Feb 6, 2026

Start:                2026-02-06 02:00:00 UTC
Maintenance window:   02:00 — 05:00 UTC (3 hours)

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

INVENTORY SCAN:
  Total CPE found:    150 (SNMP SNMP poll)
  Models:
    - LiteBeam AC:    127 CPE (version 6.3.8)
    - LiteBeam Gen2:  23 CPE (version 6.2.5)

NEW FIRMWARE AVAILABLE:
  LiteBeam AC:    6.4.1 (released 2026-01-15)
  LiteBeam Gen2:  6.3.2 (released 2026-02-01)

CANDIDATES FOR UPDATE:
  LiteBeam AC:    127 CPE (all)
  LiteBeam Gen2:  23 CPE (all)
  Total:          150 CPE

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

UPDATE EXECUTION:

Batch 1/5 (30 CPE × LiteBeam AC):
  CPE-COTO-001: 6.3.8 → 6.4.1 [30s] SUCCESS ✓
  CPE-COTO-002: 6.3.8 → 6.4.1 [28s] SUCCESS ✓
  ...
  CPE-COTO-030: 6.3.8 → 6.4.1 [32s] SUCCESS ✓

Batch 2/5 (30 CPE × LiteBeam AC):
  CPE-KOSY-001: 6.3.8 → 6.4.1 [35s] SUCCESS ✓
  ...
  CPE-KOSY-023: 6.3.8 → 6.4.1 [Timeout 300s] TIMEOUT ⚠️
                Config backed up. Manual rollback needed.

[Continue batches...]

SUMMARY:
  Total processed:    150 CPE
  Successful:         148 CPE (98.7%)
  Failed:             0 CPE
  Timeout:            2 CPE (CPE-KOSY-023, CPE-IKOM-045)
  Rolled back:        0 CPE

  Session impact:     0 (all reconnected auto)
  SLA impact:         0 (within maintenance window)

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

VALIDATION POST-UPDATE:
  SNMP poll:      All CPE responding ✓
  Routes:         BGP stable ✓
  Throughput:     Normal ✓

ALERTS:
  P2 Alert: "2 CPE timeout during firmware update — manual check required"
      Recipients: noc@rgz.local
      Action: SSH CPE-KOSY-023, run 'ubntctl get-fw-version', rollback if needed

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

End:            2026-02-06 03:47:23 UTC (duration: 1h 47min)
Next update:    2026-02-13 02:00:00 UTC (weekly)

Configuration

env
# Firmware Update Schedule
FIRMWARE_UPDATE_ENABLED=true
FIRMWARE_UPDATE_DAY_OF_WEEK=3             # 0=Mon, 3=Thu, 6=Sun
FIRMWARE_UPDATE_HOUR_UTC=2
FIRMWARE_UPDATE_DURATION_MAX_MIN=120      # 02:00-04:00 (2h window)

# Firmware Repository
FIRMWARE_REPO_URL=https://fw-repo.rgz.bj/firmware
FIRMWARE_CACHE_DIR=/tmp/firmware
FIRMWARE_CHECK_INTERVAL_DAYS=7            # Check new FW weekly

# CPE SSH Access
CPE_SSH_USER=ubnt
CPE_SSH_KEY_PATH=/home/rgz-app/.ssh/cpe_rsa
CPE_SSH_TIMEOUT_SEC=30
CPE_SSH_KNOWN_HOSTS_FILE=/home/rgz-app/.ssh/known_hosts

# Update Behavior
FIRMWARE_UPDATE_BATCH_SIZE=30             # Parallel CPE (firewall limit)
FIRMWARE_UPDATE_TIMEOUT_SEC=300           # 5min per CPE
FIRMWARE_ROLLBACK_ON_FAILURE=true
FIRMWARE_VERIFY_CHECKSUM=true             # SHA256 check after download

# Monitoring
FIRMWARE_ALERT_ON_TIMEOUT=true
FIRMWARE_ALERT_ON_FAILURE=true
FIRMWARE_PROMETHEUS_EXPORT=true

# Skip List (CPE à ne pas updater)
FIRMWARE_SKIP_CPE_LIST=CPE-LAB-001,CPE-TEST-999  # Dev/test CPE

Endpoints API

MéthodeRouteDescriptionRéponse
GET/api/v1/firmware/updates/availableLister FW disponiblesList[{cpe_model, current_version, new_version}]
POST/api/v1/firmware/updates/schedulePlanifier update (admin)202 Accepted +
GET/api/v1/firmware/updates/status?task_id={id}Suivi live update
GET/api/v1/firmware/updates/history?days=30Historique updatesList[update_run]
GET/api/v1/cpe/{cpe_id}/firmwareVersion CPE actuelle

Celery Task

ChampValeur
Task namergz.firmware.update
ScheduleWeekly Thursday 02:00 UTC (0 2 * * 4)
Queuergz.maintenance
Timeout7200s (2 hours)
Retry1x (critical task, manual intervention if fails)

Commandes Utiles

bash
# Déclencher firmware update manuellement
docker-compose exec rgz-api celery -A app.celery_app call rgz.firmware.update

# Lister CPE firmware versions (SNMP poll)
docker-compose exec rgz-api python3 -c "
from app.services.snmp import SNMPPoller
poller = SNMPPoller()
for cpe in poller.get_all_cpe():
    version = poller.get_firmware_version(cpe.ip)
    print(f'{cpe.hostname}: {version}')
" | sort

# Vérifier firmware repo
curl https://fw-repo.rgz.bj/firmware/ | grep -E "\.tar\.gz|\.txt" | head

# Récupérer historique updates
curl -H "Authorization: Bearer {admin_token}" \
  "http://api-rgz.duckdns.org/api/v1/firmware/updates/history?days=30" | jq

# Vérifier update en cours (live status)
curl -H "Authorization: Bearer {admin_token}" \
  "http://api-rgz.duckdns.org/api/v1/firmware/updates/status?task_id={task_id}" | jq

# SSH CPE directement (troubleshoot timeout)
ssh -i /home/rgz-app/.ssh/cpe_rsa ubnt@10.142.0.1 "ubntctl get-fw-version"

# Logs firmware update
docker-compose logs rgz-beat | grep "firmware.update"

# Vérifier CPE connectivity (debug)
for ip in 10.142.0.{1..10}; do
  echo -n "$ip: "
  timeout 2 ping -c 1 $ip > /dev/null && echo "OK" || echo "FAIL"
done

Implémentation TODO

  • [ ] Schéma DB cpe_firmware_updates + firmware_repository
  • [ ] Tâche Celery rgz.firmware.update dans app/tasks/firmware.py
  • [ ] Fonction _get_cpe_inventory_snmp() (poll toutes CPE)
  • [ ] Fonction _get_available_firmware() (fetch repo)
  • [ ] Fonction _update_cpe_batch() (SSH flash multi-CPE)
  • [ ] Fonction _verify_update_success() (post-boot version check)
  • [ ] Fonction _rollback_cpe() (restore backup config)
  • [ ] Endpoints API GET/POST /api/v1/firmware/*
  • [ ] SSH key management (stockage sécurisé, rotation)
  • [ ] Firmware checksum verification (SHA256)
  • [ ] Email notification (success, failure, timeout summary)
  • [ ] Tests: dry-run firmware update (simulate timeout, rollback)
  • [ ] Documentation: SOP update CPE manuelle, emergency procedures

Dernière mise à jour: 2026-02-21

PROJET MOSAÏQUE — 81 outils, 22 conteneurs, 500+ revendeurs WiFi Zone