Skip to content

#11 — Inscription OCR

PLANIFIÉ

Priorité: 🔴 CRITIQUE · Type: TYPE B · Conteneur: rgz-api · Code: app/api/v1/endpoints/portal.py

Dépendances: #3 rgz-portal


Description

Lecture automatique de pièce d'identité lors de l'inscription (CNI/CIP/PASSPORT) via Tesseract OCR. Extraction du nom complet et validation du document. Stockage du hash SHA-256 de l'image (jamais l'image elle-même) pour déduplication.

Stack: Tesseract 4.x (pré-installé dans docker/api/Dockerfile)


Architecture Interne

Workflow

Utilisateur → Portail capture pièce ID (caméra/upload)

POST /api/v1/portal/ocr (multipart)

Validation: magic bytes (JPEG/PNG only, max 5MB)

Tesseract OCR extraction texte

Hash SHA-256 image

Déduplication: si hash = existe → abonné connu (update), sinon → nouvel abonné

Stockage DB: subscriber.id_document_hash + full_name

Response JSON: {full_name, id_number, extracted_text}

Service OCR (app/services/ocr.py)

python
class OcrService:
    async def extract_identity(
        self,
        image_bytes: bytes,
        document_type: IdDocumentType
    ) -> dict:
        """
        Tesseract OCR extraction.
        Returns: {full_name, id_number, extracted_text, confidence}
        """

    async def validate_cni(self, extracted_text: str) -> bool:
        """Validation CNI Bénin format"""

    async def validate_cip(self, extracted_text: str) -> bool:
        """Validation CIP format"""

    async def validate_passport(self, extracted_text: str) -> bool:
        """Validation Passport format"""

    async def hash_image(self, image_bytes: bytes) -> str:
        """SHA-256 hash pour déduplication"""

Configuration

Variables d'env (.env)

bash
# OCR
TESSERACT_PATH=/usr/bin/tesseract
OCR_CONFIDENCE_THRESHOLD=70  # % confiance minimum
OCR_TIMEOUT=30               # secondes
UPLOAD_MAX_SIZE_MB=5

Magic bytes (validation types fichier)

python
MAGIC_BYTES = {
    b'\xFF\xD8\xFF': 'image/jpeg',  # JPEG
    b'\x89PNG\r\n\x1a\n': 'image/png',  # PNG
}

Endpoints API

MéthodeRouteBodyRéponseAuthNotes
POST/api/v1/portal/ocrmultipart: image, document_type200 {full_name, id_number, extracted_text, hash}JWTTesseract extraction
POST/api/v1/portal/register{msisdn, full_name, id_document_type, image_b64, phone_manufacturer, consent}201 {subscriber_ref, subscriber_id}NonCréation subscriber
GET/api/v1/portal/subscribers/{hash}-200 {subscriber_ref, full_name, id_count}Admin JWTLookup par hash

POST /api/v1/portal/ocr

Request:

bash
curl -X POST http://api/api/v1/portal/ocr \
  -H "Authorization: Bearer $JWT" \
  -F "image=@pièce_id.jpg" \
  -F "document_type=CNI"

Response 200:

json
{
  "full_name": "Jean Kossou",
  "id_number": "BE0123456789",
  "extracted_text": "Carte Nationale d'Identité...",
  "image_hash": "a1b2c3d4e5f6...",
  "confidence": 92
}

Response 400 (validation):

json
{
  "error": {
    "code": "ERR_OCR_INVALID_FORMAT",
    "message": "Image format not supported (JPEG/PNG only)",
    "details": {"max_size_mb": 5}
  }
}

POST /api/v1/portal/register

Request:

json
{
  "msisdn": "+22901979799",
  "full_name": "Jean Kossou",
  "id_document_type": "CNI",
  "image_b64": "data:image/jpeg;base64,/9j/4AAQ...",
  "phone_manufacturer": "Apple",
  "consent": true
}

Response 201:

json
{
  "subscriber_ref": "RGZ-0197979799",
  "subscriber_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "pending",
  "next_step": "otp_verification"
}

Response 409 (hash exists):

json
{
  "error": {
    "code": "ERR_SUBSCRIBER_EXISTS",
    "message": "Subscriber with this identity already exists",
    "details": {
      "subscriber_ref": "RGZ-0197979799",
      "subscriber_id": "550e8400-e29b-41d4-a716-446655440000"
    }
  }
}

Sécurité

RègleImplémentation
SEC-12Magic bytes check (JPEG/PNG only) — AVANT traitement
SEC-12Content-Type validation (multipart/form-data)
SEC-12Max 5MB file size — rejeter si dépassé
SHA-256Hash image → déduplication dans DB
TimeoutTesseract timeout 30s max (éviter DoS)
ConfidenceOCR confidence ≥70% (configurable)

Implémentation TODO

  • [ ] Service app/services/ocr.py — Tesseract wrapper
  • [ ] POST /api/v1/portal/ocr — extraction + validation format
  • [ ] POST /api/v1/portal/register — creation subscriber atomique
  • [ ] GET /api/v1/portal/subscribers/{hash} — lookup déduplication
  • [ ] Validation IdDocumentType enum (CNI|CIP|PASSPORT)
  • [ ] SHA-256 hash image + storage DB
  • [ ] Tesseract config Bénin (language francais)
  • [ ] Unit tests OCR extraction
  • [ ] E2E tests portail inscription

Lessons Learned

  • LL#12 (SEC-12): magic bytes check b'\xFF\xD8\xFF' (JPEG), b'\x89PNG' (PNG) — rejeter tout autre
  • LL#26: hash DB first → then Redis cache
  • LL#5: colonnes DB exactes du contrat (id_document_hash NOT NULL, id_document_type CHECK(...))
  • LL#8: subscriber_id UUID (jamais int)
  • LL#29: skeleton imports dans prompt multi-fichiers

Dernière mise à jour: 2026-02-21

PROJET MOSAÏQUE — 81 outils, 22 conteneurs, 500+ revendeurs WiFi Zone