#11 — Inscription OCR
PLANIFIÉ
Priorité: 🔴 CRITIQUE · Type: TYPE B · Conteneur: rgz-api · Code: app/api/v1/endpoints/portal.py
Dépendances: #3 rgz-portal
Description
Lecture automatique de pièce d'identité lors de l'inscription (CNI/CIP/PASSPORT) via Tesseract OCR. Extraction du nom complet et validation du document. Stockage du hash SHA-256 de l'image (jamais l'image elle-même) pour déduplication.
Stack: Tesseract 4.x (pré-installé dans docker/api/Dockerfile)
Architecture Interne
Workflow
Utilisateur → Portail capture pièce ID (caméra/upload)
↓
POST /api/v1/portal/ocr (multipart)
↓
Validation: magic bytes (JPEG/PNG only, max 5MB)
↓
Tesseract OCR extraction texte
↓
Hash SHA-256 image
↓
Déduplication: si hash = existe → abonné connu (update), sinon → nouvel abonné
↓
Stockage DB: subscriber.id_document_hash + full_name
↓
Response JSON: {full_name, id_number, extracted_text}Service OCR (app/services/ocr.py)
python
class OcrService:
async def extract_identity(
self,
image_bytes: bytes,
document_type: IdDocumentType
) -> dict:
"""
Tesseract OCR extraction.
Returns: {full_name, id_number, extracted_text, confidence}
"""
async def validate_cni(self, extracted_text: str) -> bool:
"""Validation CNI Bénin format"""
async def validate_cip(self, extracted_text: str) -> bool:
"""Validation CIP format"""
async def validate_passport(self, extracted_text: str) -> bool:
"""Validation Passport format"""
async def hash_image(self, image_bytes: bytes) -> str:
"""SHA-256 hash pour déduplication"""Configuration
Variables d'env (.env)
bash
# OCR
TESSERACT_PATH=/usr/bin/tesseract
OCR_CONFIDENCE_THRESHOLD=70 # % confiance minimum
OCR_TIMEOUT=30 # secondes
UPLOAD_MAX_SIZE_MB=5Magic bytes (validation types fichier)
python
MAGIC_BYTES = {
b'\xFF\xD8\xFF': 'image/jpeg', # JPEG
b'\x89PNG\r\n\x1a\n': 'image/png', # PNG
}Endpoints API
| Méthode | Route | Body | Réponse | Auth | Notes |
|---|---|---|---|---|---|
| POST | /api/v1/portal/ocr | multipart: image, document_type | 200 {full_name, id_number, extracted_text, hash} | JWT | Tesseract extraction |
| POST | /api/v1/portal/register | {msisdn, full_name, id_document_type, image_b64, phone_manufacturer, consent} | 201 {subscriber_ref, subscriber_id} | Non | Création subscriber |
| GET | /api/v1/portal/subscribers/{hash} | - | 200 {subscriber_ref, full_name, id_count} | Admin JWT | Lookup par hash |
POST /api/v1/portal/ocr
Request:
bash
curl -X POST http://api/api/v1/portal/ocr \
-H "Authorization: Bearer $JWT" \
-F "image=@pièce_id.jpg" \
-F "document_type=CNI"Response 200:
json
{
"full_name": "Jean Kossou",
"id_number": "BE0123456789",
"extracted_text": "Carte Nationale d'Identité...",
"image_hash": "a1b2c3d4e5f6...",
"confidence": 92
}Response 400 (validation):
json
{
"error": {
"code": "ERR_OCR_INVALID_FORMAT",
"message": "Image format not supported (JPEG/PNG only)",
"details": {"max_size_mb": 5}
}
}POST /api/v1/portal/register
Request:
json
{
"msisdn": "+22901979799",
"full_name": "Jean Kossou",
"id_document_type": "CNI",
"image_b64": "data:image/jpeg;base64,/9j/4AAQ...",
"phone_manufacturer": "Apple",
"consent": true
}Response 201:
json
{
"subscriber_ref": "RGZ-0197979799",
"subscriber_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "pending",
"next_step": "otp_verification"
}Response 409 (hash exists):
json
{
"error": {
"code": "ERR_SUBSCRIBER_EXISTS",
"message": "Subscriber with this identity already exists",
"details": {
"subscriber_ref": "RGZ-0197979799",
"subscriber_id": "550e8400-e29b-41d4-a716-446655440000"
}
}
}Sécurité
| Règle | Implémentation |
|---|---|
| SEC-12 | Magic bytes check (JPEG/PNG only) — AVANT traitement |
| SEC-12 | Content-Type validation (multipart/form-data) |
| SEC-12 | Max 5MB file size — rejeter si dépassé |
| SHA-256 | Hash image → déduplication dans DB |
| Timeout | Tesseract timeout 30s max (éviter DoS) |
| Confidence | OCR confidence ≥70% (configurable) |
Implémentation TODO
- [ ] Service
app/services/ocr.py— Tesseract wrapper - [ ] POST
/api/v1/portal/ocr— extraction + validation format - [ ] POST
/api/v1/portal/register— creation subscriber atomique - [ ] GET
/api/v1/portal/subscribers/{hash}— lookup déduplication - [ ] Validation IdDocumentType enum (CNI|CIP|PASSPORT)
- [ ] SHA-256 hash image + storage DB
- [ ] Tesseract config Bénin (language francais)
- [ ] Unit tests OCR extraction
- [ ] E2E tests portail inscription
Lessons Learned
- LL#12 (SEC-12): magic bytes check
b'\xFF\xD8\xFF'(JPEG),b'\x89PNG'(PNG) — rejeter tout autre - LL#26: hash DB first → then Redis cache
- LL#5: colonnes DB exactes du contrat (id_document_hash NOT NULL, id_document_type CHECK(...))
- LL#8: subscriber_id UUID (jamais int)
- LL#29: skeleton imports dans prompt multi-fichiers
Dernière mise à jour: 2026-02-21