feat(adr): créer 3 ADR P1 manquants + atteindre score 95%

Création des ADR critiques pour phase pré-implémentation : - ADR-023 : Architecture de Modération * PostgreSQL LISTEN/NOTIFY + Redis cache priorisation * Whisper large-v3 (transcription) + NLP (distilbert, roberta) * Dashboard React + Wavesurfer.js + workflow automatisé * SLA 2h/24h/72h selon priorité, conformité DSA - ADR-024 : Monitoring et Observabilité * Prometheus + Grafana + Loki (stack self-hosted) * Alerting multi-canal : Email (Brevo) + Webhook (Slack/Discord) * Backup PostgreSQL : WAL-E continuous (RTO 1h, RPO 15min) * Runbooks incidents + dashboards métriques + uptime monitoring - ADR-025 : Secrets et Sécurité * HashiCorp Vault (self-hosted) pour secrets management * AES-256-GCM encryption PII (emails, GPS précis) * Let's Encrypt TLS 1.3 (wildcard certificate) * OWASP Top 10 mitigation complète + rate limiting Impact INCONSISTENCIES.md : - Score Modération : 20% → 95% - Score Ops & Monitoring : 30% → 95% - Score Sécurité : 40% → 95% - Score global : 82% → 95% ✅ OBJECTIF ATTEINT Phase P0 + P1 TERMINÉES : documentation prête pour Sprint 3 ! Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-01 16:44:21 +01:00
parent 9bb1891bc1
commit 5986286c3d
5 changed files with 959 additions and 863 deletions
--- a/docs/adr/024-monitoring-observabilite.md
+++ b/docs/adr/024-monitoring-observabilite.md
@@ -0,0 +1,330 @@
+# ADR-024 : Monitoring, Observabilité et Incident Response
+
+**Statut** : Accepté
+**Date** : 2026-02-01
+
+## Contexte
+
+RoadWave nécessite un système de monitoring pour garantir la disponibilité cible 99.9% (SLO) définie dans [TECHNICAL.md](../../TECHNICAL.md) :
+- **Métriques** : latency p99 < 100ms, throughput API, erreurs
+- **Alerting** : détection pannes, dégradations performance
+- **Incident response** : runbooks, escalation, post-mortems
+- **Backup/Disaster Recovery** : RTO 1h, RPO 15min
+
+Contrainte : **self-hosted** pour souveraineté données (ADR-015).
+
+## Décision
+
+Stack **Prometheus + Grafana + Loki** self-hosted avec alerting multi-canal.
+
+### Stack Technique
+
+| Composant | Technologie | Licence | Justification |
+|-----------|-------------|---------|---------------|
+| **Métriques** | Prometheus | Apache-2.0 | Standard industrie, PromQL, TSDB performant |
+| **Visualisation** | Grafana | AGPL-3.0 | Dashboards riches, alerting intégré |
+| **Logs** | Grafana Loki | AGPL-3.0 | "Prometheus pour logs", compression efficace |
+| **Tracing** | Tempo (optionnel Phase 2) | AGPL-3.0 | Traces distribuées, compatible OpenTelemetry |
+| **Alerting** | Alertmanager | Apache-2.0 | Grouping, silencing, routing multi-canal |
+| **Canaux alerts** | Email (Brevo) + Webhook (Slack/Discord) | - | Multi-canal, pas de coût SMS |
+| **Uptime monitoring** | Uptime Kuma | MIT | Self-hosted, SSL checks, incidents page |
+
+### Architecture
+
+```mermaid
+graph TB
+    subgraph Services["Services RoadWave"]
+        API["Backend Go API<br/>(Fiber metrics)"]
+        DB["PostgreSQL<br/>(pg_exporter)"]
+        Redis["Redis<br/>(redis_exporter)"]
+        Zitadel["Zitadel<br/>(metrics endpoint)"]
+    end
+
+    subgraph Monitoring["Stack Monitoring"]
+        Prom["Prometheus<br/>(scrape + TSDB)"]
+        Grafana["Grafana<br/>(dashboards)"]
+        Loki["Loki<br/>(logs aggregation)"]
+        Alert["Alertmanager<br/>(routing)"]
+        Uptime["Uptime Kuma<br/>(external checks)"]
+    end
+
+    subgraph Notifications["Alerting"]
+        Email["Email (Brevo)"]
+        Slack["Webhook Slack/Discord"]
+    end
+
+    subgraph Storage["Stockage"]
+        PromStorage["Prometheus TSDB<br/>(15j retention)"]
+        LokiStorage["Loki Chunks<br/>(7j retention)"]
+        Backups["Backups PostgreSQL<br/>(S3 OVH)"]
+    end
+
+    API --> Prom
+    DB --> Prom
+    Redis --> Prom
+    Zitadel --> Prom
+
+    API -.->|logs stdout| Loki
+    Prom --> Grafana
+    Loki --> Grafana
+    Prom --> Alert
+
+    Alert --> Email
+    Alert --> Slack
+
+    Uptime -.->|external HTTP checks| API
+    Uptime --> Alert
+
+    Prom --> PromStorage
+    Loki --> LokiStorage
+    DB -.->|WAL-E continuous| Backups
+
+    classDef serviceStyle fill:#e3f2fd,stroke:#1565c0
+    classDef monitoringStyle fill:#fff3e0,stroke:#e65100
+    classDef notifStyle fill:#f3e5f5,stroke:#6a1b9a
+    classDef storageStyle fill:#e8f5e9,stroke:#2e7d32
+
+    class Services,API,DB,Redis,Zitadel serviceStyle
+    class Monitoring,Prom,Grafana,Loki,Alert,Uptime monitoringStyle
+    class Notifications,Email,Slack notifStyle
+    class Storage,PromStorage,LokiStorage,Backups storageStyle
+```
+
+### Métriques Clés
+
+**API Performance** (Prometheus PromQL) :
+```promql
+# Latency p99
+histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
+
+# Error rate
+rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])
+
+# Throughput
+rate(http_requests_total[5m])
+```
+
+**Infrastructure** :
+- CPU usage : `rate(node_cpu_seconds_total{mode!="idle"}[5m])`
+- Memory usage : `node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes`
+- Disk I/O : `rate(node_disk_io_time_seconds_total[5m])`
+
+**Business** :
+- Active users (DAU) : compteur custom `roadwave_active_users_total`
+- Audio streams actifs : `roadwave_hls_streams_active`
+- Signalements modération : `roadwave_moderation_reports_total`
+
+## Alternatives considérées
+
+### Stack Monitoring
+
+| Option | Coût | Hébergement | Complexité | Verdict |
+|--------|------|-------------|------------|---------|
+| **Prometheus + Grafana** | **0€** | Self-hosted | ⭐⭐ Moyenne | ✅ Choisi |
+| Datadog | 15-31$/host/mois | SaaS US | ⭐ Faible | ❌ Coût + souveraineté |
+| New Relic | 99-349$/user/mois | SaaS US | ⭐ Faible | ❌ Coût prohibitif |
+| Elastic Stack (ELK) | 0€ (open) | Self-hosted | ⭐⭐⭐ Complexe | ❌ Overhead JVM |
+| VictoriaMetrics | 0€ | Self-hosted | ⭐⭐ Moyenne | ⚠️ Moins mature |
+
+### Alerting Canaux
+
+| Canal | Coût | Disponibilité | Intrusivité | Verdict |
+|-------|------|---------------|-------------|---------|
+| **Email (Brevo)** | **0€ (300/j)** | Asynchrone | ⭐ Basse | ✅ Standard |
+| **Webhook Slack/Discord** | **0€** | Temps réel | ⭐⭐ Moyenne | ✅ On-call |
+| SMS (Twilio) | 0.04€/SMS | Immédiat | ⭐⭐⭐ Haute | ⚠️ Phase 2 (critique) |
+| PagerDuty | 21$/user/mois | Immédiat + escalation | ⭐⭐⭐ Haute | ❌ Coût |
+| OpsGenie | 29$/user/mois | Immédiat + escalation | ⭐⭐⭐ Haute | ❌ Coût |
+
+### Backup Strategy
+
+| Option | RPO | RTO | Coût | Verdict |
+|--------|-----|-----|------|---------|
+| **WAL-E continuous archiving** | **15 min** | **1h** | **5-15€/mois (S3)** | ✅ Choisi |
+| pg_dump quotidien | 24h | 2-4h | 0€ (local) | ❌ RPO trop élevé |
+| pgBackRest | 5 min | 30 min | 10-20€/mois | ⚠️ Complexe MVP |
+| Managed backup (Scaleway) | 5 min | 15 min | 50€/mois | ❌ Phase 2 |
+
+## Justification
+
+### Prometheus + Grafana
+
+- **Standard industrie** : adopté par CNCF, documentation riche
+- **Performance** : TSDB optimisé, compression >10x vs PostgreSQL
+- **Écosystème** : 150+ exporters officiels (PostgreSQL, Redis, Go, Nginx)
+- **PromQL** : langage requête puissant pour alerting complexe
+- **Coût 0€** : self-hosted, licences permissives
+
+### Loki pour Logs
+
+- **Compression** : 10-50x vs Elasticsearch (stockage chunks)
+- **Simplicité** : pas de schéma, logs = labels + timestamp
+- **Intégration Grafana** : requêtes logs + métriques unifiées
+- **Performance** : grep distribué sur labels indexés
+
+### Uptime Kuma
+
+- **Self-hosted** : alternative à UptimeRobot (SaaS)
+- **Fonctionnalités** : HTTP/HTTPS checks, SSL expiry, status page public
+- **Alerting** : intégration Webhook, Email
+- **Coût 0€** : open source MIT
+
+## Conséquences
+
+### Positives
+
+- ✅ **Coût infrastructure** : 5-20€/mois (stockage S3 backups uniquement)
+- ✅ **Souveraineté** : 100% self-hosted OVH France
+- ✅ **Alerting multi-canal** : Email + Slack/Discord (extensible SMS Phase 2)
+- ✅ **Observabilité complète** : métriques + logs + uptime externe
+- ✅ **Conformité RGPD** : logs anonymisés, rétention 7-15j
+
+### Négatives
+
+- ⚠️ **Maintenance** : Stack à gérer (mises à jour Prometheus, Grafana, Loki)
+- ⚠️ **Stockage** : Prometheus TSDB consomme ~1-2 GB/mois @ 1000 RPS
+- ❌ **Pas d'on-call automatique** au MVP (Slack manual, SMS Phase 2)
+- ❌ **Courbe d'apprentissage** : PromQL à maîtriser
+
+### Dashboards Grafana
+
+**Dashboard principal** :
+- Latency p50/p95/p99 API (5 min, 1h, 24h)
+- Error rate 5xx/4xx (seuil alerte >1%)
+- Throughput requests/sec
+- Infra : CPU, RAM, Disk I/O
+- Business : DAU, streams actifs, signalements modération
+
+**Dashboard PostgreSQL** :
+- Slow queries (>100ms)
+- Connections actives vs max
+- Cache hit ratio (cible >95%)
+- Deadlocks count
+
+**Dashboard Redis** :
+- Memory usage
+- Evictions count
+- Commands/sec
+- Keyspace hits/misses ratio
+
+### Alerting Rules
+
+**Critiques** (Slack + Email immédiat) :
+```yaml
+- alert: APIDown
+  expr: up{job="roadwave-api"} == 0
+  for: 1m
+  severity: critical
+  message: "API indisponible depuis 1 min"
+
+- alert: HighErrorRate
+  expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.01
+  for: 5m
+  severity: critical
+  message: "Error rate >1% depuis 5 min"
+
+- alert: DatabaseDown
+  expr: up{job="postgresql"} == 0
+  for: 1m
+  severity: critical
+  message: "PostgreSQL indisponible"
+```
+
+**Warnings** (Email uniquement) :
+```yaml
+- alert: HighLatency
+  expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 0.1
+  for: 10m
+  severity: warning
+  message: "Latency p99 >100ms depuis 10 min"
+
+- alert: DiskSpaceRunningOut
+  expr: node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.1
+  for: 30m
+  severity: warning
+  message: "Espace disque <10%"
+```
+
+### Backup & Disaster Recovery
+
+**PostgreSQL WAL-E** :
+```bash
+# Backup continu WAL (Write-Ahead Log)
+wal-e backup-push /var/lib/postgresql/data
+
+# Rétention : 7 jours full + WAL
+# Stockage : S3 OVH (région GRA, France)
+# Chiffrement : AES-256 server-side
+```
+
+**RTO (Recovery Time Objective)** : 1h
+- Temps de restore depuis S3 : ~30 min (DB 10 GB)
+- Temps validation + relance services : ~30 min
+
+**RPO (Recovery Point Objective)** : 15 min
+- WAL archivage toutes les 15 min
+- Perte maximale : 15 min de transactions
+
+**Tests DR** : Mensuel (restore backup sur environnement staging)
+
+## Runbooks Incidents
+
+### API Down (5xx errors spike)
+
+1. **Vérifier** : Grafana dashboard → onglet Errors
+2. **Logs** : Loki query `{app="roadwave-api"} |= "error"`
+3. **Actions** :
+   - Si OOM : restart container + augmenter RAM
+   - Si DB connexions saturées : vérifier slow queries
+   - Si réseau : vérifier OVH status page
+4. **Escalade** : Si non résolu en 15 min → appel admin senior
+
+### Database Slow Queries
+
+1. **Identifier** : Grafana → PostgreSQL dashboard → Top slow queries
+2. **Analyser** : `EXPLAIN ANALYZE` sur query problématique
+3. **Actions** :
+   - Index manquant : créer index (migration rapide)
+   - Lock contention : identifier transaction longue et kill si bloquante
+4. **Prevention** : Ajouter alerte Grafana si query >100ms P95
+
+### High Load (CPU >80%)
+
+1. **Vérifier** : Grafana → Node Exporter → CPU usage
+2. **Top processus** : `htop` ou `docker stats`
+3. **Actions** :
+   - Si Whisper (modération) : réduire concurrence workers
+   - Si API : scale horizontal (ajouter instance)
+4. **Prévention** : Auto-scaling (Phase 2)
+
+## Métriques de Succès
+
+- Uptime > 99.9% (8.76h downtime/an max)
+- MTTD (Mean Time To Detect) < 5 min
+- MTTR (Mean Time To Recover) < 30 min
+- Alerts faux positifs < 5%
+
+## Migration et Rollout
+
+### Phase 1 (MVP - Sprint 2-3)
+1. Deploy Prometheus + Grafana + Loki (Docker Compose)
+2. Instrumenter API Go (Fiber middleware metrics)
+3. Configure exporters : PostgreSQL, Redis, Node
+4. Dashboard principal + 5 alertes critiques
+5. Setup WAL-E backup PostgreSQL
+
+### Phase 2 (Post-MVP - Sprint 6-8)
+1. Ajouter Tempo (tracing distribué)
+2. SMS alerting (Twilio) pour incidents critiques
+3. Auto-scaling basé métriques Prometheus
+4. Post-mortem process (template Notion)
+
+## Références
+
+- [TECHNICAL.md](../../TECHNICAL.md) (SLO 99.9%, latency p99 <100ms)
+- [ADR-001 : Langage Backend](001-langage-backend.md) (Go, Fiber)
+- [ADR-005 : Base de données](005-base-de-donnees.md) (PostgreSQL)
+- [ADR-015 : Hébergement](015-hebergement.md) (OVH France, self-hosted)
+- [Prometheus Documentation](https://prometheus.io/docs/)
+- [Grafana Loki](https://grafana.com/oss/loki/)
+- [WAL-E PostgreSQL Archiving](https://github.com/wal-e/wal-e)