refactor(adr-023/024/025): retirer exemples de code et scripts

Suppression de tous les exemples de code pour garder uniquement les descriptions techniques : ADR-023 (Architecture Modération) : - Diagramme Mermaid → description flux textuelle - Exemples SQL/Redis → description workflow - Interface Go → description abstraction - Dépendances → liste concise ADR-024 (Monitoring et Observabilité) : - Diagramme Mermaid → architecture textuelle - Exemples PromQL → description métriques - Config YAML alertes → liste alertes avec seuils - Commandes bash WAL-E → description backup - Runbooks → étapes sans commandes ADR-025 (Sécurité et Secrets) : - Diagramme Mermaid → flux secrets textuel - Commandes bash Vault → description process - Code Go encryption → architecture encryption - Schéma SQL → contraintes textuelles - Config Nginx → configuration TLS - Code Go rate limiting → paramètres middleware ADR restent 100% techniques et complets sans code concret. Cohérence avec ADR-022 (même approche). Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-01 17:12:07 +01:00
parent 60dce59905
commit 81ccbf79e6
3 changed files with 120 additions and 392 deletions
--- a/docs/adr/024-monitoring-observabilite.md
+++ b/docs/adr/024-monitoring-observabilite.md
@@ -31,86 +31,42 @@ Stack **Prometheus + Grafana + Loki** self-hosted avec alerting multi-canal.

 ### Architecture

-```mermaid
-graph TB
-    subgraph Services["Services RoadWave"]
-        API["Backend Go API<br/>(Fiber metrics)"]
-        DB["PostgreSQL<br/>(pg_exporter)"]
-        Redis["Redis<br/>(redis_exporter)"]
-        Zitadel["Zitadel<br/>(metrics endpoint)"]
-    end
+**Services surveillés** :
+- Backend Go API (métriques Fiber)
+- PostgreSQL (pg_exporter)
+- Redis (redis_exporter)
+- Zitadel (endpoint metrics)

-    subgraph Monitoring["Stack Monitoring"]
-        Prom["Prometheus<br/>(scrape + TSDB)"]
-        Grafana["Grafana<br/>(dashboards)"]
-        Loki["Loki<br/>(logs aggregation)"]
-        Alert["Alertmanager<br/>(routing)"]
-        Uptime["Uptime Kuma<br/>(external checks)"]
-    end
+**Stack Monitoring** :
+- **Prometheus** : Collecte métriques (scrape), stockage TSDB 15j rétention
+- **Grafana** : Visualisation dashboards
+- **Loki** : Agrégation logs (chunks compressés, 7j rétention)
+- **Alertmanager** : Routing alertes multi-canal
+- **Uptime Kuma** : Checks HTTP externes, SSL monitoring

-    subgraph Notifications["Alerting"]
-        Email["Email (Brevo)"]
-        Slack["Webhook Slack/Discord"]
-    end
+**Alerting** :
+- Email (Brevo) : asynchrone, faible intrusivité
+- Webhook (Slack/Discord) : temps réel, on-call

-    subgraph Storage["Stockage"]
-        PromStorage["Prometheus TSDB<br/>(15j retention)"]
-        LokiStorage["Loki Chunks<br/>(7j retention)"]
-        Backups["Backups PostgreSQL<br/>(S3 OVH)"]
-    end
-
-    API --> Prom
-    DB --> Prom
-    Redis --> Prom
-    Zitadel --> Prom
-
-    API -.->|logs stdout| Loki
-    Prom --> Grafana
-    Loki --> Grafana
-    Prom --> Alert
-
-    Alert --> Email
-    Alert --> Slack
-
-    Uptime -.->|external HTTP checks| API
-    Uptime --> Alert
-
-    Prom --> PromStorage
-    Loki --> LokiStorage
-    DB -.->|WAL-E continuous| Backups
-
-    classDef serviceStyle fill:#e3f2fd,stroke:#1565c0
-    classDef monitoringStyle fill:#fff3e0,stroke:#e65100
-    classDef notifStyle fill:#f3e5f5,stroke:#6a1b9a
-    classDef storageStyle fill:#e8f5e9,stroke:#2e7d32
-
-    class Services,API,DB,Redis,Zitadel serviceStyle
-    class Monitoring,Prom,Grafana,Loki,Alert,Uptime monitoringStyle
-    class Notifications,Email,Slack notifStyle
-    class Storage,PromStorage,LokiStorage,Backups storageStyle
-```
+**Stockage** :
+- Prometheus TSDB : métriques 15j
+- Loki chunks : logs 7j
+- Backups PostgreSQL : WAL-E continuous vers S3 OVH

 ### Métriques Clés

-**API Performance** (Prometheus PromQL) :
-```promql
-# Latency p99
-histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
-
-# Error rate
-rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])
-
-# Throughput
-rate(http_requests_total[5m])
-```
+**API Performance** (requêtes PromQL) :
+- Latency p99 : histogramme quantile 99e percentile sur durée requêtes HTTP (fenêtre 5 min)
+- Error rate : ratio requêtes 5xx / total requêtes (fenêtre 5 min)
+- Throughput : taux de requêtes par seconde (fenêtre 5 min)

 **Infrastructure** :
- CPU usage : `rate(node_cpu_seconds_total{mode!="idle"}[5m])`
- Memory usage : `node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes`
- Disk I/O : `rate(node_disk_io_time_seconds_total[5m])`
+- CPU usage : taux utilisation CPU (mode non-idle, fenêtre 5 min)
+- Memory usage : ratio mémoire disponible / totale
+- Disk I/O : temps I/O disque (fenêtre 5 min)

-**Business** :
- Active users (DAU) : compteur custom `roadwave_active_users_total`
+**Business** (compteurs custom) :
+- Active users (DAU) : `roadwave_active_users_total`
 - Audio streams actifs : `roadwave_hls_streams_active`
 - Signalements modération : `roadwave_moderation_reports_total`

@@ -209,60 +165,29 @@ rate(http_requests_total[5m])

 ### Alerting Rules

-**Critiques** (Slack + Email immédiat) :
-```yaml
- alert: APIDown
-  expr: up{job="roadwave-api"} == 0
-  for: 1m
-  severity: critical
-  message: "API indisponible depuis 1 min"
+**Alertes critiques** (Slack + Email immédiat) :
+- **API Down** : Job API indisponible pendant >1 min → Notification immédiate
+- **High Error Rate** : Taux erreurs 5xx >1% pendant >5 min → Notification immédiate
+- **Database Down** : PostgreSQL indisponible pendant >1 min → Notification immédiate

- alert: HighErrorRate
-  expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.01
-  for: 5m
-  severity: critical
-  message: "Error rate >1% depuis 5 min"
-
- alert: DatabaseDown
-  expr: up{job="postgresql"} == 0
-  for: 1m
-  severity: critical
-  message: "PostgreSQL indisponible"
-```
-
-**Warnings** (Email uniquement) :
-```yaml
- alert: HighLatency
-  expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 0.1
-  for: 10m
-  severity: warning
-  message: "Latency p99 >100ms depuis 10 min"
-
- alert: DiskSpaceRunningOut
-  expr: node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.1
-  for: 30m
-  severity: warning
-  message: "Espace disque <10%"
-```
+**Alertes warnings** (Email uniquement) :
+- **High Latency** : Latency p99 >100ms pendant >10 min → Investigation requise
+- **Disk Space Running Out** : Espace disque <10% pendant >30 min → Nettoyage requis

 ### Backup & Disaster Recovery

 **PostgreSQL WAL-E** :
-```bash
-# Backup continu WAL (Write-Ahead Log)
-wal-e backup-push /var/lib/postgresql/data
-
-# Rétention : 7 jours full + WAL
-# Stockage : S3 OVH (région GRA, France)
-# Chiffrement : AES-256 server-side
-```
+- Méthode : Backup continu Write-Ahead Log (WAL)
+- Rétention : 7 jours full + WAL incrémentaux
+- Stockage : S3 OVH région GRA (France)
+- Chiffrement : AES-256 server-side

 **RTO (Recovery Time Objective)** : 1h
- Temps de restore depuis S3 : ~30 min (DB 10 GB)
- Temps validation + relance services : ~30 min
+- Restore depuis S3 : ~30 min (DB 10 GB)
+- Validation + relance services : ~30 min

 **RPO (Recovery Point Objective)** : 15 min
- WAL archivage toutes les 15 min
+- Fréquence archivage WAL : toutes les 15 min
 - Perte maximale : 15 min de transactions

 **Tests DR** : Mensuel (restore backup sur environnement staging)
@@ -272,7 +197,7 @@ wal-e backup-push /var/lib/postgresql/data
 ### API Down (5xx errors spike)

 1. **Vérifier** : Grafana dashboard → onglet Errors
-2. **Logs** : Loki query `{app="roadwave-api"} |= "error"`
+2. **Logs** : Requête Loki filtrée sur app roadwave-api + niveau error
 3. **Actions** :
   - Si OOM : restart container + augmenter RAM
   - Si DB connexions saturées : vérifier slow queries
@@ -282,7 +207,7 @@ wal-e backup-push /var/lib/postgresql/data
 ### Database Slow Queries

 1. **Identifier** : Grafana → PostgreSQL dashboard → Top slow queries
-2. **Analyser** : `EXPLAIN ANALYZE` sur query problématique
+2. **Analyser** : Utiliser EXPLAIN ANALYZE sur query problématique
 3. **Actions** :
   - Index manquant : créer index (migration rapide)
   - Lock contention : identifier transaction longue et kill si bloquante
@@ -291,7 +216,7 @@ wal-e backup-push /var/lib/postgresql/data
 ### High Load (CPU >80%)

 1. **Vérifier** : Grafana → Node Exporter → CPU usage
-2. **Top processus** : `htop` ou `docker stats`
+2. **Top processus** : Consulter htop ou docker stats
 3. **Actions** :
   - Si Whisper (modération) : réduire concurrence workers
   - Si API : scale horizontal (ajouter instance)