refactor(adr-023/024/025): retirer exemples de code et scripts
Suppression de tous les exemples de code pour garder uniquement les descriptions techniques : ADR-023 (Architecture Modération) : - Diagramme Mermaid → description flux textuelle - Exemples SQL/Redis → description workflow - Interface Go → description abstraction - Dépendances → liste concise ADR-024 (Monitoring et Observabilité) : - Diagramme Mermaid → architecture textuelle - Exemples PromQL → description métriques - Config YAML alertes → liste alertes avec seuils - Commandes bash WAL-E → description backup - Runbooks → étapes sans commandes ADR-025 (Sécurité et Secrets) : - Diagramme Mermaid → flux secrets textuel - Commandes bash Vault → description process - Code Go encryption → architecture encryption - Schéma SQL → contraintes textuelles - Config Nginx → configuration TLS - Code Go rate limiting → paramètres middleware ADR restent 100% techniques et complets sans code concret. Cohérence avec ADR-022 (même approche). Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -31,86 +31,42 @@ Stack **Prometheus + Grafana + Loki** self-hosted avec alerting multi-canal.
|
||||
|
||||
### Architecture
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph Services["Services RoadWave"]
|
||||
API["Backend Go API<br/>(Fiber metrics)"]
|
||||
DB["PostgreSQL<br/>(pg_exporter)"]
|
||||
Redis["Redis<br/>(redis_exporter)"]
|
||||
Zitadel["Zitadel<br/>(metrics endpoint)"]
|
||||
end
|
||||
**Services surveillés** :
|
||||
- Backend Go API (métriques Fiber)
|
||||
- PostgreSQL (pg_exporter)
|
||||
- Redis (redis_exporter)
|
||||
- Zitadel (endpoint metrics)
|
||||
|
||||
subgraph Monitoring["Stack Monitoring"]
|
||||
Prom["Prometheus<br/>(scrape + TSDB)"]
|
||||
Grafana["Grafana<br/>(dashboards)"]
|
||||
Loki["Loki<br/>(logs aggregation)"]
|
||||
Alert["Alertmanager<br/>(routing)"]
|
||||
Uptime["Uptime Kuma<br/>(external checks)"]
|
||||
end
|
||||
**Stack Monitoring** :
|
||||
- **Prometheus** : Collecte métriques (scrape), stockage TSDB 15j rétention
|
||||
- **Grafana** : Visualisation dashboards
|
||||
- **Loki** : Agrégation logs (chunks compressés, 7j rétention)
|
||||
- **Alertmanager** : Routing alertes multi-canal
|
||||
- **Uptime Kuma** : Checks HTTP externes, SSL monitoring
|
||||
|
||||
subgraph Notifications["Alerting"]
|
||||
Email["Email (Brevo)"]
|
||||
Slack["Webhook Slack/Discord"]
|
||||
end
|
||||
**Alerting** :
|
||||
- Email (Brevo) : asynchrone, faible intrusivité
|
||||
- Webhook (Slack/Discord) : temps réel, on-call
|
||||
|
||||
subgraph Storage["Stockage"]
|
||||
PromStorage["Prometheus TSDB<br/>(15j retention)"]
|
||||
LokiStorage["Loki Chunks<br/>(7j retention)"]
|
||||
Backups["Backups PostgreSQL<br/>(S3 OVH)"]
|
||||
end
|
||||
|
||||
API --> Prom
|
||||
DB --> Prom
|
||||
Redis --> Prom
|
||||
Zitadel --> Prom
|
||||
|
||||
API -.->|logs stdout| Loki
|
||||
Prom --> Grafana
|
||||
Loki --> Grafana
|
||||
Prom --> Alert
|
||||
|
||||
Alert --> Email
|
||||
Alert --> Slack
|
||||
|
||||
Uptime -.->|external HTTP checks| API
|
||||
Uptime --> Alert
|
||||
|
||||
Prom --> PromStorage
|
||||
Loki --> LokiStorage
|
||||
DB -.->|WAL-E continuous| Backups
|
||||
|
||||
classDef serviceStyle fill:#e3f2fd,stroke:#1565c0
|
||||
classDef monitoringStyle fill:#fff3e0,stroke:#e65100
|
||||
classDef notifStyle fill:#f3e5f5,stroke:#6a1b9a
|
||||
classDef storageStyle fill:#e8f5e9,stroke:#2e7d32
|
||||
|
||||
class Services,API,DB,Redis,Zitadel serviceStyle
|
||||
class Monitoring,Prom,Grafana,Loki,Alert,Uptime monitoringStyle
|
||||
class Notifications,Email,Slack notifStyle
|
||||
class Storage,PromStorage,LokiStorage,Backups storageStyle
|
||||
```
|
||||
**Stockage** :
|
||||
- Prometheus TSDB : métriques 15j
|
||||
- Loki chunks : logs 7j
|
||||
- Backups PostgreSQL : WAL-E continuous vers S3 OVH
|
||||
|
||||
### Métriques Clés
|
||||
|
||||
**API Performance** (Prometheus PromQL) :
|
||||
```promql
|
||||
# Latency p99
|
||||
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
|
||||
|
||||
# Error rate
|
||||
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])
|
||||
|
||||
# Throughput
|
||||
rate(http_requests_total[5m])
|
||||
```
|
||||
**API Performance** (requêtes PromQL) :
|
||||
- Latency p99 : histogramme quantile 99e percentile sur durée requêtes HTTP (fenêtre 5 min)
|
||||
- Error rate : ratio requêtes 5xx / total requêtes (fenêtre 5 min)
|
||||
- Throughput : taux de requêtes par seconde (fenêtre 5 min)
|
||||
|
||||
**Infrastructure** :
|
||||
- CPU usage : `rate(node_cpu_seconds_total{mode!="idle"}[5m])`
|
||||
- Memory usage : `node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes`
|
||||
- Disk I/O : `rate(node_disk_io_time_seconds_total[5m])`
|
||||
- CPU usage : taux utilisation CPU (mode non-idle, fenêtre 5 min)
|
||||
- Memory usage : ratio mémoire disponible / totale
|
||||
- Disk I/O : temps I/O disque (fenêtre 5 min)
|
||||
|
||||
**Business** :
|
||||
- Active users (DAU) : compteur custom `roadwave_active_users_total`
|
||||
**Business** (compteurs custom) :
|
||||
- Active users (DAU) : `roadwave_active_users_total`
|
||||
- Audio streams actifs : `roadwave_hls_streams_active`
|
||||
- Signalements modération : `roadwave_moderation_reports_total`
|
||||
|
||||
@@ -209,60 +165,29 @@ rate(http_requests_total[5m])
|
||||
|
||||
### Alerting Rules
|
||||
|
||||
**Critiques** (Slack + Email immédiat) :
|
||||
```yaml
|
||||
- alert: APIDown
|
||||
expr: up{job="roadwave-api"} == 0
|
||||
for: 1m
|
||||
severity: critical
|
||||
message: "API indisponible depuis 1 min"
|
||||
**Alertes critiques** (Slack + Email immédiat) :
|
||||
- **API Down** : Job API indisponible pendant >1 min → Notification immédiate
|
||||
- **High Error Rate** : Taux erreurs 5xx >1% pendant >5 min → Notification immédiate
|
||||
- **Database Down** : PostgreSQL indisponible pendant >1 min → Notification immédiate
|
||||
|
||||
- alert: HighErrorRate
|
||||
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.01
|
||||
for: 5m
|
||||
severity: critical
|
||||
message: "Error rate >1% depuis 5 min"
|
||||
|
||||
- alert: DatabaseDown
|
||||
expr: up{job="postgresql"} == 0
|
||||
for: 1m
|
||||
severity: critical
|
||||
message: "PostgreSQL indisponible"
|
||||
```
|
||||
|
||||
**Warnings** (Email uniquement) :
|
||||
```yaml
|
||||
- alert: HighLatency
|
||||
expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 0.1
|
||||
for: 10m
|
||||
severity: warning
|
||||
message: "Latency p99 >100ms depuis 10 min"
|
||||
|
||||
- alert: DiskSpaceRunningOut
|
||||
expr: node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.1
|
||||
for: 30m
|
||||
severity: warning
|
||||
message: "Espace disque <10%"
|
||||
```
|
||||
**Alertes warnings** (Email uniquement) :
|
||||
- **High Latency** : Latency p99 >100ms pendant >10 min → Investigation requise
|
||||
- **Disk Space Running Out** : Espace disque <10% pendant >30 min → Nettoyage requis
|
||||
|
||||
### Backup & Disaster Recovery
|
||||
|
||||
**PostgreSQL WAL-E** :
|
||||
```bash
|
||||
# Backup continu WAL (Write-Ahead Log)
|
||||
wal-e backup-push /var/lib/postgresql/data
|
||||
|
||||
# Rétention : 7 jours full + WAL
|
||||
# Stockage : S3 OVH (région GRA, France)
|
||||
# Chiffrement : AES-256 server-side
|
||||
```
|
||||
- Méthode : Backup continu Write-Ahead Log (WAL)
|
||||
- Rétention : 7 jours full + WAL incrémentaux
|
||||
- Stockage : S3 OVH région GRA (France)
|
||||
- Chiffrement : AES-256 server-side
|
||||
|
||||
**RTO (Recovery Time Objective)** : 1h
|
||||
- Temps de restore depuis S3 : ~30 min (DB 10 GB)
|
||||
- Temps validation + relance services : ~30 min
|
||||
- Restore depuis S3 : ~30 min (DB 10 GB)
|
||||
- Validation + relance services : ~30 min
|
||||
|
||||
**RPO (Recovery Point Objective)** : 15 min
|
||||
- WAL archivage toutes les 15 min
|
||||
- Fréquence archivage WAL : toutes les 15 min
|
||||
- Perte maximale : 15 min de transactions
|
||||
|
||||
**Tests DR** : Mensuel (restore backup sur environnement staging)
|
||||
@@ -272,7 +197,7 @@ wal-e backup-push /var/lib/postgresql/data
|
||||
### API Down (5xx errors spike)
|
||||
|
||||
1. **Vérifier** : Grafana dashboard → onglet Errors
|
||||
2. **Logs** : Loki query `{app="roadwave-api"} |= "error"`
|
||||
2. **Logs** : Requête Loki filtrée sur app roadwave-api + niveau error
|
||||
3. **Actions** :
|
||||
- Si OOM : restart container + augmenter RAM
|
||||
- Si DB connexions saturées : vérifier slow queries
|
||||
@@ -282,7 +207,7 @@ wal-e backup-push /var/lib/postgresql/data
|
||||
### Database Slow Queries
|
||||
|
||||
1. **Identifier** : Grafana → PostgreSQL dashboard → Top slow queries
|
||||
2. **Analyser** : `EXPLAIN ANALYZE` sur query problématique
|
||||
2. **Analyser** : Utiliser EXPLAIN ANALYZE sur query problématique
|
||||
3. **Actions** :
|
||||
- Index manquant : créer index (migration rapide)
|
||||
- Lock contention : identifier transaction longue et kill si bloquante
|
||||
@@ -291,7 +216,7 @@ wal-e backup-push /var/lib/postgresql/data
|
||||
### High Load (CPU >80%)
|
||||
|
||||
1. **Vérifier** : Grafana → Node Exporter → CPU usage
|
||||
2. **Top processus** : `htop` ou `docker stats`
|
||||
2. **Top processus** : Consulter htop ou docker stats
|
||||
3. **Actions** :
|
||||
- Si Whisper (modération) : réduire concurrence workers
|
||||
- Si API : scale horizontal (ajouter instance)
|
||||
|
||||
Reference in New Issue
Block a user