Observability — 플랫폼 관측성
Prometheus + Loki + Tempo 기반 SRE 대시보드 · SLO/SLI/Error Budget · Alert 라우팅.
SLO · Service Level Objectives
All within budget
Availability
99.97%
Target ≥ 99.95% · 달성
Error Budget 30d72%
남은 downtime 예산: 7m 24s
P95 Latency
148ms
Target < 200 ms · 달성
Burn rate0.23x
초과 요청 비율: 0.03%
RPO (Data Loss)
18s
Target < 1 min · 달성
최근 측정12:05:42
Kafka lag + replication
RTO (Recovery)
8min
Target < 15 min · 달성
마지막 훈련2026-03-28
월간 DR drill · 다음 04-28
Metrics Dashboard
Prometheus scrape 15s
Live
Request Rate
P95 Latency / 서브시스템
Error Rate
Kafka Lag
Zenith Anchor Queue
CPU / Memory across pods
Active Alerts
2 critical
3 warning
1 info
| Severity | Service | Metric | Threshold | 현재값 | Fired At |
|---|
Prometheus Alert Rule 샘플
/etc/prometheus/rules/approval_latency.yml
groups:
- name: creata.approval.slo
interval: 30s
rules:
- alert: ApprovalL1P95TooHigh
expr: histogram_quantile(0.95, sum(rate(approval_duration_seconds_bucket[5m])) by (le, tenant)) > 0.2
for: 5m
labels:
severity: warning
track: {{ $labels.tenant }}
annotations:
summary: "Approval L1 P95 latency > 200ms on {{ $labels.tenant }}"
runbook: "https://runbook.creata/cp/approval-latency.md"
# 이중 임계: 5x burn rate 는 critical 로 승격
- alert: ApprovalBudgetBurn5x
expr: error_budget_burn_rate{slo="approval_latency"} > 5
for: 2m
labels:
severity: critical
annotations:
summary: "Approval SLO burn rate 5x — budget exhausted in 2h"
Log Volume 24h
Loki
총 184.2 GB · 초당 평균 2.1 MB
Warn+Error 3.8% · ERROR만 0.12%
Warn+Error 3.8% · ERROR만 0.12%
SRE 도구 바로가기
Grafana
grafana.creata.internal
42 대시보드 · 128 메트릭 패널 · Tenant별 RLS 적용
Loki
loki.creata.internal
30d 온라인 · 90d S3 archive · LogQL + PII 마스킹 내장
Tempo
tempo.creata.internal
OTLP 수집 · 14d 보관 · Exemplar를 통해 Metric ↔ Trace 연결