SLI/SLO/SLA 体系设计与实践
SLI(服务级别指标)、SLO(服务级别目标)、SLA(服务级别协议)是现代SRE实践的核心框架,通过量化的方式定义、测量和管理服务质量,实现可靠性工程的科学化管理。
🎯 SLI/SLO/SLA 核心概念
三者关系和定义
yaml
sli_definition:
concept: "Service Level Indicator"
description: "量化的服务性能指标"
characteristics:
- "客观可测量"
- "用户体验相关"
- "业务价值导向"
- "技术实现支撑"
measurement_types:
ratio_based:
description: "基于比率的指标"
formula: "good_events / total_events"
examples:
- "成功请求数 / 总请求数"
- "快速响应数 / 总响应数"
- "可用服务时间 / 总服务时间"
threshold_based:
description: "基于阈值的指标"
examples:
- "响应时间 < 200ms"
- "错误率 < 0.1%"
- "吞吐量 > 1000 RPS"
distribution_based:
description: "基于分布的指标"
examples:
- "95%的请求延迟 < 500ms"
- "99.9%的可用性"
- "P99延迟分布"
common_sli_patterns:
availability:
definition: "服务可访问的比例"
calculation: |
availability = uptime / (uptime + downtime)
measurement_methods:
- "HTTP状态码检查"
- "健康检查端点"
- "合成事务监控"
- "用户真实访问"
latency:
definition: "请求处理速度"
metrics:
- "平均响应时间"
- "百分位数延迟 (P50, P95, P99)"
- "端到端处理时间"
measurement_points:
- "负载均衡器"
- "应用程序"
- "数据库查询"
- "外部API调用"
throughput:
definition: "系统处理能力"
metrics:
- "每秒请求数 (RPS)"
- "每秒事务数 (TPS)"
- "数据处理量"
considerations:
- "峰值处理能力"
- "sustained负载能力"
- "弹性扩展效果"
error_rate:
definition: "错误发生频率"
calculation: |
error_rate = error_requests / total_requests
error_categories:
- "4xx客户端错误"
- "5xx服务器错误"
- "超时错误"
- "业务逻辑错误"yaml
slo_definition:
concept: "Service Level Objective"
description: "SLI的目标值或范围"
characteristics:
- "基于SLI设定"
- "可达成的目标"
- "业务需求驱动"
- "持续监控评估"
slo_structure:
components:
indicator: "选择的SLI指标"
target: "期望达到的目标值"
measurement_window: "评估时间窗口"
measurement_method: "计算方法"
example: |
SLO: 在30天滑动窗口内,API服务的可用性应保持在99.9%以上
- SLI: API请求成功率
- Target: ≥ 99.9%
- Window: 30天滑动窗口
- Method: 成功请求数 / 总请求数
slo_types:
availability_slo:
examples:
- "99.9% 可用性(每月允许43分钟故障)"
- "99.95% 可用性(每月允许22分钟故障)"
- "99.99% 可用性(每月允许4分钟故障)"
measurement_approaches:
uptime_based: "基于服务正常运行时间"
request_based: "基于请求成功率"
user_experience_based: "基于用户体验"
latency_slo:
examples:
- "95%的请求响应时间 < 200ms"
- "99%的请求响应时间 < 500ms"
- "99.9%的请求响应时间 < 1s"
considerations:
- "用户体验阈值"
- "业务场景需求"
- "技术架构限制"
throughput_slo:
examples:
- "支持1000 RPS的持续负载"
- "峰值处理能力3000 RPS"
- "99%的请求在正常吞吐量下处理"
error_budget:
concept: "错误预算"
definition: "SLO允许的最大错误量"
calculation: |
error_budget = (1 - SLO_target) × total_requests
例如:99.9%可用性SLO,月请求100万次
error_budget = (1 - 0.999) × 1,000,000 = 1,000次错误
budget_policies:
preventive_actions:
- "70%预算消耗:减缓功能发布"
- "90%预算消耗:停止非关键部署"
- "100%预算消耗:专注可靠性改进"
budget_reset:
- "按时间窗口重置"
- "滑动窗口计算"
- "事件驱动重置"yaml
sla_definition:
concept: "Service Level Agreement"
description: "正式的服务承诺协议"
characteristics:
- "具有法律约束力"
- "客户和提供商之间"
- "包含补偿条款"
- "基于SLO制定"
sla_components:
service_description:
- "服务范围定义"
- "功能特性说明"
- "使用限制条件"
- "支持时间范围"
performance_commitments:
- "可用性承诺"
- "性能指标保证"
- "响应时间承诺"
- "故障恢复时间"
measurement_methods:
- "指标计算方法"
- "测量工具和流程"
- "数据收集标准"
- "报告频率"
remedies_and_penalties:
- "服务中断补偿"
- "性能未达标处理"
- "服务积分返还"
- "合同终止条件"
sla_tiers:
basic_tier:
availability: "99.0%"
support: "工作时间邮件支持"
remedies: "服务积分 10%"
standard_tier:
availability: "99.5%"
support: "7x24电话支持"
remedies: "服务积分 25%"
premium_tier:
availability: "99.9%"
support: "专属技术经理"
remedies: "服务积分 100%"
enterprise_tier:
availability: "99.95%"
support: "现场技术支持"
remedies: "自定义补偿协议"设计原则和最佳实践
yaml
sli_design_principles:
user_centric:
focus: "以用户体验为中心"
guidelines:
- "从用户角度定义指标"
- "关注用户可感知的性能"
- "避免过于技术化的指标"
- "考虑不同用户群体"
examples:
good_sli:
- "页面加载时间 < 2秒"
- "搜索结果返回 < 500ms"
- "视频播放成功率 > 99%"
poor_sli:
- "CPU使用率 < 80%"
- "内存使用率 < 70%"
- "磁盘IO < 1000 IOPS"
measurable_and_actionable:
characteristics:
- "客观可测量"
- "数据可获得"
- "结果可执行"
- "改进可验证"
implementation:
- "使用现有监控数据"
- "设置自动化收集"
- "确保数据准确性"
- "建立改进机制"
proportional_and_achievable:
balance_considerations:
- "业务需求 vs 技术能力"
- "成本投入 vs 质量提升"
- "用户期望 vs 现实限制"
- "短期目标 vs 长期规划"
target_setting:
baseline_establishment: "基于历史数据"
incremental_improvement: "渐进式提升"
industry_benchmarking: "行业标准参考"
business_alignment: "业务价值对齐"yaml
slo_implementation:
gradual_rollout:
phases:
phase_1_pilot:
scope: "关键服务试点"
duration: "2-4周"
focus: "指标收集和基线建立"
success_criteria: "数据收集完整性"
phase_2_expansion:
scope: "扩展到主要服务"
duration: "1-2个月"
focus: "SLO设定和监控"
success_criteria: "SLO达成率"
phase_3_optimization:
scope: "全面覆盖"
duration: "持续进行"
focus: "持续优化和改进"
success_criteria: "业务价值提升"
stakeholder_alignment:
product_team:
responsibilities:
- "业务SLO需求定义"
- "用户体验标准制定"
- "产品功能优先级"
engagement:
- "SLO目标协商"
- "错误预算政策制定"
- "功能发布决策参与"
engineering_team:
responsibilities:
- "技术SLI实现"
- "监控系统建设"
- "可靠性改进"
capabilities:
- "SLI数据收集"
- "自动化监控"
- "故障响应处理"
operations_team:
responsibilities:
- "SLO监控和告警"
- "事件响应和处理"
- "可靠性运维"
processes:
- "SLO违规处理流程"
- "错误预算管理"
- "持续改进机制"📊 SLI指标体系设计
用户体验导向的SLI
yaml
web_application_sli:
availability_sli:
http_success_rate:
definition: "HTTP 2xx和3xx响应的比例"
formula: |
availability = (
count(http_status_code in [200-399]) /
count(total_http_requests)
) * 100
measurement_query: |
# Prometheus query
sum(rate(http_requests_total{status=~"[23].."}[5m])) /
sum(rate(http_requests_total[5m])) * 100
synthetic_monitoring:
definition: "合成事务成功率"
approach: "外部主动探测"
tools: ["Pingdom", "Datadog Synthetics", "自建探测"]
test_scenarios:
- "首页加载测试"
- "用户登录流程"
- "关键业务操作"
- "API端点检查"
latency_sli:
page_load_time:
metrics:
- "DOMContentLoaded时间"
- "First Contentful Paint (FCP)"
- "Largest Contentful Paint (LCP)"
- "Time to Interactive (TTI)"
measurement: |
# Real User Monitoring (RUM)
performance.timing.loadEventEnd -
performance.timing.navigationStart
api_response_time:
percentile_based: "P95响应时间 < 200ms"
measurement_query: |
# Prometheus query
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)
quality_sli:
functional_correctness:
definition: "业务逻辑正确执行的比例"
examples:
- "搜索结果相关性"
- "推荐算法准确性"
- "数据一致性检查"
user_satisfaction:
apdex_score:
definition: "Application Performance Index"
formula: |
Apdex = (Satisfied + Tolerated/2) / Total
Satisfied: response_time <= T
Tolerated: T < response_time <= 4T
Frustrated: response_time > 4T
user_feedback:
- "用户评分"
- "客服投诉率"
- "用户流失率"yaml
api_service_sli:
rest_api_sli:
availability:
measurement: "HTTP状态码成功率"
exclusions:
- "4xx客户端错误(除429限流)"
- "预期的维护窗口"
- "DDoS攻击期间"
query_example: |
sum(rate(http_requests_total{status!~"4.."}[5m])) /
sum(rate(http_requests_total[5m]))
latency:
measurement: "请求处理时间"
targets:
- "P50 < 100ms"
- "P95 < 300ms"
- "P99 < 1000ms"
breakdown_analysis:
- "认证授权时间"
- "业务逻辑处理"
- "数据库查询"
- "外部服务调用"
throughput:
measurement: "处理能力"
metrics:
- "每秒请求数(RPS)"
- "并发连接数"
- "队列处理速度"
capacity_planning:
- "峰值负载测试"
- "弹性扩展验证"
- "资源利用率监控"
grpc_service_sli:
success_rate:
measurement: "gRPC状态码成功率"
successful_codes: ["OK"]
error_codes: ["INTERNAL", "UNAVAILABLE", "DEADLINE_EXCEEDED"]
query_example: |
sum(rate(grpc_server_handled_total{grpc_code="OK"}[5m])) /
sum(rate(grpc_server_handled_total[5m]))
latency:
measurement: "gRPC调用延迟"
query_example: |
histogram_quantile(0.95,
sum(rate(grpc_server_handling_seconds_bucket[5m])) by (le)
)yaml
data_processing_sli:
batch_processing:
completeness:
definition: "数据处理完成率"
formula: |
completeness = processed_records / total_records * 100
measurement:
- "ETL作业成功率"
- "数据质量检查通过率"
- "端到端处理完成率"
timeliness:
definition: "数据处理及时性"
sli_examples:
- "95%的批处理作业在SLA时间内完成"
- "数据延迟 < 4小时"
- "实时流处理延迟 < 1分钟"
accuracy:
definition: "数据处理准确性"
validation_checks:
- "数据格式校验"
- "业务规则验证"
- "参照完整性检查"
- "数据一致性验证"
stream_processing:
latency:
end_to_end_latency: "消息产生到消费的总延迟"
processing_latency: "单个消息处理时间"
measurement_points:
- "消息队列延迟"
- "处理器延迟"
- "输出延迟"
throughput:
messages_per_second: "每秒处理消息数"
backlog_size: "待处理消息堆积"
scalability_metrics:
- "水平扩展效果"
- "资源利用效率"
- "负载均衡效果"错误预算管理
错误预算策略实施
yaml
error_budget_management:
budget_calculation:
time_window_based:
rolling_window:
description: "滑动时间窗口"
advantages:
- "反映近期服务质量"
- "快速响应质量变化"
- "平滑历史数据影响"
implementation: |
# 30天滑动窗口可用性
availability_30d = (
sum_over_time(successful_requests[30d]) /
sum_over_time(total_requests[30d])
)
error_budget_remaining = 1 - (
(1 - availability_30d) / (1 - slo_target)
)
calendar_period:
description: "日历周期(月/季/年)"
advantages:
- "与业务周期对齐"
- "便于预算规划"
- "明确重置时间"
use_cases:
- "财务报告周期"
- "业务季度评估"
- "年度规划"
event_based_budget:
incident_impact:
calculation: |
incident_budget_burn = (
incident_duration * incident_impact_percentage
) / total_time_budget
impact_assessment:
- "影响用户数量"
- "功能不可用程度"
- "业务流程中断"
- "数据丢失风险"
deployment_budget:
allocation: "为发布和部署预留错误预算"
strategy:
- "大版本发布:预留20%预算"
- "功能更新:预留10%预算"
- "安全补丁:预留5%预算"
budget_policies:
burn_rate_thresholds:
alert_levels:
- level: "WARNING"
threshold: "50%预算消耗"
action: "增加监控频率"
- level: "CRITICAL"
threshold: "75%预算消耗"
action: "限制非关键发布"
- level: "EMERGENCY"
threshold: "90%预算消耗"
action: "专注可靠性改进"
response_procedures:
budget_exhausted:
immediate_actions:
- "停止所有非紧急部署"
- "启动可靠性改进模式"
- "增加监控和告警"
- "召集可靠性委员会"
recovery_plan:
- "根因分析和修复"
- "流程改进措施"
- "技术债务处理"
- "预防措施实施"
budget_surplus:
optimization_opportunities:
- "加速功能发布"
- "技术实验和创新"
- "性能优化项目"
- "基础设施升级"
error_budget_tooling:
monitoring_dashboard:
key_metrics:
- "当前错误预算余额"
- "预算消耗速率"
- "历史预算趋势"
- "SLO达成状态"
visualization:
budget_gauge: "预算余额百分比"
burn_rate_chart: "消耗速率趋势"
incident_impact: "事件对预算的影响"
forecast_projection: "预算耗尽预测"
automated_alerting:
alert_conditions:
fast_burn: "1小时内消耗5%预算"
slow_burn: "24小时内消耗20%预算"
budget_depletion: "预算余额 < 10%"
notification_channels:
- "即时通讯(Slack/Teams)"
- "邮件通知"
- "电话告警(关键情况)"
- "工单系统集成"
integration_with_cicd:
deployment_gates:
pre_deployment_check:
- "错误预算余额检查"
- "历史发布成功率"
- "当前系统健康状态"
post_deployment_monitoring:
- "发布影响评估"
- "SLI指标变化监控"
- "自动回滚触发"
feature_flag_integration:
budget_aware_rollout:
- "渐进式功能发布"
- "基于预算余额调整发布速度"
- "异常时自动禁用功能"📈 SLO监控和告警
多层级告警策略
yaml
slo_alerting_framework:
multi_window_alerting:
fast_burn_alerts:
description: "快速错误预算消耗告警"
time_windows: ["1h", "5m"]
threshold: "预算消耗速率过快"
prometheus_rule: |
# 1小时内消耗5%预算
(
1 - (
sum(rate(http_requests_total{status=~"2.."}[1h])) /
sum(rate(http_requests_total[1h]))
)
) > 14.4 * (1 - 0.999) # 5% of 30-day budget in 1 hour
response: "立即调查和响应"
slow_burn_alerts:
description: "缓慢错误预算消耗告警"
time_windows: ["6h", "30m"]
threshold: "持续的质量下降"
prometheus_rule: |
# 6小时内消耗10%预算
(
1 - (
sum(rate(http_requests_total{status=~"2.."}[6h])) /
sum(rate(http_requests_total[6h]))
)
) > 2.4 * (1 - 0.999) # 10% of 30-day budget in 6 hours
response: "计划性调查和改进"
contextual_alerting:
business_context:
high_impact_periods:
- "业务高峰期权重加倍"
- "重要发布期间降低阈值"
- "节假日期间特殊处理"
user_segmentation:
- "VIP用户SLO更严格"
- "地域差异化阈值"
- "功能模块独立监控"
technical_context:
deployment_correlation:
- "发布窗口期放宽阈值"
- "基础设施维护期间调整"
- "依赖服务故障时相应调整"
seasonal_patterns:
- "历史模式学习"
- "预期负载变化适应"
- "周期性调整阈值"yaml
prometheus_slo_rules:
recording_rules:
sli_calculations: |
groups:
- name: sli_calculations
interval: 30s
rules:
# API可用性SLI
- record: sli:http_success_rate:5m
expr: |
sum(rate(http_requests_total{status=~"[23].."}[5m])) by (service) /
sum(rate(http_requests_total[5m])) by (service)
# API延迟SLI (P95)
- record: sli:http_latency_p95:5m
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)
# 错误预算消耗速率
- record: sli:error_budget_burn_rate:1h
expr: |
(1 - sli:http_success_rate:1h) / (1 - 0.999)
alerting_rules:
slo_breach_alerts: |
groups:
- name: slo_alerts
rules:
# 快速错误预算消耗
- alert: SLOErrorBudgetFastBurn
expr: |
(
sli:error_budget_burn_rate:1h > 14.4
and
sli:error_budget_burn_rate:5m > 14.4
)
for: 2m
labels:
severity: critical
slo_type: availability
annotations:
summary: "快速错误预算消耗"
description: |
服务 {{ $labels.service }} 正在快速消耗错误预算
当前消耗速率: {{ $value | humanizePercentage }}
# 慢速错误预算消耗
- alert: SLOErrorBudgetSlowBurn
expr: |
(
sli:error_budget_burn_rate:6h > 2.4
and
sli:error_budget_burn_rate:30m > 2.4
)
for: 15m
labels:
severity: warning
slo_type: availability
annotations:
summary: "错误预算持续消耗"
description: |
服务 {{ $labels.service }} 持续消耗错误预算
6小时消耗速率: {{ $value | humanizePercentage }}
# SLO即将违规
- alert: SLONearViolation
expr: |
sli:error_budget_remaining:30d < 0.1
for: 5m
labels:
severity: warning
slo_type: availability
annotations:
summary: "SLO即将违规"
description: |
服务 {{ $labels.service }} 错误预算余额不足10%
剩余预算: {{ $value | humanizePercentage }}告警优化和降噪
智能告警策略
yaml
intelligent_alerting:
alert_correlation:
multi_signal_analysis:
correlation_matrix:
- "SLO violation + Infrastructure metrics"
- "Error budget burn + Deployment events"
- "Latency increase + Dependency failures"
- "Availability drop + Traffic spikes"
correlation_algorithms:
temporal_correlation:
method: "时间窗口内事件关联"
window_size: "5-15 minutes"
correlation_threshold: 0.8
causal_inference:
method: "因果关系推断"
techniques:
- "Granger causality test"
- "Pearl causal framework"
- "Bayesian networks"
root_cause_hints:
automatic_annotation:
deployment_events: "Recent deployment detected"
infrastructure_issues: "High CPU/Memory usage"
dependency_failures: "Upstream service degradation"
traffic_anomalies: "Unusual traffic patterns"
investigation_suggestions:
- "Check recent deployments"
- "Verify dependency health"
- "Analyze traffic patterns"
- "Review error logs"
adaptive_thresholds:
machine_learning_based:
anomaly_detection:
algorithms:
- "Isolation Forest"
- "LSTM Neural Networks"
- "Seasonal Decomposition"
training_data:
- "Historical SLI values"
- "Seasonal patterns"
- "Business calendar events"
- "Traffic characteristics"
dynamic_adjustment:
factors:
- "Historical performance"
- "Recent trend analysis"
- "Business context"
- "External factors"
feedback_loop:
alert_quality_metrics:
precision: "True alerts / Total alerts"
recall: "Detected incidents / Total incidents"
time_to_detection: "Alert to incident confirmation"
false_positive_rate: "False alerts / Total alerts"
continuous_improvement:
- "Weekly alert quality review"
- "Threshold adjustment based on feedback"
- "Algorithm performance monitoring"
- "Stakeholder satisfaction surveys"
alert_fatigue_prevention:
alert_prioritization:
severity_matrix:
critical:
criteria:
- "Customer impact: High"
- "Business impact: Revenue affecting"
- "Response time: < 5 minutes"
slo_examples:
- "Payment service unavailable"
- "User authentication failure"
- "Data loss incidents"
warning:
criteria:
- "Customer impact: Medium"
- "Business impact: Feature degradation"
- "Response time: < 30 minutes"
slo_examples:
- "Slow response times"
- "Elevated error rates"
- "Capacity warnings"
info:
criteria:
- "Customer impact: Low"
- "Business impact: Monitoring"
- "Response time: Next business day"
slo_examples:
- "SLO trend notifications"
- "Capacity planning alerts"
- "Maintenance reminders"
notification_optimization:
smart_routing:
escalation_policies:
- "Primary on-call: 5 minutes"
- "Secondary on-call: 15 minutes"
- "Management: 30 minutes"
- "Executive: 1 hour"
context_aware_routing:
business_hours: "Different teams for different hours"
expertise_based: "Route to domain experts"
load_balancing: "Distribute across available personnel"
consolidation_strategies:
temporal_grouping:
- "Group alerts within 5-minute windows"
- "Suppress duplicate alerts"
- "Merge related incidents"
logical_grouping:
- "Service-based grouping"
- "Infrastructure component grouping"
- "Business function grouping"📋 SLI/SLO/SLA 面试重点
基础概念类
SLI、SLO、SLA三者的区别和关系?
- SLI:可测量的服务质量指标
- SLO:SLI的目标值和期望范围
- SLA:正式的服务承诺协议
- 三者层层递进的关系
什么是错误预算?如何计算和使用?
- 定义:SLO允许的最大错误量
- 计算:(1 - SLO目标) × 总请求数
- 用途:平衡可靠性和功能发布
- 政策:预算消耗的响应策略
如何选择合适的SLI指标?
- 用户体验导向
- 客观可测量
- 业务价值相关
- 技术可实现
设计实践类
如何为不同类型服务设计SLO?
- Web应用:可用性、延迟、吞吐量
- API服务:成功率、响应时间、限流
- 数据处理:完整性、及时性、准确性
- 基础设施:资源可用性、性能指标
SLO目标值如何设定?
- 基于历史数据分析
- 考虑用户期望
- 平衡技术能力
- 业务价值权衡
多层级告警策略如何设计?
- 快速消耗 vs 慢速消耗
- 多时间窗口监控
- 上下文感知告警
- 告警降噪策略
运维管理类
如何处理SLO违规?
- 根因分析流程
- 影响评估方法
- 改进措施制定
- 预防策略建立
错误预算管理的最佳实践?
- 预算分配策略
- 消耗监控机制
- 响应政策制定
- 与发布流程集成
SLO体系的组织实施?
- 跨团队协作模式
- 角色和责任划分
- 文化和流程建设
- 持续改进机制
🔗 相关内容
- 可观测性概述 - 完整可观测性体系架构
- 分布式追踪 - 链路追踪与SLI集成
- Prometheus监控 - SLI指标收集和计算
- Grafana可视化 - SLO仪表盘和可视化
SLI/SLO/SLA体系是现代SRE实践的核心,通过科学化的方法定义、测量和管理服务质量,实现可靠性工程的持续改进和业务价值最大化。
