Prometheus 告警规则设计与实践
告警是监控系统的重要组成部分,有效的告警规则设计能够在问题发生时及时通知相关人员,同时避免告警疲劳。本文深入探讨Prometheus告警规则的设计原则、配置方法和最佳实践。
🚨 告警系统架构
Prometheus告警流程
yaml
alerting_architecture:
prometheus_server:
responsibility: "告警规则评估"
functions:
- "定期执行告警规则"
- "维护告警状态"
- "发送告警到Alertmanager"
- "提供告警API查询"
alert_states:
- "Inactive: 未触发"
- "Pending: 等待中(未达到for条件)"
- "Firing: 告警中"
alertmanager:
responsibility: "告警管理和通知"
functions:
- "告警接收和验证"
- "告警分组和去重"
- "告警路由和静默"
- "通知发送和重试"
features:
- "多种通知渠道"
- "告警抑制和依赖"
- "告警静默管理"
- "高可用部署"yaml
alert_lifecycle:
rule_evaluation:
frequency: "每个evaluation_interval周期"
process:
- "执行PromQL查询"
- "评估告警条件"
- "更新告警状态"
- "计算for持续时间"
alert_states:
inactive_to_pending:
trigger: "告警条件首次满足"
duration: "0秒"
notification: "不发送通知"
pending_to_firing:
trigger: "持续满足条件达到for时间"
duration: "规则定义的for时间"
notification: "发送告警通知"
firing_to_inactive:
trigger: "告警条件不再满足"
duration: "立即"
notification: "发送恢复通知"
notification_flow:
- "Prometheus发送告警到Alertmanager"
- "Alertmanager接收和处理告警"
- "根据路由规则分组告警"
- "应用静默和抑制规则"
- "发送通知到配置的接收器"🎯 告警规则设计原则
SMART告警原则
yaml
smart_alerting:
specific:
description: "告警要具体明确"
examples:
good: "API /users endpoint 5xx error rate > 5%"
bad: "Application error"
implementation:
- "包含具体的服务名称"
- "明确的错误类型"
- "具体的阈值"
- "相关的标签信息"
measurable:
description: "告警要可量化"
examples:
good: "CPU usage > 80% for 5 minutes"
bad: "High CPU usage"
implementation:
- "明确的数值阈值"
- "时间持续条件"
- "百分比或绝对值"
- "基于历史数据的合理阈值"
actionable:
description: "告警要可执行"
examples:
good: "Disk space < 10% on /var partition"
bad: "Something is wrong"
implementation:
- "明确的问题描述"
- "可执行的修复步骤"
- "相关的文档链接"
- "责任人和联系方式"
relevant:
description: "告警要相关重要"
examples:
good: "Database connection pool exhausted"
bad: "Debug log level changed"
implementation:
- "影响用户体验的问题"
- "业务关键路径的异常"
- "系统稳定性威胁"
- "避免信息性通知"
time_bound:
description: "告警要有时间限制"
examples:
good: "Service down for > 2 minutes"
bad: "Service down"
implementation:
- "合理的for持续时间"
- "基于服务SLA的时间要求"
- "考虑问题恢复时间"
- "避免瞬时抖动"yaml
severity_levels:
critical:
definition: "系统完全不可用或数据丢失风险"
response_time: "5分钟内响应"
examples:
- "服务完全下线"
- "数据库无法连接"
- "磁盘空间耗尽"
- "安全漏洞被利用"
escalation:
- "立即短信通知on-call工程师"
- "同时邮件通知团队负责人"
- "自动创建最高优先级工单"
- "如需要通知管理层"
warning:
definition: "可能影响服务性能或稳定性"
response_time: "30分钟内响应"
examples:
- "错误率轻微上升"
- "响应时间增加"
- "资源使用率高"
- "副本数量不足"
escalation:
- "邮件通知相关团队"
- "Slack频道消息"
- "工作时间内处理"
- "记录到监控仪表板"
info:
definition: "需要关注但不紧急的事件"
response_time: "工作时间内处理"
examples:
- "部署完成通知"
- "配置变更"
- "定期维护任务"
- "容量规划提醒"
escalation:
- "仅邮件通知"
- "记录到日志系统"
- "定期汇总报告"
- "不需要立即响应"告警规则模板
yaml
# prometheus.rules.yml
groups:
- name: infrastructure.rules
rules:
# 实例下线告警
- alert: InstanceDown
expr: up == 0
for: 1m
labels:
severity: critical
category: infrastructure
team: sre
annotations:
summary: "Instance {{ $labels.instance }} down"
description: |
Instance {{ $labels.instance }} of job {{ $labels.job }}
has been down for more than 1 minute.
Current value: {{ $value }}
runbook_url: "https://runbooks.company.com/instancedown"
dashboard_url: "https://grafana.company.com/d/node-overview"
# CPU使用率告警
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
category: infrastructure
team: sre
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: |
CPU usage has been above 80% on {{ $labels.instance }} for more than 5 minutes.
Current usage: {{ $value | humanizePercentage }}
impact: "May affect application performance"
action_required: "Check running processes and consider scaling"yaml
- name: application.rules
rules:
# HTTP错误率告警
- alert: HighHTTPErrorRate
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service, instance) /
sum(rate(http_requests_total[5m])) by (service, instance)
) * 100 > 5
for: 2m
labels:
severity: critical
category: application
service: "{{ $labels.service }}"
annotations:
summary: "High HTTP 5xx error rate on {{ $labels.service }}"
description: |
HTTP 5xx error rate is {{ $value | humanizePercentage }} on service {{ $labels.service }}.
This is above the 5% threshold for more than 2 minutes.
impact: "Users experiencing service errors"
runbook_url: "https://runbooks.company.com/http-errors"
# API响应延迟告警
- alert: HighAPILatency
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service, endpoint)
) > 0.5
for: 5m
labels:
severity: warning
category: application
service: "{{ $labels.service }}"
endpoint: "{{ $labels.endpoint }}"
annotations:
summary: "High API latency on {{ $labels.service }}{{ $labels.endpoint }}"
description: |
95th percentile latency is {{ $value }}s on {{ $labels.service }}{{ $labels.endpoint }}.
This is above the 500ms SLA threshold.
sla_breach: "true"
dashboard_url: "https://grafana.company.com/d/api-performance"
# 数据库连接池告警
- alert: DatabaseConnectionPoolHigh
expr: |
(
mysql_global_status_threads_connected /
mysql_global_variables_max_connections
) * 100 > 80
for: 3m
labels:
severity: warning
category: database
database: "{{ $labels.instance }}"
annotations:
summary: "Database connection pool usage high"
description: |
Connection pool usage is {{ $value | humanizePercentage }} on {{ $labels.instance }}.
Consider investigating slow queries or increasing pool size.
current_connections: "{{ $labels.threads_connected }}"
max_connections: "{{ $labels.max_connections }}"📊 高级告警策略
基于SLI/SLO的告警
yaml
- name: slo.rules
rules:
# 可用性SLO告警 (99.9% SLA)
- alert: SLOAvailabilityBreach
expr: |
(
1 - (
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) /
sum(rate(http_requests_total[5m])) by (service)
)
) < 0.999
for: 1m
labels:
severity: critical
category: slo
slo_type: availability
annotations:
summary: "SLO availability breach on {{ $labels.service }}"
description: |
Service availability is {{ $value | humanizePercentage }}, below 99.9% SLO.
Current error budget burn rate is high.
error_budget_remaining: |
{{ with query "slo_error_budget_remaining{service='$labels.service'}" }}
{{ . | first | value | humanizePercentage }}
{{ end }}
# 延迟SLO告警 (95% of requests < 200ms)
- alert: SLOLatencyBreach
expr: |
(
sum(rate(http_request_duration_seconds_bucket{le="0.2"}[5m])) by (service) /
sum(rate(http_request_duration_seconds_count[5m])) by (service)
) < 0.95
for: 2m
labels:
severity: warning
category: slo
slo_type: latency
annotations:
summary: "SLO latency breach on {{ $labels.service }}"
description: |
{{ $value | humanizePercentage }} of requests meet the 200ms latency SLO.
This is below the 95% target.
p95_latency: |
{{ with query "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)){service='$labels.service'}" }}
{{ . | first | value }}s
{{ end }}yaml
# Error Budget计算规则
- name: error_budget.rules
interval: 1m
rules:
# 计算30天error budget剩余
- record: slo_error_budget_remaining
expr: |
1 - (
(
sum(increase(http_requests_total{status=~"5.."}[30d])) by (service) /
sum(increase(http_requests_total[30d])) by (service)
) / (1 - 0.999) # SLO target: 99.9%
)
# 计算error budget burn rate (每小时消耗的百分比)
- record: slo_error_budget_burn_rate_1h
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[1h])) by (service) /
sum(rate(http_requests_total[1h])) by (service)
) / (1 - 0.999) * 24 * 30 # 转换为30天budget的小时消耗率
- name: error_budget_alerts.rules
rules:
# 快速消耗告警 (1小时内消耗超过5%的budget)
- alert: ErrorBudgetFastBurn
expr: slo_error_budget_burn_rate_1h > 0.05
for: 5m
labels:
severity: critical
category: error_budget
annotations:
summary: "Fast error budget burn on {{ $labels.service }}"
description: |
Service {{ $labels.service }} is consuming error budget at {{ $value | humanizePercentage }}/hour.
At this rate, the monthly budget will be exhausted in {{ div 1 $value | humanizeDuration }}.
# 预警告警 (budget剩余不足20%)
- alert: ErrorBudgetLow
expr: slo_error_budget_remaining < 0.2
for: 5m
labels:
severity: warning
category: error_budget
annotations:
summary: "Error budget low on {{ $labels.service }}"
description: |
Service {{ $labels.service }} has only {{ $value | humanizePercentage }} error budget remaining.
Consider reducing deployment frequency or focusing on reliability.业务指标告警
业务KPI监控告警
yaml
business_alerts:
revenue_impact:
- alert: PaymentSuccessRateDropped
expr: |
(
sum(rate(payments_total{status="success"}[5m])) /
sum(rate(payments_total[5m]))
) < 0.95
for: 2m
labels:
severity: critical
category: business
impact: revenue
annotations:
summary: "Payment success rate below 95%"
description: |
Payment success rate is {{ $value | humanizePercentage }}.
This directly impacts revenue and user experience.
estimated_loss: |
{{ with query "avg_over_time(payment_revenue_per_minute[10m])" }}
Estimated revenue impact: ${{ mul (. | first | value) 5 }} per 5 minutes
{{ end }}
- alert: CheckoutAbandonmentHigh
expr: |
(
1 - (
sum(rate(checkout_completed_total[10m])) /
sum(rate(checkout_started_total[10m]))
)
) > 0.7
for: 5m
labels:
severity: warning
category: business
impact: conversion
annotations:
summary: "High checkout abandonment rate"
description: |
Checkout abandonment rate is {{ $value | humanizePercentage }}.
This may indicate payment issues or poor user experience.
user_experience:
- alert: UserRegistrationDropped
expr: |
sum(rate(user_registrations_total[10m])) <
0.7 * sum(avg_over_time(user_registrations_total[1d] offset 1d))
for: 10m
labels:
severity: warning
category: business
impact: growth
annotations:
summary: "User registration rate significantly dropped"
description: |
Current registration rate is 30% below yesterday's average.
Investigate registration flow and marketing campaigns.
- alert: ActiveUserSessionsLow
expr: |
active_user_sessions < 0.8 * active_user_sessions offset 1w
for: 15m
labels:
severity: info
category: business
impact: engagement
annotations:
summary: "Active user sessions below weekly average"
description: |
Current active sessions: {{ $value }}.
Weekly average: {{ with query "active_user_sessions offset 1w" }}{{ . | first | value }}{{ end }}.
Consider investigating user experience issues.
operational_efficiency:
- alert: BatchJobBacklogGrowing
expr: |
sum(batch_job_queue_size) > 1000 and
deriv(batch_job_queue_size[30m]) > 0
for: 10m
labels:
severity: warning
category: operational
impact: efficiency
annotations:
summary: "Batch job backlog growing"
description: |
Job queue size: {{ $value }} jobs.
Queue is growing at {{ with query "deriv(batch_job_queue_size[30m])" }}{{ . | first | value }}{{ end }} jobs/minute.
Consider scaling processing capacity.
advanced_alert_patterns:
anomaly_detection:
- alert: TrafficAnomalyDetected
expr: |
abs(
rate(http_requests_total[5m]) -
avg_over_time(rate(http_requests_total[5m])[1h:5m] offset 1d)
) > 2 * stddev_over_time(rate(http_requests_total[5m])[1h:5m] offset 1d)
for: 10m
labels:
severity: info
category: anomaly
annotations:
summary: "Traffic pattern anomaly detected"
description: |
Current traffic: {{ with query "rate(http_requests_total[5m])" }}{{ . | first | value | humanize }}/s{{ end }}.
Expected range based on historical data:
{{ with query "avg_over_time(rate(http_requests_total[5m])[1h:5m] offset 1d)" }}
{{ sub (. | first | value) (query "2 * stddev_over_time(rate(http_requests_total[5m])[1h:5m] offset 1d)" | first | value) | humanize }} -
{{ add (. | first | value) (query "2 * stddev_over_time(rate(http_requests_total[5m])[1h:5m] offset 1d)" | first | value) | humanize }}/s
{{ end }}
predictive_alerts:
- alert: DiskWillFillIn4Hours
expr: |
predict_linear(node_filesystem_free_bytes[1h], 4*3600) < 0
for: 5m
labels:
severity: warning
category: predictive
resource: disk
annotations:
summary: "Disk {{ $labels.mountpoint }} will be full in 4 hours"
description: |
Based on current usage trend, disk {{ $labels.mountpoint }} on {{ $labels.instance }}
will be full in approximately 4 hours.
Current free space: {{ with query "node_filesystem_free_bytes" }}{{ . | first | value | humanizeBytes }}{{ end }}🔧 Alertmanager配置
路由和分组策略
yaml
# alertmanager.yml
global:
smtp_smarthost: 'mail.company.com:587'
smtp_from: 'alerts@company.com'
smtp_auth_username: 'alerts@company.com'
smtp_auth_password: 'password'
# Slack配置
slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
# 路由配置
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s # 收到告警后等待30秒收集更多告警
group_interval: 5m # 发送新告警组的间隔
repeat_interval: 4h # 重复发送告警的间隔
receiver: 'default' # 默认接收器
routes:
# 严重告警立即发送
- match:
severity: critical
group_wait: 0s
group_interval: 1m
repeat_interval: 30m
receiver: 'critical-alerts'
# 基础设施告警发送给SRE团队
- match:
category: infrastructure
receiver: 'sre-team'
routes:
- match:
alertname: 'InstanceDown'
receiver: 'sre-oncall'
# 应用告警发送给开发团队
- match:
category: application
receiver: 'dev-team'
group_by: ['service', 'alertname']
# 业务告警发送给产品团队
- match:
category: business
receiver: 'product-team'
group_by: ['impact', 'alertname']
group_interval: 15myaml
receivers:
- name: 'default'
email_configs:
- to: 'alerts@company.com'
subject: 'Prometheus Alert'
body: |
{{ range .Alerts }}
Alert: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
Instance: {{ .Labels.instance }}
Severity: {{ .Labels.severity }}
{{ end }}
- name: 'critical-alerts'
email_configs:
- to: 'oncall@company.com'
subject: '[CRITICAL] {{ .GroupLabels.alertname }}'
body: |
{{ range .Alerts }}
🚨 CRITICAL ALERT 🚨
Summary: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
Labels:
{{ range .Labels.SortedPairs }} - {{ .Name }}: {{ .Value }}
{{ end }}
{{ if .Annotations.runbook_url }}
Runbook: {{ .Annotations.runbook_url }}
{{ end }}
{{ end }}
slack_configs:
- channel: '#critical-alerts'
title: 'Critical Alert: {{ .GroupLabels.alertname }}'
text: |
{{ range .Alerts }}
{{ .Annotations.summary }}
{{ .Annotations.description }}
{{ end }}
send_resolved: true
- name: 'sre-team'
slack_configs:
- channel: '#sre-alerts'
title: '{{ .Status | toUpper }}: {{ .GroupLabels.alertname }}'
text: |
{{ if eq .Status "firing" }}🔥{{ else }}✅{{ end }}
{{ range .Alerts }}
{{ .Annotations.summary }}
Severity: {{ .Labels.severity }}
{{ end }}
send_resolved: true
pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_SERVICE_KEY'
description: '{{ .GroupLabels.alertname }}: {{ .GroupLabels.instance }}'静默和抑制规则
yaml
# 抑制规则配置
inhibit_rules:
# 实例下线时抑制该实例的所有其他告警
- source_match:
alertname: 'InstanceDown'
target_match_re:
instance: '.*'
equal: ['instance']
# 集群级别告警抑制节点级别告警
- source_match:
alertname: 'KubernetesClusterDown'
target_match_re:
alertname: 'KubernetesNode.*'
equal: ['cluster']
# 数据库主从切换时抑制从库告警
- source_match:
alertname: 'MySQLMasterDown'
target_match:
alertname: 'MySQLSlaveDown'
equal: ['cluster']
# 维护窗口期间抑制所有告警
- source_match:
alertname: 'MaintenanceMode'
target_match_re:
alertname: '.*'
equal: ['datacenter']yaml
# 通过API创建静默
silence_examples:
# 维护窗口静默
maintenance_silence:
matchers:
- name: "instance"
value: "web-server-1"
isRegex: false
startsAt: "2024-01-15T02:00:00Z"
endsAt: "2024-01-15T04:00:00Z"
comment: "Scheduled maintenance window"
createdBy: "sre-team@company.com"
# 已知问题静默
known_issue_silence:
matchers:
- name: "alertname"
value: "HighMemoryUsage"
isRegex: false
- name: "service"
value: "analytics-service"
isRegex: false
startsAt: "2024-01-15T10:00:00Z"
endsAt: "2024-01-16T10:00:00Z"
comment: "Memory leak investigation in progress - JIRA-12345"
createdBy: "dev-team@company.com"
# 使用amtool命令行工具
amtool_commands: |
# 查看当前告警
amtool alert query
# 创建静默
amtool silence add \
alertname="HighCPUUsage" \
instance="web-server-1" \
--comment="Investigating performance issue" \
--duration="2h"
# 查看静默
amtool silence query
# 删除静默
amtool silence expire <silence-id>📋 告警设计面试重点
基础概念类
Prometheus告警的生命周期是什么?
- Inactive → Pending → Firing状态转换
- for持续时间的作用
- 告警恢复机制
Alertmanager的核心功能有哪些?
- 告警接收和验证
- 分组、去重、静默
- 路由和通知发送
- 高可用部署
什么是告警疲劳,如何避免?
- 过多无用告警的危害
- SMART告警原则
- 合理的阈值设计
- 告警分级和路由
规则设计类
如何设计有效的告警阈值?
- 基于历史数据分析
- 考虑业务影响
- 动态阈值vs固定阈值
- SLO导向的阈值设计
Error Budget告警的实现原理?
- Error Budget计算方法
- 消耗速率监控
- 预测性告警设置
- 业务影响评估
如何处理告警的依赖关系?
- 抑制规则设计
- 告警层次结构
- 根因分析
- 级联告警避免
实际应用类
大规模环境下如何优化告警性能?
- 规则评估频率优化
- 告警规则分组策略
- Alertmanager集群部署
- 通知渠道负载均衡
如何实现告警的可观测性?
- 告警规则指标监控
- 通知发送成功率
- 告警处理时间统计
- 告警质量评估
告警规则的测试和验证方法?
- 单元测试框架
- 告警规则验证
- 混沌工程测试
- 告警回放和分析
🔗 相关内容
- Prometheus架构 - 理解告警评估机制
- PromQL查询 - 告警规则查询优化
- 指标收集 - 告警数据来源
- 性能优化 - 告警系统性能调优
有效的告警规则设计是监控系统成功的关键。通过遵循最佳实践、合理设计阈值和路由策略,可以构建既能及时发现问题又不产生告警疲劳的监控告警系统。
