Skip to content

Prometheus 告警规则设计与实践

告警是监控系统的重要组成部分,有效的告警规则设计能够在问题发生时及时通知相关人员,同时避免告警疲劳。本文深入探讨Prometheus告警规则的设计原则、配置方法和最佳实践。

🚨 告警系统架构

Prometheus告警流程

yaml
alerting_architecture:
  prometheus_server:
    responsibility: "告警规则评估"
    functions:
      - "定期执行告警规则"
      - "维护告警状态"
      - "发送告警到Alertmanager"
      - "提供告警API查询"
    
    alert_states:
      - "Inactive: 未触发"
      - "Pending: 等待中(未达到for条件)"
      - "Firing: 告警中"
  
  alertmanager:
    responsibility: "告警管理和通知"
    functions:
      - "告警接收和验证"
      - "告警分组和去重"
      - "告警路由和静默"
      - "通知发送和重试"
    
    features:
      - "多种通知渠道"
      - "告警抑制和依赖"
      - "告警静默管理"
      - "高可用部署"
yaml
alert_lifecycle:
  rule_evaluation:
    frequency: "每个evaluation_interval周期"
    process:
      - "执行PromQL查询"
      - "评估告警条件"
      - "更新告警状态"
      - "计算for持续时间"
  
  alert_states:
    inactive_to_pending:
      trigger: "告警条件首次满足"
      duration: "0秒"
      notification: "不发送通知"
    
    pending_to_firing:
      trigger: "持续满足条件达到for时间"
      duration: "规则定义的for时间"
      notification: "发送告警通知"
    
    firing_to_inactive:
      trigger: "告警条件不再满足"
      duration: "立即"
      notification: "发送恢复通知"
  
  notification_flow:
    - "Prometheus发送告警到Alertmanager"
    - "Alertmanager接收和处理告警"
    - "根据路由规则分组告警"
    - "应用静默和抑制规则"
    - "发送通知到配置的接收器"

🎯 告警规则设计原则

SMART告警原则

yaml
smart_alerting:
  specific:
    description: "告警要具体明确"
    examples:
      good: "API /users endpoint 5xx error rate > 5%"
      bad: "Application error"
    
    implementation:
      - "包含具体的服务名称"
      - "明确的错误类型"
      - "具体的阈值"
      - "相关的标签信息"
  
  measurable:
    description: "告警要可量化"
    examples:
      good: "CPU usage > 80% for 5 minutes"
      bad: "High CPU usage"
    
    implementation:
      - "明确的数值阈值"
      - "时间持续条件"
      - "百分比或绝对值"
      - "基于历史数据的合理阈值"
  
  actionable:
    description: "告警要可执行"
    examples:
      good: "Disk space < 10% on /var partition"
      bad: "Something is wrong"
    
    implementation:
      - "明确的问题描述"
      - "可执行的修复步骤"
      - "相关的文档链接"
      - "责任人和联系方式"
  
  relevant:
    description: "告警要相关重要"
    examples:
      good: "Database connection pool exhausted"
      bad: "Debug log level changed"
    
    implementation:
      - "影响用户体验的问题"
      - "业务关键路径的异常"
      - "系统稳定性威胁"
      - "避免信息性通知"
  
  time_bound:
    description: "告警要有时间限制"
    examples:
      good: "Service down for > 2 minutes"
      bad: "Service down"
    
    implementation:
      - "合理的for持续时间"
      - "基于服务SLA的时间要求"
      - "考虑问题恢复时间"
      - "避免瞬时抖动"
yaml
severity_levels:
  critical:
    definition: "系统完全不可用或数据丢失风险"
    response_time: "5分钟内响应"
    examples:
      - "服务完全下线"
      - "数据库无法连接"
      - "磁盘空间耗尽"
      - "安全漏洞被利用"
    
    escalation:
      - "立即短信通知on-call工程师"
      - "同时邮件通知团队负责人"
      - "自动创建最高优先级工单"
      - "如需要通知管理层"
  
  warning:
    definition: "可能影响服务性能或稳定性"
    response_time: "30分钟内响应"
    examples:
      - "错误率轻微上升"
      - "响应时间增加"
      - "资源使用率高"
      - "副本数量不足"
    
    escalation:
      - "邮件通知相关团队"
      - "Slack频道消息"
      - "工作时间内处理"
      - "记录到监控仪表板"
  
  info:
    definition: "需要关注但不紧急的事件"
    response_time: "工作时间内处理"
    examples:
      - "部署完成通知"
      - "配置变更"
      - "定期维护任务"
      - "容量规划提醒"
    
    escalation:
      - "仅邮件通知"
      - "记录到日志系统"
      - "定期汇总报告"
      - "不需要立即响应"

告警规则模板

yaml
# prometheus.rules.yml
groups:
- name: infrastructure.rules
  rules:
  # 实例下线告警
  - alert: InstanceDown
    expr: up == 0
    for: 1m
    labels:
      severity: critical
      category: infrastructure
      team: sre
    annotations:
      summary: "Instance {{ $labels.instance }} down"
      description: |
        Instance {{ $labels.instance }} of job {{ $labels.job }} 
        has been down for more than 1 minute.
        Current value: {{ $value }}
      runbook_url: "https://runbooks.company.com/instancedown"
      dashboard_url: "https://grafana.company.com/d/node-overview"

  # CPU使用率告警  
  - alert: HighCPUUsage
    expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
    for: 5m
    labels:
      severity: warning
      category: infrastructure
      team: sre
    annotations:
      summary: "High CPU usage on {{ $labels.instance }}"
      description: |
        CPU usage has been above 80% on {{ $labels.instance }} for more than 5 minutes.
        Current usage: {{ $value | humanizePercentage }}
      impact: "May affect application performance"
      action_required: "Check running processes and consider scaling"
yaml
- name: application.rules
  rules:
  # HTTP错误率告警
  - alert: HighHTTPErrorRate
    expr: |
      (
        sum(rate(http_requests_total{status=~"5.."}[5m])) by (service, instance) /
        sum(rate(http_requests_total[5m])) by (service, instance)
      ) * 100 > 5
    for: 2m
    labels:
      severity: critical
      category: application
      service: "{{ $labels.service }}"
    annotations:
      summary: "High HTTP 5xx error rate on {{ $labels.service }}"
      description: |
        HTTP 5xx error rate is {{ $value | humanizePercentage }} on service {{ $labels.service }}.
        This is above the 5% threshold for more than 2 minutes.
      impact: "Users experiencing service errors"
      runbook_url: "https://runbooks.company.com/http-errors"

  # API响应延迟告警
  - alert: HighAPILatency
    expr: |
      histogram_quantile(0.95,
        sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service, endpoint)
      ) > 0.5
    for: 5m
    labels:
      severity: warning
      category: application
      service: "{{ $labels.service }}"
      endpoint: "{{ $labels.endpoint }}"
    annotations:
      summary: "High API latency on {{ $labels.service }}{{ $labels.endpoint }}"
      description: |
        95th percentile latency is {{ $value }}s on {{ $labels.service }}{{ $labels.endpoint }}.
        This is above the 500ms SLA threshold.
      sla_breach: "true"
      dashboard_url: "https://grafana.company.com/d/api-performance"

  # 数据库连接池告警
  - alert: DatabaseConnectionPoolHigh
    expr: |
      (
        mysql_global_status_threads_connected / 
        mysql_global_variables_max_connections
      ) * 100 > 80
    for: 3m
    labels:
      severity: warning
      category: database
      database: "{{ $labels.instance }}"
    annotations:
      summary: "Database connection pool usage high"
      description: |
        Connection pool usage is {{ $value | humanizePercentage }} on {{ $labels.instance }}.
        Consider investigating slow queries or increasing pool size.
      current_connections: "{{ $labels.threads_connected }}"
      max_connections: "{{ $labels.max_connections }}"

📊 高级告警策略

基于SLI/SLO的告警

yaml
- name: slo.rules
  rules:
  # 可用性SLO告警 (99.9% SLA)
  - alert: SLOAvailabilityBreach
    expr: |
      (
        1 - (
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) /
          sum(rate(http_requests_total[5m])) by (service)
        )
      ) < 0.999
    for: 1m
    labels:
      severity: critical
      category: slo
      slo_type: availability
    annotations:
      summary: "SLO availability breach on {{ $labels.service }}"
      description: |
        Service availability is {{ $value | humanizePercentage }}, below 99.9% SLO.
        Current error budget burn rate is high.
      error_budget_remaining: |
        {{ with query "slo_error_budget_remaining{service='$labels.service'}" }}
          {{ . | first | value | humanizePercentage }}
        {{ end }}

  # 延迟SLO告警 (95% of requests < 200ms)
  - alert: SLOLatencyBreach
    expr: |
      (
        sum(rate(http_request_duration_seconds_bucket{le="0.2"}[5m])) by (service) /
        sum(rate(http_request_duration_seconds_count[5m])) by (service)
      ) < 0.95
    for: 2m
    labels:
      severity: warning
      category: slo
      slo_type: latency
    annotations:
      summary: "SLO latency breach on {{ $labels.service }}"
      description: |
        {{ $value | humanizePercentage }} of requests meet the 200ms latency SLO.
        This is below the 95% target.
      p95_latency: |
        {{ with query "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)){service='$labels.service'}" }}
          {{ . | first | value }}s
        {{ end }}
yaml
# Error Budget计算规则
- name: error_budget.rules
  interval: 1m
  rules:
  # 计算30天error budget剩余
  - record: slo_error_budget_remaining
    expr: |
      1 - (
        (
          sum(increase(http_requests_total{status=~"5.."}[30d])) by (service) /
          sum(increase(http_requests_total[30d])) by (service)
        ) / (1 - 0.999)  # SLO target: 99.9%
      )
  
  # 计算error budget burn rate (每小时消耗的百分比)
  - record: slo_error_budget_burn_rate_1h
    expr: |
      (
        sum(rate(http_requests_total{status=~"5.."}[1h])) by (service) /
        sum(rate(http_requests_total[1h])) by (service)
      ) / (1 - 0.999) * 24 * 30  # 转换为30天budget的小时消耗率

- name: error_budget_alerts.rules
  rules:
  # 快速消耗告警 (1小时内消耗超过5%的budget)
  - alert: ErrorBudgetFastBurn
    expr: slo_error_budget_burn_rate_1h > 0.05
    for: 5m
    labels:
      severity: critical
      category: error_budget
    annotations:
      summary: "Fast error budget burn on {{ $labels.service }}"
      description: |
        Service {{ $labels.service }} is consuming error budget at {{ $value | humanizePercentage }}/hour.
        At this rate, the monthly budget will be exhausted in {{ div 1 $value | humanizeDuration }}.

  # 预警告警 (budget剩余不足20%)
  - alert: ErrorBudgetLow
    expr: slo_error_budget_remaining < 0.2
    for: 5m
    labels:
      severity: warning
      category: error_budget
    annotations:
      summary: "Error budget low on {{ $labels.service }}"
      description: |
        Service {{ $labels.service }} has only {{ $value | humanizePercentage }} error budget remaining.
        Consider reducing deployment frequency or focusing on reliability.

业务指标告警

业务KPI监控告警
yaml
business_alerts:
  revenue_impact:
    - alert: PaymentSuccessRateDropped
      expr: |
        (
          sum(rate(payments_total{status="success"}[5m])) /
          sum(rate(payments_total[5m]))
        ) < 0.95
      for: 2m
      labels:
        severity: critical
        category: business
        impact: revenue
      annotations:
        summary: "Payment success rate below 95%"
        description: |
          Payment success rate is {{ $value | humanizePercentage }}.
          This directly impacts revenue and user experience.
        estimated_loss: |
          {{ with query "avg_over_time(payment_revenue_per_minute[10m])" }}
            Estimated revenue impact: ${{ mul (. | first | value) 5 }} per 5 minutes
          {{ end }}
    
    - alert: CheckoutAbandonmentHigh
      expr: |
        (
          1 - (
            sum(rate(checkout_completed_total[10m])) /
            sum(rate(checkout_started_total[10m]))
          )
        ) > 0.7
      for: 5m
      labels:
        severity: warning
        category: business
        impact: conversion
      annotations:
        summary: "High checkout abandonment rate"
        description: |
          Checkout abandonment rate is {{ $value | humanizePercentage }}.
          This may indicate payment issues or poor user experience.

  user_experience:
    - alert: UserRegistrationDropped
      expr: |
        sum(rate(user_registrations_total[10m])) <
        0.7 * sum(avg_over_time(user_registrations_total[1d] offset 1d))
      for: 10m
      labels:
        severity: warning
        category: business
        impact: growth
      annotations:
        summary: "User registration rate significantly dropped"
        description: |
          Current registration rate is 30% below yesterday's average.
          Investigate registration flow and marketing campaigns.
    
    - alert: ActiveUserSessionsLow
      expr: |
        active_user_sessions < 0.8 * active_user_sessions offset 1w
      for: 15m
      labels:
        severity: info
        category: business
        impact: engagement
      annotations:
        summary: "Active user sessions below weekly average"
        description: |
          Current active sessions: {{ $value }}.
          Weekly average: {{ with query "active_user_sessions offset 1w" }}{{ . | first | value }}{{ end }}.
          Consider investigating user experience issues.

  operational_efficiency:
    - alert: BatchJobBacklogGrowing
      expr: |
        sum(batch_job_queue_size) > 1000 and
        deriv(batch_job_queue_size[30m]) > 0
      for: 10m
      labels:
        severity: warning
        category: operational
        impact: efficiency
      annotations:
        summary: "Batch job backlog growing"
        description: |
          Job queue size: {{ $value }} jobs.
          Queue is growing at {{ with query "deriv(batch_job_queue_size[30m])" }}{{ . | first | value }}{{ end }} jobs/minute.
          Consider scaling processing capacity.

advanced_alert_patterns:
  anomaly_detection:
    - alert: TrafficAnomalyDetected
      expr: |
        abs(
          rate(http_requests_total[5m]) -
          avg_over_time(rate(http_requests_total[5m])[1h:5m] offset 1d)
        ) > 2 * stddev_over_time(rate(http_requests_total[5m])[1h:5m] offset 1d)
      for: 10m
      labels:
        severity: info
        category: anomaly
      annotations:
        summary: "Traffic pattern anomaly detected"
        description: |
          Current traffic: {{ with query "rate(http_requests_total[5m])" }}{{ . | first | value | humanize }}/s{{ end }}.
          Expected range based on historical data: 
          {{ with query "avg_over_time(rate(http_requests_total[5m])[1h:5m] offset 1d)" }}
            {{ sub (. | first | value) (query "2 * stddev_over_time(rate(http_requests_total[5m])[1h:5m] offset 1d)" | first | value) | humanize }} - 
            {{ add (. | first | value) (query "2 * stddev_over_time(rate(http_requests_total[5m])[1h:5m] offset 1d)" | first | value) | humanize }}/s
          {{ end }}
  
  predictive_alerts:
    - alert: DiskWillFillIn4Hours
      expr: |
        predict_linear(node_filesystem_free_bytes[1h], 4*3600) < 0
      for: 5m
      labels:
        severity: warning
        category: predictive
        resource: disk
      annotations:
        summary: "Disk {{ $labels.mountpoint }} will be full in 4 hours"
        description: |
          Based on current usage trend, disk {{ $labels.mountpoint }} on {{ $labels.instance }}
          will be full in approximately 4 hours.
          Current free space: {{ with query "node_filesystem_free_bytes" }}{{ . | first | value | humanizeBytes }}{{ end }}

🔧 Alertmanager配置

路由和分组策略

yaml
# alertmanager.yml
global:
  smtp_smarthost: 'mail.company.com:587'
  smtp_from: 'alerts@company.com'
  smtp_auth_username: 'alerts@company.com'
  smtp_auth_password: 'password'
  
  # Slack配置
  slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'

# 路由配置
route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s        # 收到告警后等待30秒收集更多告警
  group_interval: 5m     # 发送新告警组的间隔
  repeat_interval: 4h    # 重复发送告警的间隔
  receiver: 'default'    # 默认接收器
  
  routes:
  # 严重告警立即发送
  - match:
      severity: critical
    group_wait: 0s
    group_interval: 1m
    repeat_interval: 30m
    receiver: 'critical-alerts'
  
  # 基础设施告警发送给SRE团队
  - match:
      category: infrastructure
    receiver: 'sre-team'
    routes:
    - match:
        alertname: 'InstanceDown'
      receiver: 'sre-oncall'
  
  # 应用告警发送给开发团队
  - match:
      category: application
    receiver: 'dev-team'
    group_by: ['service', 'alertname']
  
  # 业务告警发送给产品团队
  - match:
      category: business
    receiver: 'product-team'
    group_by: ['impact', 'alertname']
    group_interval: 15m
yaml
receivers:
- name: 'default'
  email_configs:
  - to: 'alerts@company.com'
    subject: 'Prometheus Alert'
    body: |
      {{ range .Alerts }}
      Alert: {{ .Annotations.summary }}
      Description: {{ .Annotations.description }}
      Instance: {{ .Labels.instance }}
      Severity: {{ .Labels.severity }}
      {{ end }}

- name: 'critical-alerts'
  email_configs:
  - to: 'oncall@company.com'
    subject: '[CRITICAL] {{ .GroupLabels.alertname }}'
    body: |
      {{ range .Alerts }}
      🚨 CRITICAL ALERT 🚨
      
      Summary: {{ .Annotations.summary }}
      Description: {{ .Annotations.description }}
      
      Labels:
      {{ range .Labels.SortedPairs }}  - {{ .Name }}: {{ .Value }}
      {{ end }}
      
      {{ if .Annotations.runbook_url }}
      Runbook: {{ .Annotations.runbook_url }}
      {{ end }}
      {{ end }}
  
  slack_configs:
  - channel: '#critical-alerts'
    title: 'Critical Alert: {{ .GroupLabels.alertname }}'
    text: |
      {{ range .Alerts }}
      {{ .Annotations.summary }}
      {{ .Annotations.description }}
      {{ end }}
    send_resolved: true

- name: 'sre-team'
  slack_configs:
  - channel: '#sre-alerts'
    title: '{{ .Status | toUpper }}: {{ .GroupLabels.alertname }}'
    text: |
      {{ if eq .Status "firing" }}🔥{{ else }}✅{{ end }} 
      {{ range .Alerts }}
      {{ .Annotations.summary }}
      Severity: {{ .Labels.severity }}
      {{ end }}
    send_resolved: true
  
  pagerduty_configs:
  - service_key: 'YOUR_PAGERDUTY_SERVICE_KEY'
    description: '{{ .GroupLabels.alertname }}: {{ .GroupLabels.instance }}'

静默和抑制规则

yaml
# 抑制规则配置
inhibit_rules:
# 实例下线时抑制该实例的所有其他告警
- source_match:
    alertname: 'InstanceDown'
  target_match_re:
    instance: '.*'
  equal: ['instance']

# 集群级别告警抑制节点级别告警
- source_match:
    alertname: 'KubernetesClusterDown'
  target_match_re:
    alertname: 'KubernetesNode.*'
  equal: ['cluster']

# 数据库主从切换时抑制从库告警
- source_match:
    alertname: 'MySQLMasterDown'
  target_match:
    alertname: 'MySQLSlaveDown'
  equal: ['cluster']

# 维护窗口期间抑制所有告警
- source_match:
    alertname: 'MaintenanceMode'
  target_match_re:
    alertname: '.*'
  equal: ['datacenter']
yaml
# 通过API创建静默
silence_examples:
  # 维护窗口静默
  maintenance_silence:
    matchers:
    - name: "instance"
      value: "web-server-1"
      isRegex: false
    startsAt: "2024-01-15T02:00:00Z"
    endsAt: "2024-01-15T04:00:00Z"
    comment: "Scheduled maintenance window"
    createdBy: "sre-team@company.com"
  
  # 已知问题静默
  known_issue_silence:
    matchers:
    - name: "alertname"
      value: "HighMemoryUsage"
      isRegex: false
    - name: "service"
      value: "analytics-service"
      isRegex: false
    startsAt: "2024-01-15T10:00:00Z"
    endsAt: "2024-01-16T10:00:00Z"
    comment: "Memory leak investigation in progress - JIRA-12345"
    createdBy: "dev-team@company.com"

# 使用amtool命令行工具
amtool_commands: |
  # 查看当前告警
  amtool alert query

  # 创建静默
  amtool silence add \
    alertname="HighCPUUsage" \
    instance="web-server-1" \
    --comment="Investigating performance issue" \
    --duration="2h"

  # 查看静默
  amtool silence query

  # 删除静默
  amtool silence expire <silence-id>

📋 告警设计面试重点

基础概念类

  1. Prometheus告警的生命周期是什么?

    • Inactive → Pending → Firing状态转换
    • for持续时间的作用
    • 告警恢复机制
  2. Alertmanager的核心功能有哪些?

    • 告警接收和验证
    • 分组、去重、静默
    • 路由和通知发送
    • 高可用部署
  3. 什么是告警疲劳,如何避免?

    • 过多无用告警的危害
    • SMART告警原则
    • 合理的阈值设计
    • 告警分级和路由

规则设计类

  1. 如何设计有效的告警阈值?

    • 基于历史数据分析
    • 考虑业务影响
    • 动态阈值vs固定阈值
    • SLO导向的阈值设计
  2. Error Budget告警的实现原理?

    • Error Budget计算方法
    • 消耗速率监控
    • 预测性告警设置
    • 业务影响评估
  3. 如何处理告警的依赖关系?

    • 抑制规则设计
    • 告警层次结构
    • 根因分析
    • 级联告警避免

实际应用类

  1. 大规模环境下如何优化告警性能?

    • 规则评估频率优化
    • 告警规则分组策略
    • Alertmanager集群部署
    • 通知渠道负载均衡
  2. 如何实现告警的可观测性?

    • 告警规则指标监控
    • 通知发送成功率
    • 告警处理时间统计
    • 告警质量评估
  3. 告警规则的测试和验证方法?

    • 单元测试框架
    • 告警规则验证
    • 混沌工程测试
    • 告警回放和分析

🔗 相关内容


有效的告警规则设计是监控系统成功的关键。通过遵循最佳实践、合理设计阈值和路由策略,可以构建既能及时发现问题又不产生告警疲劳的监控告警系统。

正在精进