Fluentd 日志收集实践

深入探讨Fluentd在实际生产环境中的日志收集配置、模式匹配、数据处理和故障排查，涵盖从简单文件收集到复杂多源聚合的完整实践指南。

🎯 日志收集策略

多数据源收集架构

收集架构模式数据源识别

yaml

collection_patterns:
  centralized_collection:
    description: "中心化收集模式"
    architecture: |
      Apps/Services → Fluentd Agent → Central Aggregator → Storage
    
    advantages:
      - "统一配置管理"
      - "集中数据处理"
      - "简化运维复杂度"
      - "资源利用优化"
    
    deployment_example:
      agent_nodes: "每个服务器部署Fluentd Agent"
      aggregator_cluster: "中心Fluentd集群"
      load_balancer: "负载均衡器分发"
      storage_backends: "多种存储后端"
  
  distributed_collection:
    description: "分布式收集模式"
    architecture: |
      Apps → Local Fluentd → Regional Aggregator → Global Storage
    
    benefits:
      - "地理位置就近处理"
      - "网络延迟降低"
      - "区域故障隔离"
      - "数据本地化合规"
    
    use_cases:
      - "多数据中心部署"
      - "跨地域应用"
      - "数据主权要求"
      - "网络带宽限制"
  
  hybrid_collection:
    description: "混合收集模式"
    scenarios:
      real_time_logs: "实时日志直接发送"
      batch_logs: "批量日志本地聚合"
      critical_logs: "关键日志多路径发送"
      archive_logs: "归档日志延迟处理"
    
    routing_strategy: |
      # 基于日志类型的智能路由
      <match critical.**>
        @type copy
        <store>
          @type forward  # 实时转发
          <server>
            name primary
            host aggregator-1.logging.svc.local
          </server>
        </store>
        <store>
          @type file    # 本地备份
          path /backup/critical/
        </store>
      </match>

yaml

data_source_classification:
  application_logs:
    web_servers:
      sources: ["nginx", "apache", "envoy"]
      characteristics:
        - "高频访问日志"
        - "结构化程度高"
        - "实时性要求"
        - "大数据量"
      
      collection_strategy:
        format: "标准访问日志格式"
        frequency: "实时收集"
        parsing: "预定义解析器"
        routing: "按状态码路由"
    
    application_services:
      sources: ["java_apps", "python_apps", "go_services"]
      characteristics:
        - "业务逻辑日志"
        - "多级日志级别"
        - "异常堆栈信息"
        - "上下文关联"
      
      collection_strategy:
        format: "结构化JSON或自定义"
        frequency: "准实时收集"
        parsing: "多格式解析"
        enrichment: "添加应用元数据"
  
  infrastructure_logs:
    system_logs:
      sources: ["syslog", "systemd", "kernel"]
      characteristics:
        - "系统级事件"
        - "安全相关"
        - "故障诊断信息"
        - "标准格式"
      
      collection_config: |
        <source>
          @type syslog
          port 5140
          bind 0.0.0.0
          tag system.syslog
          <parse>
            @type syslog
            message_format rfc3164
            with_priority true
          </parse>
        </source>
    
    container_logs:
      sources: ["docker", "containerd", "cri-o"]
      challenges:
        - "容器生命周期短暂"
        - "日志格式多样"
        - "元数据关联"
        - "多行日志处理"
      
      kubernetes_integration: |
        <source>
          @type tail
          @id kubernetes_containers
          path /var/log/containers/*.log
          pos_file /var/log/fluentd-containers.log.pos
          tag kubernetes.*
          read_from_head true
          
          <parse>
            @type cri
            merge_cri_fields false
            key_name log
          </parse>
        </source>
  
  security_logs:
    audit_logs:
      sources: ["kubernetes_audit", "system_audit", "application_audit"]
      requirements:
        - "完整性保证"
        - "篡改检测"
        - "长期保存"
        - "快速检索"
      
      special_handling: |
        <filter security.audit.**>
          @type record_transformer
          <record>
            # 添加完整性校验
            checksum ${Digest::SHA256.hexdigest(record.to_json)}
            # 添加审计时间戳
            audit_timestamp ${Time.now.strftime('%Y-%m-%dT%H:%M:%S.%L%z')}
          </record>
        </filter>
    
    access_logs:
      sources: ["api_gateway", "load_balancer", "firewall"]
      analysis_focus:
        - "访问模式分析"
        - "异常检测"
        - "性能监控"
        - "安全威胁识别"

高级解析配置

复杂日志格式解析数据增强和转换

ruby

# 多格式自适应解析
<source>
  @type tail
  path /var/log/app/mixed-format.log
  pos_file /var/log/fluentd/mixed-format.log.pos
  tag app.mixed
  
  <parse>
    @type multi_format
    
    # JSON格式日志
    <pattern>
      format json
      time_key timestamp
      time_format %Y-%m-%dT%H:%M:%S.%L%z
      keep_time_key true
    </pattern>
    
    # 结构化文本格式
    <pattern>
      format /^(?<timestamp>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}.\d{3}) \[(?<level>\w+)\] \[(?<thread>[\w-]+)\] (?<logger>[\w.]+) - (?<message>.*)$/
      time_key timestamp
      time_format %Y-%m-%d %H:%M:%S.%L
    </pattern>
    
    # Key-Value格式
    <pattern>
      format ltsv
      time_key time
      label_delimiter |
      delimiter_pattern /\t/
    </pattern>
    
    # Apache访问日志格式
    <pattern>
      format apache2
      time_key time
    </pattern>
    
    # 兜底模式 - 保留原始消息
    <pattern>
      format none
      message_key message
    </pattern>
  </parse>
</source>

# 多行日志聚合
<source>
  @type tail
  path /var/log/app/exceptions.log
  pos_file /var/log/fluentd/exceptions.log.pos
  tag app.exceptions
  
  <parse>
    @type multiline
    format_firstline /^\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}/
    format1 /^(?<timestamp>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}.\d{3}) \[(?<level>\w+)\] (?<message>.*)$/
    time_key timestamp
    time_format %Y-%m-%d %H:%M:%S.%L
  </parse>
</source>

# Java异常堆栈专用解析
<source>
  @type tail
  path /var/log/java-app/*.log
  pos_file /var/log/fluentd/java-app.log.pos
  tag java.app
  
  <parse>
    @type multiline
    format_firstline /^\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}/
    format1 /^(?<timestamp>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}.\d{3}) +(?<level>\w+) +--- \[(?<thread>.*?)\] (?<logger>\S+)\s*: (?<message>.*)/
    time_key timestamp
    time_format %Y-%m-%d %H:%M:%S.%L
    
    # 继续行处理（异常堆栈）
    multiline_flush_interval 5s
    keep_time_key true
  </parse>
</source>

# 自定义Grok模式
<source>
  @type tail
  path /var/log/custom-app/*.log
  pos_file /var/log/fluentd/custom-app.log.pos
  tag custom.app
  
  <parse>
    @type grok
    grok_pattern %{CUSTOM_APP_LOG}
    custom_pattern_path /etc/fluentd/patterns/custom
    time_key logtime
    time_format %Y-%m-%d %H:%M:%S
  </parse>
</source>

ruby

# 地理位置信息增强
<filter web.access>
  @type geoip
  geoip_lookup_keys client_ip
  geoip_database /etc/fluentd/GeoLite2-City.mmdb
  
  <record>
    geoip_country ${record["geoip"]["country_name"]}
    geoip_region ${record["geoip"]["region_name"]}
    geoip_city ${record["geoip"]["city_name"]}
    geoip_latitude ${record["geoip"]["latitude"]}
    geoip_longitude ${record["geoip"]["longitude"]}
  </record>
  
  # 移除详细地理信息，只保留需要的字段
  remove_tag_prefix web
  add_tag_prefix geo.web
</filter>

# 用户代理解析
<filter web.access>
  @type ua_parser
  key_name user_agent
  delete_key false
  
  out_key user_agent_parsed
  flatten true
</filter>

# 业务字段提取和计算
<filter app.**>
  @type record_transformer
  enable_ruby true
  
  <record>
    # 提取请求ID
    request_id ${record["message"] =~ /request_id=([a-zA-Z0-9-]+)/ ? $1 : "unknown"}
    
    # 计算响应时间类别
    response_time_category ${
      case record["response_time"].to_f
      when 0...100
        "fast"
      when 100...500
        "normal"
      when 500...2000
        "slow"
      else
        "very_slow"
      end
    }
    
    # 提取HTTP方法和路径
    http_method ${record["request"] =~ /^(\w+) / ? $1 : "unknown"}
    http_path ${record["request"] =~ /^\w+ ([^\s?]+)/ ? $1 : "unknown"}
    
    # 标准化状态码类别
    status_class ${(record["status"].to_i / 100) * 100}
    
    # 添加时间维度
    hour_of_day ${Time.at(record["time"]).hour}
    day_of_week ${Time.at(record["time"]).wday}
    
    # 环境信息
    environment "#{ENV['ENVIRONMENT'] || 'unknown'}"
    datacenter "#{ENV['DATACENTER'] || 'unknown'}"
    cluster "#{ENV['CLUSTER_NAME'] || 'default'}"
  </record>
</filter>

# 敏感数据脱敏
<filter app.**>
  @type record_transformer
  enable_ruby true
  
  <record>
    # 脱敏处理
    message ${
      msg = record["message"].dup
      # 信用卡号脱敏
      msg.gsub!(/\b(?:\d{4}[-\s]?){3}\d{4}\b/, "****-****-****-****")
      # 邮箱脱敏
      msg.gsub!(/\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b/, "[EMAIL_REDACTED]")
      # 电话号码脱敏
      msg.gsub!(/\b\d{3}-\d{3}-\d{4}\b/, "[PHONE_REDACTED]")
      # 身份证号脱敏
      msg.gsub!(/\b\d{17}[\dXx]\b/, "[ID_REDACTED]")
      msg
    }
    
    # 查询参数中的敏感信息
    query_string ${
      if record["query_string"]
        qs = record["query_string"].dup
        qs.gsub!(/([?&])(password|token|secret|key)=[^&]*/, '\1\2=[REDACTED]')
        qs
      else
        record["query_string"]
      end
    }
  </record>
</filter>

# 数据验证和清洗
<filter app.**>
  @type grep
  <regexp>
    key level
    pattern ^(ERROR|WARN|INFO|DEBUG|TRACE)$
  </regexp>
</filter>

<filter app.**>
  @type record_transformer
  enable_ruby true
  
  <record>
    # 数据类型转换和验证
    response_time ${
      rt = record["response_time"]
      if rt.is_a?(String) && rt.match(/^\d+(\.\d+)?$/)
        rt.to_f
      elsif rt.is_a?(Numeric)
        rt
      else
        0.0
      end
    }
    
    # 状态码标准化
    status_code ${
      status = record["status"] || record["status_code"]
      if status.is_a?(String) && status.match(/^\d{3}$/)
        status.to_i
      elsif status.is_a?(Integer) && status >= 100 && status <= 599
        status
      else
        0
      end
    }
    
    # 清理空字段
    level ${record["level"]&.upcase || "UNKNOWN"}
    service ${record["service"]&.strip || "unknown"}
  </record>
</filter>

高级路由和分发

智能路由策略条件分发配置

ruby

# 基于内容的动态路由
<match app.**>
  @type rewrite_tag_filter
  
  # 错误日志路由
  <rule>
    key level
    pattern /^ERROR$/
    tag error.${tag_parts[1]}
  </rule>
  
  # 安全事件路由
  <rule>
    key message
    pattern /(login|logout|authentication|authorization|failed|denied)/i
    tag security.${tag_parts[1]}
  </rule>
  
  # 性能事件路由
  <rule>
    key response_time
    pattern /^[5-9]\d{2,}|[1-9]\d{3,}/  # >= 500ms
    tag performance.slow.${tag_parts[1]}
  </rule>
  
  # 业务事件路由
  <rule>
    key message
    pattern /(payment|order|transaction|billing)/i
    tag business.${tag_parts[1]}
  </rule>
  
  # 默认路由
  <rule>
    key level
    pattern /.*/
    tag normal.${tag_parts[1]}
  </rule>
</match>

# 多级路由处理
<match error.**>
  @type copy
  
  # 发送到错误专用存储
  <store>
    @type elasticsearch
    host elasticsearch-errors.logging.svc.local
    port 9200
    index_name errors-%Y.%m.%d
    type_name error_log
    
    <buffer time>
      timekey 3600
      timekey_wait 60
    </buffer>
  </store>
  
  # 发送告警通知
  <store>
    @type exec
    command /usr/local/bin/alert-handler.sh
    format json
    
    <buffer>
      flush_mode immediate
    </buffer>
  </store>
  
  # 发送到Slack
  <store>
    @type slack
    webhook_url "#{ENV['SLACK_WEBHOOK_URL']}"
    channel "#alerts"
    username "Fluentd"
    icon_emoji ":rotating_light:"
    
    title "Error Alert"
    message "Service: %s, Error: %s"
    message_keys service,message
    
    <buffer>
      flush_mode immediate
    </buffer>
  </store>
</match>

# 地理位置分发
<match geo.web.access>
  @type route
  
  <route geo.web.access.us>
    <match>
      geoip_country "United States"
    </match>
  </route>
  
  <route geo.web.access.eu>
    <match>
      geoip_country /(Germany|France|United Kingdom|Italy)/
    </match>
  </route>
  
  <route geo.web.access.asia>
    <match>
      geoip_country /(China|Japan|South Korea|Singapore)/
    </match>
  </route>
  
  <route geo.web.access.other>
    <match>
      geoip_country /.*/
    </match>
  </route>
</match>

# 负载均衡路由
<match normal.**>
  @type roundrobin
  
  <server>
    name primary
    host aggregator-1.logging.svc.local
    port 24224
    weight 60
  </server>
  
  <server>
    name secondary
    host aggregator-2.logging.svc.local
    port 24224
    weight 30
  </server>
  
  <server>
    name backup
    host aggregator-3.logging.svc.local
    port 24224
    weight 10
  </server>
  
  <buffer>
    @type file
    path /var/log/fluentd/buffer/roundrobin
    flush_mode interval
    flush_interval 5s
    chunk_limit_size 16m
  </buffer>
</match>

ruby

# 基于时间的分发
<filter **>
  @type record_transformer
  enable_ruby true
  
  <record>
    time_bucket ${
      hour = Time.at(record["time"]).hour
      case hour
      when 0..5
        "night"
      when 6..11
        "morning"
      when 12..17
        "afternoon"
      else
        "evening"
      end
    }
  </record>
</filter>

<match **>
  @type route
  
  # 夜间低频数据批量处理
  <route night>
    <match>
      time_bucket "night"
    </match>
    
    @type file
    path /var/log/batch/night-%Y%m%d.log
    
    <buffer time>
      timekey 3600
      chunk_limit_size 100m
    </buffer>
  </route>
  
  # 白天实时处理
  <route realtime>
    <match>
      time_bucket /(morning|afternoon|evening)/
    </match>
    
    @type forward
    <server>
      host realtime-processor.logging.svc.local
      port 24224
    </server>
    
    <buffer>
      flush_mode interval
      flush_interval 1s
      chunk_limit_size 1m
    </buffer>
  </route>
</match>

# 基于服务重要性的分发
<filter app.**>
  @type record_transformer
  <record>
    service_tier ${
      case record["service"]
      when /^(auth|payment|billing)$/
        "critical"
      when /^(api|web|mobile)$/
        "important"
      else
        "normal"
      end
    }
  </record>
</filter>

<match **>
  @type route
  
  # 关键服务高可靠性处理
  <route critical>
    <match>
      service_tier "critical"
    </match>
    
    @type copy
    <store>
      @type forward
      <server>
        host critical-aggregator-1.logging.svc.local
        port 24224
      </server>
      <server>
        host critical-aggregator-2.logging.svc.local
        port 24224
      </server>
      
      <buffer>
        @type file
        path /var/log/fluentd/buffer/critical
        flush_mode immediate
        retry_forever true
        retry_max_interval 30
      </buffer>
    </store>
    
    # 本地备份
    <store>
      @type file
      path /backup/critical/%Y/%m/%d/critical.log
      append true
      
      <buffer time>
        timekey 86400
      </buffer>
    </store>
  </route>
  
  # 普通服务标准处理
  <route others>
    <match>
      service_tier /(important|normal)/
    </match>
    
    @type forward
    <server>
      host standard-aggregator.logging.svc.local
      port 24224
    </server>
    
    <buffer>
      flush_mode interval
      flush_interval 5s
      chunk_limit_size 8m
    </buffer>
  </route>
</match>

🔍 故障排查和性能调优

常见问题诊断

性能问题诊断监控和告警配置

yaml

performance_troubleshooting:
  high_memory_usage:
    symptoms:
      - "内存使用持续增长"
      - "GC频繁触发"
      - "处理延迟增加"
      - "OOM错误"
    
    diagnosis_steps:
      1. "检查Buffer配置"
      2. "分析GC日志"
      3. "监控插件内存使用"
      4. "检查正则表达式复杂度"
    
    solutions:
      buffer_optimization: |
        <buffer>
          # 减少内存缓冲大小
          chunk_limit_size 8m
          queue_limit_length 64
          
          # 使用文件缓冲
          @type file
          path /var/log/fluentd/buffer/
          
          # 更频繁的刷新
          flush_mode interval
          flush_interval 3s
        </buffer>
      
      gc_tuning: |
        # Ruby GC优化
        RUBY_GC_HEAP_INIT_SLOTS: "100000"
        RUBY_GC_HEAP_FREE_SLOTS: "50000"
        RUBY_GC_HEAP_GROWTH_FACTOR: "1.1"
        RUBY_GC_MALLOC_LIMIT: "50000000"
  
  high_cpu_usage:
    common_causes:
      - "复杂正则表达式"
      - "Ruby脚本计算"
      - "频繁的数据转换"
      - "过多的Worker进程"
    
    optimization_strategies:
      regex_optimization: |
        # 避免复杂正则
        # 差的例子
        <parse>
          format /^(?<timestamp>.*) \[(?<level>.*)\] (?<message>.*)$/
        </parse>
        
        # 好的例子
        <parse>
          format /^(?<timestamp>\S+ \S+) \[(?<level>\w+)\] (?<message>.+)$/
        </parse>
      
      ruby_script_optimization: |
        # 减少Ruby脚本使用
        <filter **>
          @type record_transformer
          enable_ruby false  # 禁用Ruby
          <record>
            environment "${ENV['ENVIRONMENT']}"
            datacenter "${ENV['DATACENTER']}"
          </record>
        </filter>
      
      worker_tuning: |
        <system>
          workers 2  # 减少worker数量
          worker_limit_to_one true
        </system>
  
  data_loss_issues:
    causes:
      - "Buffer溢出"
      - "网络中断"
      - "目标系统故障"
      - "配置错误"
    
    prevention_measures:
      reliable_delivery: |
        <match **>
          @type forward
          <server>
            host aggregator.logging.svc.local
            port 24224
          </server>
          
          # 可靠传输配置
          require_ack_response true
          ack_response_timeout 30s
          
          <buffer>
            @type file
            path /var/log/fluentd/buffer/
            
            # 重试配置
            retry_forever true
            retry_max_interval 300s
            
            # 持久化
            flush_at_shutdown true
          </buffer>
        </match>
      
      backup_strategy: |
        <match **>
          @type copy
          <store>
            # 主要输出
            @type elasticsearch
            host primary-es.logging.svc.local
            port 9200
          </store>
          
          <store>
            # 备份到文件
            @type file
            path /backup/logs/%Y/%m/%d/backup.log
            append true
            
            <buffer time>
              timekey 86400
            </buffer>
          </store>
        </match>

yaml

monitoring_setup:
  metrics_collection:
    prometheus_metrics: |
      # 监控插件配置
      <source>
        @type prometheus
        bind 0.0.0.0
        port 24231
        metrics_path /metrics
      </source>
      
      <source>
        @type prometheus_monitor
        <labels>
          hostname ${hostname}
          service fluentd
        </labels>
      </source>
      
      # 输出监控
      <source>
        @type prometheus_output_monitor
        <labels>
          hostname ${hostname}
        </labels>
      </source>
    
    custom_metrics: |
      # 自定义业务指标
      <filter app.**>
        @type prometheus
        <metric>
          name fluentd_input_status_code_total
          type counter
          desc The total number of input status code
          key status_code
          <labels>
            tag ${tag}
            hostname ${hostname}
            status_code ${status_code}
          </labels>
        </metric>
        
        <metric>
          name fluentd_input_response_time
          type histogram
          desc Response time histogram
          key response_time
          buckets 0.1,0.5,1,5,10
          <labels>
            tag ${tag}
            hostname ${hostname}
          </labels>
        </metric>
      </filter>
  
  health_monitoring:
    liveness_check: |
      # 健康检查端点
      <source>
        @type http
        port 9880
        bind 0.0.0.0
        
        <transport tls>
          ca_path /etc/ssl/certs/ca.crt
          cert_path /etc/ssl/certs/server.crt
          private_key_path /etc/ssl/private/server.key
        </transport>
      </source>
    
    kubernetes_probes: |
      # Kubernetes探针配置
      livenessProbe:
        httpGet:
          path: /fluentd.healthcheck?json=%7B%22ping%22%3A+%22pong%22%7D
          port: 9880
        initialDelaySeconds: 30
        periodSeconds: 15
        timeoutSeconds: 5
        failureThreshold: 3
      
      readinessProbe:
        httpGet:
          path: /fluentd.healthcheck?json=%7B%22ping%22%3A+%22pong%22%7D
          port: 9880
        initialDelaySeconds: 10
        periodSeconds: 5
        timeoutSeconds: 3
        failureThreshold: 3
  
  alerting_rules:
    critical_alerts:
      buffer_queue_full: |
        alert: FluentdBufferQueueFull
        expr: fluentd_status_buffer_queue_length / fluentd_status_buffer_total_queued_size > 0.9
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Fluentd buffer queue is nearly full"
          description: "Buffer queue usage is {{ $value }}% on {{ $labels.hostname }}"
      
      high_retry_rate: |
        alert: FluentdHighRetryRate
        expr: rate(fluentd_status_retry_count[5m]) > 10
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High retry rate detected"
          description: "Retry rate is {{ $value }} per second on {{ $labels.hostname }}"
    
    performance_alerts:
      high_memory_usage: |
        alert: FluentdHighMemoryUsage
        expr: process_resident_memory_bytes{job="fluentd"} / 1024 / 1024 > 1000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Fluentd memory usage is high"
          description: "Memory usage is {{ $value }}MB on {{ $labels.hostname }}"

调试和日志分析

调试工具和技巧

ruby

# 调试配置示例
debugging_configuration:
  verbose_logging:
    system_config: |
      <system>
        log_level trace
        suppress_repeated_stacktrace false
        emit_error_log_interval 5s
      </system>
    
    plugin_debugging: |
      # 在每个插件中添加调试输出
      <filter **>
        @type stdout
        <format>
          @type inspect
        </format>
      </filter>
  
  event_tracing:
    tag_debugging: |
      # 跟踪事件路由
      <match debug.**>
        @type stdout
        <format>
          @type json
        </format>
      </match>
      
      # 添加调试标签
      <filter **>
        @type record_transformer
        <record>
          debug_tag ${tag}
          debug_time ${time}
          debug_hostname ${hostname}
        </record>
      </filter>
    
    performance_profiling: |
      # 性能分析
      <filter **>
        @type elapsed_time
        tag elapsed.${tag}
        interval 1000  # 每1000条记录输出一次统计
      </filter>
      
      <match elapsed.**>
        @type file
        path /var/log/fluentd/performance.log
        <format>
          @type json
        </format>
      </match>
  
  buffer_debugging:
    buffer_monitoring: |
      # Buffer状态监控
      <source>
        @type debug_agent
        bind 127.0.0.1
        port 24230
      </source>
    
    chunk_analysis: |
      # 分析chunk内容
      <buffer>
        @type file
        path /var/log/fluentd/buffer/debug
        
        # 启用详细日志
        chunk_full_threshold 0.8
        compress gzip
        
        # 调试回调
        flush_thread_count 1
      </buffer>
  
  network_debugging:
    connection_monitoring: |
      # 网络连接监控
      <match **>
        @type forward
        <server>
          host aggregator.logging.svc.local
          port 24224
        </server>
        
        # 启用心跳
        heartbeat_type tcp
        heartbeat_interval 1s
        phi_failure_detector true
        hard_timeout 60s
        
        # 连接池调试
        keepalive true
        keepalive_timeout 20s
      </match>
    
    ssl_debugging: |
      # SSL连接调试
      <match secure.**>
        @type forward
        <server>
          host secure-aggregator.logging.svc.local
          port 24224
        </server>
        
        <security>
          self_hostname fluentd-client
          shared_key "#{ENV['FLUENTD_SHARED_KEY']}"
        </security>
        
        # SSL调试
        tls true
        tls_verify_hostname false
        tls_cert_path /etc/ssl/certs/client.crt
        tls_key_path /etc/ssl/private/client.key
        tls_ca_cert_path /etc/ssl/certs/ca.crt
      </match>

troubleshooting_tools:
  log_analysis:
    grep_patterns: |
      # 常用错误模式
      grep "ERROR" /var/log/fluentd/fluentd.log
      grep "retry" /var/log/fluentd/fluentd.log
      grep "buffer" /var/log/fluentd/fluentd.log
      grep "connection" /var/log/fluentd/fluentd.log
    
    performance_analysis: |
      # 性能分析命令
      # 查看GC统计
      grep "GC" /var/log/fluentd/fluentd.log | tail -100
      
      # 查看内存使用
      ps aux | grep fluentd
      
      # 查看文件描述符
      lsof -p $(pgrep fluentd) | wc -l
      
      # 查看网络连接
      netstat -an | grep :24224
  
  configuration_validation:
    dry_run: |
      # 配置验证
      fluentd --dry-run -c /etc/fluentd/fluent.conf
      
      # 语法检查
      ruby -c /etc/fluentd/fluent.conf
    
    plugin_testing: |
      # 插件测试
      echo '{"message":"test"}' | fluentd -c test.conf -vv
      
      # 单个插件测试
      fluent-cat test.sample < test.json
  
  monitoring_commands:
    runtime_inspection: |
      # 运行时监控
      # 查看进程状态
      curl http://localhost:24230/api/processes.json
      
      # 查看配置
      curl http://localhost:24230/api/config.json
      
      # 查看插件状态
      curl http://localhost:24230/api/plugins.json
    
    metrics_collection: |
      # 指标收集
      curl http://localhost:24231/metrics
      
      # 健康检查
      curl http://localhost:9880/fluentd.healthcheck

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177

📋 Fluentd 收集实践面试重点

配置实践类

如何配置多格式日志的自适应解析？
- multi_format解析器使用
- 格式匹配优先级
- 性能优化考虑
- 兜底处理策略
多行日志聚合的最佳实践？
- multiline格式配置
- Java异常堆栈处理
- 时间戳识别模式
- 性能影响评估
如何实现复杂的日志路由策略？
- rewrite_tag_filter使用
- 条件匹配规则
- 标签重写技巧
- 路由性能优化

性能优化类

大数据量场景下的性能调优？
- Buffer配置优化
- Worker进程调整
- 内存管理策略
- GC参数调优
如何诊断和解决数据丢失问题？
- 可靠传输配置
- Buffer溢出处理
- 重试机制设计
- 备份策略实施
网络故障场景下的容错机制？
- 心跳检测配置
- 连接池管理
- 故障转移策略
- 本地缓存机制

运维管理类

生产环境的监控和告警策略？
- 关键性能指标
- Prometheus集成
- 告警规则设计
- 故障预警机制
如何进行Fluentd的故障排查？
- 日志分析技巧
- 性能瓶颈定位
- 配置验证方法
- 调试工具使用
数据安全和合规性考虑？
- 敏感数据脱敏
- 传输加密配置
- 访问控制实施
- 审计日志记录

🔗 相关内容

Fluentd概述 - Fluentd基础架构和核心概念
日志聚合架构 - 整体聚合架构设计
日志管理最佳实践 - 综合最佳实践指南
ELK Stack集成 - 与Elasticsearch生态集成

Fluentd的日志收集实践涵盖了从基础配置到高级优化的全方位内容，通过合理的配置和调优，可以构建高效可靠的日志收集基础设施。

Fluentd 日志收集实践 ​

🎯 日志收集策略 ​

多数据源收集架构 ​

高级解析配置 ​

高级路由和分发 ​

🔍 故障排查和性能调优 ​

常见问题诊断 ​

调试和日志分析 ​

📋 Fluentd 收集实践面试重点 ​

配置实践类 ​

性能优化类 ​

运维管理类 ​

🔗 相关内容 ​

Fluentd 日志收集实践

🎯 日志收集策略

多数据源收集架构

高级解析配置

高级路由和分发

🔍 故障排查和性能调优

常见问题诊断

调试和日志分析

📋 Fluentd 收集实践面试重点

配置实践类

性能优化类

运维管理类

🔗 相关内容