Skip to content

Kubernetes 故障排查面试题

Kubernetes 故障排查是运维工程师的核心技能,涵盖集群监控、问题诊断和性能优化等关键能力。

🔥 常见故障排查面试题

1. Pod 启动失败诊断

问题:Pod 处于 Pending、CrashLoopBackOff 或 ImagePullBackOff 状态,如何系统性地排查问题?

参考答案

bash
# 1. 查看 Pod 基本状态
kubectl get pods -o wide
kubectl describe pod <pod-name>

# 2. 查看 Pod 事件
kubectl get events --field-selector involvedObject.name=<pod-name>

# 3. 查看容器日志
kubectl logs <pod-name> -c <container-name>
kubectl logs <pod-name> -c <container-name> --previous  # 查看上次重启前的日志

# 4. 调试容器
kubectl exec -it <pod-name> -c <container-name> -- /bin/sh

# 5. 资源检查
kubectl top nodes
kubectl top pods

# 6. 网络诊断
kubectl run debug --image=nicolaka/netshoot -it --rm -- /bin/bash

常见问题诊断流程

yaml
# 调试 Pod 模板
apiVersion: v1
kind: Pod
metadata:
  name: debug-pod
spec:
  containers:
  - name: debug
    image: nicolaka/netshoot
    command: ["sleep", "3600"]
    securityContext:
      capabilities:
        add: ["NET_ADMIN"]
  hostNetwork: true  # 使用宿主机网络进行网络诊断

---
# 问题排查检查清单
# Pending 状态检查:
#   1. 节点资源是否充足
#   2. 节点选择器是否正确
#   3. 污点和容忍是否匹配
#   4. PVC 是否可用

# CrashLoopBackOff 检查:
#   1. 应用配置是否正确
#   2. 依赖服务是否可用
#   3. 资源限制是否合理
#   4. 健康检查配置是否正确

# ImagePullBackOff 检查:
#   1. 镜像名称和标签是否正确
#   2. 镜像仓库是否可访问
#   3. 认证凭据是否配置
#   4. 网络是否连通

2. 网络连接问题排查

问题:Pod 之间无法通信,Service 无法访问,如何排查网络问题?

参考答案

bash
# 网络连接测试脚本
#!/bin/bash

# 1. DNS 解析测试
kubectl run dns-test --image=busybox --rm -it -- nslookup kubernetes.default.svc.cluster.local

# 2. 服务连接测试  
kubectl run connectivity-test --image=busybox --rm -it -- \
  wget -qO- --timeout=2 http://my-service.my-namespace.svc.cluster.local

# 3. 跨命名空间连接测试
kubectl run cross-ns-test --image=busybox --rm -it -- \
  nc -zv my-service.other-namespace.svc.cluster.local 80

# 4. 外部连接测试
kubectl run external-test --image=busybox --rm -it -- \
  wget -qO- --timeout=2 http://httpbin.org/ip

# 5. 端口连通性测试
kubectl run port-test --image=busybox --rm -it -- \
  nc -zv <pod-ip> <port>

网络故障排查 YAML

yaml
# 网络诊断工具 Pod
apiVersion: v1
kind: Pod
metadata:
  name: network-debug
spec:
  containers:
  - name: network-tools
    image: nicolaka/netshoot
    command: ["sleep", "3600"]
    securityContext:
      capabilities:
        add: ["NET_ADMIN", "NET_RAW"]
  hostNetwork: false

---
# 测试用的简单服务
apiVersion: v1
kind: Service
metadata:
  name: test-service
spec:
  selector:
    app: test-app
  ports:
  - port: 80
    targetPort: 8080
  type: ClusterIP

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: test-app
spec:
  replicas: 2
  selector:
    matchLabels:
      app: test-app
  template:
    metadata:
      labels:
        app: test-app
    spec:
      containers:
      - name: app
        image: hashicorp/http-echo:latest
        args:
        - -text=Hello from $(POD_NAME)
        env:
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        ports:
        - containerPort: 8080

---
# NetworkPolicy 测试
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: test-netpol
spec:
  podSelector:
    matchLabels:
      app: test-app
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          access: allowed
    ports:
    - protocol: TCP
      port: 8080
  egress:
  - to: []
    ports:
    - protocol: UDP
      port: 53  # DNS

3. 存储问题诊断

问题:PVC 无法绑定,存储卷挂载失败,如何排查存储问题?

参考答案

bash
# 存储问题排查命令
# 1. 检查 PVC 状态
kubectl get pvc -A
kubectl describe pvc <pvc-name>

# 2. 检查 PV 状态  
kubectl get pv
kubectl describe pv <pv-name>

# 3. 检查 StorageClass
kubectl get storageclass
kubectl describe storageclass <storage-class-name>

# 4. 检查 CSI 驱动
kubectl get pods -n kube-system | grep csi
kubectl logs -n kube-system <csi-driver-pod>

# 5. 节点存储检查
kubectl get nodes -o wide
kubectl describe node <node-name>

存储故障排查示例

yaml
# 存储诊断 Pod
apiVersion: v1
kind: Pod
metadata:
  name: storage-debug
spec:
  containers:
  - name: debug
    image: busybox
    command: ["sleep", "3600"]
    volumeMounts:
    - name: test-volume
      mountPath: /data
  volumes:
  - name: test-volume
    persistentVolumeClaim:
      claimName: debug-pvc

---
# 测试 PVC
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: debug-pvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
  storageClassName: standard

---
# 存储性能测试
apiVersion: v1
kind: Pod
metadata:
  name: storage-benchmark
spec:
  containers:
  - name: fio
    image: ljishen/fio
    command: ["fio"]
    args:
    - "--name=random-write"
    - "--ioengine=libaio"
    - "--rw=randwrite"
    - "--bs=4k"
    - "--direct=1"
    - "--size=1G"
    - "--numjobs=1"
    - "--runtime=60"
    - "--group_reporting"
    - "--filename=/data/testfile"
    volumeMounts:
    - name: test-volume
      mountPath: /data
    resources:
      requests:
        cpu: 500m
        memory: 512Mi
      limits:
        cpu: 1
        memory: 1Gi
  volumes:
  - name: test-volume
    persistentVolumeClaim:
      claimName: debug-pvc

4. 节点故障和资源问题

问题:节点 NotReady,资源不足,如何诊断和处理节点级别问题?

参考答案

bash
# 节点诊断脚本
#!/bin/bash

# 1. 检查节点状态
kubectl get nodes -o wide
kubectl describe node <node-name>

# 2. 检查节点资源使用
kubectl top nodes
kubectl top pods --all-namespaces --sort-by=cpu
kubectl top pods --all-namespaces --sort-by=memory

# 3. 检查 kubelet 日志
# 在节点上执行
journalctl -u kubelet -f
journalctl -u docker -f

# 4. 检查系统资源
# 在节点上执行
df -h
free -h
iostat -x 1 5
top

# 5. 检查网络状态
ip route show
iptables -L -n -v
ss -tuln

# 6. 清理未使用的资源
kubectl delete pods --field-selector=status.phase=Failed -A
kubectl delete pods --field-selector=status.phase=Succeeded -A
docker system prune -f

节点维护和恢复

yaml
# 节点维护模式
apiVersion: v1
kind: Node
metadata:
  name: worker-node-1
spec:
  # 设置节点不可调度
  unschedulable: true
  taints:
  - key: node.kubernetes.io/maintenance
    value: "true"
    effect: NoSchedule

---
# 节点资源清理 DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-cleanup
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: node-cleanup
  template:
    metadata:
      labels:
        name: node-cleanup
    spec:
      hostPID: true
      hostNetwork: true
      hostIPC: true
      tolerations:
      - operator: Exists
        effect: NoSchedule
      containers:
      - name: cleanup
        image: alpine:latest
        command: ["sh", "-c"]
        args:
        - |
          while true; do
            # 清理临时文件
            find /host/tmp -type f -atime +7 -delete 2>/dev/null || true
            # 清理日志文件
            find /host/var/log -name "*.log" -size +100M -delete 2>/dev/null || true
            # 清理未使用的容器镜像
            chroot /host docker system prune -f 2>/dev/null || true
            sleep 3600
          done
        securityContext:
          privileged: true
        volumeMounts:
        - name: host-root
          mountPath: /host
          mountPropagation: HostToContainer
        resources:
          requests:
            cpu: 10m
            memory: 50Mi
          limits:
            cpu: 100m
            memory: 100Mi
      volumes:
      - name: host-root
        hostPath:
          path: /

💡 高级故障排查技巧

5. 集群级别监控和告警

问题:如何建立完整的 Kubernetes 集群监控体系?关键指标有哪些?

参考答案

yaml
# Prometheus 监控配置
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
      evaluation_interval: 15s

    rule_files:
    - "/etc/prometheus/rules/*.yml"

    scrape_configs:
    # Kubernetes API Server
    - job_name: 'kubernetes-apiservers'
      kubernetes_sd_configs:
      - role: endpoints
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      relabel_configs:
      - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
        action: keep
        regex: default;kubernetes;https

    # Kubelet 指标
    - job_name: 'kubernetes-nodes'
      kubernetes_sd_configs:
      - role: node
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)

    # Pod 指标
    - job_name: 'kubernetes-pods'
      kubernetes_sd_configs:
      - role: pod
      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)

---
# 关键告警规则
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-rules
data:
  cluster.rules.yml: |
    groups:
    - name: kubernetes-cluster
      rules:
      # 节点状态告警
      - alert: NodeNotReady
        expr: kube_node_status_condition{condition="Ready",status="true"} == 0
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "Node {{ $labels.node }} is not ready"
          description: "Node {{ $labels.node }} has been not ready for more than 10 minutes"

      # 节点资源告警
      - alert: NodeHighCPUUsage
        expr: (1 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])))) * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on node {{ $labels.instance }}"
          description: "CPU usage on node {{ $labels.instance }} is {{ $value }}%"

      - alert: NodeHighMemoryUsage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High memory usage on node {{ $labels.instance }}"
          description: "Memory usage on node {{ $labels.instance }} is {{ $value }}%"

      # Pod 状态告警
      - alert: PodCrashLooping
        expr: rate(kube_pod_container_status_restarts_total[15m]) * 60 * 15 > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"
          description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} has restarted {{ $value }} times in the last 15 minutes"

      # API Server 告警
      - alert: KubeAPIDown
        expr: up{job="kubernetes-apiservers"} == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Kubernetes API server is down"
          description: "Kubernetes API server has been down for more than 5 minutes"

      # etcd 告警
      - alert: EtcdInsufficientMembers
        expr: count(etcd_server_id) < 3
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "Etcd has insufficient members"
          description: "Etcd cluster has {{ $value }} members available, less than 3"

6. 性能调优和容量规划

问题:如何进行 Kubernetes 集群性能调优?容量规划的关键指标是什么?

参考答案

bash
# 性能基准测试脚本
#!/bin/bash

# 1. 集群基准测试
kubectl run cluster-benchmark --image=quay.io/prometheus/node-exporter \
  --restart=Never --rm -it -- node_exporter

# 2. 网络性能测试
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: iperf-server
spec:
  containers:
  - name: iperf-server
    image: networkstatic/iperf3
    command: ["iperf3", "-s"]
    ports:
    - containerPort: 5201
---
apiVersion: v1
kind: Pod
metadata:
  name: iperf-client
spec:
  containers:
  - name: iperf-client
    image: networkstatic/iperf3
    command: ["sleep", "3600"]
EOF

# 等待 Pod 启动后执行测试
kubectl exec iperf-client -- iperf3 -c iperf-server -t 30

# 3. 存储 I/O 性能测试
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: fio-test
spec:
  containers:
  - name: fio
    image: ljishen/fio
    command: ["fio", "--name=test", "--size=1G", "--rw=randrw", "--bs=4k", "--ioengine=libaio", "--direct=1", "--runtime=60", "--group_reporting"]
    volumeMounts:
    - name: test-volume
      mountPath: /data
  volumes:
  - name: test-volume
    emptyDir: {}
EOF

# 4. API Server 性能测试
kubectl run api-benchmark --image=appropriate/curl --restart=Never --rm -it -- \
  curl -X GET https://kubernetes.default.svc.cluster.local/api/v1/pods \
  --header "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" \
  --cacert /var/run/secrets/kubernetes.io/serviceaccount/ca.crt

容量规划配置

yaml
# 资源监控和规划
apiVersion: v1
kind: ConfigMap
metadata:
  name: capacity-monitoring
data:
  queries.yml: |
    # 节点资源利用率查询
    node_cpu_utilization: |-
      (1 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])))) * 100
    
    node_memory_utilization: |-
      (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
    
    node_disk_utilization: |-
      (1 - (node_filesystem_avail_bytes{fstype!="tmpfs"} / node_filesystem_size_bytes{fstype!="tmpfs"})) * 100
    
    # Pod 资源使用情况
    pod_cpu_usage: |-
      sum(rate(container_cpu_usage_seconds_total{container!="POD",container!=""}[5m])) by (namespace, pod)
    
    pod_memory_usage: |-
      sum(container_memory_working_set_bytes{container!="POD",container!=""}) by (namespace, pod)
    
    # 集群整体资源分配
    cluster_cpu_allocation: |-
      sum(kube_pod_container_resource_requests_cpu_cores) / sum(kube_node_status_allocatable_cpu_cores) * 100
    
    cluster_memory_allocation: |-
      sum(kube_pod_container_resource_requests_memory_bytes) / sum(kube_node_status_allocatable_memory_bytes) * 100

---
# 自动扩容策略
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: capacity-aware-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  minReplicas: 3
  maxReplicas: 50
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 100
        periodSeconds: 60
      - type: Pods
        value: 4
        periodSeconds: 60
      selectPolicy: Max
    scaleDown:
      stabilizationWindowSeconds: 600
      policies:
      - type: Percent
        value: 10
        periodSeconds: 60
      selectPolicy: Min
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 70

这些故障排查面试题涵盖了 Kubernetes 运维中最常见的问题场景,展示了系统性的问题诊断和解决能力。

正在精进