Kubernetes 故障排查面试题
Kubernetes 故障排查是运维工程师的核心技能,涵盖集群监控、问题诊断和性能优化等关键能力。
🔥 常见故障排查面试题
1. Pod 启动失败诊断
问题:Pod 处于 Pending、CrashLoopBackOff 或 ImagePullBackOff 状态,如何系统性地排查问题?
参考答案:
bash
# 1. 查看 Pod 基本状态
kubectl get pods -o wide
kubectl describe pod <pod-name>
# 2. 查看 Pod 事件
kubectl get events --field-selector involvedObject.name=<pod-name>
# 3. 查看容器日志
kubectl logs <pod-name> -c <container-name>
kubectl logs <pod-name> -c <container-name> --previous # 查看上次重启前的日志
# 4. 调试容器
kubectl exec -it <pod-name> -c <container-name> -- /bin/sh
# 5. 资源检查
kubectl top nodes
kubectl top pods
# 6. 网络诊断
kubectl run debug --image=nicolaka/netshoot -it --rm -- /bin/bash常见问题诊断流程:
yaml
# 调试 Pod 模板
apiVersion: v1
kind: Pod
metadata:
name: debug-pod
spec:
containers:
- name: debug
image: nicolaka/netshoot
command: ["sleep", "3600"]
securityContext:
capabilities:
add: ["NET_ADMIN"]
hostNetwork: true # 使用宿主机网络进行网络诊断
---
# 问题排查检查清单
# Pending 状态检查:
# 1. 节点资源是否充足
# 2. 节点选择器是否正确
# 3. 污点和容忍是否匹配
# 4. PVC 是否可用
# CrashLoopBackOff 检查:
# 1. 应用配置是否正确
# 2. 依赖服务是否可用
# 3. 资源限制是否合理
# 4. 健康检查配置是否正确
# ImagePullBackOff 检查:
# 1. 镜像名称和标签是否正确
# 2. 镜像仓库是否可访问
# 3. 认证凭据是否配置
# 4. 网络是否连通2. 网络连接问题排查
问题:Pod 之间无法通信,Service 无法访问,如何排查网络问题?
参考答案:
bash
# 网络连接测试脚本
#!/bin/bash
# 1. DNS 解析测试
kubectl run dns-test --image=busybox --rm -it -- nslookup kubernetes.default.svc.cluster.local
# 2. 服务连接测试
kubectl run connectivity-test --image=busybox --rm -it -- \
wget -qO- --timeout=2 http://my-service.my-namespace.svc.cluster.local
# 3. 跨命名空间连接测试
kubectl run cross-ns-test --image=busybox --rm -it -- \
nc -zv my-service.other-namespace.svc.cluster.local 80
# 4. 外部连接测试
kubectl run external-test --image=busybox --rm -it -- \
wget -qO- --timeout=2 http://httpbin.org/ip
# 5. 端口连通性测试
kubectl run port-test --image=busybox --rm -it -- \
nc -zv <pod-ip> <port>网络故障排查 YAML:
yaml
# 网络诊断工具 Pod
apiVersion: v1
kind: Pod
metadata:
name: network-debug
spec:
containers:
- name: network-tools
image: nicolaka/netshoot
command: ["sleep", "3600"]
securityContext:
capabilities:
add: ["NET_ADMIN", "NET_RAW"]
hostNetwork: false
---
# 测试用的简单服务
apiVersion: v1
kind: Service
metadata:
name: test-service
spec:
selector:
app: test-app
ports:
- port: 80
targetPort: 8080
type: ClusterIP
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: test-app
spec:
replicas: 2
selector:
matchLabels:
app: test-app
template:
metadata:
labels:
app: test-app
spec:
containers:
- name: app
image: hashicorp/http-echo:latest
args:
- -text=Hello from $(POD_NAME)
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
ports:
- containerPort: 8080
---
# NetworkPolicy 测试
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: test-netpol
spec:
podSelector:
matchLabels:
app: test-app
policyTypes:
- Ingress
- Egress
ingress:
- from:
- podSelector:
matchLabels:
access: allowed
ports:
- protocol: TCP
port: 8080
egress:
- to: []
ports:
- protocol: UDP
port: 53 # DNS3. 存储问题诊断
问题:PVC 无法绑定,存储卷挂载失败,如何排查存储问题?
参考答案:
bash
# 存储问题排查命令
# 1. 检查 PVC 状态
kubectl get pvc -A
kubectl describe pvc <pvc-name>
# 2. 检查 PV 状态
kubectl get pv
kubectl describe pv <pv-name>
# 3. 检查 StorageClass
kubectl get storageclass
kubectl describe storageclass <storage-class-name>
# 4. 检查 CSI 驱动
kubectl get pods -n kube-system | grep csi
kubectl logs -n kube-system <csi-driver-pod>
# 5. 节点存储检查
kubectl get nodes -o wide
kubectl describe node <node-name>存储故障排查示例:
yaml
# 存储诊断 Pod
apiVersion: v1
kind: Pod
metadata:
name: storage-debug
spec:
containers:
- name: debug
image: busybox
command: ["sleep", "3600"]
volumeMounts:
- name: test-volume
mountPath: /data
volumes:
- name: test-volume
persistentVolumeClaim:
claimName: debug-pvc
---
# 测试 PVC
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: debug-pvc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
storageClassName: standard
---
# 存储性能测试
apiVersion: v1
kind: Pod
metadata:
name: storage-benchmark
spec:
containers:
- name: fio
image: ljishen/fio
command: ["fio"]
args:
- "--name=random-write"
- "--ioengine=libaio"
- "--rw=randwrite"
- "--bs=4k"
- "--direct=1"
- "--size=1G"
- "--numjobs=1"
- "--runtime=60"
- "--group_reporting"
- "--filename=/data/testfile"
volumeMounts:
- name: test-volume
mountPath: /data
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 1
memory: 1Gi
volumes:
- name: test-volume
persistentVolumeClaim:
claimName: debug-pvc4. 节点故障和资源问题
问题:节点 NotReady,资源不足,如何诊断和处理节点级别问题?
参考答案:
bash
# 节点诊断脚本
#!/bin/bash
# 1. 检查节点状态
kubectl get nodes -o wide
kubectl describe node <node-name>
# 2. 检查节点资源使用
kubectl top nodes
kubectl top pods --all-namespaces --sort-by=cpu
kubectl top pods --all-namespaces --sort-by=memory
# 3. 检查 kubelet 日志
# 在节点上执行
journalctl -u kubelet -f
journalctl -u docker -f
# 4. 检查系统资源
# 在节点上执行
df -h
free -h
iostat -x 1 5
top
# 5. 检查网络状态
ip route show
iptables -L -n -v
ss -tuln
# 6. 清理未使用的资源
kubectl delete pods --field-selector=status.phase=Failed -A
kubectl delete pods --field-selector=status.phase=Succeeded -A
docker system prune -f节点维护和恢复:
yaml
# 节点维护模式
apiVersion: v1
kind: Node
metadata:
name: worker-node-1
spec:
# 设置节点不可调度
unschedulable: true
taints:
- key: node.kubernetes.io/maintenance
value: "true"
effect: NoSchedule
---
# 节点资源清理 DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-cleanup
namespace: kube-system
spec:
selector:
matchLabels:
name: node-cleanup
template:
metadata:
labels:
name: node-cleanup
spec:
hostPID: true
hostNetwork: true
hostIPC: true
tolerations:
- operator: Exists
effect: NoSchedule
containers:
- name: cleanup
image: alpine:latest
command: ["sh", "-c"]
args:
- |
while true; do
# 清理临时文件
find /host/tmp -type f -atime +7 -delete 2>/dev/null || true
# 清理日志文件
find /host/var/log -name "*.log" -size +100M -delete 2>/dev/null || true
# 清理未使用的容器镜像
chroot /host docker system prune -f 2>/dev/null || true
sleep 3600
done
securityContext:
privileged: true
volumeMounts:
- name: host-root
mountPath: /host
mountPropagation: HostToContainer
resources:
requests:
cpu: 10m
memory: 50Mi
limits:
cpu: 100m
memory: 100Mi
volumes:
- name: host-root
hostPath:
path: /💡 高级故障排查技巧
5. 集群级别监控和告警
问题:如何建立完整的 Kubernetes 集群监控体系?关键指标有哪些?
参考答案:
yaml
# Prometheus 监控配置
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
data:
prometheus.yml: |
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "/etc/prometheus/rules/*.yml"
scrape_configs:
# Kubernetes API Server
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
# Kubelet 指标
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs:
- role: node
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
# Pod 指标
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
---
# 关键告警规则
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-rules
data:
cluster.rules.yml: |
groups:
- name: kubernetes-cluster
rules:
# 节点状态告警
- alert: NodeNotReady
expr: kube_node_status_condition{condition="Ready",status="true"} == 0
for: 10m
labels:
severity: critical
annotations:
summary: "Node {{ $labels.node }} is not ready"
description: "Node {{ $labels.node }} has been not ready for more than 10 minutes"
# 节点资源告警
- alert: NodeHighCPUUsage
expr: (1 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])))) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on node {{ $labels.instance }}"
description: "CPU usage on node {{ $labels.instance }} is {{ $value }}%"
- alert: NodeHighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
for: 5m
labels:
severity: critical
annotations:
summary: "High memory usage on node {{ $labels.instance }}"
description: "Memory usage on node {{ $labels.instance }} is {{ $value }}%"
# Pod 状态告警
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) * 60 * 15 > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"
description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} has restarted {{ $value }} times in the last 15 minutes"
# API Server 告警
- alert: KubeAPIDown
expr: up{job="kubernetes-apiservers"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Kubernetes API server is down"
description: "Kubernetes API server has been down for more than 5 minutes"
# etcd 告警
- alert: EtcdInsufficientMembers
expr: count(etcd_server_id) < 3
for: 3m
labels:
severity: critical
annotations:
summary: "Etcd has insufficient members"
description: "Etcd cluster has {{ $value }} members available, less than 3"6. 性能调优和容量规划
问题:如何进行 Kubernetes 集群性能调优?容量规划的关键指标是什么?
参考答案:
bash
# 性能基准测试脚本
#!/bin/bash
# 1. 集群基准测试
kubectl run cluster-benchmark --image=quay.io/prometheus/node-exporter \
--restart=Never --rm -it -- node_exporter
# 2. 网络性能测试
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: iperf-server
spec:
containers:
- name: iperf-server
image: networkstatic/iperf3
command: ["iperf3", "-s"]
ports:
- containerPort: 5201
---
apiVersion: v1
kind: Pod
metadata:
name: iperf-client
spec:
containers:
- name: iperf-client
image: networkstatic/iperf3
command: ["sleep", "3600"]
EOF
# 等待 Pod 启动后执行测试
kubectl exec iperf-client -- iperf3 -c iperf-server -t 30
# 3. 存储 I/O 性能测试
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: fio-test
spec:
containers:
- name: fio
image: ljishen/fio
command: ["fio", "--name=test", "--size=1G", "--rw=randrw", "--bs=4k", "--ioengine=libaio", "--direct=1", "--runtime=60", "--group_reporting"]
volumeMounts:
- name: test-volume
mountPath: /data
volumes:
- name: test-volume
emptyDir: {}
EOF
# 4. API Server 性能测试
kubectl run api-benchmark --image=appropriate/curl --restart=Never --rm -it -- \
curl -X GET https://kubernetes.default.svc.cluster.local/api/v1/pods \
--header "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" \
--cacert /var/run/secrets/kubernetes.io/serviceaccount/ca.crt容量规划配置:
yaml
# 资源监控和规划
apiVersion: v1
kind: ConfigMap
metadata:
name: capacity-monitoring
data:
queries.yml: |
# 节点资源利用率查询
node_cpu_utilization: |-
(1 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])))) * 100
node_memory_utilization: |-
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
node_disk_utilization: |-
(1 - (node_filesystem_avail_bytes{fstype!="tmpfs"} / node_filesystem_size_bytes{fstype!="tmpfs"})) * 100
# Pod 资源使用情况
pod_cpu_usage: |-
sum(rate(container_cpu_usage_seconds_total{container!="POD",container!=""}[5m])) by (namespace, pod)
pod_memory_usage: |-
sum(container_memory_working_set_bytes{container!="POD",container!=""}) by (namespace, pod)
# 集群整体资源分配
cluster_cpu_allocation: |-
sum(kube_pod_container_resource_requests_cpu_cores) / sum(kube_node_status_allocatable_cpu_cores) * 100
cluster_memory_allocation: |-
sum(kube_pod_container_resource_requests_memory_bytes) / sum(kube_node_status_allocatable_memory_bytes) * 100
---
# 自动扩容策略
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: capacity-aware-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-app
minReplicas: 3
maxReplicas: 50
behavior:
scaleUp:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 100
periodSeconds: 60
- type: Pods
value: 4
periodSeconds: 60
selectPolicy: Max
scaleDown:
stabilizationWindowSeconds: 600
policies:
- type: Percent
value: 10
periodSeconds: 60
selectPolicy: Min
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 70这些故障排查面试题涵盖了 Kubernetes 运维中最常见的问题场景,展示了系统性的问题诊断和解决能力。
