Checklist: Debug Kubernetes trong Môi trường Production

Debug Kubernetes trên production thường khó vì hệ thống phân tán, nhiều lớp abstraction và áp lực thời gian xử lý sự cố. Checklist này giúp bạn đi theo quy trình có hệ thống để tìm nguyên nhân và xử lý nhanh hơn.

Chuẩn bị Trước khi Debug

Công cụ cần có

kubectl với context đúng
kubectx / kubens
stern hoặc kubectl logs -f
k9s
helm (nếu dùng Helm)
Truy cập monitoring (Prometheus/Grafana)
Quyền truy cập cloud console (nếu cần)

Health Check Nhanh

bash

kubectl get nodes
kubectl get pods --all-namespaces
kubectl top nodes
kubectl top pods --all-namespaces
kubectl get events --sort-by='.lastTimestamp' --all-namespaces

1. Vấn đề Pod Status

1.1 Pod Pending / ContainerCreating

bash

kubectl get pods -n <namespace>
kubectl describe pod <pod-name> -n <namespace>
kubectl get events -n <namespace> --field-selector involvedObject.name=<pod-name>

Nguyên nhân thường gặp:

Image pull lỗi
Thiếu CPU/RAM trên node
PVC chưa bind
Selector/affinity không match node

1.2 Pod CrashLoopBackOff

bash

kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous
kubectl logs <pod-name> -n <namespace> --all-containers=true

Kiểm tra thêm:

bash

kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.reason}'

1.3 Pod Running nhưng Not Ready

bash

kubectl describe pod <pod-name> -n <namespace> | grep -A 5 "Readiness"
kubectl describe pod <pod-name> -n <namespace> | grep -A 5 "Liveness"
kubectl exec -it <pod-name> -n <namespace> -- curl http://localhost:<port>/health

WARNING

Đảm bảo readiness probe phản ánh khả năng nhận traffic, còn liveness probe phản ánh ứng dụng còn sống.

2. Vấn đề Service và Networking

2.1 Service không truy cập được

bash

kubectl get endpoints <service-name> -n <namespace>
kubectl describe svc <service-name> -n <namespace>
kubectl get svc <service-name> -n <namespace> -o jsonpath='{.spec.selector}'
kubectl get pods -n <namespace> --show-labels

2.2 DNS không phân giải

bash

kubectl run -it --rm debug --image=busybox --restart=Never -- nslookup <service-name>.<namespace>.svc.cluster.local
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns

2.3 Ingress không hoạt động

bash

kubectl get ingress -n <namespace>
kubectl describe ingress <ingress-name> -n <namespace>
kubectl get pods -n <ingress-namespace> -l app=<ingress-controller>
kubectl logs -n <ingress-namespace> -l app=<ingress-controller>

3. Vấn đề Tài nguyên và Hiệu năng

3.1 CPU/Memory cao

bash

kubectl top pods -n <namespace>
kubectl top nodes
kubectl describe pod <pod-name> -n <namespace> | grep -A 10 "Limits\|Requests"

Hướng xử lý:

Tăng request/limit phù hợp
Scale horizontal
Tối ưu mã nguồn ứng dụng
Kiểm tra memory leak

3.2 Node gặp sự cố

bash

kubectl get nodes
kubectl describe node <node-name>
kubectl describe node <node-name> | grep -A 10 "Allocated resources"

Các trạng thái cần chú ý:

MemoryPressure
DiskPressure
PIDPressure
NetworkUnavailable

4. ConfigMap / Secret / Env Issues

bash

kubectl get configmap -n <namespace>
kubectl get configmap <configmap-name> -n <namespace> -o yaml
kubectl describe pod <pod-name> -n <namespace> | grep -A 5 "Mounts"
kubectl exec <pod-name> -n <namespace> -- env

Kiểm tra mount path, key name, namespace và quyền truy cập.

5. Deployment và Rollout Issues

bash

kubectl get deployment <deployment-name> -n <namespace>
kubectl rollout status deployment/<deployment-name> -n <namespace>
kubectl rollout history deployment/<deployment-name> -n <namespace>
kubectl describe deployment <deployment-name> -n <namespace>

Rollback khi cần:

bash

kubectl rollout undo deployment/<deployment-name> -n <namespace>
kubectl rollout undo deployment/<deployment-name> -n <namespace> --to-revision=2

6. Storage và Volume Issues

bash

kubectl get pvc -n <namespace>
kubectl describe pvc <pvc-name> -n <namespace>
kubectl get storageclass
kubectl get pv

Lỗi phổ biến:

StorageClass không tồn tại
Hết quota
PVC selector không match PV

7. Security và RBAC

bash

kubectl get sa -n <namespace>
kubectl describe sa <service-account-name> -n <namespace>
kubectl get roles -n <namespace>
kubectl get rolebindings -n <namespace>
kubectl get clusterroles
kubectl get clusterrolebindings

Khi nghi ngờ network policy chặn traffic:

bash

kubectl get networkpolicies -n <namespace>
kubectl describe networkpolicy <policy-name> -n <namespace>

8. Logging và Monitoring

bash

kubectl logs -l app=<app-label> -n <namespace> --all-containers=true
kubectl logs <pod-name> -n <namespace> --previous --all-containers=true
kubectl logs <pod-name> -n <namespace> --timestamps
kubectl logs <pod-name> -n <namespace> --since=1h

Debug bằng ephemeral container:

bash

kubectl debug <pod-name> -n <namespace> -it --image=busybox --target=<container-name>

9. Workflow Debug Production (Gợi ý)

bash

# 1. Xác định phạm vi sự cố
kubectl get pods -n <namespace>
kubectl get events -n <namespace> --sort-by='.lastTimestamp'

# 2. Xem chi tiết pod
kubectl describe pod <pod-name> -n <namespace>

# 3. Xem logs hiện tại + previous
kubectl logs <pod-name> -n <namespace> --all-containers=true --previous

# 4. Kiểm tra resource
kubectl top pod <pod-name> -n <namespace>

# 5. Kiểm tra service/endpoints
kubectl get svc -n <namespace>
kubectl get endpoints <service-name> -n <namespace>

# 6. Kiểm tra config
kubectl get configmap,secret -n <namespace>

# 7. Test connectivity/health từ trong cluster
kubectl exec <pod-name> -n <namespace> -- curl http://localhost:<port>/health

One-liners hữu ích

bash

kubectl get pods -o wide --all-namespaces
kubectl get events --all-namespaces --sort-by='.lastTimestamp'
kubectl top pods --all-namespaces --sort-by=memory
kubectl top pods --all-namespaces --sort-by=cpu | head -20

Best Practices

TIP

Luôn xem events trước vì thường chứa root cause.
Dùng --previous để lấy log container bị crash.
describe trước, exec sau.
Kiểm tra resources/limits khi gặp lỗi khó tái hiện.
Ghi lại runbook sau mỗi incident.

WARNING

Hạn chế thao tác phá huỷ trong production (--force) nếu chưa đánh giá rủi ro dữ liệu và SLA.

Kết luận

Debug Kubernetes hiệu quả cần quy trình nhất quán: bắt đầu từ tổng quan, drill-down theo lớp (pod/container/log/network/config), rồi mới can thiệp. Checklist rõ ràng sẽ giúp giảm MTTR và tránh bỏ sót bước quan trọng trong sự cố production.

Checklist: Debug Kubernetes trong Môi trường Production ​

Chuẩn bị Trước khi Debug ​

Health Check Nhanh ​

1. Vấn đề Pod Status ​

1.1 Pod Pending / ContainerCreating ​

1.2 Pod CrashLoopBackOff ​

1.3 Pod Running nhưng Not Ready ​

2. Vấn đề Service và Networking ​

2.1 Service không truy cập được ​

2.2 DNS không phân giải ​

2.3 Ingress không hoạt động ​

3. Vấn đề Tài nguyên và Hiệu năng ​

3.1 CPU/Memory cao ​

3.2 Node gặp sự cố ​

4. ConfigMap / Secret / Env Issues ​

5. Deployment và Rollout Issues ​

6. Storage và Volume Issues ​

7. Security và RBAC ​

8. Logging và Monitoring ​

9. Workflow Debug Production (Gợi ý) ​

One-liners hữu ích ​

Best Practices ​

Kết luận ​

Checklist: Debug Kubernetes trong Môi trường Production

Chuẩn bị Trước khi Debug

Health Check Nhanh

1. Vấn đề Pod Status

1.1 Pod Pending / ContainerCreating

1.2 Pod CrashLoopBackOff

1.3 Pod Running nhưng Not Ready

2. Vấn đề Service và Networking

2.1 Service không truy cập được

2.2 DNS không phân giải

2.3 Ingress không hoạt động

3. Vấn đề Tài nguyên và Hiệu năng

3.1 CPU/Memory cao

3.2 Node gặp sự cố

4. ConfigMap / Secret / Env Issues

5. Deployment và Rollout Issues

6. Storage và Volume Issues

7. Security và RBAC

8. Logging và Monitoring

9. Workflow Debug Production (Gợi ý)

One-liners hữu ích

Best Practices

Kết luận