Checklist: Debug Kubernetes in Production Environment
Debugging Kubernetes in production can be challenging, especially when dealing with distributed systems, multiple layers of abstraction, and time-sensitive issues. This comprehensive checklist provides a systematic approach to diagnose and resolve common Kubernetes production problems.
Pre-Debugging Preparation
Before diving into debugging, ensure you have the necessary tools and access:
Essential Tools
kubectl(configured with proper context)kubectx/kubens(for context switching)sternorkubectl logs -f(for log streaming)k9s(terminal UI for Kubernetes)helm(if using Helm charts)- Access to cluster monitoring tools (Prometheus, Grafana)
- Access to cloud provider console (if applicable)
Quick Health Check
Start with a high-level overview of your cluster:
# Check cluster nodes
kubectl get nodes
# Check all pods across all namespaces
kubectl get pods --all-namespaces
# Check resource usage
kubectl top nodes
kubectl top pods --all-namespaces
# Check cluster events
kubectl get events --sort-by='.lastTimestamp' --all-namespacesDebugging Checklist
1. Pod Status Issues
1.1 Pod Not Starting (Pending/ContainerCreating)
# Check pod status
kubectl get pods -n <namespace>
# Describe pod for detailed information
kubectl describe pod <pod-name> -n <namespace>
# Check pod events
kubectl get events -n <namespace> --field-selector involvedObject.name=<pod-name>Common Causes:
- Image pull errors: Check image name, tag, and registry access
- Resource constraints: Insufficient CPU/memory on nodes
- Volume mount issues: PVC not bound, wrong storage class
- Node selector/affinity: No matching nodes available
Quick Fixes:
# Check image pull secrets
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.imagePullSecrets}'
# Check resource requests/limits
kubectl describe pod <pod-name> -n <namespace> | grep -A 5 "Limits\|Requests"
# Check persistent volume claims
kubectl get pvc -n <namespace>1.2 Pod CrashLoopBackOff
# Get pod logs (current instance)
kubectl logs <pod-name> -n <namespace>
# Get logs from previous crashed container
kubectl logs <pod-name> -n <namespace> --previous
# Get all container logs in pod
kubectl logs <pod-name> -n <namespace> --all-containers=true
# Stream logs in real-time
kubectl logs -f <pod-name> -n <namespace>Common Causes:
- Application errors (check logs)
- Missing environment variables
- Incorrect configuration files
- Health check failures
- Out of memory (OOMKilled)
Debugging Steps:
# Check exit code
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'
# Check if OOMKilled
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.reason}'
# Check resource limits
kubectl describe pod <pod-name> -n <namespace> | grep -A 10 "Limits"1.3 Pod Running but Not Ready
# Check readiness probe status
kubectl describe pod <pod-name> -n <namespace> | grep -A 5 "Readiness"
# Check liveness probe
kubectl describe pod <pod-name> -n <namespace> | grep -A 5 "Liveness"
# Test endpoint manually
kubectl exec -it <pod-name> -n <namespace> -- curl http://localhost:<port>/healthProbe Configuration
Ensure your readiness and liveness probes are correctly configured:
- Readiness probe should check if app can serve traffic
- Liveness probe should check if app is alive
- Initial delay should account for startup time
2. Service and Networking Issues
2.1 Service Not Accessible
# Check service endpoints
kubectl get endpoints <service-name> -n <namespace>
# Describe service
kubectl describe svc <service-name> -n <namespace>
# Check service selector matches pod labels
kubectl get svc <service-name> -n <namespace> -o jsonpath='{.spec.selector}'
kubectl get pods -n <namespace> --show-labelsCommon Issues:
- Service selector doesn't match pod labels
- No pods ready (readiness probe failing)
- Wrong port configuration
- Network policies blocking traffic
2.2 DNS Resolution Problems
# Test DNS from within a pod
kubectl run -it --rm debug --image=busybox --restart=Never -- nslookup <service-name>.<namespace>.svc.cluster.local
# Check CoreDNS pods
kubectl get pods -n kube-system -l k8s-app=kube-dns
# Check CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns2.3 Ingress Not Working
# Check ingress status
kubectl get ingress -n <namespace>
# Describe ingress
kubectl describe ingress <ingress-name> -n <namespace>
# Check ingress controller pods
kubectl get pods -n <ingress-namespace> -l app=<ingress-controller>
# Check ingress controller logs
kubectl logs -n <ingress-namespace> -l app=<ingress-controller>3. Resource and Performance Issues
3.1 High CPU/Memory Usage
# Check resource usage
kubectl top pods -n <namespace>
kubectl top nodes
# Get detailed resource metrics
kubectl describe pod <pod-name> -n <namespace> | grep -A 10 "Limits\|Requests"
# Check if pods are being throttled
kubectl describe pod <pod-name> -n <namespace> | grep -i throttleSolutions:
- Adjust resource requests and limits
- Scale horizontally (add more replicas)
- Optimize application code
- Check for memory leaks
3.2 Node Issues
# Check node status
kubectl get nodes
kubectl describe node <node-name>
# Check node conditions
kubectl get nodes -o wide
# Check node resources
kubectl describe node <node-name> | grep -A 10 "Allocated resources"
# Check for node pressure
kubectl get nodes -o json | jq '.items[].status.conditions'Common Node Problems:
- MemoryPressure: Node running out of memory
- DiskPressure: Node running out of disk space
- PIDPressure: Too many processes
- NetworkUnavailable: Network issues
4. Configuration and Secrets Issues
4.1 ConfigMap/Secret Not Mounted
# Check if ConfigMap exists
kubectl get configmap -n <namespace>
# Check ConfigMap contents
kubectl get configmap <configmap-name> -n <namespace> -o yaml
# Check if mounted in pod
kubectl describe pod <pod-name> -n <namespace> | grep -A 5 "Mounts"
# Verify mounted files
kubectl exec <pod-name> -n <namespace> -- ls -la /path/to/mount
kubectl exec <pod-name> -n <namespace> -- cat /path/to/config4.2 Environment Variables Issues
# Check environment variables in pod
kubectl exec <pod-name> -n <namespace> -- env
# Check env from ConfigMap/Secret
kubectl describe pod <pod-name> -n <namespace> | grep -A 20 "Environment"5. Deployment and Rollout Issues
5.1 Deployment Not Updating
# Check deployment status
kubectl get deployment <deployment-name> -n <namespace>
# Check rollout status
kubectl rollout status deployment/<deployment-name> -n <namespace>
# Check deployment history
kubectl rollout history deployment/<deployment-name> -n <namespace>
# Describe deployment
kubectl describe deployment <deployment-name> -n <namespace>5.2 Rollout Stuck
# Check replica sets
kubectl get rs -n <namespace>
# Check rollout status
kubectl rollout status deployment/<deployment-name> -n <namespace>
# Check for deployment conditions
kubectl get deployment <deployment-name> -n <namespace> -o jsonpath='{.status.conditions}'Common Causes:
- Insufficient resources for new pods
- Image pull errors
- Health check failures
- Quota limits reached
Rollback if needed:
# Rollback to previous revision
kubectl rollout undo deployment/<deployment-name> -n <namespace>
# Rollback to specific revision
kubectl rollout undo deployment/<deployment-name> -n <namespace> --to-revision=26. Storage and Volume Issues
6.1 PVC Not Bound
# Check PVC status
kubectl get pvc -n <namespace>
# Describe PVC
kubectl describe pvc <pvc-name> -n <namespace>
# Check storage classes
kubectl get storageclass
# Check PV
kubectl get pv
kubectl describe pv <pv-name>Common Issues:
- No available storage in storage class
- Storage class doesn't exist
- Insufficient quota
- Node selector mismatch
6.2 Volume Mount Errors
# Check pod volume mounts
kubectl describe pod <pod-name> -n <namespace> | grep -A 10 "Volumes"
# Check if volume is mounted
kubectl exec <pod-name> -n <namespace> -- df -h
kubectl exec <pod-name> -n <namespace> -- mount | grep <volume-path>7. Security and RBAC Issues
7.1 Permission Denied Errors
# Check service account
kubectl get sa -n <namespace>
kubectl describe sa <service-account-name> -n <namespace>
# Check roles and role bindings
kubectl get roles -n <namespace>
kubectl get rolebindings -n <namespace>
# Check cluster roles
kubectl get clusterroles
kubectl get clusterrolebindings7.2 Network Policy Blocking Traffic
# Check network policies
kubectl get networkpolicies -n <namespace>
kubectl describe networkpolicy <policy-name> -n <namespace>
# Test connectivity
kubectl run -it --rm debug --image=busybox --restart=Never -- wget -O- <service-url>8. Logging and Monitoring
8.1 Collecting Comprehensive Logs
# All pods in namespace
kubectl logs -l app=<app-label> -n <namespace> --all-containers=true
# Previous crashed containers
kubectl logs <pod-name> -n <namespace> --previous --all-containers=true
# Logs with timestamps
kubectl logs <pod-name> -n <namespace> --timestamps
# Logs from specific time
kubectl logs <pod-name> -n <namespace> --since=1h
kubectl logs <pod-name> -n <namespace> --since-time="2024-01-01T00:00:00Z"
# Using stern for multi-pod log streaming
stern <app-label> -n <namespace>8.2 Debugging with Ephemeral Containers
# Add debug container to running pod (Kubernetes 1.23+)
kubectl debug <pod-name> -n <namespace> -it --image=busybox --target=<container-name>
# Create debug pod with same node
kubectl debug node/<node-name> -it --image=busybox9. Advanced Debugging Techniques
9.1 Exec into Containers
# Exec into pod
kubectl exec -it <pod-name> -n <namespace> -- /bin/sh
# Exec into specific container
kubectl exec -it <pod-name> -n <namespace> -c <container-name> -- /bin/sh
# Run commands
kubectl exec <pod-name> -n <namespace> -- ps aux
kubectl exec <pod-name> -n <namespace> -- netstat -tulpn
kubectl exec <pod-name> -n <namespace> -- env9.2 Port Forwarding for Local Testing
# Forward pod port
kubectl port-forward <pod-name> -n <namespace> 8080:80
# Forward service port
kubectl port-forward svc/<service-name> -n <namespace> 8080:80
# Forward deployment
kubectl port-forward deployment/<deployment-name> -n <namespace> 8080:809.3 API Server Debugging
# Check API server logs (if you have access)
kubectl logs -n kube-system -l component=kube-apiserver
# Check API server events
kubectl get events --all-namespaces --sort-by='.lastTimestamp'10. Production Debugging Workflow
Follow this systematic workflow when debugging production issues:
# Step 1: Identify the problem
kubectl get pods -n <namespace>
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
# Step 2: Get detailed pod information
kubectl describe pod <pod-name> -n <namespace>
# Step 3: Check logs
kubectl logs <pod-name> -n <namespace> --all-containers=true --previous
# Step 4: Check resources
kubectl top pod <pod-name> -n <namespace>
kubectl describe pod <pod-name> -n <namespace> | grep -A 10 "Limits"
# Step 5: Check networking
kubectl get svc -n <namespace>
kubectl get endpoints <service-name> -n <namespace>
# Step 6: Check configuration
kubectl get configmap,secret -n <namespace>
kubectl describe pod <pod-name> -n <namespace> | grep -A 20 "Environment"
# Step 7: Test connectivity
kubectl exec <pod-name> -n <namespace> -- curl http://localhost:<port>/healthQuick Reference Commands
Essential kubectl Commands
# Get resources
kubectl get <resource> -n <namespace> -o wide
kubectl get <resource> -n <namespace> -o yaml
kubectl get <resource> -n <namespace> -o json
# Describe resources
kubectl describe <resource> <name> -n <namespace>
# Watch resources
kubectl get pods -n <namespace> -w
# Delete resources
kubectl delete pod <pod-name> -n <namespace>
kubectl delete pod <pod-name> -n <namespace> --grace-period=0 --force
# Apply changes
kubectl apply -f <manifest.yaml>
kubectl patch deployment <name> -n <namespace> -p '{"spec":{"replicas":3}}'
# Scale resources
kubectl scale deployment <name> -n <namespace> --replicas=3Useful One-Liners
# Get all pods with their node assignments
kubectl get pods -o wide --all-namespaces
# Get pods sorted by restart count
kubectl get pods --sort-by='.status.containerStatuses[0].restartCount' -n <namespace>
# Get all events sorted by time
kubectl get events --all-namespaces --sort-by='.lastTimestamp'
# Get resource usage for all pods
kubectl top pods --all-namespaces --sort-by=memory
# Get all images used in cluster
kubectl get pods --all-namespaces -o jsonpath="{..image}" | tr -s '[[:space:]]' '\n' | sort | uniq
# Find pods using most CPU
kubectl top pods --all-namespaces --sort-by=cpu | head -20Common Production Scenarios
Scenario 1: Application Not Responding
# 1. Check pod status
kubectl get pods -n <namespace>
# 2. Check if service has endpoints
kubectl get endpoints <service-name> -n <namespace>
# 3. Check pod logs
kubectl logs <pod-name> -n <namespace> --tail=100
# 4. Test from inside cluster
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- curl http://<service-name>.<namespace>.svc.cluster.local
# 5. Check ingress/load balancer
kubectl get ingress -n <namespace>Scenario 2: High Memory Usage
# 1. Identify high memory pods
kubectl top pods --all-namespaces --sort-by=memory
# 2. Check memory limits
kubectl describe pod <pod-name> -n <namespace> | grep -A 5 "Limits"
# 3. Check if OOMKilled
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.reason}'
# 4. Check node memory
kubectl top nodes
kubectl describe node <node-name> | grep -A 10 "Allocated resources"Scenario 3: Slow Application Performance
# 1. Check resource throttling
kubectl describe pod <pod-name> -n <namespace> | grep -i throttle
# 2. Check CPU usage
kubectl top pod <pod-name> -n <namespace>
# 3. Check node resources
kubectl top nodes
# 4. Check for pending pods
kubectl get pods -n <namespace> | grep Pending
# 5. Check HPA status
kubectl get hpa -n <namespace>Best Practices
Production Debugging Tips
- Always check events first - They often contain the root cause
- Use
--previousflag - Get logs from crashed containers - Describe before exec - Understand the pod state first
- Check resource limits - Many issues stem from resource constraints
- Verify labels and selectors - Common networking issue
- Test connectivity - Use debug pods to test network
- Monitor continuously - Set up proper monitoring and alerting
- Document findings - Keep a runbook for common issues
Production Considerations
- Be careful with
kubectl delete --force- can cause data loss - Avoid debugging directly in production - use staging when possible
- Set up proper logging and monitoring before issues occur
- Use read-only debugging tools when possible
- Document your debugging process for future reference
Tools and Extensions
Recommended Tools
# k9s - Terminal UI
# Install: https://k9scli.io/
# stern - Multi-pod log tailing
# Install: https://github.com/stern/stern
# kubectx/kubens - Context switching
# Install: https://github.com/ahmetb/kubectx
# dive - Image analysis
# Install: https://github.com/wagoodman/dive
# kubectl-debug - Debug plugin
# Install: https://github.com/aylei/kubectl-debugUseful kubectl Plugins
# Install krew (kubectl plugin manager)
# https://krew.sigs.k8s.io/
# Useful plugins:
kubectl krew install access-matrix
kubectl krew install resource-capacity
kubectl krew install node-shell
kubectl krew install pod-logsConclusion
Debugging Kubernetes in production requires a systematic approach and deep understanding of the cluster components. This checklist provides a comprehensive guide to diagnose and resolve common issues.
Remember:
- Start with high-level overview (nodes, pods, events)
- Drill down systematically (pod → container → logs)
- Use the right tools for the job
- Document your findings
- Learn from each incident
