RKE2 Troubleshooting Guide
This comprehensive troubleshooting guide covers common issues encountered in production RKE2 deployments, with practical solutions and diagnostic techniques.
General Diagnostics
Cluster Health Overview
Start with these basic health checks:
bash
# Check node status
kubectl get nodes -o wide
# Check system pods
kubectl get pods -A --field-selector=status.phase!=Running
# Check cluster info
kubectl cluster-info
# Check component status
kubectl get componentstatusesLog Collection
Gather logs from key components:
bash
# RKE2 server logs
sudo journalctl -u rke2-server.service -f
# RKE2 agent logs
sudo journalctl -u rke2-agent.service -f
# Kubelet logs
sudo journalctl -u kubelet -f
# Container runtime logs
sudo journalctl -u containerd -fNode-Level Issues
Node Not Ready
When nodes show "NotReady" status:
bash
# Check node conditions
kubectl describe node <node-name>
# Check kubelet status
sudo systemctl status kubelet
# Check disk space
df -h
# Check memory usage
free -h
# Check for failed systemd services
systemctl --failedNode Join Failures
Common issues when agents can't join the cluster:
bash
# Verify token
sudo cat /var/lib/rancher/rke2/server/node-token
# Test connectivity to supervisor API
curl -k https://lb.edge.example.com:9345/ping
# Check DNS resolution
nslookup lb.edge.example.com
# Verify firewall rules
sudo ufw status
sudo iptables -L -nCertificate Issues
Certificate-related problems:
bash
# Check certificate expiration
for cert in /var/lib/rancher/rke2/server/tls/*.crt; do
echo "=== $cert ==="
openssl x509 -in "$cert" -noout -dates
done
# Verify certificate SANs
openssl x509 -in /var/lib/rancher/rke2/server/tls/serving-kube-apiserver.crt -noout -text | grep -A1 "Subject Alternative Name"
# Regenerate certificates (if needed)
sudo rm -rf /var/lib/rancher/rke2/server/tls/
sudo systemctl restart rke2-server.serviceNetworking Issues
Cilium Troubleshooting
Check Cilium Status
bash
# Verify Cilium pods are running
kubectl get pods -n kube-system -l k8s-app=cilium
# Check Cilium status
kubectl -n kube-system exec -it ds/cilium -- cilium status
# Detailed Cilium status
kubectl -n kube-system exec -it ds/cilium -- cilium status --verboseeBPF Verification
bash
# Check if eBPF is properly enabled
kubectl -n kube-system exec -it ds/cilium -- cilium status | grep -E "KubeProxyReplacement|BPF|eBPF"
# Detailed Cilium status with verbose output
kubectl -n kube-system exec -it ds/cilium -- cilium status --verbose
# Alternative command if cilium-dbg is available
kubectl -n kube-system exec -it ds/cilium -- cilium-dbg status
# Check eBPF readiness on nodes
sysctl -a | grep bpf
sysctl -a | grep net.core.bpf
sysctl -a | grep kernel.unprivileged_bpf_disabled
# Set missing eBPF configurations if necessary
sudo sysctl -w kernel.unprivileged_bpf_disabled=1
sudo sysctl -w net.core.bpf_jit_enable=1
sudo sysctl -w net.core.bpf_jit_harden=0
# Verify BPF filesystem mount
mount | grep bpf
ls -l /sys/fs/bpf
# Mount BPF filesystem manually if missing
sudo mount -t bpf bpf /sys/fs/bpf
# Add to /etc/fstab for persistence
echo "bpf /sys/fs/bpf bpf defaults 0 0" | sudo tee -a /etc/fstab
# Check kernel modules
lsmod | grep -E "bpf|xdp|cgroup|ip6table|ebpf"
# Load required modules if missing
modprobe bpf
modprobe xdp
modprobe cgroup_bpf
# Check BPF features
bpftool feature
# Check Cilium BPF maps
kubectl -n kube-system exec -it ds/cilium -- cilium-dbg bpf tunnel list
# Check available Cilium commands
kubectl -n kube-system exec -it ds/cilium -- cilium --help
kubectl -n kube-system exec -it ds/cilium -- cilium-dbg --helpCNI Configuration Problems
bash
# Check CNI configuration directory
ls -l /etc/cni/net.d/
cat /etc/cni/net.d/05-cilium.conflist
# If CNI config is missing, restart Cilium pods
kubectl delete pod -n kube-system -l k8s-app=cilium
# Check Cilium HelmChartConfig
kubectl get helmchartconfig -n kube-system rke2-cilium -o yaml
# Apply custom Cilium configuration
kubectl apply -f /var/lib/rancher/rke2/server/manifests/rke2-cilium-config.yaml
# Check Cilium operator logs
kubectl logs -n kube-system -l app.kubernetes.io/name=cilium-operator
# Manually reinstall Cilium (last resort)
kubectl delete -n kube-system daemonset cilium
kubectl delete -n kube-system deployment cilium-operator
kubectl delete -f /var/lib/rancher/rke2/server/manifests/rke2-cilium-config.yaml
kubectl apply -f /var/lib/rancher/rke2/server/manifests/rke2-cilium-config.yamlCilium Connectivity Testing
bash
# Check Cilium pod status
kubectl get pods -n kube-system -l k8s-app=cilium
# Get detailed pod information
kubectl get pods -n kube-system -l k8s-app=cilium --show-labels
# Check Cilium logs with label selector
kubectl logs -l k8s-app=cilium -n kube-system
# Check specific containers in Cilium pods
kubectl logs -n kube-system <cilium-pod-name> -c install-cni-binaries
kubectl logs -n kube-system <cilium-pod-name> -c mount-bpf-fs
kubectl logs -n kube-system <cilium-pod-name> -c apply-sysctl-overwrites
# Check if previous container logs are available
kubectl logs -n kube-system <cilium-pod-name> -c cilium --previousNetwork Policy Issues
bash
# List all network policies
kubectl get cnp,netpol -A
# Describe specific policy
kubectl describe cnp <policy-name> -n <namespace>
# Check policy enforcement
kubectl -n kube-system exec -it ds/cilium -- cilium policy get
# Monitor policy violations
kubectl -n kube-system exec -it ds/cilium -- cilium monitor --type policy-verdictDNS Resolution Problems
bash
# Test DNS from a pod
kubectl run -it --rm debug --image=busybox --restart=Never -- nslookup kubernetes.default
# Check CoreDNS pods
kubectl get pods -n kube-system -l k8s-app=kube-dns
# CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns
# Test external DNS
kubectl run -it --rm debug --image=busybox --restart=Never -- nslookup google.comLoad Balancer Issues
bash
# Test load balancer connectivity
curl -k https://lb.edge.example.com:6443/healthz
# Check NGINX logs (if using NGINX)
sudo tail -f /var/log/nginx/error.log
# Test backend server health
for server in 192.168.10.10 192.168.10.11 192.168.10.12; do
echo "Testing $server..."
curl -k https://$server:6443/healthz
done
# Restart load balancer
sudo systemctl restart nginxStorage Issues
Persistent Volume Problems
bash
# Check PV status
kubectl get pv
# Check PVC status
kubectl get pvc -A
# Describe problematic PVC
kubectl describe pvc <pvc-name> -n <namespace>
# Check storage class
kubectl get storageclass
# Local path provisioner logs (if used)
kubectl logs -n kube-system -l app=local-path-provisionerDisk Space Issues
bash
# Check disk usage on nodes
df -h
# Check container image usage
sudo crictl images
sudo crictl rmi <image-id> # Remove unused images
# Clean up unused containers
sudo crictl ps -a
sudo crictl rm <container-id>
# Check kubelet garbage collection
kubectl get nodes -o jsonpath='{.items[*].status.images[?(@.sizeBytes>1000000000)].names[0]}'etcd Issues
etcd Health Checks
bash
# Check etcd member list
sudo /var/lib/rancher/rke2/bin/etcdctl --endpoints=https://127.0.0.1:2379 \
--cacert=/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt \
--cert=/var/lib/rancher/rke2/server/tls/etcd/server-client.crt \
--key=/var/lib/rancher/rke2/server/tls/etcd/server-client.key \
member list
# Check etcd cluster health
sudo /var/lib/rancher/rke2/bin/etcdctl --endpoints=https://127.0.0.1:2379 \
--cacert=/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt \
--cert=/var/lib/rancher/rke2/server/tls/etcd/server-client.crt \
--key=/var/lib/rancher/rke2/server/tls/etcd/server-client.key \
endpoint health
# Check etcd status
sudo /var/lib/rancher/rke2/bin/etcdctl --endpoints=https://127.0.0.1:2379 \
--cacert=/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt \
--cert=/var/lib/rancher/rke2/server/tls/etcd/server-client.crt \
--key=/var/lib/rancher/rke2/server/tls/etcd/server-client.key \
endpoint status --write-out=tableetcd Performance Issues
bash
# Check etcd performance
sudo /var/lib/rancher/rke2/bin/etcdctl --endpoints=https://127.0.0.1:2379 \
--cacert=/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt \
--cert=/var/lib/rancher/rke2/server/tls/etcd/server-client.crt \
--key=/var/lib/rancher/rke2/server/tls/etcd/server-client.key \
check perf
# Monitor etcd metrics
curl -k https://127.0.0.1:2381/metrics | grep etcd_Security and Compliance Issues
CIS Compliance Problems
bash
# Run CIS benchmark scan
kube-bench --benchmark cis-1.9
# Check if etcd user exists
id etcd
# Verify SELinux status (RHEL/CentOS)
sestatus
getenforce
# Check AppArmor status (Ubuntu)
sudo aa-status
# Verify pod security standards
kubectl get ns -o jsonpath='{range .items[*]}{.metadata.name}: {.metadata.labels.pod-security\.kubernetes\.io/enforce}{"\n"}{end}'RBAC Issues
bash
# Check if user can perform action
kubectl auth can-i <verb> <resource> --as=<user>
# Check what a service account can do
kubectl auth can-i --list --as=system:serviceaccount:<namespace>:<serviceaccount>
# Check role bindings
kubectl get rolebindings,clusterrolebindings -A -o widePerformance Issues
Resource Constraints
bash
# Check node resource usage
kubectl top nodes
# Check pod resource usage
kubectl top pods -A
# Identify resource-hungry pods
kubectl get pods -A --sort-by='.spec.containers[0].resources.requests.memory'
# Check for OOMKilled pods
kubectl get pods -A --field-selector=status.phase=Failed
kubectl describe pod <failed-pod> -n <namespace>API Server Performance
bash
# Check API server metrics
kubectl get --raw /metrics | grep apiserver_request_duration_seconds
# Check for slow requests
kubectl logs -n kube-system <apiserver-pod> | grep "slow request"
# Monitor API server load
kubectl top pods -n kube-system -l component=kube-apiserverContainer Runtime Issues
containerd Problems
bash
# Check containerd status
sudo systemctl status containerd
# List containers
sudo crictl ps -a
# Check container logs
sudo crictl logs <container-id>
# Inspect container
sudo crictl inspect <container-id>
# Check containerd logs
sudo journalctl -u containerd -fImage Pull Issues
bash
# Check image pull status
kubectl describe pod <pod-name> -n <namespace>
# Test image pull manually
sudo crictl pull <image-name>
# Check registry connectivity
curl -I https://registry-1.docker.io/v2/
# Check image repository secrets
kubectl get secrets -A | grep dockerBackup and Recovery
etcd Backup Issues
bash
# Create etcd backup
sudo /var/lib/rancher/rke2/bin/etcdctl --endpoints=https://127.0.0.1:2379 \
--cacert=/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt \
--cert=/var/lib/rancher/rke2/server/tls/etcd/server-client.crt \
--key=/var/lib/rancher/rke2/server/tls/etcd/server-client.key \
snapshot save /tmp/etcd-snapshot.db
# Verify backup
sudo /var/lib/rancher/rke2/bin/etcdctl snapshot status /tmp/etcd-snapshot.db --write-out=tableMonitoring and Alerting
Metrics Collection Issues
bash
# Check if metrics server is running
kubectl get pods -n kube-system -l k8s-app=metrics-server
# Test metrics endpoint
kubectl get --raw /metrics
# Check node exporter (if using Prometheus)
curl http://<node-ip>:9100/metricsRecovery Procedures
Cluster Recovery Checklist
Assess the situation
bashkubectl get nodes kubectl get pods -ACheck system services
bashsudo systemctl status rke2-server sudo systemctl status rke2-agentRestart services if needed
bashsudo systemctl restart rke2-server sudo systemctl restart rke2-agentRestore from backup if necessary
bash# Stop RKE2 sudo systemctl stop rke2-server # Restore etcd backup sudo /var/lib/rancher/rke2/bin/etcdctl snapshot restore /path/to/backup.db # Start RKE2 sudo systemctl start rke2-server
Debug Utilities
Useful Debugging Commands
bash
# Get all events sorted by time
kubectl get events --sort-by=.metadata.creationTimestamp -A
# Check resource quotas
kubectl get resourcequota -A
# Check limit ranges
kubectl get limitrange -A
# Get pod logs for all containers
kubectl logs <pod-name> -n <namespace> --all-containers=true
# Port forward for debugging
kubectl port-forward <pod-name> <local-port>:<pod-port> -n <namespace>
# Execute commands in running pods
kubectl exec -it <pod-name> -n <namespace> -- /bin/shNetwork Debugging
bash
# Test connectivity between pods
kubectl run -it --rm netshoot --image=nicolaka/netshoot --restart=Never -- /bin/bash
# Monitor network traffic
sudo tcpdump -i any port 6443
# Check routing table
ip route show
# Test DNS resolution
dig kubernetes.default.svc.cluster.local @10.43.0.10This troubleshooting guide should help you diagnose and resolve most common issues in RKE2 deployments. Always start with the basic health checks and work your way through the specific subsystems based on the symptoms you observe.
