Skip to content

RKE2 Troubleshooting Guide

This comprehensive troubleshooting guide covers common issues encountered in production RKE2 deployments, with practical solutions and diagnostic techniques.

General Diagnostics

Cluster Health Overview

Start with these basic health checks:

bash
# Check node status
kubectl get nodes -o wide

# Check system pods
kubectl get pods -A --field-selector=status.phase!=Running

# Check cluster info
kubectl cluster-info

# Check component status
kubectl get componentstatuses

Log Collection

Gather logs from key components:

bash
# RKE2 server logs
sudo journalctl -u rke2-server.service -f

# RKE2 agent logs
sudo journalctl -u rke2-agent.service -f

# Kubelet logs
sudo journalctl -u kubelet -f

# Container runtime logs
sudo journalctl -u containerd -f

Node-Level Issues

Node Not Ready

When nodes show "NotReady" status:

bash
# Check node conditions
kubectl describe node <node-name>

# Check kubelet status
sudo systemctl status kubelet

# Check disk space
df -h

# Check memory usage
free -h

# Check for failed systemd services
systemctl --failed

Node Join Failures

Common issues when agents can't join the cluster:

bash
# Verify token
sudo cat /var/lib/rancher/rke2/server/node-token

# Test connectivity to supervisor API
curl -k https://lb.edge.example.com:9345/ping

# Check DNS resolution
nslookup lb.edge.example.com

# Verify firewall rules
sudo ufw status
sudo iptables -L -n

Certificate Issues

Certificate-related problems:

bash
# Check certificate expiration
for cert in /var/lib/rancher/rke2/server/tls/*.crt; do
    echo "=== $cert ==="
    openssl x509 -in "$cert" -noout -dates
done

# Verify certificate SANs
openssl x509 -in /var/lib/rancher/rke2/server/tls/serving-kube-apiserver.crt -noout -text | grep -A1 "Subject Alternative Name"

# Regenerate certificates (if needed)
sudo rm -rf /var/lib/rancher/rke2/server/tls/
sudo systemctl restart rke2-server.service

Networking Issues

Cilium Troubleshooting

Check Cilium Status

bash
# Verify Cilium pods are running
kubectl get pods -n kube-system -l k8s-app=cilium

# Check Cilium status
kubectl -n kube-system exec -it ds/cilium -- cilium status

# Detailed Cilium status
kubectl -n kube-system exec -it ds/cilium -- cilium status --verbose

eBPF Verification

bash
# Check if eBPF is properly enabled
kubectl -n kube-system exec -it ds/cilium -- cilium status | grep -E "KubeProxyReplacement|BPF|eBPF"

# Detailed Cilium status with verbose output
kubectl -n kube-system exec -it ds/cilium -- cilium status --verbose

# Alternative command if cilium-dbg is available
kubectl -n kube-system exec -it ds/cilium -- cilium-dbg status

# Check eBPF readiness on nodes
sysctl -a | grep bpf
sysctl -a | grep net.core.bpf
sysctl -a | grep kernel.unprivileged_bpf_disabled

# Set missing eBPF configurations if necessary
sudo sysctl -w kernel.unprivileged_bpf_disabled=1
sudo sysctl -w net.core.bpf_jit_enable=1
sudo sysctl -w net.core.bpf_jit_harden=0

# Verify BPF filesystem mount
mount | grep bpf
ls -l /sys/fs/bpf

# Mount BPF filesystem manually if missing
sudo mount -t bpf bpf /sys/fs/bpf

# Add to /etc/fstab for persistence
echo "bpf  /sys/fs/bpf  bpf  defaults  0  0" | sudo tee -a /etc/fstab

# Check kernel modules
lsmod | grep -E "bpf|xdp|cgroup|ip6table|ebpf"

# Load required modules if missing
modprobe bpf
modprobe xdp
modprobe cgroup_bpf

# Check BPF features
bpftool feature

# Check Cilium BPF maps
kubectl -n kube-system exec -it ds/cilium -- cilium-dbg bpf tunnel list

# Check available Cilium commands
kubectl -n kube-system exec -it ds/cilium -- cilium --help
kubectl -n kube-system exec -it ds/cilium -- cilium-dbg --help

CNI Configuration Problems

bash
# Check CNI configuration directory
ls -l /etc/cni/net.d/
cat /etc/cni/net.d/05-cilium.conflist

# If CNI config is missing, restart Cilium pods
kubectl delete pod -n kube-system -l k8s-app=cilium

# Check Cilium HelmChartConfig
kubectl get helmchartconfig -n kube-system rke2-cilium -o yaml

# Apply custom Cilium configuration
kubectl apply -f /var/lib/rancher/rke2/server/manifests/rke2-cilium-config.yaml

# Check Cilium operator logs
kubectl logs -n kube-system -l app.kubernetes.io/name=cilium-operator

# Manually reinstall Cilium (last resort)
kubectl delete -n kube-system daemonset cilium
kubectl delete -n kube-system deployment cilium-operator
kubectl delete -f /var/lib/rancher/rke2/server/manifests/rke2-cilium-config.yaml
kubectl apply -f /var/lib/rancher/rke2/server/manifests/rke2-cilium-config.yaml

Cilium Connectivity Testing

bash
# Check Cilium pod status
kubectl get pods -n kube-system -l k8s-app=cilium

# Get detailed pod information
kubectl get pods -n kube-system -l k8s-app=cilium --show-labels

# Check Cilium logs with label selector
kubectl logs -l k8s-app=cilium -n kube-system

# Check specific containers in Cilium pods
kubectl logs -n kube-system <cilium-pod-name> -c install-cni-binaries
kubectl logs -n kube-system <cilium-pod-name> -c mount-bpf-fs
kubectl logs -n kube-system <cilium-pod-name> -c apply-sysctl-overwrites

# Check if previous container logs are available
kubectl logs -n kube-system <cilium-pod-name> -c cilium --previous

Network Policy Issues

bash
# List all network policies
kubectl get cnp,netpol -A

# Describe specific policy
kubectl describe cnp <policy-name> -n <namespace>

# Check policy enforcement
kubectl -n kube-system exec -it ds/cilium -- cilium policy get

# Monitor policy violations
kubectl -n kube-system exec -it ds/cilium -- cilium monitor --type policy-verdict

DNS Resolution Problems

bash
# Test DNS from a pod
kubectl run -it --rm debug --image=busybox --restart=Never -- nslookup kubernetes.default

# Check CoreDNS pods
kubectl get pods -n kube-system -l k8s-app=kube-dns

# CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns

# Test external DNS
kubectl run -it --rm debug --image=busybox --restart=Never -- nslookup google.com

Load Balancer Issues

bash
# Test load balancer connectivity
curl -k https://lb.edge.example.com:6443/healthz

# Check NGINX logs (if using NGINX)
sudo tail -f /var/log/nginx/error.log

# Test backend server health
for server in 192.168.10.10 192.168.10.11 192.168.10.12; do
    echo "Testing $server..."
    curl -k https://$server:6443/healthz
done

# Restart load balancer
sudo systemctl restart nginx

Storage Issues

Persistent Volume Problems

bash
# Check PV status
kubectl get pv

# Check PVC status
kubectl get pvc -A

# Describe problematic PVC
kubectl describe pvc <pvc-name> -n <namespace>

# Check storage class
kubectl get storageclass

# Local path provisioner logs (if used)
kubectl logs -n kube-system -l app=local-path-provisioner

Disk Space Issues

bash
# Check disk usage on nodes
df -h

# Check container image usage
sudo crictl images
sudo crictl rmi <image-id>  # Remove unused images

# Clean up unused containers
sudo crictl ps -a
sudo crictl rm <container-id>

# Check kubelet garbage collection
kubectl get nodes -o jsonpath='{.items[*].status.images[?(@.sizeBytes>1000000000)].names[0]}'

etcd Issues

etcd Health Checks

bash
# Check etcd member list
sudo /var/lib/rancher/rke2/bin/etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt \
  --cert=/var/lib/rancher/rke2/server/tls/etcd/server-client.crt \
  --key=/var/lib/rancher/rke2/server/tls/etcd/server-client.key \
  member list

# Check etcd cluster health
sudo /var/lib/rancher/rke2/bin/etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt \
  --cert=/var/lib/rancher/rke2/server/tls/etcd/server-client.crt \
  --key=/var/lib/rancher/rke2/server/tls/etcd/server-client.key \
  endpoint health

# Check etcd status
sudo /var/lib/rancher/rke2/bin/etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt \
  --cert=/var/lib/rancher/rke2/server/tls/etcd/server-client.crt \
  --key=/var/lib/rancher/rke2/server/tls/etcd/server-client.key \
  endpoint status --write-out=table

etcd Performance Issues

bash
# Check etcd performance
sudo /var/lib/rancher/rke2/bin/etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt \
  --cert=/var/lib/rancher/rke2/server/tls/etcd/server-client.crt \
  --key=/var/lib/rancher/rke2/server/tls/etcd/server-client.key \
  check perf

# Monitor etcd metrics
curl -k https://127.0.0.1:2381/metrics | grep etcd_

Security and Compliance Issues

CIS Compliance Problems

bash
# Run CIS benchmark scan
kube-bench --benchmark cis-1.9

# Check if etcd user exists
id etcd

# Verify SELinux status (RHEL/CentOS)
sestatus
getenforce

# Check AppArmor status (Ubuntu)
sudo aa-status

# Verify pod security standards
kubectl get ns -o jsonpath='{range .items[*]}{.metadata.name}: {.metadata.labels.pod-security\.kubernetes\.io/enforce}{"\n"}{end}'

RBAC Issues

bash
# Check if user can perform action
kubectl auth can-i <verb> <resource> --as=<user>

# Check what a service account can do
kubectl auth can-i --list --as=system:serviceaccount:<namespace>:<serviceaccount>

# Check role bindings
kubectl get rolebindings,clusterrolebindings -A -o wide

Performance Issues

Resource Constraints

bash
# Check node resource usage
kubectl top nodes

# Check pod resource usage
kubectl top pods -A

# Identify resource-hungry pods
kubectl get pods -A --sort-by='.spec.containers[0].resources.requests.memory'

# Check for OOMKilled pods
kubectl get pods -A --field-selector=status.phase=Failed
kubectl describe pod <failed-pod> -n <namespace>

API Server Performance

bash
# Check API server metrics
kubectl get --raw /metrics | grep apiserver_request_duration_seconds

# Check for slow requests
kubectl logs -n kube-system <apiserver-pod> | grep "slow request"

# Monitor API server load
kubectl top pods -n kube-system -l component=kube-apiserver

Container Runtime Issues

containerd Problems

bash
# Check containerd status
sudo systemctl status containerd

# List containers
sudo crictl ps -a

# Check container logs
sudo crictl logs <container-id>

# Inspect container
sudo crictl inspect <container-id>

# Check containerd logs
sudo journalctl -u containerd -f

Image Pull Issues

bash
# Check image pull status
kubectl describe pod <pod-name> -n <namespace>

# Test image pull manually
sudo crictl pull <image-name>

# Check registry connectivity
curl -I https://registry-1.docker.io/v2/

# Check image repository secrets
kubectl get secrets -A | grep docker

Backup and Recovery

etcd Backup Issues

bash
# Create etcd backup
sudo /var/lib/rancher/rke2/bin/etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt \
  --cert=/var/lib/rancher/rke2/server/tls/etcd/server-client.crt \
  --key=/var/lib/rancher/rke2/server/tls/etcd/server-client.key \
  snapshot save /tmp/etcd-snapshot.db

# Verify backup
sudo /var/lib/rancher/rke2/bin/etcdctl snapshot status /tmp/etcd-snapshot.db --write-out=table

Monitoring and Alerting

Metrics Collection Issues

bash
# Check if metrics server is running
kubectl get pods -n kube-system -l k8s-app=metrics-server

# Test metrics endpoint
kubectl get --raw /metrics

# Check node exporter (if using Prometheus)
curl http://<node-ip>:9100/metrics

Recovery Procedures

Cluster Recovery Checklist

  1. Assess the situation

    bash
    kubectl get nodes
    kubectl get pods -A
  2. Check system services

    bash
    sudo systemctl status rke2-server
    sudo systemctl status rke2-agent
  3. Restart services if needed

    bash
    sudo systemctl restart rke2-server
    sudo systemctl restart rke2-agent
  4. Restore from backup if necessary

    bash
    # Stop RKE2
    sudo systemctl stop rke2-server
    
    # Restore etcd backup
    sudo /var/lib/rancher/rke2/bin/etcdctl snapshot restore /path/to/backup.db
    
    # Start RKE2
    sudo systemctl start rke2-server

Debug Utilities

Useful Debugging Commands

bash
# Get all events sorted by time
kubectl get events --sort-by=.metadata.creationTimestamp -A

# Check resource quotas
kubectl get resourcequota -A

# Check limit ranges
kubectl get limitrange -A

# Get pod logs for all containers
kubectl logs <pod-name> -n <namespace> --all-containers=true

# Port forward for debugging
kubectl port-forward <pod-name> <local-port>:<pod-port> -n <namespace>

# Execute commands in running pods
kubectl exec -it <pod-name> -n <namespace> -- /bin/sh

Network Debugging

bash
# Test connectivity between pods
kubectl run -it --rm netshoot --image=nicolaka/netshoot --restart=Never -- /bin/bash

# Monitor network traffic
sudo tcpdump -i any port 6443

# Check routing table
ip route show

# Test DNS resolution
dig kubernetes.default.svc.cluster.local @10.43.0.10

This troubleshooting guide should help you diagnose and resolve most common issues in RKE2 deployments. Always start with the basic health checks and work your way through the specific subsystems based on the symptoms you observe.