This post is more oriented to network engineer that want to start diving in the system and container world.
As a network folk, I’m still not very familiar with containers so I’m writing this article as a reminder of the main issues I’ve got during the Cilium installation. I’ve struggled a lot with various issues to deploy Cilium on Kind cluster. I’ve got also issues when I’ve tried to setup the cluster mesh with kind.
Like often, when you struggle a lot, you learn a lot. In couple of months these issues will probably not be relevant anymore as everything is moving and improving so fast.
With Cilium, the main command to check the health of the installation is cilium status –wait but what if like me you struggle even before cilium is installed ? Like always when something is not working, the first thing that comes to mind is what kind of logs should I check and how. So, first, there are couple of useful commands to collect logs.
Troubleshooting commands list
Cilium
The basic command to display the Cilium status
cilium status --wait
Kubernets
To get very details logs of what happens during the creation of the pod.
kubectl describe pod -n kube-system
You also have the erros that can happends per container in the pod
kubectl describe pod -n kube-system <pod_name>
kubectl describe pod -n kube-system cilium-pncjs
To see the logs of the pod with the timestamps
kubectl -n kube-system logs --timestamps <pod_name>
kubectl -n kube-system logs --timestamps cilium-kfwdf
#-p allows you to access the previous logs
kubectl -n kube-system logs --timestamps -p <pod_name>
To see the logs of a container
kubectl logs <pod_name> -c <container_name>
kubectl logs mc2 -c 1st
kubectl get events
The command cluster-info dump is very verbose and so far is the most complete I saw
kubectl cluster-info dump | grep -i error
Docker
These are usefull commands when you don’t have space left on your drive :D
#List then delete stale containers
docker volume ls
docker volume prune
#Display then delete a container
docker ps -a
docker container rm <container_id>
Other Usefull commands
The two commands I wish I would know prior to install Cilium cluster mesh. Because you need to switch context to see the status of both clusters.
kubectl config get-contexts
kubectl config use-context kind-cluster2
Issues I’ve got during the install
Error : could not find a log line that matches
The well documented “too many open files”
Creating cluster "kind" ...
✓ Ensuring node image (kindest/node:v1.31.0) 🖼
✗ Preparing nodes 📦 📦 📦 📦
Deleted nodes: ["kind-control-plane" "kind-worker" "kind-worker2" "kind-worker3"]
ERROR: failed to create cluster: could not find a log line that matches "Reached target .*Multi-User System.*|detected cgroup v1"
How I resolved the error
By googling the error message you can easily find resources. The commands and explaination to solve the issues can be found here : https://kind.sigs.k8s.io/docs/user/known-issues/#pod-errors-due-to-too-many-open-files
sudo sysctl fs.inotify.max_user_watches=524288
sudo sysctl fs.inotify.max_user_instances=512
Error due to snap on my Ubuntu
This error happened after I’vz tried to load cilium image to kubnertes nodes.
ERROR: command "docker save -o /tmp/images-tar1960078274/images.tar quay.io/cilium/cilium:v1.17.2" failed with error: exit status 1
Command Output: failed to save image: invalid output path: directory "/tmp/images-tar1960078274" does not exist
Resources :
https://github.com/kubernetes-sigs/kind/issues/2535#issuecomment-968836159
https://kind.sigs.k8s.io/docs/user/known-issues/#docker-installed-with-snap
How I resolved the error
#create a new dir
mkdir dockerimage
#assign write access
chmod 755 dockerimage
#save the new image
docker save -o /home/nboulene/dockerimage/images.tar quay.io/cilium/cilium:v1.17.2
#load the saved image to the cluster
kind load image-archive dockerimage/images.tar --name=cluster1
Error due to containers privileges
The containers were in Init:CrashLoopBackOff state
kubectl get pods -A
kube-system cilium-mtcdq 0/1 Init:CrashLoopBackOff 4 (56s ago) 3m49s
kube-system cilium-n6gcz 0/1 Init:CrashLoopBackOff 4 (64s ago) 3m49s
kube-system cilium-nv54v 0/1 Init:CrashLoopBackOff 4 (59s ago) 3m49s
With the command to display specific log of a container My errors was :
kubectl describe pod -n kube-system cilium-nv54v
<snip>
Back-off restarting failed container mount-cgroup in pod cilium-nv54v_kube-system
Then when I’ve checked the pod logs detail, I saw it was in “Waiting” state with reason CrashLoopBackoff
kubectl describe pod -n kube-system cilium-nv54v | egrep -i -A 15 mount-cgroup:$
mount-cgroup:
Container ID: containerd://3c769d6500631b3f4369e0c092cc117b2d6970e1a83983cd0ec558c506295cfd
Image: quay.io/cilium/cilium:v1.17.1@sha256:8969bfd9c87cbea91e40665f8ebe327268c99d844ca26d7d12165de07f702866
Image ID: quay.io/cilium/cilium@sha256:8969bfd9c87cbea91e40665f8ebe327268c99d844ca26d7d12165de07f702866
Port: <none>
Host Port: <none>
Command:
sh
-ec
cp /usr/bin/cilium-mount /hostbin/cilium-mount;
nsenter --cgroup=/hostproc/1/ns/cgroup --mount=/hostproc/1/ns/mnt "${BIN_PATH}/cilium-mount" $CGROUP_ROOT;
rm /hostbin/cilium-mount
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
and on the init container mount-cgroup the error was :
cp: cannot create regular file '/hostbin/cilium-mount': Permission denied
How I resolved the error
To resolve the issue you need to enter the command –set securityContext.privileged=true when installing Cilium with Helm
helm install cilium cilium/cilium --version 1.17.2 \
--namespace kube-system \
--set image.pullPolicy=IfNotPresent \
--set ipam.mode=kubernetes --set securityContext.privileged=true
helm upgrade cilium cilium/cilium --version 1.17.2 --namespace kube-system --set cluster.name=cluster1 --set cluster.id=1
Error with the service Type for cluster mesh
When I’ve tried to enable clustermesh on the first cluster, I’ve got the following error :
cilium clustermesh enable --context $CLUSTER1
Error: Unable to enable ClusterMesh: cannot auto-detect service type, please specify using '--service-type' option
The 3 options are:
- LoadBalancer : I needed an external loadbalancer and I didn’t have one.
- ClusterIP : Error: Unable to enable ClusterMesh: service type “ClusterIP” is not valid
- NodePort : Is the only remaining viable option :)
How I resolved the error
Simply by specifying the service-type
cilium clustermesh enable --service-type NodePort --context $CLUSTER1
⚠️ Using service type NodePort may fail when nodes are removed from the cluster!