Cilium installation on kind and troubleshooting

This post is more oriented to network engineer that want to start diving in the system and container world.

As a network folk, I’m still not very familiar with containers so I’m writing this article as a reminder of the main issues I’ve got during the Cilium installation. I’ve struggled a lot with various issues to deploy Cilium on Kind cluster. I’ve got also issues when I’ve tried to setup the cluster mesh with kind.
Like often, when you struggle a lot, you learn a lot. In couple of months these issues will probably not be relevant anymore as everything is moving and improving so fast.

With Cilium, the main command to check the health of the installation is cilium status –wait but what if like me you struggle even before cilium is installed ? Like always when something is not working, the first thing that comes to mind is what kind of logs should I check and how. So, first, there are couple of useful commands to collect logs.

Troubleshooting commands list

Cilium

The basic command to display the Cilium status

cilium status --wait 

Kubernets

To get very details logs of what happens during the creation of the pod.

kubectl describe pod -n kube-system

You also have the erros that can happends per container in the pod

kubectl describe pod -n kube-system <pod_name>  
kubectl describe pod -n kube-system cilium-pncjs

To see the logs of the pod with the timestamps

kubectl -n kube-system logs --timestamps <pod_name>
kubectl -n kube-system logs --timestamps cilium-kfwdf
#-p allows you to access the previous logs
kubectl -n kube-system logs --timestamps -p <pod_name>

To see the logs of a container

kubectl logs <pod_name> -c <container_name>
kubectl logs mc2 -c 1st
kubectl get events

The command cluster-info dump is very verbose and so far is the most complete I saw

kubectl cluster-info dump | grep -i error

Docker

These are usefull commands when you don’t have space left on your drive :D

#List then delete stale containers
docker volume ls
docker volume prune

#Display then delete a container
docker ps -a
docker container rm <container_id>

Other Usefull commands

The two commands I wish I would know prior to install Cilium cluster mesh. Because you need to switch context to see the status of both clusters.

kubectl config get-contexts
kubectl config use-context kind-cluster2

Issues I’ve got during the install

Error : could not find a log line that matches

The well documented “too many open files”

Creating cluster "kind" ...
 ✓ Ensuring node image (kindest/node:v1.31.0) 🖼
 ✗ Preparing nodes 📦 📦 📦 📦  
Deleted nodes: ["kind-control-plane" "kind-worker" "kind-worker2" "kind-worker3"]
ERROR: failed to create cluster: could not find a log line that matches "Reached target .*Multi-User System.*|detected cgroup v1"

How I resolved the error

By googling the error message you can easily find resources. The commands and explaination to solve the issues can be found here : https://kind.sigs.k8s.io/docs/user/known-issues/#pod-errors-due-to-too-many-open-files

sudo sysctl fs.inotify.max_user_watches=524288
sudo sysctl fs.inotify.max_user_instances=512

Error due to snap on my Ubuntu

This error happened after I’vz tried to load cilium image to kubnertes nodes.

ERROR: command "docker save -o /tmp/images-tar1960078274/images.tar quay.io/cilium/cilium:v1.17.2" failed with error: exit status 1
Command Output: failed to save image: invalid output path: directory "/tmp/images-tar1960078274" does not exist

Resources :
https://github.com/kubernetes-sigs/kind/issues/2535#issuecomment-968836159
https://kind.sigs.k8s.io/docs/user/known-issues/#docker-installed-with-snap

How I resolved the error

#create a new dir
mkdir dockerimage

#assign write access
chmod 755 dockerimage

#save the new image
docker save -o /home/nboulene/dockerimage/images.tar quay.io/cilium/cilium:v1.17.2

#load the saved image to the cluster
kind load image-archive dockerimage/images.tar --name=cluster1

Error due to containers privileges

The containers were in Init:CrashLoopBackOff state

kubectl get pods -A

kube-system          cilium-mtcdq                                  0/1     Init:CrashLoopBackOff   4 (56s ago)   3m49s
kube-system          cilium-n6gcz                                  0/1     Init:CrashLoopBackOff   4 (64s ago)   3m49s
kube-system          cilium-nv54v                                  0/1     Init:CrashLoopBackOff   4 (59s ago)   3m49s

With the command to display specific log of a container My errors was :

kubectl describe pod -n kube-system cilium-nv54v 
<snip>
Back-off restarting failed container mount-cgroup in pod cilium-nv54v_kube-system

Then when I’ve checked the pod logs detail, I saw it was in “Waiting” state with reason CrashLoopBackoff

kubectl describe pod -n kube-system cilium-nv54v | egrep -i -A 15 mount-cgroup:$

  mount-cgroup:
    Container ID:  containerd://3c769d6500631b3f4369e0c092cc117b2d6970e1a83983cd0ec558c506295cfd
    Image:         quay.io/cilium/cilium:v1.17.1@sha256:8969bfd9c87cbea91e40665f8ebe327268c99d844ca26d7d12165de07f702866
    Image ID:      quay.io/cilium/cilium@sha256:8969bfd9c87cbea91e40665f8ebe327268c99d844ca26d7d12165de07f702866
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -ec
      cp /usr/bin/cilium-mount /hostbin/cilium-mount;
      nsenter --cgroup=/hostproc/1/ns/cgroup --mount=/hostproc/1/ns/mnt "${BIN_PATH}/cilium-mount" $CGROUP_ROOT;
      rm /hostbin/cilium-mount
      
    State:       Waiting
      Reason:    CrashLoopBackOff
    Last State:  Terminated

and on the init container mount-cgroup the error was :

cp: cannot create regular file '/hostbin/cilium-mount': Permission denied

How I resolved the error

To resolve the issue you need to enter the command –set securityContext.privileged=true when installing Cilium with Helm

helm install cilium cilium/cilium --version 1.17.2 \
   --namespace kube-system \
   --set image.pullPolicy=IfNotPresent \
   --set ipam.mode=kubernetes   --set securityContext.privileged=true

helm upgrade cilium cilium/cilium --version 1.17.2    --namespace kube-system --set cluster.name=cluster1 --set cluster.id=1

Error with the service Type for cluster mesh

When I’ve tried to enable clustermesh on the first cluster, I’ve got the following error :

cilium clustermesh enable --context $CLUSTER1
Error: Unable to enable ClusterMesh: cannot auto-detect service type, please specify using '--service-type' option

The 3 options are:

  • LoadBalancer : I needed an external loadbalancer and I didn’t have one.
  • ClusterIP : Error: Unable to enable ClusterMesh: service type “ClusterIP” is not valid
  • NodePort : Is the only remaining viable option :)

How I resolved the error

Simply by specifying the service-type

cilium clustermesh enable --service-type NodePort --context $CLUSTER1 
⚠️  Using service type NodePort may fail when nodes are removed from the cluster!

Related

Noël Boulène © 2021. This blog is strictly personnal and opinions expressed here are only mine and doesn't reflect those of my past, current or futur employers. No warranty whatsoever is made that any of the posts are accurate. There is absolutely no assurance (apart from author´s professional integrity) that any statement contained in a post is true, correct or precise. · Powered by the Academic theme for Hugo.