Goglides Dev 🌱

Goglides Dev 🌱 is a community of amazing users

We are working on this space so that IT professionals can grow together.

Create account Log in
Cover image for OCP cluster down for several days how to recover?
Balkrishna Pandey
Balkrishna Pandey

Posted on

OCP cluster down for several days how to recover?

This blog post is also valid for the use case where you are adding new nodes to the cluster. Openshift makes adding new nodes to your cluster easy, but you need to approve the certificate signing requests for the process to complete. This guide will show you how get a list of all pending CSRs and how to approve them.

So what happened why did this blog post come to my mind?

I had to restart my OCP cluster after it had been down for a month. After loading the computer for a while, all major components began appearing. But for some reason, nodes are showing NotReady status,

127.0.0.1 $ oc get nodes
NAME   STATUS   ROLES      AGE  VERSION
master1  NotReady  master,worker  73d  v1.23.5+3afdacb
master2  NotReady  master,worker  73d  v1.23.5+3afdacb
master3  NotReady  master,worker  73d  v1.23.5+3afdacb
Enter fullscreen mode Exit fullscreen mode

I then ssh into one of the nodes and perform the following journalctl log check:

journalctl -x -f
Enter fullscreen mode Exit fullscreen mode

It consistently generates the following error reports:

Sep 02 15:04:27 master3 hyperkube[2671]: E0902 15:04:27.098146  2671 kubelet.go:2484] "Error getting node" err="node \"master3\" not found"

Sep 02 15:04:27 master3 hyperkube[2671]: I0902 15:04:27.179714  2671 csi_plugin.go:1063] Failed to contact API server when waiting for CSINode publishing: csinodes.storage.k8s.io "master3" is forbidden: User "system:anonymous" cannot get resource "csinodes" in API group "storage.k8s.io" at the cluster scope

Sep 02 15:04:27 master3 hyperkube[2671]: I0902 15:04:27.195732  2671 kubelet_node_status.go:376] "Setting node annotation to enable volume controller attach/detach"
Enter fullscreen mode Exit fullscreen mode

List all Pending CSR

OCP could not automatically cycle certificates and be authorized since the system was unavailable for several months. So I had to approve all pending csr manually. You can check all pending csr request as follows,

oc get csr | grep -i pending
Enter fullscreen mode Exit fullscreen mode

You should see output something similar to this,

csr-26qjb  10h   kubernetes.io/kube-apiserver-client-kubelet  system:serviceaccount:openshift-machine-config-operator:node-bootstrapper  <none>       Pending
csr-2dc5z  8h   kubernetes.io/kube-apiserver-client-kubelet  system:serviceaccount:openshift-machine-config-operator:node-bootstrapper  <none>       Pending
csr-2jx9v  11h   kubernetes.io/kube-apiserver-client-kubelet  system:serviceaccount:openshift-machine-config-operator:node-bootstrapper  <none> 
...
Enter fullscreen mode Exit fullscreen mode

Now you can approve each CSR as follows,

oc adm certificate approve <certname>
Enter fullscreen mode Exit fullscreen mode

There could be multiple CSR in my case It was around 25 CSR, so I used the following command to approve all CSR at once,

oc get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | xargs oc adm certificate approve
Enter fullscreen mode Exit fullscreen mode

Output:

certificatesigningrequest.certificates.k8s.io/csr-26qjb approved
certificatesigningrequest.certificates.k8s.io/csr-2dc5z approved
certificatesigningrequest.certificates.k8s.io/csr-2jx9v approved
...
Enter fullscreen mode Exit fullscreen mode

Now confirm that all machines are part of the cluster and in a Ready state as follows

oc get nodes
Enter fullscreen mode Exit fullscreen mode

Output:

NAME   STATUS  ROLES      AGE  VERSION
master1  Ready  master,worker  73d  v1.23.5+3afdacb
master2  Ready  master,worker  73d  v1.23.5+3afdacb
master3  Ready  master,worker  73d  v1.23.5+3afdacb
Enter fullscreen mode Exit fullscreen mode

All nodes are healthy. So, that's how you can list all pending csr and approve them. In my case, the cluster had been down for a month and couldn't automatically cycle certificates, so I had to approve all pending csr requests manually.

I hope you find this blog post helpful. Thanks for reading!

Happy learning 😃

Discussion (0)