We recently noticed strange issues with one of the openshift storage components. Some of the pods are experiencing problems.
oc get pods
Output:
...
csi-rbdplugin-provisioner-5466484bd5-47mpr 5/6 CrashLoopBackOff 6 8m31s
csi-rbdplugin-provisioner-5466484bd5-vvlds 5/6 CrashLoopBackOff 10 30m
noobaa-db-pg-0 0/1 Init:0/2 0 21m
ocs-operator-7b8dd5c85f-krf8r 0/1 Running 0 15m
...
So we began checking the logs of each pod one by one. The ocs-operator health check status was failing, but there isn't much information in this log to go on.
oc logs ocs-operator-7b8dd5c85f-krf8r -f
Output:
...
{"level":"info","ts":1663854793.1919212,"logger":"controller-runtime.healthz","msg":"healthz check failed","statuses":[{}]}
{"level":"info","ts":1663854803.1925268,"logger":"controller-runtime.healthz","msg":"healthz check failed","statuses":[{}]}
So we looked at another pod in the CrashLoopBackOff
state. This pod has several containers. The following error logs were found in one of the containers named csi-provisioner
.
oc logs csi-rbdplugin-provisioner-5466484bd5-47mpr
Output:
error: a container name must be specified for pod csi-rbdplugin-provisioner-5466484bd5-47mpr, choose one of: [csi-provisioner csi-resizer csi-attacher csi-snapshotter csi-rbdplugin liveness-prometheus]
oc logs csi-rbdplugin-provisioner-5466484bd5-47mpr csi-provisioner
Output:
I0922 13:57:20.393426 1 feature_gate.go:243] feature gates: &{map[]}
I0922 13:57:20.393600 1 csi-provisioner.go:138] Version: v4.8.0-202206281335.p0.g3ea7e68.assembly.stream-0-ga5337a9-dirty
I0922 13:57:20.393622 1 csi-provisioner.go:161] Building kube configs for running in cluster...
I0922 13:57:20.403527 1 connection.go:153] Connecting to unix:///csi/csi-provisioner.sock
W0922 13:57:30.404071 1 connection.go:172] Still connecting to unix:///csi/csi-provisioner.sock
W0922 13:57:40.404390 1 connection.go:172] Still connecting to unix:///csi/csi-provisioner.sock
W0922 13:57:50.404020 1 connection.go:172] Still connecting to unix:///csi/csi-provisioner.sock
W0922 13:58:00.404168 1 connection.go:172] Still connecting to unix:///csi/csi-provisioner.sock
Then we examined another container log called csi-rbdplugin
. The application could not create a file and reported the error failed to write ceph configuration file (open /etc/ceph/ceph.conf: permission denied)
.
oc logs csi-rbdplugin-provisioner-5466484bd5-47mpr csi-rbdplugin
Output:
I0922 14:00:31.206741 1 cephcsi.go:131] Driver version: release-4.8 and Git version: ad563f5bebb2efd5f64dee472e441bbe918fa101
I0922 14:00:31.206911 1 cephcsi.go:149] Initial PID limit is set to 1024
E0922 14:00:31.206946 1 cephcsi.go:153] Failed to set new PID limit to -1: open /sys/fs/cgroup/pids/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod1c017f07_348a_493c_bd88_184670c3c35c.slice/crio-cd12a9e76956fc3e3ae7054d781f09b51dc5872047ba4f7b761cedcc174d91a3.scope/pids.max: permission denied
I0922 14:00:31.206963 1 cephcsi.go:176] Starting driver type: rbd with name: openshift-storage.rbd.csi.ceph.com
F0922 14:00:31.206993 1 driver.go:107] failed to write ceph configuration file (open /etc/ceph/ceph.conf: permission denied)
goroutine 1 [running]:
k8s.io/klog/v2.stacks(0xc0001c0001, 0xc00035c680, 0x83, 0xc7)
/remote-source/app/vendor/k8s.io/klog/v2/klog.go:1026 +0xb9
k8s.io/klog/v2.(*loggingT).output(0x2b3c140, 0xc000000003, 0x0, 0x0, 0xc000415730, 0x2170e40, 0x9, 0x6b, 0x41a900)
/remote-source/app/vendor/k8s.io/klog/v2/klog.go:975 +0x191
k8s.io/klog/v2.(*loggingT).printDepth(0x2b3c140, 0xc000000003, 0x0, 0x0, 0x0, 0x0, 0x1, 0xc000531350, 0x1, 0x1)
/remote-source/app/vendor/k8s.io/klog/v2/klog.go:732 +0x16f
k8s.io/klog/v2.FatalDepth(...)
/remote-source/app/vendor/k8s.io/klog/v2/klog.go:1488
github.com/ceph/ceph-csi/internal/util.FatalLogMsg(0x1b5df58, 0x2c, 0xc0005b1d10, 0x1, 0x1)
/remote-source/app/internal/util/log.go:58 +0x118
github.com/ceph/ceph-csi/internal/rbd.(*Driver).Run(0xc0005b1f18, 0x2b3c040)
/remote-source/app/internal/rbd/driver.go:107 +0xa5
main.main()
/remote-source/app/cmd/cephcsi.go:182 +0x345
goroutine 19 [chan receive]:
k8s.io/klog/v2.(*loggingT).flushDaemon(0x2b3c140)
/remote-source/app/vendor/k8s.io/klog/v2/klog.go:1169 +0x8b
created by k8s.io/klog/v2.init.0
/remote-source/app/vendor/k8s.io/klog/v2/klog.go:417 +0xdf
goroutine 131 [chan receive]:
k8s.io/klog.(*loggingT).flushDaemon(0x2b3bf60)
/remote-source/app/vendor/k8s.io/klog/klog.go:1010 +0x8b
created by k8s.io/klog.init.0
/remote-source/app/vendor/k8s.io/klog/klog.go:411 +0xd8
We also found a similar issue here in the RedHat forum. In that forum, someone suggested using privileged: true
as a workaround to the csi-rbdplugin
container, which fixes the issue immediately.
oc edit deployment csi-rbdplugin-provisioner -n openshift-storage
name: csi-rbdplugin
resources: {}
securityContext:
privileged: true
However, it appears that the issue is related to scc profile, as pods' scc profile, which is supposed to be rook-ceph-csi
, has been changed to a different scc profile, ncom-common
, in some way. Because multiple people are currently accessing this cluster; as a result, I could not determine who is making these changes.
openshift.io/scc: ncom-common
creationTimestamp: "2022-09-22T14:09:26Z"
generateName: csi-rbdplugin-provisioner-6b4f9497b8-
labels:
app: csi-rbdplugin-provisioner
contains: csi-rbdplugin-metrics
Top comments (0)