Goglides Dev 🌱

Balkrishna Pandey
Balkrishna Pandey

Posted on

Openshift data foundation (ODF) storage (CEPH) FULL, paused modify; how to fix it?

Problem statement

Recently, I encountered the log message "FULL, paused modify" in RGW (RADOS Gateway) component of a Rook-Ceph cluster running in an OpenShift cluster. This indicated that the RGW was in a "FULL" state and could not perform any further modifications to the storage cluster as all available storage capacity had been used. This was likely due to either insufficient storage capacity in the cluster or a problem with the storage devices.

oc logs -l  app=rook-ceph-rgw -n openshift-storage -c rgw
Enter fullscreen mode Exit fullscreen mode

Output:

debug 2023-01-23 20:13:27.164 7ff7dcf03280  0 deferred set uid:gid to 167:167 (ceph:ceph)
debug 2023-01-23 20:13:27.164 7ff7dcf03280  0 ceph version 14.2.11-208.el8cp (6738ba96f296a41c24357c12e8d594fbde457abc) nautilus (stable), process radosgw, pid 87
debug 2023-01-23 20:13:27.211 7ff7dcf03280  0 client.11821623.objecter  FULL, paused modify 0x55a8702db800 tid 0
Enter fullscreen mode Exit fullscreen mode

Debugging steps

To debug this let's enable a debug pod which has ceph toolbox as follows,

oc patch OCSInitialization ocsinit -n openshift-storage --type json --patch  '[{ "op": "replace", "path": "/spec/enableCephTools", "value": true }]'
Enter fullscreen mode Exit fullscreen mode

Once you have enabled the Ceph tools pod, you can access it by running the command:

TOOLS_POD=$(oc get pods -n openshift-storage -l app=rook-ceph-tools -o name)
oc rsh -n openshift-storage $TOOLS_POD
Enter fullscreen mode Exit fullscreen mode

Inside the container you will be able to run various Ceph commands like ceph status, ceph health, ceph df, ceph osd tree, etc. that will give you more visibility into the cluster and can help you identify the cause of the "FULL" state.

ceph health detail
Enter fullscreen mode Exit fullscreen mode

Output:

HEALTH_ERR 17 backfillfull osd(s); 1 full osd(s); 10 pool(s) full
Enter fullscreen mode Exit fullscreen mode

The output indicate that the cluster is in a HEALTH_ERR state with 17 OSDs that are backfillfull, 1 OSD that is full and 10 pools that are full.

  • A backfillfull OSD is an OSD that is full and can't perform any more writes. This can happen when an OSD is trying to catch up with the rest of the cluster.
  • A full OSD is an OSD that has reached its full capacity and can't accept any more data.
  • A full pool is a pool that has reached its full capacity and can't accept any more data.

This means that the cluster is running out of storage space, and all available storage is being used. This could be caused by a lack of capacity, a problem with the storage devices, or a problem with the way data is being distributed across the cluster.

To resolve this issue, you can try the following:

  1. Check the storage capacity of the cluster using the ceph df command and ensure that there is enough available storage space.
  2. Check the health of the storage devices using the ceph health command and ensure that all devices are in a healthy state.
  3. Remove or relocate unnecessary data from the cluster to free up space.
  4. Increase the size of the storage devices in the cluster.
  5. Monitor the cluster and identify the objects that are taking up the most space, and delete or move them to another cluster.

It's important to note that, modifying storage cluster in an active Ceph cluster can be risky and should be done with caution. It's recommended to take a backup of the data before proceeding, and consult the Ceph documentation and community for guidance.

Temporary workaround

As a temporary workaround you can also increase the size of full-ratio. You can check the current full ratio value in a Ceph cluster by using the ceph osd dump command. This command will provide you with a detailed report of the current state of the cluster, including the current full ratio value.

You can also check the full ratio value specifically by using the ceph osd dump --format json-pretty command which will give you a json format output. And then you can search for the "full_ratio" key to check the current full ratio value.

For example, you can use following command:

ceph osd dump --format json-pretty | grep full_ratio
Enter fullscreen mode Exit fullscreen mode

You will see an output like this:

    "full_ratio": 0.90,
Enter fullscreen mode Exit fullscreen mode

This shows that the current full ratio value is set to 0.90, which means that when an OSD reaches 90% of its capacity, it is considered "full" and will not accept any more data.

To increase this value you can run following command,

ceph osd set-full-ratio 0.95
Enter fullscreen mode Exit fullscreen mode

It's important to keep in mind that changing the full ratio will not fix the "FULL" state; it will just change the threshold of when the cluster will be considered "FULL." It's also important to note that if all OSDs in the cluster is full, Ceph will not be able to accept new data, and client operations will fail. So do not forget to increase the size of the storage devices in the cluster, add more storage devices, remove or relocate unnecessary data, etc.

Top comments (0)