Problem statement
Recently, I encountered the log message "FULL, paused modify" in RGW (RADOS Gateway) component of a Rook-Ceph cluster running in an OpenShift cluster. This indicated that the RGW was in a "FULL" state and could not perform any further modifications to the storage cluster as all available storage capacity had been used. This was likely due to either insufficient storage capacity in the cluster or a problem with the storage devices.
oc logs -l app=rook-ceph-rgw -n openshift-storage -c rgw
Output:
debug 2023-01-23 20:13:27.164 7ff7dcf03280 0 deferred set uid:gid to 167:167 (ceph:ceph)
debug 2023-01-23 20:13:27.164 7ff7dcf03280 0 ceph version 14.2.11-208.el8cp (6738ba96f296a41c24357c12e8d594fbde457abc) nautilus (stable), process radosgw, pid 87
debug 2023-01-23 20:13:27.211 7ff7dcf03280 0 client.11821623.objecter FULL, paused modify 0x55a8702db800 tid 0
Debugging steps
To debug this let's enable a debug pod which has ceph toolbox as follows,
oc patch OCSInitialization ocsinit -n openshift-storage --type json --patch '[{ "op": "replace", "path": "/spec/enableCephTools", "value": true }]'
Once you have enabled the Ceph tools pod, you can access it by running the command:
TOOLS_POD=$(oc get pods -n openshift-storage -l app=rook-ceph-tools -o name)
oc rsh -n openshift-storage $TOOLS_POD
Inside the container you will be able to run various Ceph commands like ceph status
, ceph health
, ceph df
, ceph osd tree
, etc. that will give you more visibility into the cluster and can help you identify the cause of the "FULL" state.
ceph health detail
Output:
HEALTH_ERR 17 backfillfull osd(s); 1 full osd(s); 10 pool(s) full
The output indicate that the cluster is in a HEALTH_ERR
state with 17 OSDs
that are backfillfull
, 1 OSD
that is full
and 10 pools
that are full
.
- A
backfillfull
OSD is an OSD that is full and can't perform any more writes. This can happen when an OSD is trying to catch up with the rest of the cluster. - A
full OSD
is an OSD that has reached its full capacity and can't accept any more data. - A
full pool
is a pool that has reached its full capacity and can't accept any more data.
This means that the cluster is running out of storage space, and all available storage is being used. This could be caused by a lack of capacity, a problem with the storage devices, or a problem with the way data is being distributed across the cluster.
To resolve this issue, you can try the following:
- Check the storage capacity of the cluster using the
ceph df
command and ensure that there is enough available storage space. - Check the health of the storage devices using the
ceph health
command and ensure that all devices are in a healthy state. - Remove or relocate unnecessary data from the cluster to free up space.
- Increase the size of the storage devices in the cluster.
- Monitor the cluster and identify the objects that are taking up the most space, and delete or move them to another cluster.
It's important to note that, modifying storage cluster in an active Ceph cluster can be risky and should be done with caution. It's recommended to take a backup of the data before proceeding, and consult the Ceph documentation and community for guidance.
Temporary workaround
As a temporary workaround you can also increase the size of full-ratio. You can check the current full ratio value in a Ceph cluster by using the ceph osd dump
command. This command will provide you with a detailed report of the current state of the cluster, including the current full ratio value.
You can also check the full ratio value specifically by using the ceph osd dump --format json-pretty
command which will give you a json format output. And then you can search for the "full_ratio" key to check the current full ratio value.
For example, you can use following command:
ceph osd dump --format json-pretty | grep full_ratio
You will see an output like this:
"full_ratio": 0.90,
This shows that the current full ratio value is set to 0.90, which means that when an OSD reaches 90% of its capacity, it is considered "full" and will not accept any more data.
To increase this value you can run following command,
ceph osd set-full-ratio 0.95
It's important to keep in mind that changing the full ratio will not fix the "FULL" state; it will just change the threshold of when the cluster will be considered "FULL." It's also important to note that if all OSDs in the cluster is full, Ceph will not be able to accept new data, and client operations will fail. So do not forget to increase the size of the storage devices in the cluster, add more storage devices, remove or relocate unnecessary data, etc.
Top comments (0)