Authors: Patrick Ohly (Intel)
The v1.24 release of Kubernetes brings storage capacitytracking as a generally available feature.
Problems we have solved
As explained in more detail in the previous blog post about this feature, storage capacity tracking allows a CSI driver to publish information about remaining capacity. The kube-scheduler then uses that information to pick suitable nodes for a Pod when that Pod has volumes that still need to be provisioned.
Without this information, a Pod may get stuck without ever being scheduled onto a suitable node because kube-scheduler has to choose blindly and always ends up picking a node for which the volume cannot be provisioned because the underlying storage system managed by the CSI driver does not have sufficient capacity left.
Because CSI drivers publish storage capacity information that gets used at a later time when it might not be up-to-date anymore, it can still happen that a node is picked that doesn't work out after all. Volume provisioning recovers from that by informing the scheduler that it needs to try again with a different node.
Load teststhat were done again for promotion to GA confirmed that all storage in a cluster can be consumed by Pods with storage capacity tracking whereas Pods got stuck without it.
Problems we have not solved
Recovery from a failed volume provisioning attempt has one known limitation: if a Pod uses two volumes and only one of them could be provisioned, then all future scheduling decisions are limited by the already provisioned volume. If that volume is local to a node and the other volume cannot be provisioned there, the Pod is stuck. This problem pre-dates storage capacity tracking and while the additional information makes it less likely to occur, it cannot be avoided in all cases, except of course by only using one volume per Pod.
An idea for solving this was proposed in a KEP draft: volumes that were provisioned and haven't been used yet cannot have any valuable data and therefore could be freed and provisioned again elsewhere. SIG Storage is looking for interested developers who want to continue working on this.
Also not solved is support in Cluster Autoscaler for Pods with volumes. For CSI drivers with storage capacity tracking, a prototype was developed and discussed in a PR. It was meant to work with arbitrary CSI drivers, but that flexibility made it hard to configure and slowed down scale up operations: because autoscaler was unable to simulate volume provisioning, it only scaled the cluster by one node at a time, which was seen as insufficient.
Therefore that PR was not merged and a different approach with tighter coupling between autoscaler and CSI driver will be needed. For this a better understanding is needed about which local storage CSI drivers are used in combination with cluster autoscaling. Should this lead to a new KEP, then users will have to try out an implementation in practice before it can move to beta or GA. So please reach out to SIG Storage if you have an interest in this topic.
Thanks a lot to the members of the community who have contributed to this feature or given feedback including members of SIG Scheduling,SIG Autoscaling, and of course SIG Storage!