Author: Elana Hashman (Red Hat)
The 1.22 release introduced alpha support for configuring swap memory usage for Kubernetes workloads on a per-node basis.
In prior releases, Kubernetes did not support the use of swap memory on Linux, as it is difficult to provide guarantees and account for pod memory utilization when swap is involved. As part of Kubernetes' earlier design, swap support was considered out of scope, and a kubelet would by default fail to start if swap was detected on a node.
However, there are a number of use casesthat would benefit from Kubernetes nodes supporting swap, including improved node stability, better support for applications with high memory overhead but smaller working sets, the use of memory-constrained devices, and memory flexibility.
Hence, over the past two releases, SIG Node has been working to gather appropriate use cases and feedback, and propose a design for adding swap support to nodes in a controlled, predictable manner so that Kubernetes users can perform testing and provide data to continue building cluster capabilities on top of swap. The alpha graduation of swap memory support for nodes is our first milestone towards this goal!
There are a number of possible ways that one could envision swap use on a node. To keep the scope manageable for this initial implementation, when swap is already provisioned and available on a node, we have proposedthe kubelet should be able to be configured such that:
- It can start with swap on.
- It will direct the Container Runtime Interface to allocate zero swap memory to Kubernetes workloads by default.
- You can configure the kubelet to specify swap utilization for the entire node.
Swap configuration on a node is exposed to a cluster admin via the
memorySwap in the KubeletConfiguration. As a cluster administrator, you can specify the node's behaviour in the presence of swap memory by setting
This is possible through the addition of a
memory_swap_limit_in_bytes field to the container runtime interface (CRI). The kubelet's config will control how much swap memory the kubelet instructs the container runtime to allocate to each container via the CRI. The container runtime will then write the swap settings to the container level cgroup.
On a node where swap memory is already provisioned, Kubernetes use of swap on a node can be enabled by enabling the
NodeSwap feature gate on the kubelet, and disabling the
failSwapOn configuration settingor the
--fail-swap-on command line flag.
You can also optionally configure
memorySwap.swapBehavior in order to specify how a node will use swap memory. For example,
memorySwap: swapBehavior: LimitedSwap
The available configuration options for
LimitedSwap(default): Kubernetes workloads are limited in how much swap they can use. Workloads on the node not managed by Kubernetes can still swap.
UnlimitedSwap: Kubernetes workloads can use as much swap memory as they request, up to the system limit.
If configuration for
memorySwap is not specified and the feature gate is enabled, by default the kubelet will apply the same behaviour as the
The behaviour of the
LimitedSwap setting depends if the node is running with v1 or v2 of control groups (also known as "cgroups"):
- cgroups v1: Kubernetes workloads can use any combination of memory and swap, up to the pod's memory limit, if set.
- cgroups v2: Kubernetes workloads cannot use swap memory.
Having swap available on a system reduces predictability. Swap's performance is worse than regular memory, sometimes by many orders of magnitude, which can cause unexpected performance regressions. Furthermore, swap changes a system's behaviour under memory pressure, and applications cannot directly control what portions of their memory usage are swapped out. Since enabling swap permits greater memory usage for workloads in Kubernetes that cannot be predictably accounted for, it also increases the risk of noisy neighbours and unexpected packing configurations, as the scheduler cannot account for swap memory usage.
The performance of a node with swap memory enabled depends on the underlying physical storage. When swap memory is in use, performance will be significantly worse in an I/O operations per second (IOPS) constrained environment, such as a cloud VM with I/O throttling, when compared to faster storage mediums like solid-state drives or NVMe.
Hence, we do not recommend the use of swap for certain performance-constrained workloads or environments. Cluster administrators and developers should benchmark their nodes and applications before using swap in production scenarios, and we need your help with that!
The Kubernetes 1.22 release introduces alpha support for swap memory on nodes, and we will continue to work towards beta graduation in the 1.23 release. This will include:
- Adding support for controlling swap consumption at the Pod level via cgroups.
- This will include the ability to set a system-reserved quantity of swap from what kubelet detects on the host.
- Determining a set of metrics for node QoS in order to evaluate the performance and stability of nodes with and without swap enabled.
- Collecting feedback from test user cases.
- We will consider introducing new configuration modes for swap, such as a node-wide swap limit for workloads.
You can review the current documentationon the Kubernetes website.
Your feedback is always welcome! SIG Node meets regularlyand can be reachedvia Slack (channel #sig-node ), or the SIG'smailing list. Feel free to reach out to me, Elana Hashman ( @ehashman on Slack and GitHub) if you'd like to help.