Author: Itamar Holder (Red Hat)
The 1.22 release introduced Alpha supportfor configuring swap memory usage for Kubernetes workloads running on Linux on a per-node basis. Now, in release 1.28, support for swap on Linux nodes has graduated to Beta, along with many new improvements.
Prior to version 1.22, Kubernetes did not provide support for swap memory on Linux systems. This was due to the inherent difficulty in guaranteeing and accounting for pod memory utilization when swap memory was involved. As a result, swap support was deemed out of scope in the initial design of Kubernetes, and the default behavior of a kubelet was to fail to start if swap memory was detected on a node.
In version 1.22, the swap feature for Linux was initially introduced in its Alpha stage. This represented a significant advancement, providing Linux users with the opportunity to experiment with the swap feature for the first time. However, as an Alpha version, it was not fully developed and had several issues, including inadequate support for cgroup v2, insufficient metrics and summary API statistics, inadequate testing, and more.
Swap in Kubernetes has numerous use casesfor a wide range of users. As a result, the node special interest group within the Kubernetes project has invested significant effort into supporting swap on Linux nodes for beta. Compared to the alpha, the kubelet's support for running with swap enabled is more stable and robust, more user-friendly, and addresses many known shortcomings. This graduation to beta represents a crucial step towards achieving the goal of fully supporting swap in Kubernetes.
The utilization of swap memory on a node where it has already been provisioned can be facilitated by the activation of the
NodeSwap feature gate on the kubelet. Additionally, you must disable the
failSwapOn configuration setting, or the deprecated
--fail-swap-on command line flag must be deactivated.
It is possible to configure the
memorySwap.swapBehavior option to define the manner in which a node utilizes swap memory. For instance,
# this fragment goes into the kubelet's configuration file memorySwap: swapBehavior: UnlimitedSwap
The available configuration options for
UnlimitedSwap(default): Kubernetes workloads can use as much swap memory as they request, up to the system limit.
LimitedSwap: The utilization of swap memory by Kubernetes workloads is subject to limitations. Only Pods of Burstable QoS are permitted to employ swap.
If configuration for
memorySwap is not specified and the feature gate is enabled, by default the kubelet will apply the same behaviour as the
NodeSwap is supported for cgroup v2 only. For Kubernetes v1.28, using swap along with cgroup v1 is no longer supported.
It is required for this demo that the kubeadm tool be installed, following the steps outlined in thekubeadm installation guide. If swap is already enabled on the node, cluster creation may proceed. If swap is not enabled, please refer to the provided instructions for enabling swap.
I'll demonstrate creating 4GiB of unencrypted swap.
dd if=/dev/zero of=/swapfile bs=128M count=32 chmod 600 /swapfile mkswap /swapfile swapon /swapfile swapon -s # enable the swap file only until this node is rebooted
To start the swap file at boot time, add line like
/swapfile swap swap defaults 0 0 to
To make things clearer, here is an example kubeadm configuration file
kubeadm-config.yaml for the swap enabled cluster.
--- apiVersion: "kubeadm.k8s.io/v1beta3" kind: InitConfiguration --- apiVersion: kubelet.config.k8s.io/v1beta1 kind: KubeletConfiguration failSwapOn: false featureGates: NodeSwap: true memorySwap: swapBehavior: LimitedSwap
Then create a single-node cluster using
kubeadm init --config kubeadm-config.yaml. During init, there is a warning that swap is enabled on the node and in case the kubelet
failSwapOn is set to true. We plan to remove this warning in a future release.
The configuration of swap memory, including its limitations, presents a significant challenge. Not only is it prone to misconfiguration, but as a system-level property, any misconfiguration could potentially compromise the entire node rather than just a specific workload. To mitigate this risk and ensure the health of the node, we have implemented Swap in Beta with automatic configuration of limitations.
LimitedSwap, Pods that do not fall under the Burstable QoS classification (i.e.
Guaranteed Qos Pods) are prohibited from utilizing swap memory.
BestEffort QoS Pods exhibit unpredictable memory consumption patterns and lack information regarding their memory usage, making it difficult to determine a safe allocation of swap memory. Conversely,
Guaranteed QoS Pods are typically employed for applications that rely on the precise allocation of resources specified by the workload, with memory being immediately available. To maintain the aforementioned security and node health guarantees, these Pods are not permitted to use swap memory when
LimitedSwap is in effect.
Prior to detailing the calculation of the swap limit, it is necessary to define the following terms:
nodeTotalMemory: The total amount of physical memory available on the node.
totalPodsSwapAvailable: The total amount of swap memory on the node that is available for use by Pods (some swap memory may be reserved for system use).
containerMemoryRequest: The container's memory request.
Swap limitation is configured as:
(containerMemoryRequest / nodeTotalMemory) × totalPodsSwapAvailable
In other words, the amount of swap that a container is able to use is proportionate to its memory request, the node's total physical memory and the total amount of swap memory on the node that is available for use by Pods.
It is important to note that, for containers within Burstable QoS Pods, it is possible to opt-out of swap usage by specifying memory requests that are equal to memory limits. Containers configured in this manner will not have access to swap memory.
There are a number of possible ways that one could envision swap use on a node. When swap is already provisioned and available on a node, SIG Node have proposedthe kubelet should be able to be configured so that:
- It can start with swap on.
- It will direct the Container Runtime Interface to allocate zero swap memory to Kubernetes workloads by default.
Swap configuration on a node is exposed to a cluster admin via the
memorySwap in the KubeletConfiguration. As a cluster administrator, you can specify the node's behaviour in the presence of swap memory by setting
The kubelet employs the CRI(container runtime interface) API to direct the CRI to configure specific cgroup v2 parameters (such as
memory.swap.max) in a manner that will enable the desired swap configuration for a container. The CRI is then responsible to write these settings to the container-level cgroup.
A notable deficiency in the Alpha version was the inability to monitor and introspect swap usage. This issue has been addressed in the Beta version introduced in Kubernetes 1.28, which now provides the capability to monitor swap usage through several different methods.
The beta version of kubelet now collectsnode-level metric statistics, which can be accessed at the
/stats/summary kubelet HTTP endpoints. This allows clients who can directly interrogate the kubelet to monitor swap usage and remaining swap memory when using LimitedSwap. Additionally, a
machine_swap_bytes metric has been added to cadvisor to show the total physical swap capacity of the machine.
Having swap available on a system reduces predictability. Swap's performance is worse than regular memory, sometimes by many orders of magnitude, which can cause unexpected performance regressions. Furthermore, swap changes a system's behaviour under memory pressure. Since enabling swap permits greater memory usage for workloads in Kubernetes that cannot be predictably accounted for, it also increases the risk of noisy neighbours and unexpected packing configurations, as the scheduler cannot account for swap memory usage.
The performance of a node with swap memory enabled depends on the underlying physical storage. When swap memory is in use, performance will be significantly worse in an I/O operations per second (IOPS) constrained environment, such as a cloud VM with I/O throttling, when compared to faster storage mediums like solid-state drives or NVMe.
As such, we do not advocate the utilization of swap memory for workloads or environments that are subject to performance constraints. Furthermore, it is recommended to employ
LimitedSwap, as this significantly mitigates the risks posed to the node.
Cluster administrators and developers should benchmark their nodes and applications before using swap in production scenarios, and we need your help with that!
Enabling swap on a system without encryption poses a security risk, as critical information, such as volumes that represent Kubernetes Secrets, may be swapped out to the disk. If an unauthorized individual gains access to the disk, they could potentially obtain these confidential data. To mitigate this risk, the Kubernetes project strongly recommends that you encrypt your swap space. However, handling encrypted swap is not within the scope of kubelet; rather, it is a general OS configuration concern and should be addressed at that level. It is the administrator's responsibility to provision encrypted swap to mitigate this risk.
Furthermore, as previously mentioned, with
LimitedSwap the user has the option to completely disable swap usage for a container by specifying memory requests that are equal to memory limits. This will prevent the corresponding containers from accessing swap memory.
The Kubernetes 1.28 release introduced Beta support for swap memory on Linux nodes, and we will continue to work towards general availabilityfor this feature. I hope that this will include:
- Add the ability to set a system-reserved quantity of swap from what kubelet detects on the host.
- Adding support for controlling swap consumption at the Pod level via cgroups.
- This point is still under discussion.
- Collecting feedback from test user cases.
- We will consider introducing new configuration modes for swap, such as a node-wide swap limit for workloads.
You can review the current documentationfor using swap with Kubernetes.
Your feedback is always welcome! SIG Node meets regularlyand can be reachedvia Slack (channel #sig-node ), or the SIG'smailing list. A Slack channel dedicated to swap is also available at #sig-node-swap.
Feel free to reach out to me, Itamar Holder ( @iholder101 on Slack and GitHub) if you'd like to help or ask further questions.