CNCF Projects

Kubernetes v1.34: Decoupled Taint Manager Is Now Stable

Kubernetes Blog - Mon, 09/15/2025 - 14:30

This enhancement separates the responsibility of managing node lifecycle and pod eviction into two distinct components. Previously, the node lifecycle controller handled both marking nodes as unhealthy with NoExecute taints and evicting pods from them. Now, a dedicated taint eviction controller manages the eviction process, while the node lifecycle controller focuses solely on applying taints. This separation not only improves code organization but also makes it easier to improve taint eviction controller or build custom implementations of the taint based eviction.

What's new?

The feature gate SeparateTaintEvictionController has been promoted to GA in this release. Users can optionally disable taint-based eviction by setting --controllers=-taint-eviction-controller in kube-controller-manager.

How can I learn more?

For more details, refer to the KEP and to the beta announcement article: Kubernetes 1.29: Decoupling taint manager from node lifecycle controller.

How to get involved?

We offer a huge thank you to all the contributors who helped with design, implementation, and review of this feature and helped move it from beta to stable:

Ed Bartosh (@bart0sh)
Yuan Chen (@yuanchen8911)
Aldo Culquicondor (@alculquicondor)
Baofa Fan (@carlory)
Sergey Kanzhelev (@SergeyKanzhelev)
Tim Bannister (@lmktfy)
Maciej Skoczeń (@macsko)
Maciej Szulik (@soltysh)
Wojciech Tyczynski (@wojtek-t)

Categories: CNCF Projects, Kubernetes

Kubernetes v1.34: Autoconfiguration for Node Cgroup Driver Goes GA

Kubernetes Blog - Fri, 09/12/2025 - 14:30

Historically, configuring the correct cgroup driver has been a pain point for users running new Kubernetes clusters. On Linux systems, there are two different cgroup drivers: cgroupfs and systemd. In the past, both the kubelet and CRI implementation (like CRI-O or containerd) needed to be configured to use the same cgroup driver, or else the kubelet would misbehave without any explicit error message. This was a source of headaches for many cluster admins. Now, we've (almost) arrived at the end of that headache.

Automated cgroup driver detection

In v1.28.0, the SIG Node community introduced the feature gate KubeletCgroupDriverFromCRI, which instructs the kubelet to ask the CRI implementation which cgroup driver to use. You can read more here. After many releases of waiting for each CRI implementation to have major versions released and packaged in major operating systems, this feature has gone GA as of Kubernetes 1.34.0.

In addition to setting the feature gate, a cluster admin needs to ensure their CRI implementation is new enough:

containerd: Support was added in v2.0.0
CRI-O: Support was added in v1.28.0

Announcement: Kubernetes is deprecating containerd v1.y support

While CRI-O releases versions that match Kubernetes versions, and thus CRI-O versions without this behavior are no longer supported, containerd maintains its own release cycle. containerd support for this feature is only in v2.0 and later, but Kubernetes 1.34 still supports containerd 1.7 and other LTS releases of containerd.

The Kubernetes SIG Node community has formally agreed upon a final support timeline for containerd v1.y. The last Kubernetes release to offer this support will be the last released version of v1.35, and support will be dropped in v1.36.0. To assist administrators in managing this future transition, a new detection mechanism is available. You are able to monitor the kubelet_cri_losing_support metric to determine if any nodes in your cluster are using a containerd version that will soon be outdated. The presence of this metric with a version label of 1.36.0 will indicate that the node's containerd runtime is not new enough for the upcoming requirements. Consequently, an administrator will need to upgrade containerd to v2.0 or a later version before, or at the same time as, upgrading the kubelet to v1.36.0.

Categories: CNCF Projects, Kubernetes

Kubernetes v1.34: Mutable CSI Node Allocatable Graduates to Beta

Kubernetes Blog - Thu, 09/11/2025 - 14:30

The functionality for CSI drivers to update information about attachable volume count on the nodes, first introduced as Alpha in Kubernetes v1.33, has graduated to Beta in the Kubernetes v1.34 release! This marks a significant milestone in enhancing the accuracy of stateful pod scheduling by reducing failures due to outdated attachable volume capacity information.

Background

Traditionally, Kubernetes CSI drivers report a static maximum volume attachment limit when initializing. However, actual attachment capacities can change during a node's lifecycle for various reasons, such as:

Manual or external operations attaching/detaching volumes outside of Kubernetes control.
Dynamically attached network interfaces or specialized hardware (GPUs, NICs, etc.) consuming available slots.
Multi-driver scenarios, where one CSI driver’s operations affect available capacity reported by another.

Static reporting can cause Kubernetes to schedule pods onto nodes that appear to have capacity but don't, leading to pods stuck in a ContainerCreating state.

Dynamically adapting CSI volume limits

With this new feature, Kubernetes enables CSI drivers to dynamically adjust and report node attachment capacities at runtime. This ensures that the scheduler, as well as other components relying on this information, have the most accurate, up-to-date view of node capacity.

How it works

Kubernetes supports two mechanisms for updating the reported node volume limits:

Periodic Updates: CSI drivers specify an interval to periodically refresh the node's allocatable capacity.
Reactive Updates: An immediate update triggered when a volume attachment fails due to exhausted resources (ResourceExhausted error).

Enabling the feature

To use this beta feature, the MutableCSINodeAllocatableCount feature gate must be enabled in these components:

kube-apiserver
kubelet

Example CSI driver configuration

Below is an example of configuring a CSI driver to enable periodic updates every 60 seconds:

apiVersion: storage.k8s.io/v1
kind: CSIDriver
metadata:
name: example.csi.k8s.io
spec:
nodeAllocatableUpdatePeriodSeconds: 60

This configuration directs kubelet to periodically call the CSI driver's NodeGetInfo method every 60 seconds, updating the node’s allocatable volume count. Kubernetes enforces a minimum update interval of 10 seconds to balance accuracy and resource usage.

Immediate updates on attachment failures

When a volume attachment operation fails due to a ResourceExhausted error (gRPC code 8), Kubernetes immediately updates the allocatable count instead of waiting for the next periodic update. The Kubelet then marks the affected pods as Failed, enabling their controllers to recreate them. This prevents pods from getting permanently stuck in the ContainerCreating state.

Getting started

To enable this feature in your Kubernetes v1.34 cluster:

Enable the feature gate MutableCSINodeAllocatableCount on the kube-apiserver and kubelet components.
Update your CSI driver configuration by setting nodeAllocatableUpdatePeriodSeconds.
Monitor and observe improvements in scheduling accuracy and pod placement reliability.

Next steps

This feature is currently in beta and the Kubernetes community welcomes your feedback. Test it, share your experiences, and help guide its evolution to GA stability.

Join discussions in the Kubernetes Storage Special Interest Group (SIG-Storage) to shape the future of Kubernetes storage capabilities.

Categories: CNCF Projects, Kubernetes

Kubernetes v1.34: Use An Init Container To Define App Environment Variables

Kubernetes Blog - Wed, 09/10/2025 - 14:30

Kubernetes typically uses ConfigMaps and Secrets to set environment variables, which introduces additional API calls and complexity, For example, you need to separately manage the Pods of your workloads and their configurations, while ensuring orderly updates for both the configurations and the workload Pods.

Alternatively, you might be using a vendor-supplied container that requires environment variables (such as a license key or a one-time token), but you don’t want to hard-code them or mount volumes just to get the job done.

If that's the situation you are in, you now have a new (alpha) way to achieve that. Provided you have the EnvFiles feature gate enabled across your cluster, you can tell the kubelet to load a container's environment variables from a volume (the volume must be part of the Pod that the container belongs to). this feature gate allows you to load environment variables directly from a file in an emptyDir volume without actually mounting that file into the container. It’s a simple yet elegant solution to some surprisingly common problems.

What’s this all about?

At its core, this feature allows you to point your container to a file, one generated by an initContainer, and have Kubernetes parse that file to set your environment variables. The file lives in an emptyDir volume (a temporary storage space that lasts as long as the pod does), Your main container doesn’t need to mount the volume. The kubelet will read the file and inject these variables when the container starts.

How It Works

Here's a simple example:

apiVersion: v1
kind: Pod
spec:
 initContainers:
 - name: generate-config
 image: busybox
 command: ['sh', '-c', 'echo "CONFIG_VAR=HELLO" > /config/config.env']
 volumeMounts:
 - name: config-volume
 mountPath: /config
 containers:
 - name: app-container
 image: gcr.io/distroless/static
 env:
 - name: CONFIG_VAR
 valueFrom:
 fileKeyRef:
 path: config.env
 volumeName: config-volume
 key: CONFIG_VAR
 volumes:
 - name: config-volume
 emptyDir: {}

Using this approach is a breeze. You define your environment variables in the pod spec using the fileKeyRef field, which tells Kubernetes where to find the file and which key to pull. The file itself resembles the standard for .env syntax (think KEY=VALUE), and (for this alpha stage at least) you must ensure that it is written into an emptyDir volume. Other volume types aren't supported for this feature. At least one init container must mount that emptyDir volume (to write the file), but the main container doesn’t need to—it just gets the variables handed to it at startup.

A word on security

While this feature supports handling sensitive data such as keys or tokens, note that its implementation relies on emptyDir volumes mounted into pod. Operators with node filesystem access could therefore easily retrieve this sensitive data through pod directory paths.

If storing sensitive data like keys or tokens using this feature, ensure your cluster security policies effectively protect nodes against unauthorized access to prevent exposure of confidential information.

Summary

This feature will eliminate a number of complex workarounds used today, simplifying apps authoring, and opening doors for more use cases. Kubernetes stays flexible and open for feedback. Tell us how you use this feature or what is missing.

Categories: CNCF Projects, Kubernetes

Kubernetes v1.34: Snapshottable API server cache

Kubernetes Blog - Tue, 09/09/2025 - 14:30

For years, the Kubernetes community has been on a mission to improve the stability and performance predictability of the API server. A major focus of this effort has been taming list requests, which have historically been a primary source of high memory usage and heavy load on the etcd datastore. With each release, we've chipped away at the problem, and today, we're thrilled to announce the final major piece of this puzzle.

The snapshottable API server cache feature has graduated to Beta in Kubernetes v1.34, culminating a multi-release effort to allow virtually all read requests to be served directly from the API server's cache.

Evolving the cache for performance and stability

The path to the current state involved several key enhancements over recent releases that paved the way for today's announcement.

Consistent reads from cache (Beta in v1.31)

While the API server has long used a cache for performance, a key milestone was guaranteeing consistent reads of the latest data from it. This v1.31 enhancement allowed the watch cache to be used for strongly-consistent read requests for the first time, a huge win as it enabled filtered collections (e.g. "a list of pods bound to this node") to be safely served from the cache instead of etcd, dramatically reducing its load for common workloads.

Taming large responses with streaming (Beta in v1.33)

Another key improvement was tackling the problem of memory spikes when transmitting large responses. The streaming encoder, introduced in v1.33, allowed the API server to send list items one by one, rather than buffering the entire multi-gigabyte response in memory. This made the memory cost of sending a response predictable and minimal, regardless of its size.

The missing piece

Despite these huge improvements, a critical gap remained. Any request for a historical LIST—most commonly used for paginating through large result sets—still had to bypass the cache and query etcd directly. This meant that the cost of retrieving the data was still unpredictable and could put significant memory pressure on the API server.

Kubernetes 1.34: snapshots complete the picture

The snapshottable API server cache solves this final piece of the puzzle. This feature enhances the watch cache, enabling it to generate efficient, point-in-time snapshots of its state.

Here’s how it works: for each update, the cache creates a lightweight snapshot. These snapshots are "lazy copies," meaning they don't duplicate objects but simply store pointers, making them incredibly memory-efficient.

When a list request for a historical resourceVersion arrives, the API server now finds the corresponding snapshot and serves the response directly from its memory. This closes the final major gap, allowing paginated requests to be served entirely from the cache.

A new era of API Server performance ?

With this final piece in place, the synergy of these three features ushers in a new era of API server predictability and performance:

Get Data from Cache: Consistent reads and snapshottable cache work together to ensure nearly all read requests—whether for the latest data or a historical snapshot—are served from the API server's memory.
Send data via stream: Streaming list responses ensure that sending this data to the client has a minimal and constant memory footprint.

The result is a system where the resource cost of read operations is almost fully predictable and much more resiliant to spikes in request load. This means dramatically reduced memory pressure, a lighter load on etcd, and a more stable, scalable, and reliable control plane for all Kubernetes clusters.

How to get started

With its graduation to Beta, the SnapshottableCache feature gate is enabled by default in Kubernetes v1.34. There are no actions required to start benefiting from these performance and stability improvements.

Acknowledgements

Special thanks for designing, implementing, and reviewing these critical features go to:

Ahmad Zolfaghari (@ah8ad3)
Ben Luddy (@benluddy) – Red Hat
Chen Chen (@z1cheng) – Microsoft
Davanum Srinivas (@dims) – Nvidia
David Eads (@deads2k) – Red Hat
Han Kang (@logicalhan) – CoreWeave
haosdent (@haosdent) – Shopee
Joe Betz (@jpbetz) – Google
Jordan Liggitt (@liggitt) – Google
Łukasz Szaszkiewicz (@p0lyn0mial) – Red Hat
Maciej Borsz (@mborsz) – Google
Madhav Jivrajani (@MadhavJivrajani) – UIUC
Marek Siarkowicz (@serathius) – Google
NKeert (@NKeert)
Tim Bannister (@lmktfy)
Wei Fu (@fuweid) - Microsoft
Wojtek Tyczyński (@wojtek-t) – Google

...and many others in SIG API Machinery. This milestone is a testament to the community's dedication to building a more scalable and robust Kubernetes.

Categories: CNCF Projects, Kubernetes

Path To Releasing Helm v4

Helm Blog - Mon, 09/08/2025 - 20:00

The first Alpha for Helm v4 has been released. Now that Helm v4 development is in the home stretch, we wanted to share the details on what's happening and how the broader community can get involved.

Alpha Period

With the start of September, there is a freeze on new major features for Helm v4. This begins the Alpha phase, where API breaking changes will still happen, but the focus turns to stability and making sure the existing changes work as expected.

If you're a Helm user, during this period you can test out the current capabilities and provide feedback where things aren't working as expected. Just remember, this is alpha quality software and changes are still occurring.

For Helm SDK users, now is a good time to look at the API to see if there are any concerns with the design changes along with any impacts to your efforts.

The Alpha period runs through the month of September.

Beta Period

The beta period starts in October. At this point the focus is on stability in preparation for release. API breaking changes should be complete and the focus transitions to fixing any bugs to ensure there is a stable release.

Testers should file bugs as they encounter any issues.

Per release schedule policy, once the first beta version is available, the final 4.0.0 release date will be chosen and announced.

Release Candidates

At the end of October, the first release candidate will be created. This represents what we think will be released as Helm v4. If there are any major issues, they will be fixed and a new release candidate will be made.

? Release ?

The release is planned for KubeCon + CloudNativeCon North America 2025. in mid November, which is 6 years after the release of Helm v3 and 10 years after the creation of Helm. More details on the release will come.

Categories: CNCF Projects, Kubernetes

Beyond linkerd-viz: Linkerd Metrics with OpenTelemetry

Linkerd Blog - Mon, 09/08/2025 - 20:00

TL;DR

Linkerd, the enterprise-grade service mesh that minimizes overhead, now integrates with OpenTelemetry, often also simply called OTel. That’s pretty cool because it allows you to collect and export Linkerd’s metrics to your favorite observability tools. This integration improves your ability to monitor and troubleshoot applications effectively. Sounds interesting? Read on.

Before we dive into this topic, I want to be sure you have a basic understanding of Kubernetes. If you’re new to it, that’s ok! But I’d recommend exploring the official Kubernetes tutorials and/or experimenting with “Kind” (Kubernetes in Docker) with this simple guide.

Categories: CNCF Projects

Path To Releasing Helm v4

Helm Blog - Mon, 09/08/2025 - 20:00

The first Alpha for Helm v4 has been released. Now that Helm v4 development is in the home stretch, we wanted to share the details on what's happening and how the broader community can get involved.

Categories: CNCF Projects, Kubernetes

Kubernetes v1.34: VolumeAttributesClass for Volume Modification GA

Kubernetes Blog - Mon, 09/08/2025 - 14:30

The VolumeAttributesClass API, which empowers users to dynamically modify volume attributes, has officially graduated to General Availability (GA) in Kubernetes v1.34. This marks a significant milestone, providing a robust and stable way to tune your persistent storage directly within Kubernetes.

What is VolumeAttributesClass?

At its core, VolumeAttributesClass is a cluster-scoped resource that defines a set of mutable parameters for a volume. Think of it as a "profile" for your storage, allowing cluster administrators to expose different quality-of-service (QoS) levels or performance tiers.

Users can then specify a volumeAttributesClassName in their PersistentVolumeClaim (PVC) to indicate which class of attributes they desire. The magic happens through the Container Storage Interface (CSI): when a PVC referencing a VolumeAttributesClass is updated, the associated CSI driver interacts with the underlying storage system to apply the specified changes to the volume.

This means you can now:

Dynamically scale performance: Increase IOPS or throughput for a busy database, or reduce it for a less critical application.
Optimize costs: Adjust attributes on the fly to match your current needs, avoiding over-provisioning.
Simplify operations: Manage volume modifications directly within the Kubernetes API, rather than relying on external tools or manual processes.

What is new from Beta to GA

There are two major enhancements from beta.

Cancel support from infeasible errors

To improve resilience and user experience, the GA release introduces explicit cancel support when a requested volume modification becomes infeasible. If the underlying storage system or CSI driver indicates that the requested changes cannot be applied (e.g., due to invalid arguments), users can cancel the operation and revert the volume to its previous stable configuration, preventing the volume from being left in an inconsistent state.

Quota support based on scope

While VolumeAttributesClass doesn't add a new quota type, the Kubernetes control plane can be configured to enforce quotas on PersistentVolumeClaims that reference a specific VolumeAttributesClass.

This is achieved by using the scopeSelector field in a ResourceQuota to target PVCs that have .spec.volumeAttributesClassName set to a particular VolumeAttributesClass name. Please see more details here.

Drivers support VolumeAttributesClass

Amazon EBS CSI Driver: The AWS EBS CSI driver has robust support for VolumeAttributesClass and allows you to modify parameters like volume type (e.g., gp2 to gp3, io1 to io2), IOPS, and throughput of EBS volumes dynamically.
Google Compute Engine (GCE) Persistent Disk CSI Driver (pd.csi.storage.gke.io): This driver also supports dynamic modification of persistent disk attributes, including IOPS and throughput, via VolumeAttributesClass.

Contact

For any inquiries or specific questions related to VolumeAttributesClass, please reach out to the SIG Storage community.

Categories: CNCF Projects, Kubernetes

CoreDNS-1.12.4 Release

CoreDNS Blog - Sun, 09/07/2025 - 20:00

This release improves stability and security, fixing context propagation in DoH, label offset handling in the file plugin, and connection leaks in gRPC and transfer. It also adds support for the prefer option in loadbalance, introduces timeouts to the metrics server, and fixes several security vulnerabilities (see details in related security advisories). Brought to You By Archy Ilya Kulakov Olli Janatuinen Qasim Sarfraz Syed Azeez Ville Vesilehto wencyu Yong Tang

Categories: CNCF Projects

Kubernetes v1.34: Pod Replacement Policy for Jobs Goes GA

Kubernetes Blog - Fri, 09/05/2025 - 14:30

In Kubernetes v1.34, the Pod replacement policy feature has reached general availability (GA). This blog post describes the Pod replacement policy feature and how to use it in your Jobs.

About Pod Replacement Policy

By default, the Job controller immediately recreates Pods as soon as they fail or begin terminating (when they have a deletion timestamp).

As a result, while some Pods are terminating, the total number of running Pods for a Job can temporarily exceed the specified parallelism. For Indexed Jobs, this can even mean multiple Pods running for the same index at the same time.

This behavior works fine for many workloads, but it can cause problems in certain cases.

For example, popular machine learning frameworks like TensorFlow and JAX expect exactly one Pod per worker index. If two Pods run at the same time, you might encounter errors such as:

/job:worker/task:4: Duplicate task registration with task_name=/job:worker/replica:0/task:4

Additionally, starting replacement Pods before the old ones fully terminate can lead to:

Scheduling delays by kube-scheduler as the nodes remain occupied.
Unnecessary cluster scale-ups to accommodate the replacement Pods.
Temporary bypassing of quota checks by workload orchestrators like Kueue.

With Pod replacement policy, Kubernetes gives you control over when the control plane replaces terminating Pods, helping you avoid these issues.

How Pod Replacement Policy works

This enhancement means that Jobs in Kubernetes have an optional field .spec.podReplacementPolicy.
You can choose one of two policies:

TerminatingOrFailed (default): Replaces Pods as soon as they start terminating.
Failed: Replaces Pods only after they fully terminate and transition to the Failed phase.

Setting the policy to Failed ensures that a new Pod is only created after the previous one has completely terminated.

For Jobs with a Pod Failure Policy, the default podReplacementPolicy is Failed, and no other value is allowed. See Pod Failure Policy to learn more about Pod Failure Policies for Jobs.

You can check how many Pods are currently terminating by inspecting the Job’s .status.terminating field:

kubectl get job myjob -o=jsonpath='{.status.terminating}'

Example

Here’s a Job example that executes a task two times (spec.completions: 2) in parallel (spec.parallelism: 2) and replaces Pods only after they fully terminate (spec.podReplacementPolicy: Failed):

apiVersion: batch/v1
kind: Job
metadata:
 name: example-job
spec:
 completions: 2
 parallelism: 2
 podReplacementPolicy: Failed
 template:
 spec:
 restartPolicy: Never
 containers:
 - name: worker
 image: your-image

If a Pod receives a SIGTERM signal (deletion, eviction, preemption...), it begins terminating. When the container handles termination gracefully, cleanup may take some time.

When the Job starts, we will see two Pods running:

kubectl get pods

NAME READY STATUS RESTARTS AGE
example-job-qr8kf 1/1 Running 0 2s
example-job-stvb4 1/1 Running 0 2s

Let's delete one of the Pods (example-job-qr8kf).

With the TerminatingOrFailed policy, as soon as one Pod (example-job-qr8kf) starts terminating, the Job controller immediately creates a new Pod (example-job-b59zk) to replace it.

kubectl get pods

NAME READY STATUS RESTARTS AGE
example-job-b59zk 1/1 Running 0 1s
example-job-qr8kf 1/1 Terminating 0 17s
example-job-stvb4 1/1 Running 0 17s

With the Failed policy, the new Pod (example-job-b59zk) is not created while the old Pod (example-job-qr8kf) is terminating.

kubectl get pods

NAME READY STATUS RESTARTS AGE
example-job-qr8kf 1/1 Terminating 0 17s
example-job-stvb4 1/1 Running 0 17s

When the terminating Pod has fully transitioned to the Failed phase, a new Pod is created:

kubectl get pods

NAME READY STATUS RESTARTS AGE
example-job-b59zk 1/1 Running 0 1s
example-job-stvb4 1/1 Running 0 25s

How can you learn more?

Read the user-facing documentation for Pod Replacement Policy, Backoff Limit per Index, and Pod Failure Policy.
Read the KEPs for Pod Replacement Policy, Backoff Limit per Index, and Pod Failure Policy.

Acknowledgments

As with any Kubernetes feature, multiple people contributed to getting this done, from testing and filing bugs to reviewing code.

As this feature moves to stable after 2 years, we would like to thank the following people:

Kevin Hannon - for writing the KEP and the initial implementation.
Michał Woźniak - for guidance, mentorship, and reviews.
Aldo Culquicondor - for guidance, mentorship, and reviews.
Maciej Szulik - for guidance, mentorship, and reviews.
Dejan Zele Pejchev - for taking over the feature and promoting it from Alpha through Beta to GA.

Get involved

This work was sponsored by the Kubernetes batch working group in close collaboration with the SIG Apps community.

If you are interested in working on new features in the space we recommend subscribing to our Slack channel and attending the regular community meetings.

Categories: CNCF Projects, Kubernetes

Linkerd Edge Release Roundup: September 2025

Linkerd Blog - Thu, 09/04/2025 - 20:00

Welcome to the September 2025 Edge Release Roundup post, where we dive into the most recent edge releases to help keep everyone up to date on the latest and greatest! This post covers edge releases from August 2025.

How to give feedback

Edge releases are a snapshot of our current development work on main; by definition, they always have the most recent features but they may have incomplete features, features that end up getting rolled back later, or (like all software) even bugs. That said, edge releases are intended for production use, and go through a rigorous set of automated and manual tests before being released. Once released, we also document whether the release is recommended for broad use – and when needed, we go back and update the recommendations.

Categories: CNCF Projects

Kubernetes v1.34: PSI Metrics for Kubernetes Graduates to Beta

Kubernetes Blog - Thu, 09/04/2025 - 14:30

As Kubernetes clusters grow in size and complexity, understanding the health and performance of individual nodes becomes increasingly critical. We are excited to announce that as of Kubernetes v1.34, Pressure Stall Information (PSI) Metrics has graduated to Beta.

What is Pressure Stall Information (PSI)?

Pressure Stall Information (PSI) is a feature of the Linux kernel (version 4.20 and later) that provides a canonical way to quantify pressure on infrastructure resources, in terms of whether demand for a resource exceeds current supply. It moves beyond simple resource utilization metrics and instead measures the amount of time that tasks are stalled due to resource contention. This is a powerful way to identify and diagnose resource bottlenecks that can impact application performance.

PSI exposes metrics for CPU, memory, and I/O, categorized as either some or full pressure:

some: The percentage of time that at least one task is stalled on a resource. This indicates some level of resource contention.
full: The percentage of time that all non-idle tasks are stalled on a resource simultaneously. This indicates a more severe resource bottleneck.

PSI: 'Some' vs. 'Full' Pressure

These metrics are aggregated over 10-second, 1-minute, and 5-minute rolling windows, providing a comprehensive view of resource pressure over time.

PSI metrics in Kubernetes

With the KubeletPSI feature gate enabled, the kubelet can now collect PSI metrics from the Linux kernel and expose them through two channels: the Summary API and the /metrics/cadvisor Prometheus endpoint. This allows you to monitor and alert on resource pressure at the node, pod, and container level.

The following new metrics are available in Prometheus exposition format via /metrics/cadvisor:

container_pressure_cpu_stalled_seconds_total
container_pressure_cpu_waiting_seconds_total
container_pressure_memory_stalled_seconds_total
container_pressure_memory_waiting_seconds_total
container_pressure_io_stalled_seconds_total
container_pressure_io_waiting_seconds_total

These metrics, along with the data from the Summary API, provide a granular view of resource pressure, enabling you to pinpoint the source of performance issues and take corrective action. For example, you can use these metrics to:

Identify memory leaks: A steadily increasing some pressure for memory can indicate a memory leak in an application.
Optimize resource requests and limits: By understanding the resource pressure of your workloads, you can more accurately tune their resource requests and limits.
Autoscale workloads: You can use PSI metrics to trigger autoscaling events, ensuring that your workloads have the resources they need to perform optimally.

How to enable PSI metrics

To enable PSI metrics in your Kubernetes cluster, you need to:

Ensure your nodes are running a Linux kernel version 4.20 or later and are using cgroup v2.
Enable the KubeletPSI feature gate on the kubelet.

Once enabled, you can start scraping the /metrics/cadvisor endpoint with your Prometheus-compatible monitoring solution or query the Summary API to collect and visualize the new PSI metrics. Note that PSI is a Linux-kernel feature, so these metrics are not available on Windows nodes. Your cluster can contain a mix of Linux and Windows nodes, and on the Windows nodes the kubelet does not expose PSI metrics.

What's next?

We are excited to bring PSI metrics to the Kubernetes community and look forward to your feedback. As a beta feature, we are actively working on improving and extending this functionality towards a stable GA release. We encourage you to try it out and share your experiences with us.

To learn more about PSI metrics, check out the official Kubernetes documentation. You can also get involved in the conversation on the #sig-node Slack channel.

Categories: CNCF Projects, Kubernetes

Kubernetes v1.34: Service Account Token Integration for Image Pulls Graduates to Beta

Kubernetes Blog - Wed, 09/03/2025 - 14:30

The Kubernetes community continues to advance security best practices by reducing reliance on long-lived credentials. Following the successful alpha release in Kubernetes v1.33, Service Account Token Integration for Kubelet Credential Providers has now graduated to beta in Kubernetes v1.34, bringing us closer to eliminating long-lived image pull secrets from Kubernetes clusters.

This enhancement allows credential providers to use workload-specific service account tokens to obtain registry credentials, providing a secure, ephemeral alternative to traditional image pull secrets.

What's new in beta?

The beta graduation brings several important changes that make the feature more robust and production-ready:

Required `cacheType` field

Breaking change from alpha: The cacheType field is required in the credential provider configuration when using service account tokens. This field is new in beta and must be specified to ensure proper caching behavior.

# CAUTION: this is not a complete configuration example, just a reference for the 'tokenAttributes.cacheType' field.
tokenAttributes:
 serviceAccountTokenAudience: "my-registry-audience"
 cacheType: "ServiceAccount" # Required field in beta
 requireServiceAccount: true

Choose between two caching strategies:

Token: Cache credentials per service account token (use when credential lifetime is tied to the token). This is useful when the credential provider transforms the service account token into registry credentials with the same lifetime as the token, or when registries support Kubernetes service account tokens directly. Note: The kubelet cannot send service account tokens directly to registries; credential provider plugins are needed to transform tokens into the username/password format expected by registries.
ServiceAccount: Cache credentials per service account identity (use when credentials are valid for all pods using the same service account)

Isolated image pull credentials

The beta release provides stronger security isolation for container images when using service account tokens for image pulls. It ensures that pods can only access images that were pulled using ServiceAccounts they're authorized to use. This prevents unauthorized access to sensitive container images and enables granular access control where different workloads can have different registry permissions based on their ServiceAccount.

When credential providers use service account tokens, the system tracks ServiceAccount identity (namespace, name, and UID) for each pulled image. When a pod attempts to use a cached image, the system verifies that the pod's ServiceAccount matches exactly with the ServiceAccount that was used to originally pull the image.

Administrators can revoke access to previously pulled images by deleting and recreating the ServiceAccount, which changes the UID and invalidates cached image access.

For more details about this capability, see the image pull credential verification documentation.

How it works

Configuration

Credential providers opt into using ServiceAccount tokens by configuring the tokenAttributes field:

#
# CAUTION: this is an example configuration.
# Do not use this for your own cluster!
#
apiVersion: kubelet.config.k8s.io/v1
kind: CredentialProviderConfig
providers:
- name: my-credential-provider
 matchImages:
 - "*.myregistry.io/*"
 defaultCacheDuration: "10m"
 apiVersion: credentialprovider.kubelet.k8s.io/v1
 tokenAttributes:
 serviceAccountTokenAudience: "my-registry-audience"
 cacheType: "ServiceAccount" # New in beta
 requireServiceAccount: true
 requiredServiceAccountAnnotationKeys:
 - "myregistry.io/identity-id"
 optionalServiceAccountAnnotationKeys:
 - "myregistry.io/optional-annotation"

Image pull flow

At a high level, kubelet coordinates with your credential provider and the container runtime as follows:

When the image is not present locally:
- kubelet checks its credential cache using the configured cacheType (Token or ServiceAccount)
- If needed, kubelet requests a ServiceAccount token for the pod's ServiceAccount and passes it, plus any required annotations, to the credential provider
- The provider exchanges that token for registry credentials and returns them to kubelet
- kubelet caches credentials per the cacheType strategy and pulls the image with those credentials
- kubelet records the ServiceAccount coordinates (namespace, name, UID) associated with the pulled image for later authorization checks
When the image is already present locally:
- kubelet verifies the pod's ServiceAccount coordinates match the coordinates recorded for the cached image
- If they match exactly, the cached image can be used without pulling from the registry
- If they differ, kubelet performs a fresh pull using credentials for the new ServiceAccount
With image pull credential verification enabled:
- Authorization is enforced using the recorded ServiceAccount coordinates, ensuring pods only use images pulled by a ServiceAccount they are authorized to use
- Administrators can revoke access by deleting and recreating a ServiceAccount; the UID changes and previously recorded authorization no longer matches

Audience restriction

The beta release builds on service account node audience restriction (beta since v1.33) to ensure kubelet can only request tokens for authorized audiences. Administrators configure allowed audiences using RBAC to enable kubelet to request service account tokens for image pulls:

#
# CAUTION: this is an example configuration.
# Do not use this for your own cluster!
#
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
 name: kubelet-credential-provider-audiences
rules:
- verbs: ["request-serviceaccounts-token-audience"]
 apiGroups: [""]
 resources: ["my-registry-audience"]
 resourceNames: ["registry-access-sa"] # Optional: specific SA

Getting started with beta

Prerequisites

Kubernetes v1.34 or later
Feature gate enabled: KubeletServiceAccountTokenForCredentialProviders=true (beta, enabled by default)
Credential provider support: Update your credential provider to handle ServiceAccount tokens

Migration from alpha

If you're already using the alpha version, the migration to beta requires minimal changes:

Add cacheType field: Update your credential provider configuration to include the required cacheType field
Review caching strategy: Choose between Token and ServiceAccount cache types based on your provider's behavior
Test audience restrictions: Ensure your RBAC configuration, or other cluster authorization rules, will properly restrict token audiences

Example setup

Here's a complete example for setting up a credential provider with service account tokens (this example assumes your cluster uses RBAC authorization):

#
# CAUTION: this is an example configuration.
# Do not use this for your own cluster!
#

# Service Account with registry annotations
apiVersion: v1
kind: ServiceAccount
metadata:
 name: registry-access-sa
 namespace: default
 annotations:
 myregistry.io/identity-id: "user123"
---
# RBAC for audience restriction
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
 name: registry-audience-access
rules:
- verbs: ["request-serviceaccounts-token-audience"]
 apiGroups: [""]
 resources: ["my-registry-audience"]
 resourceNames: ["registry-access-sa"] # Optional: specific ServiceAccount
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
 name: kubelet-registry-audience
roleRef:
 apiGroup: rbac.authorization.k8s.io
 kind: ClusterRole
 name: registry-audience-access
subjects:
- kind: Group
 name: system:nodes
 apiGroup: rbac.authorization.k8s.io
---
# Pod using the ServiceAccount
apiVersion: v1
kind: Pod
metadata:
 name: my-pod
spec:
 serviceAccountName: registry-access-sa
 containers:
 - name: my-app
 image: myregistry.example/my-app:latest

What's next?

For Kubernetes v1.35, we - Kubernetes SIG Auth - expect the feature to stay in beta, and we will continue to solicit feedback.

You can learn more about this feature on the service account token for image pulls page in the Kubernetes documentation.

You can also follow along on the KEP-4412 to track progress across the coming Kubernetes releases.

Call to action

In this blog post, I have covered the beta graduation of ServiceAccount token integration for Kubelet Credential Providers in Kubernetes v1.34. I discussed the key improvements, including the required cacheType field and enhanced integration with Ensure Secret Pull Images.

We have been receiving positive feedback from the community during the alpha phase and would love to hear more as we stabilize this feature for GA. In particular, we would like feedback from credential provider implementors as they integrate with the new beta API and caching mechanisms. Please reach out to us on the #sig-auth-authenticators-dev channel on Kubernetes Slack.

How to get involved

If you are interested in getting involved in the development of this feature, share feedback, or participate in any other ongoing SIG Auth projects, please reach out on the #sig-auth channel on Kubernetes Slack.

You are also welcome to join the bi-weekly SIG Auth meetings, held every other Wednesday.

Categories: CNCF Projects, Kubernetes

Kubernetes v1.34: Introducing CPU Manager Static Policy Option for Uncore Cache Alignment

Kubernetes Blog - Tue, 09/02/2025 - 14:30

A new CPU Manager Static Policy Option called prefer-align-cpus-by-uncorecache was introduced in Kubernetes v1.32 as an alpha feature, and has graduated to beta in Kubernetes v1.34. This CPU Manager Policy Option is designed to optimize performance for specific workloads running on processors with a split uncore cache architecture. In this article, I'll explain what that means and why it's useful.

Understanding the feature

What is uncore cache?

Until relatively recently, nearly all mainstream computer processors had a monolithic last-level-cache cache that was shared across every core in a multiple CPU package. This monolithic cache is also referred to as uncore cache (because it is not linked to a specific core), or as Level 3 cache. As well as the Level 3 cache, there is other cache, commonly called Level 1 and Level 2 cache, that is associated with a specific CPU core.

In order to reduce access latency between the CPU cores and their cache, recent AMD64 and ARM architecture based processors have introduced a split uncore cache architecture, where the last-level-cache is divided into multiple physical caches, that are aligned to specific CPU groupings within the physical package. The shorter distances within the CPU package help to reduce latency.

Kubernetes is able to place workloads in a way that accounts for the cache topology within the CPU package(s).

Cache-aware workload placement

The matrix below shows the CPU-to-CPU latency measured in nanoseconds (lower is better) when passing a packet between CPUs, via its cache coherence protocol on a processor that uses split uncore cache. In this example, the processor package consists of 2 uncore caches. Each uncore cache serves 8 CPU cores. Blue entries in the matrix represent latency between CPUs sharing the same uncore cache, while grey entries indicate latency between CPUs corresponding to different uncore caches. Latency between CPUs that correspond to different caches are higher than the latency between CPUs that belong to the same cache.

With prefer-align-cpus-by-uncorecache enabled, the static CPU Manager attempts to allocates CPU resources for a container, such that all CPUs assigned to a container share the same uncore cache. This policy operates on a best-effort basis, aiming to minimize the distribution of a container's CPU resources across uncore caches, based on the container's requirements, and accounting for allocatable resources on the node.

By running a workload, where it can, on a set of CPUS that use the smallest feasible number of uncore caches, applications benefit from reduced cache latency (as seen in the matrix above), and from reduced contention against other workloads, which can result in overall higher throughput. The benefit only shows up if your nodes use a split uncore cache topology for their processors.

The following diagram below illustrates uncore cache alignment when the feature is enabled.

By default, Kubernetes does not account for uncore cache topology; containers are assigned CPU resources using a packed methodology. As a result, Container 1 and Container 2 can experience a noisy neighbor impact due to cache access contention on Uncore Cache 0. Additionally, Container 2 will have CPUs distributed across both caches which can introduce a cross-cache latency.

With prefer-align-cpus-by-uncorecache enabled, each container is isolated on an individual cache. This resolves the cache contention between the containers and minimizes the cache latency for the CPUs being utilized.

Use cases

Common use cases can include telco applications like vRAN, Mobile Packet Core, and Firewalls. It's important to note that the optimization provided by prefer-align-cpus-by-uncorecache can be dependent on the workload. For example, applications that are memory bandwidth bound may not benefit from uncore cache alignment, as utilizing more uncore caches can increase memory bandwidth access.

Enabling the feature

To enable this feature, set the CPU Manager Policy to static and enable the CPU Manager Policy Options with prefer-align-cpus-by-uncorecache.

For Kubernetes 1.34, the feature is in the beta stage and requires the CPUManagerPolicyBetaOptions feature gate to also be enabled.

Append the following to the kubelet configuration file:

kind: KubeletConfiguration
apiVersion: kubelet.config.k8s.io/v1beta1
featureGates:
 ...
 CPUManagerPolicyBetaOptions: true
cpuManagerPolicy: "static"
cpuManagerPolicyOptions:
 prefer-align-cpus-by-uncorecache: "true"
reservedSystemCPUs: "0"
...

If you're making this change to an existing node, remove the cpu_manager_state file and then restart kubelet.

prefer-align-cpus-by-uncorecache can be enabled on nodes with a monolithic uncore cache processor. The feature will mimic a best-effort socket alignment effect and will pack CPU resources on the socket similar to the default static CPU Manager policy.

Getting involved

This feature is driven by SIG Node. If you are interested in helping develop this feature, sharing feedback, or participating in any other ongoing SIG Node projects, please attend the SIG Node meeting for more details.

Categories: CNCF Projects, Kubernetes

Kubernetes v1.34: DRA has graduated to GA

Kubernetes Blog - Mon, 09/01/2025 - 14:30

Kubernetes 1.34 is here, and it has brought a huge wave of enhancements for Dynamic Resource Allocation (DRA)! This release marks a major milestone with many APIs in the resource.k8s.io group graduating to General Availability (GA), unlocking the full potential of how you manage devices on Kubernetes. On top of that, several key features have moved to beta, and a fresh batch of new alpha features promise even more expressiveness and flexibility.

Let's dive into what's new for DRA in Kubernetes 1.34!

The core of DRA is now GA

The headline feature of the v1.34 release is that the core of DRA has graduated to General Availability.

Kubernetes Dynamic Resource Allocation (DRA) provides a flexible framework for managing specialized hardware and infrastructure resources, such as GPUs or FPGAs. DRA provides APIs that enable each workload to specify the properties of the devices it needs, but leaving it to the scheduler to allocate actual devices, allowing increased reliability and improved utilization of expensive hardware.

With the graduation to GA, DRA is stable and will be part of Kubernetes for the long run. The community can still expect a steady stream of new features being added to DRA over the next several Kubernetes releases, but they will not make any breaking changes to DRA. So users and developers of DRA drivers can start adopting DRA with confidence.

Starting with Kubernetes 1.34, DRA is enabled by default; the DRA features that have reached beta are also enabled by default. That's because the default API version for DRA is now the stable v1 version, and not the earlier versions (eg: v1beta1 or v1beta2) that needed explicit opt in.

Features promoted to beta

Several powerful features have been promoted to beta, adding more control, flexibility, and observability to resource management with DRA.

Admin access labelling has been updated. In v1.34, you can restrict device support to people (or software) authorized to use it. This is meant as a way to avoid privilege escalation if a DRA driver grants additional privileges when admin access is requested and to avoid accessing devices which are in use by normal applications, potentially in another namespace. The restriction works by ensuring that only users with access to a namespace with the resource.k8s.io/admin-access: "true" label are authorized to create ResourceClaim or ResourceClaimTemplates objects with the adminAccess field set to true. This ensures that non-admin users cannot misuse the feature.

Prioritized list lets users specify a list of acceptable devices for their workloads, rather than just a single type of device. So while the workload might run best on a single high-performance GPU, it might also be able to run on 2 mid-level GPUs. The scheduler will attempt to satisfy the alternatives in the list in order, so the workload will be allocated the best set of devices available on the node.

The kubelet's API has been updated to report on Pod resources allocated through DRA. This allows node monitoring agents to know the allocated DRA resources for Pods on a node and makes it possible to use the DRA information in the PodResources API to develop new features and integrations.

New alpha features

Kubernetes 1.34 also introduces several new alpha features that give us a glimpse into the future of resource management with DRA.

Extended resource mapping support in DRA allows cluster administrators to advertise DRA-managed resources as extended resources, allowing developers to consume them using the familiar, simpler request syntax while still benefiting from dynamic allocation. This makes it possible for existing workloads to start using DRA without modifications, simplifying the transition to DRA for both application developers and cluster administrators.

Consumable capacity introduces a flexible device sharing model where multiple, independent resource claims from unrelated pods can each be allocated a share of the same underlying physical device. This new capability is managed through optional, administrator-defined sharing policies that govern how a device's total capacity is divided and enforced by the platform for each request. This allows for sharing of devices in scenarios where pre-defined partitions are not viable. A blog about this feature is coming soon.

Binding conditions improve scheduling reliability for certain classes of devices by allowing the Kubernetes scheduler to delay binding a pod to a node until its required external resources, such as attachable devices or FPGAs, are confirmed to be fully prepared. This prevents premature pod assignments that could lead to failures and ensures more robust, predictable scheduling by explicitly modeling resource readiness before the pod is committed to a node.

Resource health status for DRA improves observability by exposing the health status of devices allocated to a Pod via Pod Status. This works whether the device is allocated through DRA or Device Plugin. This makes it easier to understand the cause of an unhealthy device and respond properly. A blog about this feature is coming soon.

What’s next?

While DRA got promoted to GA this cycle, the hard work on DRA doesn't stop. There are several features in alpha and beta that we plan to bring to GA in the next couple of releases and we are looking to continue to improve performance, scalability and reliability of DRA. So expect an equally ambitious set of features in DRA for the 1.35 release.

Getting involved

A good starting point is joining the WG Device Management Slack channel and meetings, which happen at US/EU and EU/APAC friendly time slots.

Not all enhancement ideas are tracked as issues yet, so come talk to us if you want to help or have some ideas yourself! We have work to do at all levels, from difficult core changes to usability enhancements in kubectl, which could be picked up by newcomers.

Acknowledgments

A huge thanks to the new contributors to DRA this cycle:

Alay Patel (alaypatel07)
Gaurav Kumar Ghildiyal (gauravkghildiyal)
JP (Jpsassine)
Kobayashi Daisuke (KobayashiD27)
Laura Lorenz (lauralorenz)
Sunyanan Choochotkaew (sunya-ch)
Swati Gupta (guptaNswati)
Yu Liao (yliaog)

Categories: CNCF Projects, Kubernetes

Kubernetes v1.34: Finer-Grained Control Over Container Restarts

Kubernetes Blog - Fri, 08/29/2025 - 14:30

With the release of Kubernetes 1.34, a new alpha feature is introduced that gives you more granular control over container restarts within a Pod. This feature, named Container Restart Policy and Rules, allows you to specify a restart policy for each container individually, overriding the Pod's global restart policy. In addition, it also allows you to conditionally restart individual containers based on their exit codes. This feature is available behind the alpha feature gate ContainerRestartRules.

This has been a long-requested feature. Let's dive into how it works and how you can use it.

The problem with a single restart policy

Before this feature, the restartPolicy was set at the Pod level. This meant that all containers in a Pod shared the same restart policy (Always, OnFailure, or Never). While this works for many use cases, it can be limiting in others.

For example, consider a Pod with a main application container and an init container that performs some initial setup. You might want the main container to always restart on failure, but the init container should only run once and never restart. With a single Pod-level restart policy, this wasn't possible.

Introducing per-container restart policies

With the new ContainerRestartRules feature gate, you can now specify a restartPolicy for each container in your Pod's spec. You can also define restartPolicyRules to control restarts based on exit codes. This gives you the fine-grained control you need to handle complex scenarios.

Use cases

Let's look at some real-life use cases where per-container restart policies can be beneficial.

In-place restarts for training jobs

In ML research, it's common to orchestrate a large number of long-running AI/ML training workloads. In these scenarios, workload failures are unavoidable. When a workload fails with a retriable exit code, you want the container to restart quickly without rescheduling the entire Pod, which consumes a significant amount of time and resources. Restarting the failed container "in-place" is critical for better utilization of compute resources. The container should only restart "in-place" if it failed due to a retriable error; otherwise, the container and Pod should terminate and possibly be rescheduled.

This can now be achieved with container-level restartPolicyRules. The workload can exit with different codes to represent retriable and non-retriable errors. With restartPolicyRules, the workload can be restarted in-place quickly, but only when the error is retriable.

Try-once init containers

Init containers are often used to perform initialization work for the main container, such as setting up environments and credentials. Sometimes, you want the main container to always be restarted, but you don't want to retry initialization if it fails.

With a container-level restartPolicy, this is now possible. The init container can be executed only once, and its failure would be considered a Pod failure. If the initialization succeeds, the main container can be always restarted.

Pods with multiple containers

For Pods that run multiple containers, you might have different restart requirements for each container. Some containers might have a clear definition of success and should only be restarted on failure. Others might need to be always restarted.

This is now possible with a container-level restartPolicy, allowing individual containers to have different restart policies.

How to use it

To use this new feature, you need to enable the ContainerRestartRules feature gate on your Kubernetes cluster control-plane and worker nodes running Kubernetes 1.34+. Once enabled, you can specify the restartPolicy and restartPolicyRules fields in your container definitions.

Here are some examples:

Example 1: Restarting on specific exit codes

In this example, the container should restart if and only if it fails with a retriable error, represented by exit code 42.

To achieve this, the container has restartPolicy: Never, and a restart policy rule that tells Kubernetes to restart the container in-place if it exits with code 42.

apiVersion: v1
kind: Pod
metadata:
 name: restart-on-exit-codes
 annotations:
 kubernetes.io/description: "This Pod only restart the container only when it exits with code 42."
spec:
 restartPolicy: Never
 containers:
 - name: restart-on-exit-codes
 image: docker.io/library/busybox:1.28
 command: ['sh', '-c', 'sleep 60 && exit 0']
 restartPolicy: Never  # Container restart policy must be specified if rules are specified
 restartPolicyRules: # Only restart the container if it exits with code 42
 - action: Restart
 exitCodes:
 operator: In
 values: [42]

Example 2: A try-once init container

In this example, a Pod should always be restarted once the initialization succeeds. However, the initialization should only be tried once.

To achieve this, the Pod has an Always restart policy. The init-once init container will only try once. If it fails, the Pod will fail. This allows the Pod to fail if the initialization failed, but also keep running once the initialization succeeds.

apiVersion: v1
kind: Pod
metadata:
 name: fail-pod-if-init-fails
 annotations:
 kubernetes.io/description: "This Pod has an init container that runs only once. After initialization succeeds, the main container will always be restarted."
spec:
 restartPolicy: Always
 initContainers:
 - name: init-once  # This init container will only try once. If it fails, the Pod will fail.
 image: docker.io/library/busybox:1.28
 command: ['sh', '-c', 'echo "Failing initialization" && sleep 10 && exit 1']
 restartPolicy: Never
 containers:
 - name: main-container # This container will always be restarted once initialization succeeds.
 image: docker.io/library/busybox:1.28
 command: ['sh', '-c', 'sleep 1800 && exit 0']

Example 3: Containers with different restart policies

In this example, there are two containers with different restart requirements. One should always be restarted, while the other should only be restarted on failure.

This is achieved by using a different container-level restartPolicy on each of the two containers.

apiVersion: v1
kind: Pod
metadata:
 name: on-failure-pod
 annotations:
 kubernetes.io/description: "This Pod has two containers with different restart policies."
spec:
 containers:
 - name: restart-on-failure
 image: docker.io/library/busybox:1.28
 command: ['sh', '-c', 'echo "Not restarting after success" && sleep 10 && exit 0']
 restartPolicy: OnFailure
 - name: restart-always
 image: docker.io/library/busybox:1.28
 command: ['sh', '-c', 'echo "Always restarting" && sleep 1800 && exit 0']
 restartPolicy: Always

Learn more

Read the documentation for container restart policy.
Read the KEP for the Container Restart Rules

Roadmap

More actions and signals to restart Pods and containers are coming! Notably, there are plans to add support for restarting the entire Pod. Planning and discussions on these features are in progress. Feel free to share feedback or requests with the SIG Node community!

Your feedback is welcome!

This is an alpha feature, and the Kubernetes project would love to hear your feedback. Please try it out. This feature is driven by the SIG Node. If you are interested in helping develop this feature, sharing feedback, or participating in any other ongoing SIG Node projects, please reach out to the SIG Node community!

You can reach SIG Node by several means:

Categories: CNCF Projects, Kubernetes

Kubernetes v1.34: User preferences (kuberc) are available for testing in kubectl 1.34

Kubernetes Blog - Thu, 08/28/2025 - 14:30

Have you ever wished you could enable interactive delete, by default, in kubectl? Or maybe, you'd like to have custom aliases defined, but not necessarily generate hundreds of them manually? Look no further. SIG-CLI has been working hard to add user preferences to kubectl, and we are happy to announce that this functionality is reaching beta as part of the Kubernetes v1.34 release.

How it works

A full description of this functionality is available in our official documentation, but this blog post will answer both of the questions from the beginning of this article.

Before we dive into details, let's quickly cover what the user preferences file looks like and where to place it. By default, kubectl will look for kuberc file in your default kubeconfig directory, which is $HOME/.kube. Alternatively, you can specify this location using --kuberc option or the KUBERC environment variable.

Just like every Kubernetes manifest, kuberc file will start with an apiVersion and kind:

apiVersion: kubectl.config.k8s.io/v1beta1
kind: Preference
# the user preferences will follow here

Defaults

Let's start by setting default values for kubectl command options. Our goal is to always use interactive delete, which means we want the --interactive option for kubectl delete to always be set to true. This can be achieved with the following addition to our kuberc file:

defaults:
- command: delete
 options:
 - name: interactive
 default: "true"

In the above example, I'm introducing defaults section, which allows users to define default values for kubectl options. In this case, we're setting the interactive option for kubectl delete to be true by default. This default can be overridden if a user explicitly provides a different value such as kubectl delete --interactive=false, in which case the explicit option takes precedence.

Another highly encouraged default from SIG-CLI, is using Server-Side Apply. To do so, you can add the following snippet to your preferences:

# continuing defaults section
- command: apply
 options:
 - name: server-side
 default: "true"

Aliases

The ability to define aliases allows us to save precious seconds when typing commands. I bet that you most likely have one defined for kubectl, because typing seven letters is definitely longer than just pressing k.

For this reason, the ability to define aliases was a must-have when we decided to implement user preferences, alongside defaulting. To define an alias for any of the built-in commands, expand your kuberc file with the following addition:

aliases:
- name: gns
 command: get
 prependArgs:
 - namespace
 options:
 - name: output
 default: json

There's a lot going on above, so let me break this down. First, we're introducing a new section: aliases. Here, we're defining a new alias gns, which is mapped to the command get command. Next, we're defining arguments (namespace resource) that will be inserted right after the command name. Additionally, we're setting --output=json option for this alias. The structure of options block is identical to the one in the defaults section.

You probably noticed that we've introduced a mechanism for prepending arguments, and you might wonder if there is a complementary setting for appending them (in other words, adding to the end of the command, after user-provided arguments). This can be achieved through appendArgs block, which is presented below:

# continuing aliases section
- name: runx
 command: run
 options:
 - name: image
 default: busybox
 - name: namespace
 default: test-ns
 appendArgs:
 - --
 - custom-arg

Here, we're introducing another alias: runx, which invokes kubectl run command, passing --image and --namespace options with predefined values, and also appending -- and custom-arg at the end of the invocation.

Debugging

We hope that kubectl user preferences will open up new possibilities for our users. Whenever you're in doubt, feel free to run kubectl with increased verbosity. At -v=5, you should get all the possible debugging information from this feature, which will be crucial when reporting issues.

To learn more, I encourage you to read through our official documentation and the actual proposal.

Get involved

Kubectl user preferences feature has reached beta, and we are very interested in your feedback. We'd love to hear what you like about it and what problems you'd like to see it solve. Feel free to join SIG-CLI slack channel, or open an issue against kubectl repository. You can also join us at our community meetings, which happen every other Wednesday, and share your stories with us.

Categories: CNCF Projects, Kubernetes

Metal3.io becomes a CNCF incubating project

CNCF Blog Projects Category - Wed, 08/27/2025 - 15:00

The CNCF Technical Oversight Committee (TOC) has voted to accept Metal3.io as a CNCF incubating project. Metal3.io joins a growing ecosystem of technologies tackling real-world challenges at the edge of cloud native infrastructure.

What is Metal3.io?

The Metal3.io project (pronounced: “Metal Kubed”) provides components for bare metal host management with Kubernetes. You can enroll your bare metal machines, provision operating system images, and then, if you like, deploy Kubernetes clusters to them. From there, operating and upgrading your Kubernetes clusters can be handled by Metal3.io. Moreover, Metal3.io is itself a Kubernetes application, so it runs on Kubernetes and uses Kubernetes resources and APIs as its interface.

Metal3.io is also one of the providers for the Kubernetes subproject Cluster API. Cluster API provides infrastructure-agnostic Kubernetes lifecycle management, and Metal3.io brings the bare metal implementation.

Key Milestones and Ecosystem Growth

The project was started in 2019 by Red Hat and was quickly joined by Ericsson. Metal3.io then joined the CNCF sandbox in September 2020.

Metal3.io has steadily matured and grown during the sandbox phase, with:

57 active contributing organizations, led by Ericsson and Red Hat.
An active community organizing weekly online meetings with working group updates, issue triaging, design discussions, etc.
Organizations such as Fujitsu, Ikea, SUSE, Ericsson, and Red Hat among the growing list of adopters.
New features and API iterations, including IP address management, node reuse, firmware settings, and updates management both in provisioning time and on day 2, as well as remediation for the bare metal hosts.
A new operator, called the Ironic Standalone Operator, has been introduced to replace the shell-based deployment method for Ironic.
Added robust security processes, regular scans of dependencies, a vulnerability disclosure process, and automated dependency updates.

Integrations Across the Cloud Native Landscape

Metal3.io connects seamlessly with many CNCF projects, including:

Kubernetes: Metal3.io builds on the success of Kubernetes and makes use of CustomResourceDefinitions
Cluster API: Turn the bare metal servers into Kubernetes clusters
Cert-manager: Certificates for webhooks, etc.
Ironic: Handles the hardware for Metal3.io by interacting with baseboard management controllers
Prometheus: Metal3.io exposes metrics in a format that Prometheus can scrape

Technical Components

Baremetal Operator (BMO): Exposes parts of the Ironic API as a Kubernetes native API
Cluster API Provider Metal³ (CAPM3): Provides integration with Cluster API
IP Address Manager (IPAM): Handles IP addresses and pools
Ironic Standalone Operator (IrSO): Makes it easy to deploy Ironic on Kubernetes
Ironic-Image: Container image for Ironic

Community Highlights

1523 GitHub Stars
8368 merged pull requests
1434 issues
186 contributors
187 Releases

Maintainer Perspective

“As a maintainer of the Metal3.io project, I’m proud of its growth towards becoming one of the leading solutions for running Kubernetes on bare metal. I take pride in how it has evolved beyond provisioning bare metal only to support broader lifecycle needs, ensuring users can sustain and operate their bare metal deployments effectively. Equally rewarding has been seeing the community come together to establish strong processes and governance, positioning Metal3.io for CNCF incubation.”

—Kashif Khan, Maintainer, Metal3.io

“Metal3.io is a testament to the power of collaboration across open source communities. It marries the battle-tested hardware support of the Ironic project with the Kubernetes API paradigm, using a lightweight Kubernetes-native deployment model. I am delighted to see it begin incubation with CNCF. I have no doubt that the forum the Metal3.io project provides will continue to drive progress in integration between Kubernetes and bare metal.”

—Zane Bitter, Maintainer, Metal3.io

From the TOC

“Metal3.io addresses a critical need for cloud native infrastructure by making bare metal as manageable and Kubernetes-native as any other platform. The project’s steady growth, technical maturity, and strong integration with the Kubernetes ecosystem made it a clear choice for incubation. We’re excited to support Metal3.io as it continues to empower organizations deploying Kubernetes at the edge and beyond.”

— Ricardo Rocha, TOC Sponsor

Looking Ahead

Metal3.io’s roadmap for 2025 includes:

New API revisions for CAPM3, BMO, and IPAM
Maturing IPAM as a Cluster API IPAM provider
Multi-tenancy support
Support for architectures other than x86_64, i.e., ARM
Improve DHCP-less provisioning
Simplifying Ironic deployment with IrSO

As a CNCF-hosted project, Metal3.io is part of a neutral foundation aligned with its technical interests, as well as the larger Linux Foundation, which provides governance, marketing support, and community outreach. Metal3.io joins incubating technologies ArtifactHUB, Backstage, Buildpacks, Chaos Mesh, Cloud Custodian, Container Network Interface (CNI), Contour, Cortex, Crossplane, Dragonfly, Emissary-Ingress, Flatcar, gRPC, Karmada, Keptn, Keycloak, Knative, Kubeflow, Kubescape, KubeVela, KubeVirt, Kyverno, Litmus, Longhorn, NATS, Notary, OpenCost, OpenFeature, OpenKruise, OpenTelemetry, OpenYurt, Operator Framework, Strimzi, Thanos, Volcano, and wasmCloud. For more information on maturity requirements for each level, please visit the CNCF Graduation Criteria.

We look forward to seeing how Metal3.io continues to evolve with the backing of the CNCF community.

Learn more: https://www.cncf.io/projects/metal%C2%B3/

Categories: CNCF Projects

Kubernetes v1.34: Of Wind & Will (O' WaW)

Kubernetes Blog - Wed, 08/27/2025 - 14:30

Editors: Agustina Barbetta, Alejandro Josue Leon Bellido, Graziano Casto, Melony Qin, Dipesh Rawat

Similar to previous releases, the release of Kubernetes v1.34 introduces new stable, beta, and alpha features. The consistent delivery of high-quality releases underscores the strength of our development cycle and the vibrant support from our community.

This release consists of 58 enhancements. Of those enhancements, 23 have graduated to Stable, 22 have entered Beta, and 13 have entered Alpha.

There are also some deprecations and removals in this release; make sure to read about those.

Release theme and logo

Three bears sail a wooden ship with a flag featuring a paw and a helm symbol on the sail, as wind blows across the ocean

A release powered by the wind around us — and the will within us.

Every release cycle, we inherit winds that we don't really control — the state of our tooling, documentation, and the historical quirks of our project. Sometimes these winds fill our sails, sometimes they push us sideways or die down.

What keeps Kubernetes moving isn't the perfect winds, but the will of our sailors who adjust the sails, man the helm, chart the courses and keep the ship steady. The release happens not because conditions are always ideal, but because of the people who build it, the people who release it, and the bears ^, cats, dogs, wizards, and curious minds who keep Kubernetes sailing strong — no matter which way the wind blows.

This release, Of Wind & Will (O' WaW), honors the winds that have shaped us, and the will that propels us forward.

^ Oh, and you wonder why bears? Keep wondering!

Spotlight on key updates

Kubernetes v1.34 is packed with new features and improvements. Here are a few select updates the Release Team would like to highlight!

Stable: The core of DRA is GA

Dynamic Resource Allocation (DRA) enables more powerful ways to select, allocate, share, and configure GPUs, TPUs, NICs and other devices.

Since the v1.30 release, DRA has been based around claiming devices using structured parameters that are opaque to the core of Kubernetes. This enhancement took inspiration from dynamic provisioning for storage volumes. DRA with structured parameters relies on a set of supporting API kinds: ResourceClaim, DeviceClass, ResourceClaimTemplate, and ResourceSlice API types under resource.k8s.io, while extending the .spec for Pods with a new resourceClaims field.
The resource.k8s.io/v1 APIs have graduated to stable and are now available by default.

This work was done as part of KEP #4381 led by WG Device Management.

Beta: Projected ServiceAccount tokens for `kubelet` image credential providers

The kubelet credential providers, used for pulling private container images, traditionally relied on long-lived Secrets stored on the node or in the cluster. This approach increased security risks and management overhead, as these credentials were not tied to the specific workload and did not rotate automatically.
To solve this, the kubelet can now request short-lived, audience-bound ServiceAccount tokens for authenticating to container registries. This allows image pulls to be authorized based on the Pod's own identity rather than a node-level credential.
The primary benefit is a significant security improvement. It eliminates the need for long-lived Secrets for image pulls, reducing the attack surface and simplifying credential management for both administrators and developers.

This work was done as part of KEP #4412 led by SIG Auth and SIG Node.

Alpha: Support for KYAML, a Kubernetes dialect of YAML

KYAML aims to be a safer and less ambiguous YAML subset, and was designed specifically for Kubernetes. Whatever version of Kubernetes you use, starting from Kubernetes v1.34 you are able to use KYAML as a new output format for kubectl.

KYAML addresses specific challenges with both YAML and JSON. YAML's significant whitespace requires careful attention to indentation and nesting, while its optional string-quoting can lead to unexpected type coercion (for example: "The Norway Bug"). Meanwhile, JSON lacks comment support and has strict requirements for trailing commas and quoted keys.

You can write KYAML and pass it as an input to any version of kubectl, because all KYAML files are also valid as YAML. With kubectl v1.34, you are also able to request KYAML output (as in kubectl get -o kyaml …) by setting environment variable KUBECTL_KYAML=true. If you prefer, you can still request the output in JSON or YAML format.

This work was done as part of KEP #5295 led by SIG CLI.

Features graduating to Stable

This is a selection of some of the improvements that are now stable following the v1.34 release.

Delayed creation of Job’s replacement Pods

By default, Job controllers create replacement Pods immediately when a Pod starts terminating, causing both Pods to run simultaneously. This can cause resource contention in constrained clusters, where the replacement Pod may struggle to find available nodes until the original Pod fully terminates. The situation can also trigger unwanted cluster autoscaler scale-ups. Additionally, some machine learning frameworks like TensorFlow and JAX require only one Pod per index to run at a time, making simultaneous Pod execution problematic. This feature introduces .spec.podReplacementPolicy in Jobs. You may choose to create replacement Pods only when the Pod is fully terminated (has .status.phase: Failed). To do this, set .spec.podReplacementPolicy: Failed.
Introduced as alpha in v1.28, this feature has graduated to stable in v1.34.

This work was done as part of KEP #3939 led by SIG Apps.

Recovery from volume expansion failure

This feature allows users to cancel volume expansions that are unsupported by the underlying storage provider, and retry volume expansion with smaller values that may succeed.
Introduced as alpha in v1.23, this feature has graduated to stable in v1.34.

This work was done as part of KEP #1790 led by SIG Storage.

VolumeAttributesClass for volume modification

VolumeAttributesClass has graduated to stable in v1.34. VolumeAttributesClass is a generic, Kubernetes-native API for modifying volume parameters like provisioned IO. It allows workloads to vertically scale their volumes on-line to balance cost and performance, if supported by their provider.
Like all new volume features in Kubernetes, this API is implemented via the container storage interface (CSI). Your provisioner-specific CSI driver must support the new ModifyVolume API which is the CSI side of this feature.

This work was done as part of KEP #3751 led by SIG Storage.

Structured authentication configuration

Kubernetes v1.29 introduced a configuration file format to manage API server client authentication, moving away from the previous reliance on a large set of command-line options. The AuthenticationConfiguration kind allows administrators to support multiple JWT authenticators, CEL expression validation, and dynamic reloading. This change significantly improves the manageability and auditability of the cluster's authentication settings - and has graduated to stable in v1.34.

This work was done as part of KEP #3331 led by SIG Auth.

Finer-grained authorization based on selectors

Kubernetes authorizers, including webhook authorizers and the built-in node authorizer, can now make authorization decisions based on field and label selectors in incoming requests. When you send list, watch or deletecollection requests with selectors, the authorization layer can now evaluate access with that additional context.

For example, you can write an authorization policy that only allows listing Pods bound to a specific .spec.nodeName. The client (perhaps the kubelet on a particular node) must specify the field selector that the policy requires, otherwise the request is forbidden. This change makes it feasible to set up least privilege rules, provided that the client knows how to conform to the restrictions you set. Kubernetes v1.34 now supports more granular control in environments like per-node isolation or custom multi-tenant setups.

This work was done as part of KEP #4601 led by SIG Auth.

Restrict anonymous requests with fine-grained controls

Instead of fully enabling or disabling anonymous access, you can now configure a strict list of endpoints where unauthenticated requests are allowed. This provides a safer alternative for clusters that rely on anonymous access to health or bootstrap endpoints like /healthz, /readyz, or /livez.

With this feature, accidental RBAC misconfigurations that grant broad access to anonymous users can be avoided without requiring changes to external probes or bootstrapping tools.

This work was done as part of KEP #4633 led by SIG Auth.

More efficient requeueing through plugin-specific callbacks

The kube-scheduler can now make more accurate decisions about when to retry scheduling Pods that were previously unschedulable. Each scheduling plugin can now register callback functions that tell the scheduler whether an incoming cluster event is likely to make a rejected Pod schedulable again.

This reduces unnecessary retries and improves overall scheduling throughput - especially in clusters using dynamic resource allocation. The feature also lets certain plugins skip the usual backoff delay when it is safe to do so, making scheduling faster in specific cases.

This work was done as part of KEP #4247 led by SIG Scheduling.

Ordered Namespace deletion

Semi-random resource deletion order can create security gaps or unintended behavior, such as Pods persisting after their associated NetworkPolicies are deleted.
This improvement introduces a more structured deletion process for Kubernetes namespaces to ensure secure and deterministic resource removal. By enforcing a structured deletion sequence that respects logical and security dependencies, this approach ensures Pods are removed before other resources.
This feature was introduced in Kubernetes v1.33 and graduated to stable in v1.34. The graduation improves security and reliability by mitigating risks from non-deterministic deletions, including the vulnerability described in CVE-2024-7598.

This work was done as part of KEP #5080 led by SIG API Machinery.

Streaming list responses

Handling large list responses in Kubernetes previously posed a significant scalability challenge. When clients requested extensive resource lists, such as thousands of Pods or Custom Resources, the API server was required to serialize the entire collection of objects into a single, large memory buffer before sending it. This process created substantial memory pressure and could lead to performance degradation, impacting the overall stability of the cluster.
To address this limitation, a streaming encoding mechanism for collections (list responses) has been introduced. For the JSON and Kubernetes Protobuf response formats, that streaming mechanism is automatically active and the associated feature gate is stable. The primary benefit of this approach is the avoidance of large memory allocations on the API server, resulting in a much smaller and more predictable memory footprint. Consequently, the cluster becomes more resilient and performant, especially in large-scale environments where frequent requests for extensive resource lists are common.

This work was done as part of KEP #5116 led by SIG API Machinery.

Resilient watch cache initialization

Watch cache is a caching layer inside kube-apiserver that maintains an eventually consistent cache of cluster state stored in etcd. In the past, issues could occur when the watch cache was not yet initialized during kube-apiserver startup or when it required re-initialization.

To address these issues, the watch cache initialization process has been made more resilient to failures, improving control plane robustness and ensuring controllers and clients can reliably establish watches. This improvement was introduced as beta in v1.31 and is now stable.

This work was done as part of KEP #4568 led by SIG API Machinery and SIG Scalability.

Relaxing DNS search path validation

Previously, the strict validation of a Pod's DNS search path in Kubernetes often created integration challenges in complex or legacy network environments. This restrictiveness could block configurations that were necessary for an organization's infrastructure, forcing administrators to implement difficult workarounds.
To address this, relaxed DNS validation was introduced as alpha in v1.32 and has now graduated to stable in v1.34. A common use case involves Pods that need to communicate with both internal Kubernetes services and external domains. By setting a single dot (.) as the first entry in the searches list of the Pod's .spec.dnsConfig, administrators can prevent the system's resolver from appending the cluster's internal search domains to external queries. This avoids generating unnecessary DNS requests to the internal DNS server for external hostnames, improving efficiency and preventing potential resolution errors.

This work was done as part of KEP #4427 led by SIG Network.

Support for Direct Service Return (DSR) in Windows `kube-proxy`

DSR provides performance optimizations by allowing return traffic routed through load balancers to bypass the load balancer and respond directly to the client, reducing load on the load balancer and improving overall latency. For information on DSR on Windows, read Direct Server Return (DSR) in a nutshell.
Initially introduced in v1.14, this feature has graduated to stable in v1.34.

This work was done as part of KEP #5100 led by SIG Windows.

Sleep action for Container lifecycle hooks

A Sleep action for containers’ PreStop and PostStart lifecycle hooks was introduced to provide a straightforward way to manage graceful shutdowns and improve overall container lifecycle management.
The Sleep action allows containers to pause for a specified duration after starting or before termination. Using a negative or zero sleep duration returns immediately, resulting in a no-op.
The Sleep action was introduced in Kubernetes v1.29, with zero value support added in v1.32. Both features graduated to stable in v1.34.

This work was done as part of KEP #3960 and KEP #4818 led by SIG Node.

Linux node swap support

Historically, the lack of swap support in Kubernetes could lead to workload instability, as nodes under memory pressure often had to terminate processes abruptly. This particularly affected applications with large but infrequently accessed memory footprints and prevented more graceful resource management.

To address this, configurable per-node swap support was introduced in v1.22. It has progressed through alpha and beta stages and has graduated to stable in v1.34. The primary mode, LimitedSwap, allows Pods to use swap within their existing memory limits, providing a direct solution to the problem. By default, the kubelet is configured with NoSwap mode, which means Kubernetes workloads cannot use swap.

This feature improves workload stability and allows for more efficient resource utilization. It enables clusters to support a wider variety of applications, especially in resource-constrained environments, though administrators must consider the potential performance impact of swapping.

This work was done as part of KEP #2400 led by SIG Node.

Allow special characters in environment variables

The environment variable validation rules in Kubernetes have been relaxed to allow nearly all printable ASCII characters in variable names, excluding =. This change supports scenarios where workloads require nonstandard characters in variable names - for example, frameworks like .NET Core that use : to represent nested configuration keys.

The relaxed validation applies to environment variables defined directly in Pod spec, as well as those injected using envFrom references to ConfigMaps and Secrets.

This work was done as part of KEP #4369 led by SIG Node.

Taint management is separated from Node lifecycle

Historically, the TaintManager's logic for applying NoSchedule and NoExecute taints to nodes based on their condition (NotReady, Unreachable, etc.) was tightly coupled with the node lifecycle controller. This tight coupling made the code harder to maintain and test, and it also limited the flexibility of the taint-based eviction mechanism. This KEP refactors the TaintManager into its own separate controller within the Kubernetes controller manager. It is an internal architectural improvement designed to increase code modularity and maintainability. This change allows the logic for taint-based evictions to be tested and evolved independently, but it has no direct user-facing impact on how taints are used.

This work was done as part of KEP #3902 led by SIG Scheduling and SIG Node.

New features in Beta

This is a selection of some of the improvements that are now beta following the v1.34 release.

Pod-level resource requests and limits

Defining resource needs for Pods with multiple containers has been challenging, as requests and limits could only be set on a per-container basis. This forced developers to either over-provision resources for each container or meticulously divide the total desired resources, making configuration complex and often leading to inefficient resource allocation. To simplify this, the ability to specify resource requests and limits at the Pod level was introduced. This allows developers to define an overall resource budget for a Pod, which is then shared among its constituent containers. This feature was introduced as alpha in v1.32 and has graduated to beta in v1.34, with HPA now supporting pod-level resource specifications.

The primary benefit is a more intuitive and straightforward way to manage resources for multi-container Pods. It ensures that the total resources used by all containers do not exceed the Pod's defined limits, leading to better resource planning, more accurate scheduling, and more efficient utilization of cluster resources.

This work was done as part of KEP #2837 led by SIG Scheduling and SIG Autoscaling.

`.kuberc` file for `kubectl` user preferences

A .kuberc configuration file allows you to define preferences for kubectl, such as default options and command aliases. Unlike the kubeconfig file, the .kuberc configuration file does not contain cluster details, usernames or passwords.
This feature was introduced as alpha in v1.33, gated behind the environment variable KUBECTL_KUBERC. It has graduated to beta in v1.34 and is enabled by default.

This work was done as part of KEP #3104 led by SIG CLI.

External ServiceAccount token signing

Traditionally, Kubernetes manages ServiceAccount tokens using static signing keys that are loaded from disk at kube-apiserver startup. This feature introduces an ExternalJWTSigner gRPC service for out-of-process signing, enabling Kubernetes distributions to integrate with external key management solutions (for example, HSMs, cloud KMSes) for ServiceAccount token signing instead of static disk-based keys.

Introduced as alpha in v1.32, this external JWT signing capability advances to beta and is enabled by default in v1.34.

This work was done as part of KEP #740 led by SIG Auth.

DRA features in beta

Admin access for secure resource monitoring

DRA supports controlled administrative access via the adminAccess field in ResourceClaims or ResourceClaimTemplates, allowing cluster operators to access devices already in use by others for monitoring or diagnostics. This privileged mode is limited to users authorized to create such objects in namespaces labeled resource.k8s.io/admin-access: "true", ensuring regular workloads remain unaffected. Graduating to beta in v1.34, this feature provides secure introspection capabilities while preserving workload isolation through namespace-based authorization checks.

This work was done as part of KEP #5018 led by WG Device Management and SIG Auth.

Prioritized alternatives in ResourceClaims and ResourceClaimTemplates

While a workload might run best on a single high-performance GPU, it might also be able to run on two mid-level GPUs.
With the feature gate DRAPrioritizedList (now enabled by default), ResourceClaims and ResourceClaimTemplates get a new field named firstAvailable. This field is an ordered list that allows users to specify that a request may be satisfied in different ways, including allocating nothing at all if specific hardware is not available. The scheduler will attempt to satisfy the alternatives in the list in order, so the workload will be allocated the best set of devices available in the cluster.

This work was done as part of KEP #4816 led by WG Device Management.

The `kubelet` reports allocated DRA resources

The kubelet's API has been updated to report on Pod resources allocated through DRA. This allows node monitoring agents to discover the allocated DRA resources for Pods on a node. Additionally, it enables node components to use the PodResourcesAPI and leverage this DRA information when developing new features and integrations.
Starting from Kubernetes v1.34, this feature is enabled by default.

This work was done as part of KEP #3695 led by WG Device Management.

`kube-scheduler` non-blocking API calls

The kube-scheduler makes blocking API calls during scheduling cycles, creating performance bottlenecks. This feature introduces asynchronous API handling through a prioritized queue system with request deduplication, allowing the scheduler to continue processing Pods while API operations complete in the background. Key benefits include reduced scheduling latency, prevention of scheduler thread starvation during API delays, and immediate retry capability for unschedulable Pods. The implementation maintains backward compatibility and adds metrics for monitoring pending API operations.

This work was done as part of KEP #5229 led by SIG Scheduling.

Mutating admission policies

MutatingAdmissionPolicies offer a declarative, in-process alternative to mutating admission webhooks. This feature leverages CEL's object instantiation and JSON Patch strategies, combined with Server Side Apply’s merge algorithms.
This significantly simplifies admission control by allowing administrators to define mutation rules directly in the API server.
Introduced as alpha in v1.32, mutating admission policies has graduated to beta in v1.34.

This work was done as part of KEP #3962 led by SIG API Machinery.

Snapshottable API server cache

The kube-apiserver's caching mechanism (watch cache) efficiently serves requests for the latest observed state. However, list requests for previous states (for example, via pagination or by specifying a resourceVersion) often bypass this cache and are served directly from etcd. This direct etcd access significantly increases performance costs and can lead to stability issues, particularly with large resources, due to memory pressure from transferring large data blobs.
With the ListFromCacheSnapshot feature gate enabled by default, kube-apiserver will attempt to serve the response from snapshots if one is available with resourceVersion older than requested. The kube-apiserver starts with no snapshots, creates a new snapshot on every watch event, and keeps them until it detects etcd is compacted or if cache is full with events older than 75 seconds. If the provided resourceVersion is unavailable, the server will fallback to etcd.

This work was done as part of KEP #4988 led by SIG API Machinery.

Tooling for declarative validation of Kubernetes-native types

Prior to this release, validation rules for the APIs built into Kubernetes were written entirely by hand, which makes them difficult for maintainers to discover, understand, improve or test. There was no single way to find all the validation rules that might apply to an API. Declarative validation benefits Kubernetes maintainers by making API development, maintenance, and review easier while enabling programmatic inspection for better tooling and documentation. For people using Kubernetes libraries to write their own code (for example: a controller), the new approach streamlines adding new fields through IDL tags, rather than complex validation functions. This change helps speed up API creation by automating validation boilerplate, and provides more relevant error messages by performing validation on versioned types.
This enhancement (which graduated to beta in v1.33 and continues as beta in v1.34) brings CEL-based validation rules to native Kubernetes types. It allows for more granular and declarative validation to be defined directly in the type definitions, improving API consistency and developer experience.

This work was done as part of KEP #5073 led by SIG API Machinery.

Streaming informers for list requests

The streaming informers feature, which has been in beta since v1.32, gains further beta refinements in v1.34. This capability allows list requests to return data as a continuous stream of objects from the API server’s watch cache, rather than assembling paged results directly from etcd. By reusing the same mechanics used for watch operations, the API server can serve large datasets while keeping memory usage steady and avoiding allocation spikes that can affect stability.

In this release, the kube-apiserver and kube-controller-manager both take advantage of the new WatchList mechanism by default. For the kube-apiserver, this means list requests are streamed more efficiently, while the kube-controller-manager benefits from a more memory-efficient and predictable way to work with informers. Together, these improvements reduce memory pressure during large list operations, and improve reliability under sustained load, making list streaming more predictable and efficient.

This work was done as part of KEP #3157 led by SIG API Machinery and SIG Scalability.

Graceful node shutdown handling for Windows nodes

The kubelet on Windows nodes can now detect system shutdown events and begin graceful termination of running Pods. This mirrors existing behavior on Linux and helps ensure workloads exit cleanly during planned shutdowns or restarts.
When the system begins shutting down, the kubelet reacts by using standard termination logic. It respects the configured lifecycle hooks and grace periods, giving Pods time to stop before the node powers off. The feature relies on Windows pre-shutdown notifications to coordinate this process. This enhancement improves workload reliability during maintenance, restarts, or system updates. It is now in beta and enabled by default.

This work was done as part of KEP #4802 led by SIG Windows.

In-place Pod resize improvements

Graduated to beta and enabled by default in v1.33, in-place Pod resizing receives further improvements in v1.34. These include support for decreasing memory usage and integration with Pod-level resources.

This feature remains in beta in v1.34. For detailed usage instructions and examples, refer to the documentation: Resize CPU and Memory Resources assigned to Containers.

This work was done as part of KEP #1287 led by SIG Node and SIG Autoscaling.

New features in Alpha

This is a selection of some of the improvements that are now alpha following the v1.34 release.

Pod certificates for mTLS authentication

Authenticating workloads within a cluster, especially for communication with the API server, has primarily relied on ServiceAccount tokens. While effective, these tokens aren't always ideal for establishing a strong, verifiable identity for mutual TLS (mTLS) and can present challenges when integrating with external systems that expect certificate-based authentication.
Kubernetes v1.34 introduces a built-in mechanism for Pods to obtain X.509 certificates via PodCertificateRequests. The kubelet can request and manage certificates for Pods, which can then be used to authenticate to the Kubernetes API server and other services using mTLS. The primary benefit is a more robust and flexible identity mechanism for Pods. It provides a native way to implement strong mTLS authentication without relying solely on bearer tokens, aligning Kubernetes with standard security practices and simplifying integrations with certificate-aware observability and security tooling.

This work was done as part of KEP #4317 led by SIG Auth.

"Restricted" Pod security standard now forbids remote probes

The host field within probes and lifecycle handlers allows users to specify an entity other than the podIP for the kubelet to probe. However, this opens up a route for misuse and for attacks that bypass security controls, since the host field could be set to any value, including security sensitive external hosts, or localhost on the node. In Kubernetes v1.34, Pods only meet the Restricted Pod security standard if they either leave the host field unset, or if they don't even use this kind of probe. You can use Pod security admission, or a third party solution, to enforce that Pods meet this standard. Because these are security controls, check the documentation to understand the limitations and behavior of the enforcement mechanism you choose.

This work was done as part of KEP #4940 led by SIG Auth.

Use `.status.nominatedNodeName` to express Pod placement

When the kube-scheduler takes time to bind Pods to Nodes, cluster autoscalers may not understand that a Pod will be bound to a specific Node. Consequently, they may mistakenly consider the Node as underutilized and delete it.
To address this issue, the kube-scheduler can use .status.nominatedNodeName not only to indicate ongoing preemption but also to express Pod placement intentions. By enabling the NominatedNodeNameForExpectation feature gate, the scheduler uses this field to indicate where a Pod will be bound. This exposes internal reservations to help external components make informed decisions.

This work was done as part of KEP #5278 led by SIG Scheduling.

DRA features in alpha

Resource health status for DRA

It can be difficult to know when a Pod is using a device that has failed or is temporarily unhealthy, which makes troubleshooting Pod crashes challenging or impossible.
Resource Health Status for DRA improves observability by exposing the health status of devices allocated to a Pod in the Pod’s status. This makes it easier to identify the cause of Pod issues related to unhealthy devices and respond appropriately.
To enable this functionality, the ResourceHealthStatus feature gate must be enabled, and the DRA driver must implement the DRAResourceHealth gRPC service.

This work was done as part of KEP #4680 led by WG Device Management.

Extended resource mapping

Extended resource mapping provides a simpler alternative to DRA's expressive and flexible approach by offering a straightforward way to describe resource capacity and consumption. This feature enables cluster administrators to advertise DRA-managed resources as extended resources, allowing application developers and operators to continue using the familiar container’s .spec.resources syntax to consume them.
This enables existing workloads to adopt DRA without modifications, simplifying the transition to DRA for both application developers and cluster administrators.

This work was done as part of KEP #5004 led by WG Device Management.

DRA consumable capacity

Kubernetes v1.33 added support for resource drivers to advertise slices of a device that are available, rather than exposing the entire device as an all-or-nothing resource. However, this approach couldn't handle scenarios where device drivers manage fine-grained, dynamic portions of a device resource based on user demand, or share those resources independently of ResourceClaims, which are restricted by their spec and namespace.
Enabling the DRAConsumableCapacity feature gate (introduced as alpha in v1.34) allows resource drivers to share the same device, or even a slice of a device, across multiple ResourceClaims or across multiple DeviceRequests. The feature also extends the scheduler to support allocating portions of device resources, as defined in the capacity field. This DRA feature improves device sharing across namespaces and claims, tailoring it to Pod needs. It enables drivers to enforce capacity limits, enhances scheduling, and supports new use cases like bandwidth-aware networking and multi-tenant sharing.

This work was done as part of KEP #5075 led by WG Device Management.

Device binding conditions

The Kubernetes scheduler gets more reliable by delaying binding a Pod to a Node until its required external resources, such as attachable devices or FPGAs, are confirmed to be ready.
This delay mechanism is implemented in the PreBind phase of the scheduling framework. During this phase, the scheduler checks whether all required device conditions are satisfied before proceeding with binding. This enables coordination with external device controllers, ensuring more robust, predictable scheduling.

This work was done as part of KEP #5007 led by WG Device Management.

Container restart rules

Currently, all containers within a Pod will follow the same .spec.restartPolicy when exited or crashed. However, Pods that run multiple containers might have different restart requirements for each container. For example, for init containers used to perform initialization, you may not want to retry initialization if they fail. Similarly, in ML research environments with long-running training workloads, containers that fail with retriable exit codes should restart quickly in place, rather than triggering Pod recreation and losing progress.
Kubernetes v1.34 introduces the ContainerRestartRules feature gate. When enabled, a restartPolicy can be specified for each container within a Pod. A restartPolicyRules list can also be defined to override restartPolicy based on the last exit code. This provides the fine-grained control needed to handle complex scenarios and better utilization of compute resources.

This work was done as part of KEP #5307 led by SIG Node.

Load environment variables from files created in runtime

Application developers have long requested greater flexibility in declaring environment variables. Traditionally, environment variables are declared on the API server side via static values, ConfigMaps, or Secrets.

Behind the EnvFiles feature gate, Kubernetes v1.34 introduces the ability to declare environment variables at runtime. One container (typically an init container) can generate the variable and store it in a file, and a subsequent container can start with the environment variable loaded from that file. This approach eliminates the need to "wrap" the target container's entry point, enabling more flexible in-Pod container orchestration.

This feature particularly benefits AI/ML training workloads, where each Pod in a training Job requires initialization with runtime-defined values.

This work was done as part of KEP #5307 led by SIG Node.

Graduations, deprecations, and removals in v1.34

Graduations to stable

This lists all the features that graduated to stable (also known as general availability). For a full list of updates including new features and graduations from alpha to beta, see the release notes.

This release includes a total of 23 enhancements promoted to stable:

Deprecations and removals

As Kubernetes develops and matures, features may be deprecated, removed, or replaced with better ones to improve the project's overall health. See the Kubernetes deprecation and removal policy for more details on this process. Kubernetes v1.34 includes a couple of deprecations.

Manual cgroup driver configuration is deprecated

Historically, configuring the correct cgroup driver has been a pain point for users running Kubernetes clusters. Kubernetes v1.28 added a way for the kubelet to query the CRI implementation and find which cgroup driver to use. That automated detection is now strongly recommended and support for it has graduated to stable in v1.34. If your CRI container runtime does not support the ability to report the croup driver it needs, you should upgrade or change your container runtime. The cgroupDriver configuration setting in the kubelet configuration file is now deprecated. The corresponding command-line option --cgroup-driver was previously deprecated, as Kubernetes recommends using the configuration file instead. Both the configuration setting and command-line option will be removed in a future release, that removal will not happen before the v1.36 minor release.

This work was done as part of KEP #4033 led by SIG Node.

Kubernetes to end containerd 1.x support in v1.36

While Kubernetes v1.34 still supports containerd 1.7 and other LTS releases of containerd, as a consequence of automated cgroup driver detection, the Kubernetes SIG Node community has formally agreed upon a final support timeline for containerd v1.X. The last Kubernetes release to offer this support will be v1.35 (aligned with containerd 1.7 EOL). This is an early warning that if you are using containerd 1.X, consider switching to 2.0+ soon. You are able to monitor the kubelet_cri_losing_support metric to determine if any nodes in your cluster are using a containerd version that will soon be outdated.

This work was done as part of KEP #4033 led by SIG Node.

`PreferClose` traffic distribution is deprecated

The spec.trafficDistribution field within a Kubernetes Service allows users to express preferences for how traffic should be routed to Service endpoints.

KEP-3015 deprecates PreferClose and introduces two additional values: PreferSameZone and PreferSameNode. PreferSameZone is an alias for the existing PreferClose to clarify its semantics. PreferSameNode allows connections to be delivered to a local endpoint when possible, falling back to a remote endpoint when not possible.

This feature was introduced in v1.33 behind the PreferSameTrafficDistribution feature gate. It has graduated to beta in v1.34 and is enabled by default.

This work was done as part of KEP #3015 led by SIG Network.

Release notes

Check out the full details of the Kubernetes v1.34 release in our release notes.

Availability

Kubernetes v1.34 is available for download on GitHub or on the Kubernetes download page.

To get started with Kubernetes, check out these interactive tutorials or run local Kubernetes clusters using minikube. You can also easily install v1.34 using kubeadm.

Release Team

Kubernetes is only possible with the support, commitment, and hard work of its community. Each release team is made up of dedicated community volunteers who work together to build the many pieces that make up the Kubernetes releases you rely on. This requires the specialized skills of people from all corners of our community, from the code itself to its documentation and project management.

We honor the memory of Rodolfo "Rodo" Martínez Vega, a dedicated contributor whose passion for technology and community building left a mark on the Kubernetes community. Rodo served as a member of the Kubernetes Release Team across multiple releases, including v1.22-v1.23 and v1.25-v1.30, demonstrating unwavering commitment to the project's success and stability.
Beyond his Release Team contributions, Rodo was deeply involved in fostering the Cloud Native LATAM community, helping to bridge language and cultural barriers in the space. His work on the Spanish version of Kubernetes documentation and the CNCF Glossary exemplified his dedication to making knowledge accessible to Spanish-speaking developers worldwide. Rodo's legacy lives on through the countless community members he mentored, the releases he helped deliver, and the vibrant LATAM Kubernetes community he helped cultivate.

We would like to thank the entire Release Team for the hours spent hard at work to deliver the Kubernetes v1.34 release to our community. The Release Team's membership ranges from first-time shadows to returning team leads with experience forged over several release cycles. A very special thanks goes out to our release lead, Vyom Yadav, for guiding us through a successful release cycle, for his hands-on approach to solving challenges, and for bringing the energy and care that drives our community forward.

Project Velocity

The CNCF K8s DevStats project aggregates a number of interesting data points related to the velocity of Kubernetes and various sub-projects. This includes everything from individual contributions to the number of companies that are contributing and is an illustration of the depth and breadth of effort that goes into evolving this ecosystem.

During the v1.34 release cycle, which spanned 15 weeks from 19th May 2025 to 27th August 2025, Kubernetes received contributions from as many as 106 different companies and 491 individuals. In the wider cloud native ecosystem, the figure goes up to 370 companies, counting 2235 total contributors.

Note that "contribution" counts when someone makes a commit, code review, comment, creates an issue or PR, reviews a PR (including blogs and documentation) or comments on issues and PRs.
If you are interested in contributing, visit Getting Started on our contributor website.

Source for this data:

Event Update

Explore upcoming Kubernetes and cloud native events, including KubeCon + CloudNativeCon, KCD, and other notable conferences worldwide. Stay informed and get involved with the Kubernetes community!

August 2025

KCD - Kubernetes Community Days: Colombia: Aug 28, 2025 | Bogotá, Colombia

September 2025

KCD - Kubernetes Community Days: San Francisco Bay Area: Sep 9, 2025 | San Francisco, USA
KCD - Kubernetes Community Days: Washington DC: Sep 16, 2025 | Washington, D.C., USA
KCD - Kubernetes Community Days: Sofia: Sep 18, 2025 | Sofia, Bulgaria
KCD - Kubernetes Community Days: El Salvador: Sep 20, 2025 | San Salvador, El Salvador

October 2025

KCD - Kubernetes Community Days: Warsaw: Oct 9, 2025 | Warsaw, Poland
KCD - Kubernetes Community Days: Edinburgh: Oct 21, 2025 | Edinburgh, United Kingdom
KCD - Kubernetes Community Days: Sri Lanka: Oct 26, 2025 | Colombo, Sri Lanka

November 2025

KCD - Kubernetes Community Days: Porto: Nov 3, 2025 | Porto, Portugal
KubeCon + CloudNativeCon North America 2025: Nov 10-13, 2025 | Atlanta, USA
KCD - Kubernetes Community Days: Hangzhou: Nov 14, 2025 | Hangzhou, China

December 2025

KCD - Kubernetes Community Days: Suisse Romande: Dec 4, 2025 | Geneva, Switzerland

You can find the latest event details here.

Upcoming Release Webinar

Join members of the Kubernetes v1.34 Release Team on Wednesday, September 24th 2025 at 4:00 PM (UTC), to learn about the release highlights of this release. For more information and registration, visit the event page on the CNCF Online Programs site.

Get Involved

The simplest way to get involved with Kubernetes is by joining one of the many Special Interest Groups (SIGs) that align with your interests. Have something you’d like to broadcast to the Kubernetes community? Share your voice at our weekly community meeting, and through the channels below. Thank you for your continued feedback and support.

Follow us on Bluesky @Kubernetesio for the latest updates
Join the community discussion on Discuss
Join the community on Slack
Post questions (or answer questions) on Stack Overflow
Share your Kubernetes story
Read more about what’s happening with Kubernetes on the blog
Learn more about the Kubernetes Release Team

Categories: CNCF Projects, Kubernetes

You are here

CNCF Projects

What's new?

How can I learn more?

How to get involved?

Automated cgroup driver detection

Announcement: Kubernetes is deprecating containerd v1.y support

Background

Dynamically adapting CSI volume limits

How it works

Enabling the feature

Example CSI driver configuration

Immediate updates on attachment failures

Getting started

Next steps

What’s this all about?

How It Works

A word on security

Summary

Evolving the cache for performance and stability

Consistent reads from cache (Beta in v1.31)

Taming large responses with streaming (Beta in v1.33)

The missing piece

Kubernetes 1.34: snapshots complete the picture

A new era of API Server performance ?

How to get started

Acknowledgements

Alpha Period​

Beta Period​

Release Candidates​

? Release ?​

TL;DR

What is VolumeAttributesClass?

What is new from Beta to GA

Cancel support from infeasible errors

Quota support based on scope

Drivers support VolumeAttributesClass

Contact

About Pod Replacement Policy

How Pod Replacement Policy works

Example

How can you learn more?

Acknowledgments

Get involved

How to give feedback

What is Pressure Stall Information (PSI)?

PSI: 'Some' vs. 'Full' Pressure

PSI metrics in Kubernetes

How to enable PSI metrics

What's next?

What's new in beta?

Required cacheType field

Isolated image pull credentials

How it works

Configuration

Image pull flow

Audience restriction

Getting started with beta

Prerequisites

Migration from alpha

Example setup

What's next?

Call to action

How to get involved

Understanding the feature

What is uncore cache?

Cache-aware workload placement

Use cases

Enabling the feature

Further reading

Getting involved

The core of DRA is now GA

Features promoted to beta

New alpha features

What’s next?

Getting involved

Acknowledgments

The problem with a single restart policy

Introducing per-container restart policies

Use cases

Alpha Period

Beta Period

Release Candidates

? Release ?

Required `cacheType` field

Beta: Projected ServiceAccount tokens for `kubelet` image credential providers

Support for Direct Service Return (DSR) in Windows `kube-proxy`

`.kuberc` file for `kubectl` user preferences