CNCF Projects

Kubernetes v1.35: Watch Based Route Reconciliation in the Cloud Controller Manager

Kubernetes Blog - Tue, 12/30/2025 - 13:30

Up to and including Kubernetes v1.34, the route controller in Cloud Controller Manager (CCM) implementations built using the k8s.io/cloud-provider library reconciles routes at a fixed interval. This causes unnecessary API requests to the cloud provider when there are no changes to routes. Other controllers implemented through the same library already use watch-based mechanisms, leveraging informers to avoid unnecessary API calls. A new feature gate is being introduced in v1.35 to allow changing the behavior of the route controller to use watch-based informers.

What's new?

The feature gate CloudControllerManagerWatchBasedRoutesReconciliation has been introduced to k8s.io/cloud-provider in alpha stage by SIG Cloud Provider. To enable this feature you can use --feature-gate=CloudControllerManagerWatchBasedRoutesReconciliation=true in the CCM implementation you are using.

About the feature gate

This feature gate will trigger the route reconciliation loop whenever a node is added, deleted, or the fields .spec.podCIDRs or .status.addresses are updated.

An additional reconcile is performed in a random interval between 12h and 24h, which is chosen at the controller's start time.

This feature gate does not modify the logic within the reconciliation loop. Therefore, users of a CCM implementation should not experience significant changes to their existing route configurations.

How can I learn more?

For more details, refer to the KEP-5237.

Categories: CNCF Projects, Kubernetes

Kubernetes v1.35: Introducing Workload Aware Scheduling

Kubernetes Blog - Mon, 12/29/2025 - 13:30

Scheduling large workloads is a much more complex and fragile operation than scheduling a single Pod, as it often requires considering all Pods together instead of scheduling each one independently. For example, when scheduling a machine learning batch job, you often need to place each worker strategically, such as on the same rack, to make the entire process as efficient as possible. At the same time, the Pods that are part of such a workload are very often identical from the scheduling perspective, which fundamentally changes how this process should look.

There are many custom schedulers adapted to perform workload scheduling efficiently, but considering how common and important workload scheduling is to Kubernetes users, especially in the AI era with the growing number of use cases, it is high time to make workloads a first-class citizen for kube-scheduler and support them natively.

Workload aware scheduling

The recent 1.35 release of Kubernetes delivered the first tranche of workload aware scheduling improvements. These are part of a wider effort that is aiming to improve scheduling and management of workloads. The effort will span over many SIGs and releases, and is supposed to gradually expand capabilities of the system toward reaching the north star goal, which is seamless workload scheduling and management in Kubernetes including, but not limited to, preemption and autoscaling.

Kubernetes v1.35 introduces the Workload API that you can use to describe the desired shape as well as scheduling-oriented requirements of the workload. It comes with an initial implementation of gang scheduling that instructs the kube-scheduler to schedule gang Pods in the all-or-nothing fashion. Finally, we improved scheduling of identical Pods (that typically make a gang) to speed up the process thanks to the opportunistic batching feature.

Workload API

The new Workload API resource is part of the scheduling.k8s.io/v1alpha1 API group. This resource acts as a structured, machine-readable definition of the scheduling requirements of a multi-Pod application. While user-facing workloads like Jobs define what to run, the Workload resource determines how a group of Pods should be scheduled and how its placement should be managed throughout its lifecycle.

A Workload allows you to define a group of Pods and apply a scheduling policy to them. Here is what a gang scheduling configuration looks like. You can define a podGroup named workers and apply the gang policy with a minCount of 4.

apiVersion: scheduling.k8s.io/v1alpha1
kind: Workload
metadata:
 name: training-job-workload
 namespace: some-ns
spec:
 podGroups:
 - name: workers
 policy:
 gang:
 # The gang is schedulable only if 4 pods can run at once
 minCount: 4

When you create your Pods, you link them to this Workload using the new workloadRef field:

apiVersion: v1
kind: Pod
metadata:
 name: worker-0
 namespace: some-ns
spec:
 workloadRef:
 name: training-job-workload
 podGroup: workers
 ...

How gang scheduling works

The gang policy enforces all-or-nothing placement. Without gang scheduling, a Job might be partially scheduled, consuming resources without being able to run, leading to resource wastage and potential deadlocks.

When you create Pods that are part of a gang-scheduled pod group, the scheduler's GangScheduling plugin manages the lifecycle independently for each pod group (or replica key):

When you create your Pods (or a controller makes them for you), the scheduler blocks them from scheduling, until:
- The referenced Workload object is created.
- The referenced pod group exists in a Workload.
- The number of pending Pods in that group meets your minCount.
Once enough Pods arrive, the scheduler tries to place them. However, instead of binding them to nodes immediately, the Pods wait at a Permit gate.
The scheduler checks if it has found valid assignments for the entire group (at least the minCount).
- If there is room for the group, the gate opens, and all Pods are bound to nodes.
- If only a subset of the group pods was successfully scheduled within a timeout (set to 5 minutes), the scheduler rejects all of the Pods in the group. They go back to the queue, freeing up the reserved resources for other workloads.

We'd like to point out that that while this is a first implementation, the Kubernetes project firmly intends to improve and expand the gang scheduling algorithm in future releases. Benefits we hope to deliver include a single-cycle scheduling phase for a whole gang, workload-level preemption, and more, moving towards the north star goal.

Opportunistic batching

In addition to explicit gang scheduling, v1.35 introduces opportunistic batching. This is a Beta feature that improves scheduling latency for identical Pods.

Unlike gang scheduling, this feature does not require the Workload API or any explicit opt-in on the user's part. It works opportunistically within the scheduler by identifying Pods that have identical scheduling requirements (container images, resource requests, affinities, etc.). When the scheduler processes a Pod, it can reuse the feasibility calculations for subsequent identical Pods in the queue, significantly speeding up the process.

Most users will benefit from this optimization automatically, without taking any special steps, provided their Pods meet the following criteria.

Restrictions

Opportunistic batching works under specific conditions. All fields used by the kube-scheduler to find a placement must be identical between Pods. Additionally, using some features disables the batching mechanism for those Pods to ensure correctness.

Note that you may need to review your kube-scheduler configuration to ensure it is not implicitly disabling batching for your workloads.

See the docs for more details about restrictions.

The north star vision

The project has a broad ambition to deliver workload aware scheduling. These new APIs and scheduling enhancements are just the first steps. In the near future, the effort aims to tackle:

Introducing a workload scheduling phase
Improved support for multi-node DRA and topology aware scheduling
Workload-level preemption
Improved integration between scheduling and autoscaling
Improved interaction with external workload schedulers
Managing placement of workloads throughout their entire lifecycle
Multi-workload scheduling simulations

And more. The priority and implementation order of these focus areas are subject to change. Stay tuned for further updates.

Getting started

To try the workload aware scheduling improvements:

Workload API: Enable the GenericWorkload feature gate on both kube-apiserver and kube-scheduler, and ensure the scheduling.k8s.io/v1alpha1 API group is enabled.
Gang scheduling: Enable the GangScheduling feature gate on kube-scheduler (requires the Workload API to be enabled).
Opportunistic batching: As a Beta feature, it is enabled by default in v1.35. You can disable it using the OpportunisticBatching feature gate on kube-scheduler if needed.

We encourage you to try out workload aware scheduling in your test clusters and share your experiences to help shape the future of Kubernetes scheduling. You can send your feedback by:

Reaching out via Slack (#sig-scheduling).
Commenting on the workload aware scheduling tracking issue
Filing a new issue in the Kubernetes repository.

Learn more

Read the KEPs for Workload API and gang scheduling and Opportunistic batching.
Track the Workload aware scheduling issue for recent updates.

Categories: CNCF Projects, Kubernetes

Kubernetes v1.35: Fine-grained Supplemental Groups Control Graduates to GA

Kubernetes Blog - Tue, 12/23/2025 - 13:30

On behalf of Kubernetes SIG Node, we are pleased to announce the graduation of fine-grained supplemental groups control to General Availability (GA) in Kubernetes v1.35!

The new Pod field, supplementalGroupsPolicy, was introduced as an opt-in alpha feature for Kubernetes v1.31, and then had graduated to beta in v1.33. Now, the feature is generally available. This feature allows you to implement more precise control over supplemental groups in Linux containers that can strengthen the security posture particularly in accessing volumes. Moreover, it also enhances the transparency of UID/GID details in containers, offering improved security oversight.

If you are planning to upgrade your cluster from v1.32 or an earlier version, please be aware that some behavioral breaking change introduced since beta (v1.33). For more details, see the behavioral changes introduced in beta and the upgrade considerations sections of the previous blog for graduation to beta.

Motivation: Implicit group memberships defined in `/etc/group` in the container image

Even though the majority of Kubernetes cluster admins/users may not be aware of this, by default Kubernetes merges group information from the Pod with information defined in /etc/group in the container image.

Here's an example; a Pod manifest that specifies spec.securityContext.runAsUser: 1000, spec.securityContext.runAsGroup: 3000 and spec.securityContext.supplementalGroups: 4000 as part of the Pod's security context.

apiVersion: v1
kind: Pod
metadata:
 name: implicit-groups-example
spec:
 securityContext:
 runAsUser: 1000
 runAsGroup: 3000
 supplementalGroups: [4000]
 containers:
 - name: example-container
 image: registry.k8s.io/e2e-test-images/agnhost:2.45
 command: [ "sh", "-c", "sleep 1h" ]
 securityContext:
 allowPrivilegeEscalation: false

What is the result of id command in the example-container container? The output should be similar to this:

uid=1000 gid=3000 groups=3000,4000,50000

Where does group ID 50000 in supplementary groups (groups field) come from, even though 50000 is not defined in the Pod's manifest at all? The answer is /etc/group file in the container image.

Checking the contents of /etc/group in the container image contains something like the following:

user-defined-in-image:x:1000:
group-defined-in-image:x:50000:user-defined-in-image

This shows that the container's primary user 1000 belongs to the group 50000 in the last entry.

Thus, the group membership defined in /etc/group in the container image for the container's primary user is implicitly merged to the information from the Pod. Please note that this was a design decision the current CRI implementations inherited from Docker, and the community never really reconsidered it until now.

What's wrong with it?

The implicitly merged group information from /etc/group in the container image poses a security risk. These implicit GIDs can't be detected or validated by policy engines because there's no record of them in the Pod manifest. This can lead to unexpected access control issues, particularly when accessing volumes (see kubernetes/kubernetes#112879 for details) because file permission is controlled by UID/GIDs in Linux.

Fine-grained supplemental groups control in a Pod: `supplementaryGroupsPolicy`

To tackle this problem, a Pod's .spec.securityContext now includes supplementalGroupsPolicy field.

This field lets you control how Kubernetes calculates the supplementary groups for container processes within a Pod. The available policies are:

Merge: The group membership defined in /etc/group for the container's primary user will be merged. If not specified, this policy will be applied (i.e. as-is behavior for backward compatibility).
Strict: Only the group IDs specified in fsGroup, supplementalGroups, or runAsGroup are attached as supplementary groups to the container processes. Group memberships defined in /etc/group for the container's primary user are ignored.

I'll explain how the Strict policy works. The following Pod manifest specifies supplementalGroupsPolicy: Strict:

apiVersion: v1
kind: Pod
metadata:
 name: strict-supplementalgroups-policy-example
spec:
 securityContext:
 runAsUser: 1000
 runAsGroup: 3000
 supplementalGroups: [4000]
 supplementalGroupsPolicy: Strict
 containers:
 - name: example-container
 image: registry.k8s.io/e2e-test-images/agnhost:2.45
 command: [ "sh", "-c", "sleep 1h" ]
 securityContext:
 allowPrivilegeEscalation: false

The result of id command in the example-container container should be similar to this:

uid=1000 gid=3000 groups=3000,4000

You can see Strict policy can exclude group 50000 from groups!

Thus, ensuring supplementalGroupsPolicy: Strict (enforced by some policy mechanism) helps prevent the implicit supplementary groups in a Pod.

Note:

A container with sufficient privileges can change its process identity. The supplementalGroupsPolicy only affect the initial process identity.

Read on for more details.

Attached process identity in Pod status

This feature also exposes the process identity attached to the first container process of the container via .status.containerStatuses[].user.linux field. It would be helpful to see if implicit group IDs are attached.

...
status:
 containerStatuses:
 - name: ctr
 user:
 linux:
 gid: 3000
 supplementalGroups:
 - 3000
 - 4000
 uid: 1000
...

Note:

Please note that the values in status.containerStatuses[].user.linux field is the firstly attached process identity to the first container process in the container. If the container has sufficient privilege to call system calls related to process identity (e.g. setuid(2), setgid(2) or setgroups(2), etc.), the container process can change its identity. Thus, the actual process identity will be dynamic.

There are several ways to restrict these permissions in containers. We suggest the belows as simple solutions:

setting privilege: false and allowPrivilegeEscalation: false in your container's securityContext, or
conform your pod to Restricted policy in Pod Security Standard.

Also, kubelet has no visibility into NRI plugins or container runtime internal workings. Cluster Administrator configuring nodes or highly privilege workloads with the permission of a local administrator may change supplemental groups for any pod. However this is outside of a scope of Kubernetes control and should not be a concern for security-hardened nodes.

`Strict` policy requires up-to-date container runtimes

The high level container runtime (e.g. containerd, CRI-O) plays a key role for calculating supplementary group ids that will be attached to the containers. Thus, supplementalGroupsPolicy: Strict requires a CRI runtime that support this feature. The old behavior (supplementalGroupsPolicy: Merge) can work with a CRI runtime that does not support this feature, because this policy is fully backward compatible.

Here are some CRI runtimes that support this feature, and the versions you need to be running:

containerd: v2.0 or later
CRI-O: v1.31 or later

And, you can see if the feature is supported in the Node's .status.features.supplementalGroupsPolicy field. Please note that this field is different from status.declaredFeatures introduced in KEP-5328: Node Declared Features(formerly Node Capabilities).

apiVersion: v1
kind: Node
...
status:
 features:
 supplementalGroupsPolicy: true

As container runtimes support this feature universally, various security policies may start enforcing the Strict behavior as more secure. It is the best practice to ensure that your Pods are ready for this enforcement and all supplemental groups are transparently declared in Pod spec, rather than in images.

Getting involved

This enhancement was driven by the SIG Node community. Please join us to connect with the community and share your ideas and feedback around the above feature and beyond. We look forward to hearing from you!

How can I learn more?

Configure a Security Context for a Pod or Container for the further details of supplementalGroupsPolicy
KEP-3619: Fine-grained SupplementalGroups control

Categories: CNCF Projects, Kubernetes

Kubernetes v1.35: Kubelet Configuration Drop-in Directory Graduates to GA

Kubernetes Blog - Mon, 12/22/2025 - 13:30

With the recent v1.35 release of Kubernetes, support for a kubelet configuration drop-in directory is generally available. The newly stable feature simplifies the management of kubelet configuration across large, heterogeneous clusters.

With v1.35, the kubelet command line argument --config-dir is production-ready and fully supported, allowing you to specify a directory containing kubelet configuration drop-in files. All files in that directory will be automatically merged with your main kubelet configuration. This allows cluster administrators to maintain a cohesive base configuration for kubelets while enabling targeted customizations for different node groups or use cases, and without complex tooling or manual configuration management.

The problem: managing kubelet configuration at scale

As Kubernetes clusters grow larger and more complex, they often include heterogeneous node pools with different hardware capabilities, workload requirements, and operational constraints. This diversity necessitates different kubelet configurations across node groups—yet managing these varied configurations at scale becomes increasingly challenging. Several pain points emerge:

Configuration drift: Different nodes may have slightly different configurations, leading to inconsistent behavior
Node group customization: GPU nodes, edge nodes, and standard compute nodes often require different kubelet settings
Operational overhead: Maintaining separate, complete configuration files for each node type is error-prone and difficult to audit
Change management: Rolling out configuration changes across heterogeneous node pools requires careful coordination

Before this support was added to Kubernetes, cluster administrators had to choose between using a single monolithic configuration file for all nodes, manually maintaining multiple complete configuration files, or relying on separate tooling. Each approach had its own drawbacks. This graduation to stable gives cluster administrators a fully supported fourth way to solve that challenge.

Example use cases

Managing heterogeneous node pools

Consider a cluster with multiple node types: standard compute nodes, high-capacity nodes (such as those with GPUs or large amounts of memory), and edge nodes with specialized requirements.

Base configuration

File: 00-base.conf

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
clusterDNS:
 - "10.96.0.10"
clusterDomain: cluster.local

High-capacity node override

File: 50-high-capacity-nodes.conf

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
maxPods: 50
systemReserved:
 memory: "4Gi"
 cpu: "1000m"

Edge node override

File: 50-edge-nodes.conf (edge compute typically has lower capacity)

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
evictionHard:
 memory.available: "500Mi"
 nodefs.available: "5%"

With this structure, high-capacity nodes apply both the base configuration and the capacity-specific overrides, while edge nodes apply the base configuration with edge-specific settings.

Gradual configuration rollouts

When rolling out configuration changes, you can:

Add a new drop-in file with a high numeric prefix (e.g., 99-new-feature.conf)
Test the changes on a subset of nodes
Gradually roll out to more nodes
Once stable, merge changes into the base configuration

Viewing the merged configuration

Since configuration is now spread across multiple files, you can inspect the final merged configuration using the kubelet's /configz endpoint:

# Start kubectl proxy
kubectl proxy

# In another terminal, fetch the merged configuration
# Change the '<node-name>' placeholder before running the curl command
curl -X GET http://127.0.0.1:8001/api/v1/nodes/<node-name>/proxy/configz | jq .

This shows the actual configuration the kubelet is using after all merging has been applied. The merged configuration also includes any configuration settings that were specified via kubelet command-line arguments.

For detailed setup instructions, configuration examples, and merging behavior, see the official documentation:

Good practices

When using the kubelet configuration drop-in directory:

Test configurations incrementally: Always test new drop-in configurations on a subset of nodes before rolling out cluster-wide to minimize risk
Version control your drop-ins: Store your drop-in configuration files in version control (or the configuration source from which these are generated) alongside your infrastructure as code to track changes and enable easy rollbacks
Use numeric prefixes for predictable ordering: Name files with numeric prefixes (e.g., 00-, 50-, 90-) to explicitly control merge order and make the configuration layering obvious to other administrators
Be mindful of temporary files: Some text editors automatically create backup files (such as .bak, .swp, or files with ~ suffix) in the same directory when editing. Ensure these temporary or backup files are not left in the configuration directory, as they may be processed by the kubelet

Acknowledgments

This feature was developed through the collaborative efforts of SIG Node. Special thanks to all contributors who helped design, implement, test, and document this feature across its journey from alpha in v1.28, through beta in v1.30, to GA in v1.35.

To provide feedback on this feature, join the Kubernetes Node Special Interest Group, participate in discussions on the public Slack channel (#sig-node), or file an issue on GitHub.

Get involved

If you have feedback or questions about kubelet configuration management, or want to share your experience using this feature, join the discussion:

SIG Node community page
Kubernetes Slack in the #sig-node channel
SIG Node mailing list

SIG Node would love to hear about your experiences using this feature in production!

Categories: CNCF Projects, Kubernetes

Avoiding Zombie Cluster Members When Upgrading to etcd v3.6

Kubernetes Blog - Sat, 12/20/2025 - 19:00

This article is a mirror of an original that was recently published to the official etcd blog. The key takeaway? Always upgrade to etcd v3.5.26 or later before moving to v3.6. This ensures your cluster is automatically repaired, and avoids zombie members.

Issue summary

Recently, the etcd community addressed an issue that may appear when users upgrade from v3.5 to v3.6. This bug can cause the cluster to report "zombie members", which are etcd nodes that were removed from the database cluster some time ago, and are re-appearing and joining database consensus. The etcd cluster is then inoperable until these zombie members are removed.

In etcd v3.5 and earlier, the v2store was the source of truth for membership data, even though the v3store was also present. As a part of our v2store deprecation plan, in v3.6 the v3store is the source of truth for cluster membership. Through a bug report we found out that, in some older clusters, v2store and v3store could become inconsistent. This inconsistency manifests after upgrading as seeing old, removed "zombie" cluster members re-appearing in the cluster.

The fix and upgrade path

We’ve added a mechanism in etcd v3.5.26 to automatically sync v3store from v2store, ensuring that affected clusters are repaired before upgrading to 3.6.x.

To support the many users currently upgrading to 3.6, we have provided the following safe upgrade path:

Upgrade your cluster to v3.5.26 or later.
Wait and confirm that all members are healthy post-update.
Upgrade to v3.6.

We are unable to provide a safe workaround path for users who have some obstacle preventing updating to v3.5.26. As such, if v3.5.26 is not available from your packaging source or vendor, you should delay upgrading to v3.6 until it is.

Additional technical detail

Information below is offered for reference only. Users can follow the safe upgrade path without knowledge of the following details.

This issue is encountered with clusters that have been running in production on etcd v3.5.25 or earlier. It is a side effect of adding and removing members from the cluster, or recovering the cluster from failure. This means that the issue is more likely the older the etcd cluster is, but it cannot be ruled out for any user regardless of the age of the cluster.

etcd maintainers, working with issue reporters, have found three possible triggers for the issue based on symptoms and an analysis of etcd code and logs:

Bug in etcdctl snapshot restore (v3.4 and old versions): When restoring a snapshot using etcdctl snapshot restore, etcdctl was supposed to remove existing members before adding the new ones. In v3.4, due to a bug, old members were not removed, resulting in zombie members. Refer to the comment on etcdctl.
--force-new-cluster in v3.5 and earlier versions: In rare cases, forcibly creating a new single-member cluster did not fully remove old members, leaving zombies. The issue was resolved in v3.5.22. Please refer to this PR in the Raft project for detailed technical information.
--unsafe-no-sync enabled: If --unsafe-no-sync is enabled, in rare cases etcd might persist a membership change to v3store but crash before writing it to the WAL, causing inconsistency between v2store and v3store. This is a problem for single-member clusters. For multi-member clusters, forcibly creating a new single-member cluster from the crashed node’s data may lead to zombie members.

Note

--unsafe-no-sync is generally not recommended, as it may break the guarantees given by the consensus protocol.

Importantly, there may be other triggers for v2store and v3store membership data becoming inconsistent that we have not yet found. This means that you cannot assume that you are safe just because you have not performed any of the three actions above. Once users are upgraded to etcd v3.6, v3store becomes the source of membership data, and further inconsistency is not possible.

Advanced users who want to verify the consistency between v2store and v3store can follow the steps described in this comment. This check is not required to fix the issue, nor does SIG etcd recommend bypassing the v3.5.26 update regardless of the results of the check.

Key takeaway

Always upgrade to v3.5.26 or later before moving to v3.6. This ensures your cluster is automatically repaired and avoids zombie members.

Acknowledgements

We would like to thank Christian Baumann for reporting this long-standing upgrade issue. His report and follow-up work helped bring the issue to our attention so that we could investigate and resolve it upstream.

Categories: CNCF Projects, Kubernetes

Kubernetes 1.35: In-Place Pod Resize Graduates to Stable

Kubernetes Blog - Fri, 12/19/2025 - 13:30

This release marks a major step: more than 6 years after its initial conception, the In-Place Pod Resize feature (also known as In-Place Pod Vertical Scaling), first introduced as alpha in Kubernetes v1.27, and graduated to beta in Kubernetes v1.33, is now stable (GA) in Kubernetes 1.35!

This graduation is a major milestone for improving resource efficiency and flexibility for workloads running on Kubernetes.

What is in-place Pod Resize?

In the past, the CPU and memory resources allocated to a container in a Pod were immutable. This meant changing them required deleting and recreating the entire Pod. For stateful services, batch jobs, or latency-sensitive workloads, this was an incredibly disruptive operation.

In-Place Pod Resize makes CPU and memory requests and limits mutable, allowing you to adjust these resources within a running Pod, often without requiring a container restart.

Key Concept:

Desired Resources: A container's spec.containers[*].resources field now represents the desired resources. For CPU and memory, these fields are now mutable.
Actual Resources: The status.containerStatuses[*].resources field reflects the resources currently configured for a running container.
Triggering a Resize: You can request a resize by updating the desired requests and limits in the Pod's specification by utilizing the new resize subresource.

How can I start using in-place Pod Resize?

Detailed usage instructions and examples are provided in the official documentation: Resize CPU and Memory Resources assigned to Containers.

How does this help me?

In-place Pod Resize is a foundational building block that unlocks seamless, vertical autoscaling and improvements to workload efficiency.

Resources adjusted without disruption Workloads sensitive to latency or restarts can have their resources modified in-place without downtime or loss of state.
More powerful autoscaling Autoscalers are now empowered to adjust resources and with less impact. For example, Vertical Pod Autoscaler (VPA)'s InPlaceOrRecreate update mode, which leverages this feature, has graduated to beta. This allows resources to be adjusted automatically and seamlessly based on usage with minimal disruption.
- See AEP-4016 for more details.
Address transient resource needs Workloads that temporarily need more resources can be adjusted quickly. This enables features like the CPU Startup Boost (AEP-7862) where applications can request more CPU during startup and then automatically scale back down.

Here are a few examples of some use cases:

A game server that needs to adjust its size with shifting player count.
A pre-warmed worker that can be shrunk while unused but inflated with the first request.
Dynamically scale with load for efficient bin-packing.
Increased resources for JIT compilation on startup.

Changes between beta (1.33) and stable (1.35)

Since the initial beta in v1.33, development effort has primarily been around stabilizing the feature and improving its usability based on community feedback. Here are the primary changes for the stable release:

Memory limit decrease Decreasing memory limits was previously prohibited. This restriction has been lifted, and memory limit decreases are now permitted. The Kubelet attempts to prevent OOM-kills by allowing the resize only if the current memory usage is below the new desired limit. However, this check is best-effort and not guaranteed.
Prioritized resizes If a node doesn't have enough room to accept all resize requests, Deferred resizes are reattempted based on the following priority:
- PriorityClass
- QoS class
- Duration Deferred, with older requests prioritized first.
Pod Level Resources (Alpha) Support for in-place Pod Resize with Pod Level Resources has been introduced behind its own feature gate, which is alpha in v1.35.
Increased observability: There are now new Kubelet metrics and Pod events specifically associated with In-Place Pod Resize to help users track and debug resource changes.

What's next?

The graduation of In-Place Pod Resize to stable opens the door for powerful integrations across the Kubernetes ecosystem. There are several areas for futher improvement that are currently planned.

Integration with autoscalers and other projects

There are planned integrations with several autoscalers and other projects to improve workload efficiency at a larger scale. Some projects under discussion:

VPA CPU startup boost (AEP-7862): Allows applications to request more CPU at startup and scale back down after a specific period of time.
VPA Support for in-place updates (AEP-4016): VPA support for InPlaceOrRecreate has recently graduated to beta, with the eventual goal being to graduate the feature to stable. Support for InPlace mode is still being worked on; see this pull request.
Ray autoscaler: Plans to leverage In-Place Pod Resize to improve workload efficiency. See this Google Cloud blog post for more details.
Agent-sandbox "Soft-Pause": Investigating leveraging in-place Pod Resize for better improved latency. See the Github issue for more details.
Runtime support: Java and Python runtimes do not support resizing memory without restart. There is an open conversation with the Java developers, see the bug.

If you have a project that could benefit from integration with in-place pod resize, please reach out using the channels listed in the feedback section!

Feature expansion

Today, In-Place Pod Resize is prohibited when used in combination with: swap, the static CPU Manager, and the static Memory Manager. Additionally, resources other than CPU and memory are still immutable. Expanding the set of supported features and resources is under consideration as more feedback about community needs comes in.

There are also plans to support workload preemption; if there is not enough room on the node for the resize of a high priority pod, the goal is to enable policies to automatically evict a lower-priority pod or upsize the node.

Improved stability

Resolve kubelet-scheduler race conditions There are known race conditions between the kubelet and scheduler with regards to in-place pod resize. Work is underway to resolve these issues over the next few releases. See the issue for more details.
Safer memory limit decrease The Kubelet's best-effort check for OOM-kill prevention can be made even safer by moving the memory usage check into the container runtime itself. See the issue for more details.

Providing feedback

Looking to further build on this foundational feature, please share your feedback on how to improve and extend this feature. You can share your feedback through GitHub issues, mailing lists, or Slack channels related to the Kubernetes #sig-node and #sig-autoscaling communities.

Thank you to everyone who contributed to making this long-awaited feature a reality!

Categories: CNCF Projects, Kubernetes

Kubernetes v1.35: Job Managed By Goes GA

Kubernetes Blog - Thu, 12/18/2025 - 13:30

In Kubernetes v1.35, the ability to specify an external Job controller (through .spec.managedBy) graduates to General Availability.

This feature allows external controllers to take full responsibility for Job reconciliation, unlocking powerful scheduling patterns like multi-cluster dispatching with MultiKueue.

Why delegate Job reconciliation?

The primary motivation for this feature is to support multi-cluster batch scheduling architectures, such as MultiKueue.

The MultiKueue architecture distinguishes between a Management Cluster and a pool of Worker Clusters:

The Management Cluster is responsible for dispatching Jobs but not executing them. It needs to accept Job objects to track status, but it skips the creation and execution of Pods.
The Worker Clusters receive the dispatched Jobs and execute the actual Pods.
Users usually interact with the Management Cluster. Because the status is automatically propagated back, they can observe the Job's progress "live" without accessing the Worker Clusters.
In the Worker Clusters, the dispatched Jobs run as regular Jobs managed by the built-in Job controller, with no .spec.managedBy set.

By using .spec.managedBy, the MultiKueue controller on the Management Cluster can take over the reconciliation of a Job. It copies the status from the "mirror" Job running on the Worker Cluster back to the Management Cluster.

Why not just disable the Job controller? While one could theoretically achieve this by disabling the built-in Job controller entirely, this is often impossible or impractical for two reasons:

Managed Control Planes: In many cloud environments, the Kubernetes control plane is locked, and users cannot modify controller manager flags.
Hybrid Cluster Role: Users often need a "hybrid" mode where the Management Cluster dispatches some heavy workloads to remote clusters but still executes smaller or control-plane-related Jobs in the Management Cluster. .spec.managedBy allows this granularity on a per-Job basis.

How `.spec.managedBy` works

The .spec.managedBy field indicates which controller is responsible for the Job, specifically there are two modes of operation:

Standard: if unset or set to the reserved value kubernetes.io/job-controller, the built-in Job controller reconciles the Job as usual (standard behavior).
Delegation: If set to any other value, the built-in Job controller skips reconciliation entirely for that Job.

To prevent orphaned Pods or resource leaks, this field is immutable. You cannot transfer a running Job from one controller to another.

If you are looking into implementing an external controller, be aware that your controller needs to be conformant with the definitions for the Job API. In order to enforce the conformance, a significant part of the effort was to introduce the extensive Job status validation rules. Navigate to the How can you learn more? section for more details.

Ecosystem Adoption

The .spec.managedBy field is rapidly becoming the standard interface for delegating control in the Kubernetes batch ecosystem.

Various custom workload controllers are adding this field (or an equivalent) to allow MultiKueue to take over their reconciliation and orchestrate them across clusters:

While it is possible to use .spec.managedBy to implement a custom Job controller from scratch, we haven't observed that yet. The feature is specifically designed to support delegation patterns, like MultiKueue, without reinventing the wheel.

How can you learn more?

If you want to dig deeper:

Read the user-facing documentation for:

Deep dive into the design history:

The Kubernetes Enhancement Proposal (KEP) Job's managed-by mechanism including introduction of the extensive Job status validation rules.
The Kueue KEP for MultiKueue.

Explore how MultiKueue uses .spec.managedBy in practice in the task guide for running Jobs across clusters.

Acknowledgments

As with any Kubernetes feature, a lot of people helped shape this one through design discussions, reviews, test runs, and bug reports.

We would like to thank, in particular:

Maciej Szulik - for guidance, mentorship, and reviews.
Filip Křepinský - for guidance, mentorship, and reviews.

Get involved

This work was sponsored by the Kubernetes Batch Working Group in close collaboration with the SIG Apps, and with strong input from the SIG Scheduling community.

If you are interested in batch scheduling, multi-cluster solutions, or further improving the Job API:

Join us in the Batch WG and SIG Apps meetings.
Subscribe to the WG Batch Slack channel.

Categories: CNCF Projects, Kubernetes

Cilium releases 2025 annual report: A decade of cloud native networking

CNCF Blog Projects Category - Thu, 12/18/2025 - 11:00

A decade on from its first commit in 2015, 2025 marks a significant milestone for the Cilium project. The community has published the 2025 Cilium Annual Report: A Decade of Cloud Native Networking, which reflects on the project’s evolution, key milestones, and notable developments over the past year.

What began as an experimental container networking effort has grown into a mature, widely adopted platform, bringing together cloud native networking, observability, and security through an eBPF-based architecture.As Cilium enters its second decade, the community continues to grow in both size and momentum, with sustained high-volume development, widespread production adoption, and expanding use cases including virtual machines and large-scale AI infrastructure.

We invite you to explore the 2025 Annual Report and celebrate a decade of cloud native networking with the community.

For any questions or feedback, please reach out to [email protected].

Categories: CNCF Projects

Kubernetes v1.35: Timbernetes (The World Tree Release)

Kubernetes Blog - Wed, 12/17/2025 - 13:30

Editors: Aakanksha Bhende, Arujjwal Negi, Chad M. Crowell, Graziano Casto, Swathi Rao

Similar to previous releases, the release of Kubernetes v1.35 introduces new stable, beta, and alpha features. The consistent delivery of high-quality releases underscores the strength of our development cycle and the vibrant support from our community.

This release consists of 60 enhancements, including 17 stable, 19 beta, and 22 alpha features.

There are also some deprecations and removals in this release; make sure to read about those.

Release theme and logo

a storybook hex badge with a glowing world tree whose branches cradle Earth and a white Kubernetes wheel; three cheerful squirrels stand below—a wizard in a plum robe holding an LGTM scroll, a warrior with an axe and blue Kubernetes shield, and a lantern-carrying rogue in a navy cloak—on green grass above a gold ribbon reading World Tree Release, backed by soft mountains and cloud-swept sky

2025 began in the shimmer of Octarine: The Color of Magic (v1.33) and rode the gusts Of Wind & Will (v1.34). We close the year with our hands on the World Tree, inspired by Yggdrasil, the tree of life that binds many realms. Like any great tree, Kubernetes grows ring by ring and release by release, shaped by the care of a global community.

At its center sits the Kubernetes wheel wrapped around the Earth, grounded by the resilient maintainers, contributors and users who keep showing up. Between day jobs, life changes, and steady open-source stewardship, they prune old APIs, graft new features and keep one of the world’s largest open source projects healthy.

Three squirrels guard the tree: a wizard holding the LGTM scroll for reviewers, a warrior with an axe and Kubernetes shield for the release crews who cut new branches, and a rogue with a lantern for the triagers who bring light to dark issue queues.

Together, they stand in for a much larger adventuring party. Kubernetes v1.35 adds another growth ring to the World Tree, a fresh cut shaped by many hands, many paths and a community whose branches reach higher as its roots grow deeper.

Spotlight on key updates

Kubernetes v1.35 is packed with new features and improvements. Here are a few select updates the Release Team would like to highlight!

Stable: In-place update of Pod resources

Kubernetes has graduated in-place updates for Pod resources to General Availability (GA). This feature allows users to adjust CPU and memory resources without restarting Pods or Containers. Previously, such modifications required recreating Pods, which could disrupt workloads, particularly for stateful or batch applications. Earlier Kubernetes releases allowed you to change only infrastructure resource settings (requests and limits) for existing Pods. The new in-place functionality allows for smoother, nondisruptive vertical scaling, improves efficiency, and can also simplify development.

This work was done as part of KEP #1287 led by SIG Node.

Beta: Pod certificates for workload identity and security

Previously, delivering certificates to pods required external controllers (cert-manager, SPIFFE/SPIRE), CRD orchestration, and Secret management, with rotation handled by sidecars or init containers. Kubernetes v1.35 enables native workload identity with automated certificate rotation, drastically simplifying service mesh and zero-trust architectures.

Now, the kubelet generates keys, requests certificates via PodCertificateRequest, and writes credential bundles directly to the Pod's filesystem. The kube-apiserver enforces node restriction at admission time, eliminating the most common pitfall for third-party signers: accidentally violating node isolation boundaries. This enables pure mTLS flows with no bearer tokens in the issuance path.

This work was done as part of KEP #4317 led by SIG Auth.

Alpha: Node declared features before scheduling

When control planes enable new features but nodes lag behind (permitted by Kubernetes skew policy), the scheduler can place pods requiring those features onto incompatible older nodes. The node-declaration features framework allows nodes to declare their supported Kubernetes features. With the new alpha feature enabled, a Node reports the features it supports, publishing this information to the control plane via a new .status.declaredFeatures field. Then, the kube-scheduler, admission controllers, and third-party components can use these declarations. For example, you can enforce scheduling and API validation constraints to ensure that Pods run only on compatible nodes.

This work was done as part of KEP #5328 led by SIG Node.

Features graduating to Stable

This is a selection of some of the improvements that are now stable following the v1.35 release.

PreferSameNode traffic distribution

The trafficDistribution field for Services has been updated to provide more explicit control over traffic routing. A new option, PreferSameNode, has been introduced to let services strictly prioritize endpoints on the local node if available, falling back to remote endpoints otherwise.

Simultaneously, the existing PreferClose option has been renamed to PreferSameZone. This change makes the API self-explanatory by explicitly indicating that traffic is preferred within the current availability zone. While PreferClose is preserved for backward compatibility, PreferSameZone is now the standard for zonal routing, ensuring that both node-level and zone-level preferences are clearly distinguished.

This work was done as part of KEP #3015 led by SIG Network.

Job API managed-by mechanism

The Job API now includes a managedBy field that allows an external controller to handle Job status synchronization. This feature, which graduates to stable in Kubernetes v1.35, is primarily driven by MultiKueue, a multi-cluster dispatching system where a Job created in a management cluster is mirrored and executed in a worker cluster, with status updates propagated back. To enable this workflow, the built-in Job controller must not act on a particular Job resource so that the Kueue controller can manage status updates instead.

The goal is to allow clean delegation of Job synchronization to another controller. It does not aim to pass custom parameters to that controller or modify CronJob concurrency policies.

This work was done as part of KEP #4368 led by SIG Apps.

Reliable Pod update tracking with `.metadata.generation`

Historically, the Pod API lacked the metadata.generation field found in other Kubernetes objects such as Deployments. Because of this omission, controllers and users had no reliable way to verify whether the kubelet had actually processed the latest changes to a Pod's specification. This ambiguity was particularly problematic for features like In-Place Pod Vertical Scaling, where it was difficult to know exactly when a resource resize request had been enacted.

Kubernetes v1.33 added .metadata.generation fields for Pods, as an alpha feature. That field is now stable in the v1.35 Pod API, which means that every time a Pod's spec is updated, the .metadata.generation value is incremented. As part of this improvement, the Pod API also gained a .status.observedGeneration field, which reports the generation that the kubelet has successfully seen and processed. Pod conditions also each contain their own individual observedGeneration field that clients can report and / or observe.

Because this feature has graduated to stable in v1.35, it is available for all workloads.

This work was done as part of KEP #5067 led by SIG Node.

Configurable NUMA node limit for topology manager

The topology manager historically used a hard-coded limit of 8 for the maximum number of NUMA nodes it can support, preventing state explosion during affinity calculation. (There's an important detail here; a NUMA node is not the same as a Node in the Kubernetes API.) This limit on the number of NUMA nodes prevented Kubernetes from fully utilizing modern high-end servers, which increasingly feature CPU architectures with more than 8 NUMA nodes.

Kubernetes v1.31 introduced a new, beta max-allowable-numa-nodes option to the topology manager policy configuration. In Kubernetes v1.35, that option is stable. Cluster administrators who enable it can use servers with more than 8 NUMA nodes.

Although the configuration option is stable, the Kubernetes community is aware of the poor performance for large NUMA hosts, and there is a proposed enhancement (KEP-5726) that aims to improve on it. You can learn more about this by reading Control Topology Management Policies on a node.

This work was done as part of KEP #4622 led by SIG Node.

New features in Beta

This is a selection of some of the improvements that are now beta following the v1.35 release.

Expose node topology labels via Downward API

Accessing node topology information, such as region and zone, from within a Pod has typically required querying the Kubernetes API server. While functional, this approach creates complexity and security risks by necessitating broad RBAC permissions or sidecar containers just to retrieve infrastructure metadata. Kubernetes v1.35 promotes the capability to expose node topology labels directly via the Downward API to beta.

The kubelet can now inject standard topology labels, such as topology.kubernetes.io/zone and topology.kubernetes.io/region, into Pods as environment variables or projected volume files. The primary benefit is a safer and more efficient way for workloads to be topology-aware. This allows applications to natively adapt to their availability zone or region without dependencies on the API server, strengthening security by upholding the principle of least privilege and simplifying cluster configuration.

Note: Kubernetes now injects available topology labels to every Pod so that they can be used as inputs to the downward API. With the v1.35 upgrade, most cluster administrators will see several new labels added to each Pod; this is expected as part of the design.

This work was done as part of KEP #4742 led by SIG Node.

Native support for storage version migration

In Kubernetes v1.35, the native support for storage version migration graduates to beta and is enabled by default. This move integrates the migration logic directly into the core Kubernetes control plane ("in-tree"), eliminating the dependency on external tools.

Historically, administrators relied on manual "read/write loops"—often piping kubectl get into kubectl replace—to update schemas or re-encrypt data at rest. This method was inefficient and prone to conflicts, especially for large resources like Secrets. With this release, the built-in controller automatically handles update conflicts and consistency tokens, providing a safe, streamlined, and reliable way to ensure stored data remains current with minimal operational overhead.

This work was done as part of KEP #4192 led by SIG API Machinery.

Mutable Volume attach limits

A CSI (Container Storage Interface) driver is a Kubernetes plugin that provides a consistent way for storage systems to be exposed to containerized workloads. The CSINode object records details about all CSI drivers installed on a node. However, a mismatch can arise between the reported and actual attachment capacity on nodes. When volume slots are consumed after a CSI driver starts up, the kube-scheduler may assign stateful pods to nodes without sufficient capacity, ultimately getting stuck in a ContainerCreating state.

Kubernetes v1.35 makes CSINode.spec.drivers[*].allocatable.count mutable so that a node’s available volume attachment capacity can be updated dynamically. It also allows CSI drivers to control how frequently the allocatable.count value is updated on all nodes by introducing a configurable refresh interval, defined through the CSIDriver object. Additionally, it automatically updates CSINode.spec.drivers[*].allocatable.count on detecting a failure in volume attachment due to insufficient capacity. Although this feature graduated to beta in v1.34 with the feature flag MutableCSINodeAllocatableCount disabled by default, it remains in beta for v1.35 to allow time for feedback, but the feature flag is enabled by default.

This work was done as part of KEP #4876 led by SIG Storage.

Opportunistic batching

Historically, the Kubernetes scheduler processes pods sequentially with time complexity of O(num pods × num nodes), which can result in redundant computation for compatible pods. This KEP introduces an opportunistic batching mechanism that aims to improve performance by identifying such compatible Pods via Pod scheduling signature and batching them together, allowing shared filtering and scoring results across them.

The pod scheduling signature ensures that two pods with the same signature are “the same” from a scheduling perspective. It takes into account not only the pod and node attributes, but also the other pods in the system and global data about the pod placement. This means that any pod with the given signature will get the same scores/feasibility results from any arbitrary set of nodes.

The batching mechanism consists of two operations that can be invoked whenever needed - create and nominate. Create leads to the creation of a new set of batch information from the scheduling results of Pods that have a valid signature. Nominate uses the batching information from create to set the nominated node name from a new Pod whose signature matches the canonical Pod’s signature.

This work was done as part of KEP #5598 led by SIG Scheduling.

`maxUnavailable` for StatefulSets

A StatefulSet runs a group of Pods and maintains a sticky identity for each of those Pods. This is critical for stateful workloads requiring stable network identifiers or persistent storage. When a StatefulSet's .spec.updateStrategy.<type> is set to RollingUpdate, the StatefulSet controller will delete and recreate each Pod in the StatefulSet. It will proceed in the same order as Pod termination (from the largest ordinal to the smallest), updating each Pod one at a time.

Kubernetes v1.24 added a new alpha field to a StatefulSet's rollingUpdate configuration settings, called maxUnavailable. That field wasn't part of the Kubernetes API unless your cluster administrator explicitly opted in. In Kubernetes v1.35 that field is beta and is available by default. You can use it to define the maximum number of pods that can be unavailable during an update. This setting is most effective in combination with .spec.podManagementPolicy set to Parallel. You can set maxUnavailable as either a positive number (example: 2) or a percentage of the desired number of Pods (example: 10%). If this field is not specified, it will default to 1, to maintain the previous behavior of only updating one Pod at a time. This improvement allows stateful applications (that can tolerate more than one Pod being down) to finish updating faster.

This work was done as part of KEP #961 led by SIG Apps.

Configurable credential plugin policy in `kuberc`

The optional kuberc file is a way to separate server configurations and cluster credentials from user preferences without disrupting already running CI pipelines with unexpected outputs.

As part of the v1.35 release, kuberc gains additional functionality which allows users to configure credential plugin policy. This change introduces two fields credentialPluginPolicy, which allows or denies all plugins, and allows specifying a list of allowed plugins using credentialPluginAllowlist.

This work was done as part of KEP #3104 as a cooperation between SIG Auth and SIG CLI.

KYAML

YAML is a human-readable format of data serialization. In Kubernetes, YAML files are used to define and configure resources, such as Pods, Services, and Deployments. However, complex YAML is difficult to read. YAML's significant whitespace requires careful attention to indentation and nesting, while its optional string-quoting can lead to unexpected type coercion (see: The Norway Bug). While JSON is an alternative, it lacks support for comments and has strict requirements for trailing commas and quoted keys.

KYAML is a safer and less ambiguous subset of YAML designed specifically for Kubernetes. Introduced as an opt-in alpha feature in v1.34, this feature graduated to beta in Kubernetes v1.35 and has been enabled by default. It can be disabled by setting the environment variable KUBECTL_KYAML=false.

KYAML addresses challenges pertaining to both YAML and JSON. All KYAML files are also valid YAML files. This means you can write KYAML and pass it as an input to any version of kubectl. This also means that you don’t need to write in strict KYAML for the input to be parsed.

This work was done as part of KEP #5295 led by SIG CLI.

Configurable tolerance for HorizontalPodAutoscalers

The Horizontal Pod Autoscaler (HPA) has historically relied on a fixed, global 10% tolerance for scaling actions. A drawback of this hardcoded value was that workloads requiring high sensitivity, such as those needing to scale on a 5% load increase, were often blocked from scaling, while others might oscillate unnecessarily.

With Kubernetes v1.35, the configurable tolerance feature graduates to beta and is enabled by default. This enhancement allows users to define a custom tolerance window on a per-resource basis within the HPA behavior field. By setting a specific tolerance (e.g., lowering it to 0.05 for 5%), operators gain precise control over autoscaling sensitivity, ensuring that critical workloads react quickly to small metric changes, without requiring cluster-wide configuration adjustments.

This work was done as part of KEP #4951 led by SIG Autoscaling.

Support for user namespaces in Pods

Kubernetes is adding support for user namespaces, allowing pods to run with isolated user and group ID mappings instead of sharing host IDs. This means containers can operate as root internally while actually being mapped to an unprivileged user on the host, reducing the risk of privilege escalation in the event of a compromise. The feature improves pod-level security and makes it safer to run workloads that need root inside the container. Over time, support has expanded to both stateless and stateful Pods through id-mapped mounts.

This work was done as part of KEP #127 led by SIG Node.

VolumeSource: OCI artifact and/or image

When creating a Pod, you often need to provide data, binaries, or configuration files for your containers. This meant including the content into the main container image or using a custom init container to download and unpack files into an emptyDir. Both these approaches are still valid. Kubernetes v1.31 added support for the image volume type allowing Pods to declaratively pull and unpack OCI container image artifacts into a volume. This lets you package and deliver data-only artifacts such as configs, binaries, or machine learning models using standard OCI registry tools.

With this feature, you can fully separate your data from your container image and remove the need for extra init containers or startup scripts. The image volume type has been in beta since v1.33 and is enabled by default in v1.35. Please note that using this feature requires a compatible container runtime, such as containerd v2.1 or later.

This work was done as part of KEP #4639 led by SIG Node.

Enforced `kubelet` credential verification for cached images

The imagePullPolicy: IfNotPresent setting currently allows a Pod to use a container image that is already cached on a node, even if the Pod itself does not possess the credentials to pull that image. A drawback of this behavior is that it creates a security vulnerability in multi-tenant clusters: if a Pod with valid credentials pulls a sensitive private image to a node, a subsequent unauthorized Pod on the same node can access that image simply by relying on the local cache.

This KEP introduces a mechanism where the kubelet enforces credential verification for cached images. Before allowing a Pod to use a locally cached image, the kubelet checks if the Pod has the valid credentials to pull it. This ensures that only authorized workloads can use private images, regardless of whether they are already present on the node, significantly hardening the security posture for shared clusters.

In Kubernetes v1.35, this feature has graduated to beta and is enabled by default. Users can still disable it by setting the KubeletEnsureSecretPulledImages feature gate to false. Additionally, the imagePullCredentialsVerificationPolicy flag allows operators to configure the desired security level, ranging from a mode that prioritizes backward compatibility to a strict enforcement mode that offers maximum security.

This work was done as part of KEP #2535 led by SIG Node.

Fine-grained Container restart rules

Historically, the restartPolicy field was defined strictly at the Pod level, forcing the same behavior on all containers within a Pod. A drawback of this global setting was the lack of granularity for complex workloads, such as AI/ML training jobs. These often required restartPolicy: Never for the Pod to manage job completion, yet individual containers would benefit from in-place restarts for specific, retriable errors (like network glitches or GPU init failures).

Kubernetes v1.35 addresses this by enabling restartPolicy and restartPolicyRules within the container API itself. This allows users to define restart strategies for individual regular and init containers that operate independently of the Pod's overall policy. For example, a container can now be configured to restart automatically only if it exits with a specific error code, avoiding the expensive overhead of rescheduling the entire Pod for a transient failure.

In this release, the feature has graduated to beta and is enabled by default. Users can immediately leverage restartPolicyRules in their container specifications to optimize recovery times and resource utilization for long-running workloads, without altering the broader lifecycle logic of their Pods.

This work was done as part of KEP #5307 led by SIG Node.

CSI driver opt-in for service account tokens via secrets field

Providing ServiceAccount tokens to Container Storage Interface (CSI) drivers has traditionally relied on injecting them into the volume_context field. This approach presents a significant security risk because volume_context is intended for non-sensitive configuration data and is frequently logged in plain text by drivers and debugging tools, potentially leaking credentials.

Kubernetes v1.35 introduces an opt-in mechanism for CSI drivers to receive ServiceAccount tokens via the dedicated secrets field in the NodePublishVolume request. Drivers can now enable this behavior by setting the serviceAccountTokenInSecrets field to true in their CSIDriver object, instructing the kubelet to populate the token securely.

The primary benefit is the prevention of accidental credential exposure in logs and error messages. This change ensures that sensitive workload identities are handled via the appropriate secure channels, aligning with best practices for secret management while maintaining backward compatibility for existing drivers.

This work was done as part of KEP #5538 led by SIG Auth in cooperation with SIG Storage.

Deployment status: count of terminating replicas

Historically, the Deployment status provided details on available and updated replicas but lacked explicit visibility into Pods that were in the process of shutting down. A drawback of this omission was that users and controllers could not easily distinguish between a stable Deployment and one that still had Pods executing cleanup tasks or adhering to long grace periods.

Kubernetes v1.35 promotes the terminatingReplicas field within the Deployment status to beta. This field provides a count of Pods that have a deletion timestamp set but have not yet been removed from the system. This feature is a foundational step in a larger initiative to improve how Deployments handle Pod replacement, laying the groundwork for future policies regarding when to create new Pods during a rollout.

The primary benefit is improved observability for lifecycle management tools and operators. By exposing the number of terminating Pods, external systems can now make more informed decisions such as waiting for a complete shutdown before proceeding with subsequent tasks without needing to manually query and filter individual Pod lists.

This work was done as part of KEP #3973 led by SIG Apps.

New features in Alpha

This is a selection of some of the improvements that are now alpha following the v1.35 release.

Gang scheduling support in Kubernetes

Scheduling interdependent workloads, such as AI/ML training jobs or HPC simulations, has traditionally been challenging because the default Kubernetes scheduler places Pods individually. This often leads to partial scheduling where some Pods start while others wait indefinitely for resources, resulting in deadlocks and wasted cluster capacity.

Kubernetes v1.35 introduces native support for so-called "gang scheduling" via the new Workload API and PodGroup concept. This feature implements an "all-or-nothing" scheduling strategy, ensuring that a defined group of Pods is scheduled only if the cluster has sufficient resources to accommodate the entire group simultaneously.

The primary benefit is improved reliability and efficiency for batch and parallel workloads. By preventing partial deployments, it eliminates resource deadlocks and ensures that expensive cluster capacity is utilized only when a complete job can run, significantly optimizing the orchestration of large-scale data processing tasks.

This work was done as part of KEP #4671 led by SIG Scheduling.

Constrained impersonation

Historically, the impersonate verb in Kubernetes RBAC functioned on an all-or-nothing basis: once a user was authorized to impersonate a target identity, they gained all associated permissions. A drawback of this broad authorization was that it violated the principle of least privilege, preventing administrators from restricting impersonators to specific actions or resources.

Kubernetes v1.35 introduces a new alpha feature, constrained impersonation, which adds a secondary authorization check to the impersonation flow. When enabled via the ConstrainedImpersonation feature gate, the API server verifies not only the basic impersonate permission but also checks if the impersonator is authorized for the specific action using new verb prefixes (e.g., impersonate-on:<mode>:<verb>). This allows administrators to define fine-grained policies—such as permitting a support engineer to impersonate a cluster admin solely to view logs, without granting full administrative access.

This work was done as part of KEP #5284 led by SIG Auth.

Flagz for Kubernetes components

Verifying the runtime configuration of Kubernetes components, such as the API server or kubelet, has traditionally required privileged access to the host node or process arguments. To address this, the /flagz endpoint was introduced to expose command-line options via HTTP. However, its output was initially limited to plain text, making it difficult for automated tools to parse and validate configurations reliably.

In Kubernetes v1.35, the /flagz endpoint has been enhanced to support structured, machine-readable JSON output. Authorized users can now request a versioned JSON response using standard HTTP content negotiation, while the original plain text format remains available for human inspection. This update significantly improves observability and compliance workflows, allowing external systems to programmatically audit component configurations without fragile text parsing or direct infrastructure access.

This work was done as part of KEP #4828 led by SIG Instrumentation.

Statusz for Kubernetes components

Troubleshooting Kubernetes components like the kube-apiserver or kubelet has traditionally involved parsing unstructured logs or text output, which is brittle and difficult to automate. While a basic /statusz endpoint existed previously, it lacked a standardized, machine-readable format, limiting its utility for external monitoring systems.

In Kubernetes v1.35, the /statusz endpoint has been enhanced to support structured, machine-readable JSON output. Authorized users can now request this format using standard HTTP content negotiation to retrieve precise status data—such as version information and health indicators—without relying on fragile text parsing. This improvement provides a reliable, consistent interface for automated debugging and observability tools across all core components.

This work was done as part of KEP #4827 led by SIG Instrumentation.

CCM: watch-based route controller reconciliation using informers

Managing network routes within cloud environments has traditionally relied on the Cloud Controller Manager (CCM) periodically polling the cloud provider's API to verify and update route tables. This fixed-interval reconciliation approach can be inefficient, often generating a high volume of unnecessary API calls and introducing latency between a node state change and the corresponding route update.

For the Kubernetes v1.35 release, the cloud-controller-manager library introduces a watch-based reconciliation strategy for the route controller. Instead of relying on a timer, the controller now utilizes informers to watch for specific Node events, such as additions, deletions, or relevant field updates and triggers route synchronization only when a change actually occurs.

The primary benefit is a significant reduction in cloud provider API usage, which lowers the risk of hitting rate limits and reduces operational overhead. Additionally, this event-driven model improves the responsiveness of the cluster's networking layer by ensuring that route tables are updated immediately following changes in cluster topology.

This work was done as part of KEP #5237 led by SIG Cloud Provider.

Extended toleration operators for threshold-based placement

Kubernetes v1.35 introduces SLA-aware scheduling by enabling workloads to express reliability requirements. The feature adds numeric comparison operators to tolerations, allowing pods to match or avoid nodes based on SLA-oriented taints such as service guarantees or fault-domain quality.

The primary benefit is enhancing the scheduler with more precise placement. Critical workloads can demand higher-SLA nodes, while lower priority workloads can opt into lower SLA ones. This improves utilization and reduces cost without compromising reliability.

This work was done as part of KEP #5471 led by SIG Scheduling.

Mutable container resources when Job is suspended

Running batch workloads often involves trial and error with resource limits. Currently, the Job specification is immutable, meaning that if a Job fails due to an Out of Memory (OOM) error or insufficient CPU, the user cannot simply adjust the resources; they must delete the Job and create a new one, losing the execution history and status.

Kubernetes v1.35 introduces the capability to update resource requests and limits for Jobs that are in a suspended state. Enabled via the MutableJobPodResourcesForSuspendedJobs feature gate, this enhancement allows users to pause a failing Job, modify its Pod template with appropriate resource values, and then resume execution with the corrected configuration.

The primary benefit is a smoother recovery workflow for misconfigured jobs. By allowing in-place corrections during suspension, users can resolve resource bottlenecks without disrupting the Job's lifecycle identity or losing track of its completion status, significantly improving the developer experience for batch processing.

This work was done as part of KEP #5440 led by SIG Apps.

Other notable changes

Continued innovation in Dynamic Resource Allocation (DRA)

The core functionality was graduated to stable in v1.34, with the ability to turn it off. In v1.35 it is always enabled. Several alpha features have also been significantly improved and are ready for testing. We encourage users to provide feedback on these capabilities to help clear the path for their target promotion to beta in upcoming releases.

Extended Resource Requests via DRA

Several functional gaps compared to Extended Resource requests via Device Plugins were addressed, for example scoring and reuse of devices in init containers.

Device Taints and Tolerations

The new "None" effect can be used to report a problem without immediately affecting scheduling or running pod. DeviceTaintRule now provides status information about an ongoing eviction. The "None" effect can be used for a "dry run" before actually evicting pods:

Create DeviceTaintRule with "effect: None".
Check the status to see how many pods would be evicted.
Replace "effect: None" with "effect: NoExecute".

Partitionable Devices

Devices belonging to the same partitionable devices may now be defined in different ResourceSlices. You can read more in the official documentation.

Consumable Capacity, Device Binding Conditions

Several bugs were fixed and/or more tests added. You can learn more about Consumable Capacity and Binding Conditions in the official documentation.

Comparable resource version semantics

Kubernetes v1.35 changes the way that clients are allowed to interpret resource versions.

Before v1.35, the only supported comparison that clients could make was to check for string equality: if two resource versions were equal, they were the same. Clients could also provide a resource version to the API server and ask the control plane to do internal comparisons, such as streaming all events since a particular resource version.

In v1.35, all in-tree resource versions meet a new stricter definition: the values are a special form of decimal number. And, because they can be compared, clients can do their own operations to compare two different resource versions. For example, this means that a client reconnecting after a crash can detect when it has lost updates, as distinct from the case where there has been an update but no lost changes in the meantime.

This change in semantics enables other important use cases such as storage version migration, performance improvements to informers (a client helper concept), and controller reliability. All of those cases require knowing whether one resource version is newer than another.

This work was done as part of KEP #5504 led by SIG API Machinery.

Graduations, deprecations, and removals in v1.35

Graduations to stable

This lists all the features that graduated to stable (also known as general availability). For a full list of updates including new features and graduations from alpha to beta, see the release notes.

This release includes a total of 15 enhancements promoted to stable:

Deprecations, removals and community updates

As Kubernetes develops and matures, features may be deprecated, removed, or replaced with better ones to improve the project's overall health. See the Kubernetes deprecation and removal policy for more details on this process. Kubernetes v1.35 includes a couple of deprecations.

Ingress NGINX retirement

For years, the Ingress NGINX controller has been a popular choice for routing traffic into Kubernetes clusters. It was flexible, widely adopted, and served as the standard entry point for countless applications.

However, maintaining the project has become unsustainable. With a severe shortage of maintainers and mounting technical debt, the community recently made the difficult decision to retire it. This isn't strictly part of the v1.35 release, but it's such an important change that we wanted to highlight it here.

Consequently, the Kubernetes project announced that Ingress NGINX will receive only best-effort maintenance until March 2026. After this date, it will be archived with no further updates. The recommended path forward is to migrate to the Gateway API, which offers a more modern, secure, and extensible standard for traffic management.

You can find more in the official blog post.

Removal of cgroup v1 support

When it comes to managing resources on Linux nodes, Kubernetes has historically relied on cgroups (control groups). While the original cgroup v1 was functional, it was often inconsistent and limited. That is why Kubernetes introduced support for cgroup v2 back in v1.25, offering a much cleaner, unified hierarchy and better resource isolation.

Because cgroup v2 is now the modern standard, Kubernetes is ready to retire the legacy cgroup v1 support in v1.35. This is an important notice for cluster administrators: if you are still running nodes on older Linux distributions that don't support cgroup v2, your kubelet will fail to start. To avoid downtime, you will need to migrate those nodes to systems where cgroup v2 is enabled.

To learn more, read about cgroup v2;
you can also track the switchover work via KEP-5573: Remove cgroup v1 support.

Deprecation of ipvs mode in kube-proxy

Years ago, Kubernetes adopted the ipvs mode in kube-proxy to provide faster load balancing than the standard iptables. While it offered a performance boost, keeping it in sync with evolving networking requirements created too much technical debt and complexity.

Because of this maintenance burden, Kubernetes v1.35 deprecates ipvs mode. Although the mode remains available in this release, kube-proxy will now emit a warning on startup when configured to use it. The goal is to streamline the codebase and focus on modern standards. For Linux nodes, you should begin transitioning to nftables, which is now the recommended replacement.

You can find more in KEP-5495: Deprecate ipvs mode in kube-proxy.

Final call for containerd v1.X

While Kubernetes v1.35 still supports containerd 1.7 and other LTS releases, this is the final version with such support. The SIG Node community has designated v1.35 as the last release to support the containerd v1.X series.

This serves as an important reminder: before upgrading to the next Kubernetes version, you must switch to containerd 2.0 or later. To help identify which nodes need attention, you can monitor the kubelet_cri_losing_support metric within your cluster.

You can find more in the official blog post or in KEP-4033: Discover cgroup driver from CRI.

Improved Pod stability during `kubelet` restarts

Previously, restarting the kubelet service often caused a temporary disruption in Pod status. During a restart, the kubelet would reset container states, causing healthy Pods to be marked as NotReady and removed from load balancers, even if the application itself was still running correctly.

To address this reliability issue, this behavior has been corrected to ensure seamless node maintenance. The kubelet now properly restores the state of existing containers from the runtime upon startup. This ensures that your workloads remain Ready and traffic continues to flow uninterrupted during kubelet restarts or upgrades.

You can find more in KEP-4781: Fix inconsistent container ready state after kubelet restart.

Release notes

Check out the full details of the Kubernetes v1.35 release in our release notes.

Availability

Kubernetes v1.35 is available for download on GitHub or on the Kubernetes download page.

To get started with Kubernetes, check out these interactive tutorials or run local Kubernetes clusters using minikube. You can also easily install v1.35 using kubeadm.

Release team

Kubernetes is only possible with the support, commitment, and hard work of its community. Each release team is made up of dedicated community volunteers who work together to build the many pieces that make up the Kubernetes releases you rely on. This requires the specialized skills of people from all corners of our community, from the code itself to its documentation and project management.

We honor the memory of Han Kang, a long-time contributor and respected engineer whose technical excellence and infectious enthusiasm left a lasting impact on the Kubernetes community. Han was a significant force within SIG Instrumentation and SIG API Machinery, earning a 2021 Kubernetes Contributor Award for his critical work and sustained commitment to the project's core stability. Beyond his technical contributions, Han was deeply admired for his generosity as a mentor and his passion for building connections among people. He was known for "opening doors" for others, whether guiding new contributors through their first pull requests or supporting colleagues with patience and kindness. Han’s legacy lives on through the engineers he inspired, the robust systems he helped build, and the warm, collaborative spirit he fostered within the cloud native ecosystem.

We would like to thank the entire Release Team for the hours spent hard at work to deliver the Kubernetes v1.35 release to our community. The Release Team's membership ranges from first-time shadows to returning team leads with experience forged over several release cycles. We are incredibly grateful to our Release Lead, Drew Hagen, whose hands-on guidance and vibrant energy not only navigated us through complex challenges but also fueled the community spirit behind this successful release.

Project velocity

The CNCF K8s DevStats project aggregates a number of interesting data points related to the velocity of Kubernetes and various sub-projects. This includes everything from individual contributions to the number of companies that are contributing and is an illustration of the depth and breadth of effort that goes into evolving this ecosystem.

During the v1.35 release cycle, which spanned 14 weeks from 15th September 2025 to 17th December 2025, Kubernetes received contributions from as many as 85 different companies and 419 individuals. In the wider cloud native ecosystem, the figure goes up to 281 companies, counting 1769 total contributors.

Note that "contribution" counts when someone makes a commit, code review, comment, creates an issue or PR, reviews a PR (including blogs and documentation) or comments on issues and PRs.
If you are interested in contributing, visit Getting Started on our contributor website.

Sources for this data:

Events update

Explore upcoming Kubernetes and cloud native events, including KubeCon + CloudNativeCon, KCD, and other notable conferences worldwide. Stay informed and get involved with the Kubernetes community!

February 2026

KCD - Kubernetes Community Days: New Delhi: Feb 21, 2026 | New Delhi, India
KCD - Kubernetes Community Days: Guadalajara: Feb 23, 2026 | Guadalajara, Mexico

March 2026

KubeCon + CloudNativeCon Europe 2026: Mar 23-26, 2026 | Amsterdam, Netherlands

May 2026

KCD - Kubernetes Community Days: Toronto: May 13, 2026 | Toronto, Canada
KCD - Kubernetes Community Days: Helsinki: May 20, 2026 | Helsinki, Finland

June 2026

KubeCon + CloudNativeCon China 2026: Jun 10-11, 2026 | Hong Kong
KubeCon + CloudNativeCon India 2026: Jun 18-19, 2026 | Mumbai, India
KCD - Kubernetes Community Days: Kuala Lumpur: Jun 27, 2026 | Kuala Lumpur, Malaysia

July 2026

KubeCon + CloudNativeCon Japan 2026: Jul 29-30, 2026 | Yokohama, Japan

You can find the latest event details here.

Upcoming release webinar

Join members of the Kubernetes v1.35 Release Team on Wednesday, January 14, 2026, at 5:00 PM (UTC) to learn about the release highlights of this release. For more information and registration, visit the event page on the CNCF Online Programs site.

Get involved

The simplest way to get involved with Kubernetes is by joining one of the many Special Interest Groups (SIGs) that align with your interests. Have something you’d like to broadcast to the Kubernetes community? Share your voice at our weekly community meeting, and through the channels below. Thank you for your continued feedback and support.

Follow us on Bluesky @Kubernetesio for the latest updates
Join the community discussion on Discuss
Join the community on Slack
Post questions (or answer questions) on Stack Overflow
Share your Kubernetes story
Read more about what’s happening with Kubernetes on the blog
Learn more about the Kubernetes Release Team

Categories: CNCF Projects, Kubernetes

Avoiding Zombie Cluster Members When Upgrading to etcd v3.6

etcd Blog - Tue, 12/16/2025 - 19:00

Summary: SIG-etcd has patched another potential issue blocking upgrades from v3.5 to v3.6. If you are upgrading, make sure to update to v3.5.26 or later first.

Issue Summary

Recently, the etcd community addressed an issue that may appear when users upgrade from v3.5 to v3.6. This bug can cause the cluster to report “zombie members”, which are etcd nodes that were removed from the database cluster some time ago, and are re-appearing and joining database consensus. The etcd cluster is then inoperable until these zombie members are removed.

Categories: CNCF Projects

Introducing the Experimental info() Function

Prometheus Blog - Mon, 12/15/2025 - 19:00

Enriching metrics with metadata labels can be surprisingly tricky in Prometheus, even if you're a PromQL wiz! The PromQL join query traditionally used for this is inherently quite complex because it has to specify the labels to join on, the info metric to join with, and the labels to enrich with. The new, still experimental info() function, promises a simpler way, making label enrichment as simple as wrapping your query in a single function call.

In Prometheus 3.0, we introduced the info() function, a powerful new way to enrich your time series with labels from info metrics. What's special about info() versus the traditional join query technique is that it relieves you from having to specify identifying labels, which info metric(s) to join with, and the ("data" or "non-identifying") labels to enrich with. Note that "identifying labels" in this particular context refers to the set of labels that identify the info metrics in question, and are shared with associated non-info metrics. They are the labels you would join on in a Prometheus join query. Conceptually, they can be compared to foreign keys in relational databases.

Beyond the main functionality, info() also solves a subtle yet critical problem that has plagued join queries for years: The "churn problem" that causes queries to fail when non-identifying info metric labels change, combined with missing staleness marking (as is the case with OTLP ingestion).

Whether you're working with OpenTelemetry resource attributes, Kubernetes labels, or any other metadata, the info() function makes your PromQL queries cleaner, more reliable, and easier to understand.

The Problem: Complex Joins and The Churn Problem

Let us start by looking at what we have had to do until now. Imagine you're monitoring HTTP request durations via OpenTelemetry and want to break them down by Kubernetes cluster. You push your metrics to Prometheus' OTLP endpoint. Your metrics have job and instance labels, but the cluster name lives in a separate target_info metric, as the k8s_cluster_name label. Here's what the traditional approach looks like:

sum by (http_status_code, k8s_cluster_name) (
    rate(http_server_request_duration_seconds_count[2m])
  * on (job, instance) group_left (k8s_cluster_name)
    target_info
)

While this works, there are several issues:

1. Complexity: You need to know:

Which info metric contains your labels (target_info)
Which labels are the "identifying" labels to join on (job, instance)
Which data labels you want to add (k8s_cluster_name)
The proper PromQL join syntax (on, group_left)

This requires expert-level PromQL knowledge and makes queries harder to read and maintain.

2. The Churn Problem (The Critical Issue):

Here's the subtle but serious problem: What happens when an OTel resource attribute changes in a Kubernetes container, while the identifying resource attributes stay the same? An example could be the resource attribute k8s.pod.labels.app.kubernetes.io/version. Then the corresponding target_info label k8s_pod_labels_app_kubernetes_io_version changes, and Prometheus sees a completely new target_info time series.

As the OTLP endpoint doesn't mark the old target_info series as stale, both the old and new series can exist simultaneously for up to 5 minutes (the default lookback delta). During this overlap period, your join query finds two distinct matching target_info time series and fails with a "many-to-many matching" error.

This could in practice mean your dashboards break and your alerts stop firing when infrastructure changes are happening, perhaps precisely when you would need visibility the most.

The Info Function Presents a Solution

The previous join query can be converted to use the info function as follows:

sum by (http_status_code, k8s_cluster_name) (
  info(rate(http_server_request_duration_seconds_count[2m]))
)

Much more comprehensible, isn't it? As regards solving the churn problem, the real magic happens under the hood: info() automatically selects the time series with the latest sample, eliminating churn-related join failures entirely. Note that this call to info() returns all data labels from target_info, but it doesn't matter because we aggregate them away with sum.

Basic Syntax

info(v instant-vector, [data-label-selector instant-vector])

v: The instant vector to enrich with metadata labels
data-label-selector (optional): Label matchers in curly braces to filter which labels to include

In its most basic form, omitting the second parameter, info() adds all data labels from target_info:

info(rate(http_server_request_duration_seconds_count[2m]))

Through the second parameter on the other hand, you can control which data labels to include from target_info:

info(
  rate(http_server_request_duration_seconds_count[2m]),
  {k8s_cluster_name=~".+"}
)

In the example above, info() includes the k8s_cluster_name data label from target_info. Because the selector matches any non-empty string, it will include any k8s_cluster_name label value.

It's also possible to filter which k8s_cluster_name label values to include:

info(
  rate(http_server_request_duration_seconds_count[2m]),
  {k8s_cluster_name="us-east-0"}
)

Selecting Different Info Metrics

By default, info() uses the target_info metric. However, you can select different info metrics (like build_info or node_uname_info) by including a __name__ matcher in the data-label-selector:

# Use build_info instead of target_info
info(up, {__name__="build_info"})

# Use multiple info metrics (combines labels from both)
info(up, {__name__=~"(target|build)_info"})

# Select build_info and only include the version label
info(up, {__name__="build_info", version=~".+"})

Note: The current implementation always uses job and instance as the identifying labels for joining, regardless of which info metric you select. This works well for most standard info metrics but may have limitations with custom info metrics that use different identifying labels. An example of an info metric that has different identifying labels than job and instance is kube_pod_labels, its identifying labels are instead: namespace and pod. The intention is that info() in the future knows which metrics in the TSDB are info metrics and automatically uses all of them, unless the selection is explicitly restricted by a name matcher like the above, and which are the identifying labels for each info metric.

Real-World Use Cases

OpenTelemetry Integration

The primary driver for the info() function is OpenTelemetry (OTel) integration. When using Prometheus as an OTel backend, resource attributes (metadata about the metrics producer) are automatically converted to the target_info metric:

service.instance.id → instance label
service.name → job label
service.namespace → prefixed to job (i.e., <namespace>/<service.name>)
All other resource attributes → data labels on target_info

This means that, so long as at least either the service.instance.id or the service.name resource attribute is included, every OTel metric you send to Prometheus over OTLP can be enriched with resource attributes using info():

# Add all OTel resource attributes
info(rate(http_server_request_duration_seconds_sum[5m]))

# Add only specific attributes
info(
  rate(http_server_request_duration_seconds_sum[5m]),
  {k8s_cluster_name=~".+", k8s_namespace_name=~".+", k8s_pod_name=~".+"}
)

Build Information

Enrich your metrics with build-time information:

# Add version and branch information to request rates
sum by (job, http_status_code, version, branch) (
  info(
    rate(http_server_request_duration_seconds_count[2m]),
    {__name__="build_info"}
  )
)

Filter on Producer Version

Pick only metrics from certain producer versions:

sum by (job, http_status_code, version) (
  info(
    rate(http_server_request_duration_seconds_count[2m]),
    {__name__="build_info", version=~"2\\..+"}
  )
)

Before and After: Side-by-Side Comparison

Let's see how the info() function simplifies real queries:

Example 1: OpenTelemetry Resource Attribute Enrichment

Traditional approach:

sum by (http_status_code, k8s_cluster_name, k8s_namespace_name, k8s_container_name) (
    rate(http_server_request_duration_seconds_count[2m])
  * on (job, instance) group_left (k8s_cluster_name, k8s_namespace_name, k8s_container_name)
    target_info
)

With info():

sum by (http_status_code, k8s_cluster_name, k8s_namespace_name, k8s_container_name) (
  info(rate(http_server_request_duration_seconds_count[2m]))
)

The intent is much clearer with info: We're enriching http_server_request_duration_seconds_count with Kubernetes related OpenTelemetry resource attributes.

Example 2: Filtering by Label Value

Traditional approach:

sum by (http_status_code, k8s_cluster_name) (
    rate(http_server_request_duration_seconds_count[2m])
  * on (job, instance) group_left (k8s_cluster_name)
    target_info{k8s_cluster_name=~"us-.*"}
)

With info():

sum by (http_status_code, k8s_cluster_name) (
  info(
    rate(http_server_request_duration_seconds_count[2m]),
    {k8s_cluster_name=~"us-.*"}
  )
)

Here we filter to only include metrics from clusters in the US (which names start with us-). The info() version integrates the filter naturally into the data-label-selector.

Technical Benefits

Beyond the fundamental UX benefits, the info() function provides several technical advantages:

1. Automatic Churn Handling

As previously mentioned, info() automatically picks the matching info time series with the latest sample when multiple versions exist. This eliminates the "many-to-many matching" errors that plague traditional join queries during churn.

How it works: When non-identifying info metric labels change (e.g., a pod is re-created), there's a brief period where both old and new series might exist. The info() function simply selects whichever has the most recent sample, ensuring your queries keep working.

2. Better Performance

The info() function is more efficient than traditional joins:

Only selects matching info series
Avoids unnecessary label matching operations
Optimized query execution path

Getting Started

The info() function is experimental and must be enabled via a feature flag:

prometheus --enable-feature=promql-experimental-functions

Once enabled, you can start using it immediately.

Current Limitations and Future Plans

The current implementation is an MVP (Minimum Viable Product) designed to validate the approach and gather user feedback. The implementation has some intentional limitations:

Current Constraints

Default info metric: Only considers target_info by default
- Workaround: You can use __name__ matchers like {__name__=~"(target|build)_info"} in the data-label-selector, though this still assumes job and instance as identifying labels
Fixed identifying labels: Always assumes job and instance are the identifying labels for joining
- This unfortunately makes info() unsuitable for certain scenarios, e.g. including data labels from kube_pod_labels, but it's a problem we want to solve in the future

Future Development

These limitations are meant to be temporary. The experimental status allows us to:

Gather real-world usage feedback
Understand which use cases matter the most
Iterate on the design before committing to a final API

A future version of the info() function should:

Consider all info metrics by default (not just target_info)
Automatically understand identifying labels based on info metric metadata

Important: Because this is an experimental feature, the behavior may change in future Prometheus versions, or the function could potentially be removed from PromQL entirely based on user feedback.

Giving Feedback

Your feedback will directly shape the future of this feature and help us determine whether it should become a permanent part of PromQL. Feedback may be provided e.g. through our community connections or by opening a Prometheus issue.

We encourage you to try the info() function and share your feedback:

What use cases does it solve for you?
What additional functionality would you like to see?
How could the API be improved?
Do you see improved performance?

Conclusion

The experimental info() function represents a significant step forward in making PromQL more accessible and reliable. By simplifying metadata label enrichment and automatically handling the churn problem, it removes two major pain points for Prometheus users, especially those adopting OpenTelemetry.

To learn more:

Please feel welcome to share your thoughts with the Prometheus community on GitHub Discussions or get in touch with us on the CNCF Slack #prometheus channel.

Happy querying!

Categories: CNCF Projects

Building platforms using kro for composition

CNCF Blog Projects Category - Mon, 12/15/2025 - 07:00

Recent industry developments, such as Amazon’s announcement of the new EKS capabilities, highlight a trend toward supporting platforms with managed GitOps, cloud resource operators, and composition tooling. In particular, the involvement of Kube Resource Orchestrator (kro)—a young, cross-cloud initiative—reflects growing ecosystem interest in simplifying Kubernetes-native resource grouping. Its inclusion in the capabilities package signals that major cloud providers recognize the value of the SIG Cloud Provider–maintained project and its potential role in future platform-engineering workflows.

This is a win for platform engineers. The composition of Kubernetes resources is becoming increasingly important as declarative Infrastructure as Code (IaC) tooling expands the number of objects we manage. For example, CNCF graduated project Crossplane, and the cloud-specific alternatives, such as AWS Controller for Kubernetes (ACK), which is packaged with EKS Capabilities both can add hundreds or even thousands of new CRDs to a cluster.

With composition available as a managed service, platform teams can focus on their mission to build what is unique to their business but common to their teams. They achieve this by combining composition with encapsulation of all associated processes and decoupled delivery across any target environment.

The rise of Kubernetes-native composition

The core value of kro lies in the idea of a ResourceGraphDefinition. Each definition abstracts many Kubernetes objects behind a single API. This API specifies what users may configure when requesting an instance, which resources are created per request, how those sub-resources depend on each other, and what status should be exposed back to the users and dependent resources. kro then acts as a controller that responds to these definitions by creating a new user-facing CRD and managing requests against it through an optimized resource DAG. This abstraction can reduce reliance on tools such as Helm, Kustomize, or hand-written operators when creating consistent patterns.

The collaboration between and the investment across cloud vendors contributing to kro is a bright sign for our industry. However, challenges remain for end users adopting these frameworks. It can often feel like they are trapped in the “How to draw an owl” meme, where kro helps teams sketch the ovals for the head and body, but drawing the rest of the platform owl requires a big leap for the platform engineers doing the work.

Where kro fits in platform design

Effective platforms demonstrate results across three outcomes based on time to delivery:

Time for a user to get a new service they depend on to deliver their value
Time to patch all instances of an existing service or capability
Time to introduce a new business-compliant capability

Across the industry, we see platforms not only improving these metrics but fundamentally shifting beliefs about what is possible. Users are getting the tools they need to take new ideas to production in minutes, not months. A handful of engineers are managing continuous compliance and regular patching. Specialists bring their requirements directly to users without a central team bottleneck.

Universally successful platforms that deliver on these outcomes are designed around three principles:

Composition over simple abstraction

Composition enables teams to build from low-level components to high-value through common abstraction APIs. kro’s ResourceGroups provide an additional Kubernetes-native approach alongside Crossplane compositions, Helm charts, and Kratix Promises.

Encapsulation of configuration, policy, and process

Enterprise platforms must provide more than resources. They need clear ways to capture all the weird and wonderful (business-critical) requirements and processes they have built over the years. Yes, this can mean declarative code, but also imperative API calls, operational workflows that incorporate manual steps, legacy integrations with off-line systems, and, of course, interactions with non-Kubernetes resources. Safe composition depends on the ability to apply a single testable change that covers all affected systems.

Decoupled delivery across many environments

Organizations of sufficient scale and complexity need to support complex topologies, including multi-cluster Kubernetes and non-Kubernetes-based infrastructure. Platforms need to enable timely upgrades across their entire topology to reduce CVE risk while managing diverse and specialized compute, including modern options like GPUs and Functions-as-a-Service (FaaS), as well as legacy options such as mainframes or Red Hat Virtualization.

Achieving overall scalability, auditability, and resilience requires prioritizing each in the proper context. Centralized planning gives control. Decentralized delivery allows scale. A platform should enable the definition of rules and enforcement in a central orchestrator, then rely on distributed deployment engines to deliver the capability in the correct places and form. This avoids the limits of tightly coupled orchestration and reduces the operational burden of scale.

kro is strong in the first principle. It offers a clear, Kubernetes-native composition that lets teams package complex deployments, hide unnecessary details, and encode organizational defaults. Features such as CEL templating demonstrate a focus on helping engineers manage dependencies across Kubernetes objects when creating higher-level abstractions.

Where platforms need more than kro

It is important to acknowledge that kro does not aim to address the second or third principles. This is not a criticism. It reflects a focused scope, following the Unix philosophy of doing one thing well while integrating cleanly with the wider ecosystem.

kro is a powerful mechanism for packaging resource definitions and orchestrating them within a single cluster. It does not try to manage resources across clusters, handle workflows such as approvals, or integrate with systems such as ServiceNow, mainframes, or proprietary APIs that require imperative actions. The power comes from its Kubernetes-native design, which makes it easy to integrate with tools such as for scheduling, Kyverno or OPA for policy as code (PaC), and IaC controllers such as Crossplane.

The harder challenge is how to meet all three principles in a sustainable way. How can you make platform changes that are both quick and safe? The simplest answer is to enable encapsulated and testable packages that allow changes across infrastructure, configuration, policy, and process from a single implementation.

Platform orchestration frameworks—such as Kratix and others in the ecosystem—aim to address these workflow and multi-environment needs. Kratix provides a Kubernetes-native framework for delivering managed services that reflect organizational standards, with support for long-running workflows, integration with enterprise systems, and managed delivery to clusters, airgapped hardware, or mainframes. kro provides composition rather than orchestration, which allows these tools to complement each other.

Looking ahead at a growing ecosystem

The project’s multi-vendor contributions and Kubernetes SIG governance reflect growing community engagement around kro. Many contributors highlight the value of a portable, Kubernetes-native model for grouping and orchestrating resources, and the importance of reducing manual dependency management for platform teams.

The next stage for organizations is understanding how kro fits into their broader architecture. kro is an important tool for composition. Ultimately, platform value comes from tying that composition to capabilities that encapsulate configuration, policy, process workflows, and decoupled deployment across diverse environments.

Emerging standards will help organisations meet the core tests of platform value: safe self-service, consistent compliance, simple fleet upgrades, and a contribution model that scales. With standards come tools that enable platform engineers to continue to reuse capabilities, collaborate more effectively, and deliver predictable behavior across clusters and clouds.

Categories: CNCF Projects

Lima v2.0: New features for secure AI workflows

CNCF Blog Projects Category - Thu, 12/11/2025 - 07:00

On November 6th, the Lima project team shipped the second major release of Lima. In this release, the team are expanding the project focus to cover AI as well as containers.

What is Lima ?

Lima (Linux Machines) is a command line tool to launch a local Linux virtual machine, with the primary focus on running containers on a laptop.

The project began in May 2021, with the aim of promoting containerd including nerdctl (contaiNERD CTL) to Mac users. The project joined the CNCF in September 2022 as a Sandbox project, and was promoted to the Incubating level in October 2025. Through the growth of the project, the scope has expanded to support non-container workloads and non-macOS hosts too.

See also: “Lima becomes a CNCF incubating project”.

Updates in v2.0

Plugins

Lima now provides the plugin infrastructure that allows third-parties to implement new features without modifying Lima itself:

VM driver plugins: for additional hypervisors.
CLI plugins: for additional subcommands of the `limactl` command.
URL scheme plugins: for additional URL schemes to be passed as `limactl create SCHEME:SPEC` .

The plugin interfaces are still experimental and subject to change. The interfaces will be stabilized in future releases.

GPU acceleration

Lima now supports the VM driver for krunkit, providing GPU acceleration for Linux VM running on macOS hosts.

The following screenshot shows that llama.cpp running in Lima detects the Apple M4 Max processor as a virtualized GPU.

Model Context Protocol

Lima now provides Model Context Protocol (MCP) tools for reading, writing, and executing local files using a VM sandbox:

glob
list_directory
read_file
run_shell_command
search_file_content
write_file

Lima’s MCP tools are inspired by Google Gemini CLI’s built-in tools, and can be used as a secure alternative for those built-in tools. See the configuration guide here: https://lima-vm.io/docs/config/ai/outside/mcp/

Other improvements

The `limactl start` command now accepts the `--progress` flag to show the progress of the provisioning scripts.
The `limactl (create|edit|start)` commands now accept the `--mount-only DIR` flag to only mount the specified host directory. In Lima v1.x, this had to be specified in a very complex syntax: `--set ".mounts=[{\"location\":\"$(pwd)\", \"writable\":true}]"` .
The `limactl shell` command now accepts the `--preserve-env` flag to propagate the environment variables from the host to the guest.
UDP ports are now forwarded by default in addition to TCP ports.
Multiple host users can now run Lima simultaneously. This allows running Lima as a separate user account for enhanced security, using “Alcoholless” Homebrew.

See also the release note: https://github.com/lima-vm/lima/releases/tag/v2.0.0 .

We appreciate all the contributors who made this release possible, especially Ansuman Sahoo who contributed the VM driver plugin subsystem and the krunkit VM driver, through the Google Summer of Code (GSoC) 2025.

Expanding the focus to hardening AI

While Lima was originally made for promoting containerd to Mac users, it has been known to be useful for a variety of other use cases as well. One of the most notable emerging use cases is to run an AI coding agent inside a VM in order to isolate the agent from direct access to host files and commands. This setup ensures that even if an AI agent is deceived by malicious instructions searched from the Internet (e.g., fake package installations), any potential damage is confined within the VM or limited to files specified to be mounted from the host.

There are two kinds of scenarios to run an AI agent with Lima: AI inside Lima, and AI outside Lima.

AI inside Lima

This is the most common scenario; just run an AI agent inside Lima. The documentation features several examples of hardening AI agents running in Lima:

A local LLM can be used too, with the GPU acceleration feature available in the krunkit VM driver.

AI outside Lima

This scenario refers to running an AI agent as a host process outside Lima. Lima covers this scenario by providing the MCP tools that intercept file accesses and command executions.

Getting started: AI inside Lima

This section introduces how to run an AI agent (Gemini CLI) inside Lima so as to prevent the AI from directly accessing host files and commands.

If you are using Homebrew, Lima can be installed using:

brew install lima

For other installation methods, see https://lima-vm.io/docs/installation/ .

An instance of the Lima virtual machine can be created and started by running `limactl start`. However, as the default configuration mounts the entire home directory from the host, it is highly recommended to limit the mount scope to the current directory (.), especially when running an AI agent:

mkdir -p ~/test
cd ~/test
limactl start --mount-only .

To allow writing to the mount directory, append the `:w` suffix to the mount specification:

limactl start --mount-only .:w

For example, you can run AI agents such as Gemini CLI. This can be installed and executed inside Lima using the `lima` commands as follows:

lima sudo snap install node --classic
lima sudo npm install -g @google/gemini-cli
lima gemini

Gemini CLI can arbitrarily read, write, and execute files inside the VM, however, it cannot access host files except mounted ones.

To run other AI agents, see https://lima-vm.io/docs/examples/ai/.

Categories: CNCF Projects

Istio at KubeCon + CloudNativeCon North America 2025: Community highlights and project progress

CNCF Blog Projects Category - Mon, 12/08/2025 - 05:36

KubeCon + CloudNativeCon North America 2025 lit up Atlanta from November 10–13, bringing together one of the largest gatherings of open-source practitioners, platform engineers, and maintainers across the cloud native ecosystem. For the Istio community, the week was defined by packed rooms, long hallway conversations, and a genuine sense of shared progress across service mesh, Gateway API, security, and AI-driven platforms.

Before the main conference began, the community kicked things off with Istio Day on November 10, a colocated event filled with deep technical sessions, migration stories, and future-looking discussions that set the tone for the rest of the week.

Istio Day at KubeCon + CloudNativeCon NA

Istio Day brought together practitioners, contributors, and adopters for an afternoon of learning, sharing, and open conversations about where service mesh, and Istio, are headed next.

Istio Day opened with welcome remarks from the program co-chairs, setting the tone for an afternoon focused on real-world mesh evolution and the rapid growth of the Istio community. The agenda highlighted three major themes driving Istio’s future: AI-driven traffic patterns, the advancement of Ambient Mesh—including multicluster adoption, and modernizing traffic entry with Gateway API. Speakers across the ecosystem shared practical lessons on scaling, migration, reliability, and operating increasingly complex workloads with Istio.

The co-chairs closed the day by recognizing the speakers, contributors, and a community continuing to push service-mesh innovation forward. Recordings of all sessions are available at the CNCF YouTube channel.

Istio at KubeCon + CloudNativeCon

Outside of Istio Day, the project was highly visible across KubeCon + CloudNativeCon Atlanta, with maintainers, end users, and contributors sharing technical deep dives, production stories, and cutting-edge research. Istio appeared not only across expo booths and breakout sessions, but also throughout several of the keynotes, where companies showcased how Istio plays a critical role in powering their platforms at scale.

The week’s momentum fully met its stride when the Istio community reconvened with the Istio Project Update, where project leads shared latest releases and roadmap advances. In Istio: Set Sailing With Istio Without Sidecars, attendees explored how sidecar-less Ambient Mesh architecture is rapidly moving from experiment to adoption, opening new possibilities for simpler deployments and leaner data-planes.

The session Lessons Applied Building a Next-Generation AI Proxy took the crowd behind the scenes of how mesh technologies adapt to AI-driven traffic patterns and over at Automated Rightsizing for Istio DaemonSet Workloads (Poster Session), practitioners gathered to compare strategies for optimizing control-plane resources, tuning for high scale, and reducing cost without sacrificing performance.

The narrative of traffic-management evolution featured prominently in Gateway API: Table Stakes and its faster sibling Know Before You Go! Speedrun Intro to Gateway API. Meanwhile, Return of the Mesh: Gateway API’s Epic Quest for Unity scaled that conversation: how traffic, API, mesh, and routing converge into one architecture that simplifies complexity rather than multiplies it.

For long-term reflection, 5 Key Lessons From 8 Years of Building Kgateway delivered hard-earned wisdom from years of system design. In GAMMA in Action: How Careem Migrated To Istio Without Downtime, the real-world migration story—a major production rollout that stayed up during transition—provided a roadmap for teams seeking safe mesh adoption at scale.

Safety and rollout risks took center stage in Taming Rollout Risks in Distributed Web Apps: A Location-Aware Gradual Deployment Approach, where strategies for regional rollouts, steering traffic, and minimizing user impact were laid out.

Finally, operations and day-two reality were tackled in End-to-End Security With gRPC in Kubernetes and On-Call the Easy Way With Agents, reminding everyone that mesh isn’t just about architecture, but about how teams run software safely, reliably, and confidently.

Community spaces: ContribFest, Maintainer Track and the Project Pavilion

At the Project Pavilion, the Istio kiosk was constantly buzzing, drawing users with questions about Ambient Mesh, AI workloads, and deployment best practices.

The Maintainer Track brought contributors together to collaborate on roadmap topics, triage issues, and discuss key areas of investment for the next year.

At ContribFest, new contributors joined maintainers to work through good-first issues, discuss contribution pathways, and get their first PRs lined up.

Istio maintainers eecognized at the CNCF Community Awards

This year’s CNCF Community Awards were a proud moment for the project. Two Istio maintainers received well-deserved recognition:

Daniel Hawton — “Chop Wood, Carry Water” Award

John Howard — Top Committer Award

Beyond these awards, Istio was also represented prominently in conference leadership. Faseela K, one of the KubeCon + CloudNativeCon NA co-chairs and an Istio maintainer, participated in a keynote panel on Cloud Native for Good. During closing remarks, it was also announced that Lin Sun, another long-time Istio maintainer, will serve as an upcoming KubeCon + CloudNativeCon co-chair.

What we heard in Atlanta

Across sessions, kiosks, and hallways, a few themes emerged:

Ambient Mesh is moving quickly from exploration to real-world adoption.
AI workloads are reshaping traffic patterns and operational practices.
Multicluster deployments are becoming standard, with stronger focus on identity and failover.
Gateway API is solidifying as the future of modern traffic management.
Contributor growth is accelerating, supported by ContribFest and hands-on community guidance.

Looking ahead

KubeCon + CloudNativeCon NA 2025 showcased a vibrant, rapidly growing community taking on some of the toughest challenges in cloud infrastructure—from AI traffic management to zero-downtime migrations, from planet-scale control planes to the next generation of sidecar-less mesh. As we look ahead to 2026, the momentum from Atlanta makes one thing clear: the future of service mesh is bright, and the Istio community is leading it together.

See you in Amsterdam!

Categories: CNCF Projects

CoreDNS-1.13.2 Release

CoreDNS Blog - Sun, 12/07/2025 - 19:00

This release adds initial support for DoH3 and includes several core performance and stability fixes, including reduced allocations, a resolved data race in uniq, and safer QUIC listener initialization. Plugin updates improve forwarder reliability, extend GeoIP schema support, and fix issues in secondary, nomad, and kubernetes. Cache and file plugins also receive targeted performance tuning. Deprecations: The GeoIP plugin currently returns 0 for missing latitude/longitude, even though 0,0 is a real location.

Categories: CNCF Projects

Linkerd Edge Release Roundup: December 2025

Linkerd Blog - Sun, 12/07/2025 - 19:00

Welcome to the excessively-large December 2025 Edge Release Roundup posts, where we dive into the most recent edge releases to help keep everyone up to date on the latest and greatest! This post covers edge releases from September through November 2025 (the runup to KubeCon was hectic around here).

How to give feedback

Edge releases are a snapshot of our current development work on main; by definition, they always have the most recent features but they may have incomplete features, features that end up getting rolled back later, or (like all software) even bugs. That said, edge releases are intended for production use, and go through a rigorous set of automated and manual tests before being released. Once released, we also document whether the release is recommended for broad use – and when needed, we go back and update the recommendations.

Categories: CNCF Projects

Visualizing Target Relabeling Rules in Prometheus 3.8.0

Prometheus Blog - Mon, 12/01/2025 - 19:00

Prometheus' target relabeling feature allows you to adjust the labels of a discovered target or even drop the target entirely. Relabeling rules, while powerful, can be hard to understand and debug. Your rules have to match the expected labels that your service discovery mechanism returns, and getting any step wrong could label your target incorrectly or accidentally drop it.

To help you figure out where things go wrong (or right), Prometheus 3.8.0 just added a relabeling visualizer to the Prometheus server's web UI that allows you to inspect how each relabeling rule is applied to a discovered target's labels. Let's take a look at how it works!

Using the relabeling visualizer

If you head to any Prometheus server's "Service discovery" page (for example: https://demo.promlabs.com/service-discovery), you will now see a new "show relabeling" button for each discovered target:

Clicking this button shows you how each relabeling rule is applied to that particular target in sequence:

The visualizer shows you:

The initial labels of the target as discovered by the service discovery mechanism.
The details of each relabeling rule, including its action type and other parameters.
How the labels change after each relabeling rule is applied, with changes, additions, and deletions highlighted in color.
Whether the target is ultimately kept or dropped after all relabeling rules have been applied.
The final output labels of the target if it is kept.

To debug your relabeling rules, you can now read this diagram from top to bottom and find the exact step where the labels change in an unexpected way or where the target gets dropped. This should help you identify misconfigurations in your relabeling rules more easily.

Conclusion

The new relabeling visualizer in the Prometheus server's web UI is a powerful tool to help you understand and debug your target relabeling configurations. By providing a step-by-step view of how each relabeling rule affects a target's labels, it makes it easier to identify and fix issues in your setup. Update your Prometheus servers to 3.8.0 now to give it a try!

Categories: CNCF Projects

Announcing Kyverno release 1.16

CNCF Blog Projects Category - Wed, 11/26/2025 - 15:44

Kyverno 1.16 delivers major advancements in policy as code for Kubernetes, centered on a new generation of CEL-based policies now available in beta with a clear path to GA. This release introduces partial support for namespaced CEL policies to confine enforcement and minimize RBAC, aligning with least-privilege best practices. Observability is significantly enhanced with full metrics for CEL policies and native event generation, enabling precise visibility and faster troubleshooting. Security and governance get sharper controls through fine-grained policy exceptions tailored for CEL policies, and validation use cases broaden with the integration of an HTTP authorizer into ValidatingPolicy. Finally, we’re debuting the Kyverno SDK, laying the foundation for ecosystem integrations and custom tooling.

CEL policy types

CEL policies in beta

CEL policy types are introduced as v1beta. The promotion plan provides a clear, non‑breaking path: v1 will be made available in 1.17 with GA targeted for 1.18. This release includes the cluster‑scoped family (Validating, Mutating, Generating, Deleting, ImageValidating) at v1beta1 and adds namespaced variants for validation, deleting, and image validation; namespaced Generating and Mutating will follow in 1.17. PolicyException and GlobalContextEntry will advance in step to keep versions aligned; see the promotion roadmap in this tracking issue.

Namespaced policies

Kyverno 1.16 introduces namespaced CEL policy types— NamespacedValidatingPolicy, NamespacedDeletingPolicy, and NamespacedImageValidatingPolicy—which mirror their cluster-scoped counterparts but apply only within the policy’s namespace. This lets teams enforce guardrails with least-privilege RBAC and without central changes, improving multi-tenancy and safety during rollout. Choose namespaced types for team-owned namespaces and cluster-scoped types for global controls.

Observability upgrades

CEL policies now have comprehensive, native observability for faster diagnosis:

Validating policy execution latency Metrics: kyverno_validating_policy_execution_duration_seconds_count, …_sum, …_bucket
- What it measures: Time spent evaluating validating policies per admission/background execution as a Prometheus histogram.
- Key labels: policy_name, policy_background_mode, policy_validation_mode (enforce/audit), resource_kind, resource_namespace, resource_request_operation (create/update/delete), execution_cause (admission_request/background_scan), result (PASS/FAIL).
Mutating policy execution latency: kyverno_mutating_policy_execution_duration_seconds_count, …_sum, …_bucket
- What it measures: Time spent executing mutating policies (admission/background) as a Prometheus histogram.
- Key labels: policy_name, policy_background_mode, resource_kind, resource_namespace, resource_request_operation, execution_cause, result.
Generating policy execution latency Metrics: kyverno_generating_policy_execution_duration_seconds_count, …_sum, …_bucket
- What it measures: Time spent executing generating policies when evaluating requests or during background scans.
- Key labels: policy_name, policy_background_mode, resource_kind, resource_namespace, resource_request_operation, execution_cause, result.
Image-validating policy execution latency Metrics: kyverno_image_validating_policy_execution_duration_seconds_count, …_sum, …_bucket
- What it measures: Time spent evaluating image-related validating policies (e.g., image verification) as a Prometheus histogram.
- Key labels: policy_name, policy_background_mode, resource_kind, resource_namespace, resource_request_operation, execution_cause, result.

CEL policies now emit Kubernetes Events for passes, violations, errors, and compile/load issues with rich context (policy/rule, resource, user, mode). This provides instant, kubectl-visible feedback and easier correlation with admission decisions and metrics during rollout and troubleshooting.

Fine-grained policy exceptions

Image-based exceptions

This exception allows Pods in ci using images, via images attribute, that match the provided patterns while keeping the no-latest rule enforced for all other images. It narrows the bypass to specific namespaces and teams for auditability.

apiVersion: policies.kyverno.io/v1beta1
kind: PolicyException
metadata:
 name: allow-ci-latest-images
 namespace: ci
spec:
 policyRefs:
   - name: restrict-image-tag
     kind: ValidatingPolicy
 images:
   - "ghcr.io/kyverno/*:latest"
 matchConditions:
   - expression: "has(object.metadata.labels.team) && object.metadata.labels.team == 'platform'"

The following ValidatingPolicy references exceptions.allowedImages which skips validation checks for white-listed image(s):

apiVersion: policies.kyverno.io/v1beta1
kind: ValidatingPolicy
metadata:
name: restrict-image-tag
spec:
rules:
  - name: broker-config
    matchConstraints:
      resourceRules:
        - apiGroups:   [apps]
          apiVersions: [v1]
          operations:  [CREATE, UPDATE]
          resources:   [pods]
    validations:
     - message: "Containers must not allow privilege escalation unless they are in the allowed images list."
       expression: >
         object.spec.containers.all(container,
          string(container.image) in exceptions.allowedImages ||
          (
            has(container.securityContext) &&
            has(container.securityContext.allowPrivilegeEscalation) &&
            container.securityContext.allowPrivilegeEscalation == false
          )
        )

Value-based exceptions

This exception allows a list of values via allowedValues used by a CEL validation for a constrained set of targets so teams can proceed without weakening the entire policy.

apiVersion: policies.kyverno.io/v1beta1
kind: PolicyException
metadata:
 name: allow-debug-annotation
 namespace: dev
spec:
 policyRefs:
   - name: check-security-context
     kind: ValidatingPolicy
 allowedValues:
   - "debug-mode-temporary"
 matchConditions:
   - expression: "object.metadata.name.startsWith('experiments-')"

Here’s the policy leverages above allowed values. It denies resources unless the annotation value is present in exceptions.allowedValues.

apiVersion: policies.kyverno.io/v1beta1
kind: ValidatingPolicy
metadata:
 name: check-security-context
spec:
 matchConstraints:
   resourceRules:
   - apiGroups:   [apps]
     apiVersions: [v1]
     operations:  [CREATE, UPDATE]
     resources:   [deployments]
 variables:
   - name: allowedCapabilities
     expression: "['AUDIT_WRITE','CHOWN','DAC_OVERRIDE','FOWNER','FSETID','KILL','MKNOD','NET_BIND_SERVICE','SETFCAP','SETGID','SETPCAP','SETUID','SYS_CHROOT']"
 validations:
   - expression: >-
       object.spec.containers.all(container,
      container.?securityContext.?capabilities.?add.orValue([]).all(capability,
      capability in exceptions.allowedValues ||
      capability in variables.allowedCapabilities))
     message: >-
       Any capabilities added beyond the allowed list (AUDIT_WRITE, CHOWN, DAC_OVERRIDE, FOWNER,
      FSETID, KILL, MKNOD, NET_BIND_SERVICE, SETFCAP, SETGID, SETPCAP, SETUID, SYS_CHROOT)
      are disallowed.

Configurable reporting status

This exception sets reportResult: pass, so when it matches, Policy Reports show “pass” rather than the default “skip”, improving dashboards and SLO signals during planned waivers.

apiVersion: policies.kyverno.io/v1beta1
kind: PolicyException
metadata:
 name: exclude-skipped-deployment-2
 labels:
   polex.kyverno.io/priority: "0.2"
spec:
 policyRefs:
 - name: "with-multiple-exceptions"
   kind: ValidatingPolicy
 matchConditions:
   - name: "check-name"
     expression: "object.metadata.name == 'skipped-deployment'"
 reportResult: pass

Kyverno Authz Server

Beyond enriching admission-time validation, Kyverno now extends policy decisions to your service edge. The Kyverno Authz Server applies Kyverno policies to authorize requests for Envoy (via the External Authorization filter) and for plain HTTP services as a standalone HTTP authorization server, returning allow/deny decisions based on the same policy engine you use in Kubernetes. This unifies policy enforcement across admission, gateways, and services, enabling consistent guardrails and faster adoption without duplicating logic. See the project page for details: kyverno/kyverno-authz.

Introducing the Kyverno SDK

Alongside embedding CEL policy evaluation in controllers, CLIs, and CI, there’s now a companion SDK for service-edge authorization. The SDK lets you load Kyverno policies, compile them, and evaluate incoming requests to produce allow/deny decisions with structured results—powering Envoy External Authorization and plain HTTP services without duplicating policy logic. It’s designed for gateways, sidecars, and app middleware with simple Go APIs, optional metrics/hooks, and a path to unify admission-time and runtime enforcement. Note that kyverno-authz is still in active development; start with non-critical paths and add strict timeouts as you evaluate. See the SDK package for details: kyverno SDK.

Other features and enhancements

Label-based reporting configuration

Kyverno now supports label-based report suppression. Add the label reports.kyverno.io/disabled (any value, e.g., “true”) to any policy— ClusterPolicy, CEL policy types, ValidatingAdmissionPolicy, or MutatingAdmissionPolicy—to prevent all reporting (both ephemeral and PolicyReports) for that policy. This lets teams silence noisy or staging policies without changing enforcement; remove the label to resume reporting.

Use Kyverno CEL libraries in policy matchConditions

Kyverno 1.16 enables Kyverno CEL libraries in policy matchConditions, not just in rule bodies, so you can target when rules run using richer, context-aware checks. These expressions are evaluated by Kyverno but are not used to build admission webhook matchConditions—webhook routing remains unchanged.

Getting started and backward compatibility

Upgrading to Kyverno 1.16

To upgrade to Kyverno 1.16, you can use Helm:

helm repo update
helm upgrade --install kyverno kyverno/kyverno -n kyverno --version 3.6.0

Backward compatibility

Kyverno 1.16 remains fully backward compatible with existing ClusterPolicy resources. You can continue running current policies and adopt the new policy types incrementally; once CEL policy types reach GA, the legacy ClusterPolicy API will enter a formal deprecation process following our standard, non‑breaking schedule.

Roadmap

We’re building on 1.16 with a clear, low‑friction path forward. In 1.17, CEL policy types will be available as v1, migration tooling and docs will focus on making upgrades routine. We will continue to expand CEL libraries, samples, and performance optimizations. With SDK and kyverno‑authz maturation to unify admission‑time and runtime enforcement paths. See the release board for the in‑flight work and timelines: Release 1.17.0 Project Board.

Conclusion

Kyverno 1.16 marks a pivotal step toward a unified, CEL‑powered policy platform: you can adopt the new policy types in beta today, move enforcement closer to teams with namespaced policies, and gain sharper visibility with native Events and detailed latency metrics. Fine‑grained exceptions make rollouts safer without weakening guardrails, while label‑based report suppression and CEL in matchConditions reduce noise and let you target policy execution precisely.

Looking ahead, the path to v1 and GA is clear, and the ecosystem is expanding with the Kyverno Authz Server and SDK to bring the same policy engine to gateways and services. Upgrade when ready, start with audits where useful, and tell us what you build—your feedback will shape the final polish and the journey to GA.

Categories: CNCF Projects

Kubernetes v1.35 Sneak Peek

Kubernetes Blog - Tue, 11/25/2025 - 19:00

As the release of Kubernetes v1.35 approaches, the Kubernetes project continues to evolve. Features may be deprecated, removed, or replaced to improve the project's overall health. This blog post outlines planned changes for the v1.35 release that the release team believes you should be aware of to ensure the continued smooth operation of your Kubernetes cluster(s), and to keep you up to date with the latest developments. The information below is based on the current status of the v1.35 release and is subject to change before the final release date.

Deprecations and removals for Kubernetes v1.35

cgroup v1 support

On Linux nodes, container runtimes typically rely on cgroups (short for "control groups"). Support for using cgroup v2 has been stable in Kubernetes since v1.25, providing an alternative to the original v1 cgroup support. While cgroup v1 provided the initial resource control mechanism, it suffered from well-known inconsistencies and limitations. Adding support for cgroup v2 allowed use of a unified control group hierarchy, improved resource isolation, and served as the foundation for modern features, making legacy cgroup v1 support ready for removal. The removal of cgroup v1 support will only impact cluster administrators running nodes on older Linux distributions that do not support cgroup v2; on those nodes, the kubelet will fail to start. Administrators must migrate their nodes to systems with cgroup v2 enabled. More details on compatibility requirements will be available in a blog post soon after the v1.35 release.

To learn more, read about cgroup v2;
you can also track the switchover work via KEP-5573: Remove cgroup v1 support.

Deprecation of ipvs mode in kube-proxy

Many releases ago, the Kubernetes project implemented an ipvs mode in kube-proxy. It was adopted as a way to provide high-performance service load balancing, with better performance than the existing iptables mode. However, maintaining feature parity between ipvs and other kube-proxy modes became difficult, due to technical complexity and diverging requirements. This created significant technical debt and made the ipvs backend impractical to support alongside newer networking capabilities.

The Kubernetes project intends to deprecate kube-proxy ipvs mode in the v1.35 release, to streamline the kube-proxy codebase. For Linux nodes, the recommended kube-proxy mode is already nftables.

You can find more in KEP-5495: Deprecate ipvs mode in kube-proxy

Kubernetes is deprecating containerd v1.y support

While Kubernetes v1.35 still supports containerd 1.7 and other LTS releases of containerd, as a consequence of automated cgroup driver detection, the Kubernetes SIG Node community has formally agreed upon a final support timeline for containerd v1.X. Kubernetes v1.35 is the last release to offer this support (aligned with containerd 1.7 EOL).

This is a final warning that if you are using containerd 1.X, you must switch to 2.0 or later before upgrading Kubernetes to the next version. You are able to monitor the kubelet_cri_losing_support metric to determine if any nodes in your cluster are using a containerd version that will soon be unsupported.

You can find more in the official blog post or in KEP-4033: Discover cgroup driver from CRI

Featured enhancements of Kubernetes v1.35

The following enhancements are some of those likely to be included in the v1.35 release. This is not a commitment, and the release content is subject to change.

Node declared features

When scheduling Pods, Kubernetes uses node labels, taints, and tolerations to match workload requirements with node capabilities. However, managing feature compatibility becomes challenging during cluster upgrades due to version skew between the control plane and nodes. This can lead to Pods being scheduled on nodes that lack required features, resulting in runtime failures.

The node declared features framework will introduce a standard mechanism for nodes to declare their supported Kubernetes features. With the new alpha feature enabled, a Node reports the features it can support, publishing this information to the control plane through a new .status.declaredFeatures field. Then, the kube-scheduler, admission controllers and third-party components can use these declarations. For example, you can enforce scheduling and API validation constraints, ensuring that Pods run only on compatible nodes.

This approach reduces manual node labeling, improves scheduling accuracy, and prevents incompatible pod placements proactively. It also integrates with the Cluster Autoscaler for informed scale-up decisions. Feature declarations are temporary and tied to Kubernetes feature gates, enabling safe rollout and cleanup.

Targeting alpha in v1.35, node declared features aims to solve version skew scheduling issues by making node capabilities explicit, enhancing reliability and cluster stability in heterogeneous version environments.

To learn more about this before the official documentation is published, you can read KEP-5328.

In-place update of Pod resources

Kubernetes is graduating in-place updates for Pod resources to General Availability (GA). This feature allows users to adjust cpu and memory resources without restarting Pods or Containers. Previously, such modifications required recreating Pods, which could disrupt workloads, particularly for stateful or batch applications. Previous Kubernetes releases already allowed you to change infrastructure resources settings (requests and limits) for existing Pods. This allows for smoother vertical scaling, improves efficiency, and can also simplify solution development.

The Container Runtime Interface (CRI) has also been improved, extending the UpdateContainerResources API for Windows and future runtimes while allowing ContainerStatus to report real-time resource configurations. Together, these changes make scaling in Kubernetes faster, more flexible, and disruption-free. The feature was introduced as alpha in v1.27, graduated to beta in v1.33, and is targeting graduation to stable in v1.35.

You can find more in KEP-1287: In-place Update of Pod Resources

Pod certificates

When running microservices, Pods often require a strong cryptographic identity to authenticate with each other using mutual TLS (mTLS). While Kubernetes provides Service Account tokens, these are designed for authenticating to the API server, not for general-purpose workload identity.

Before this enhancement, operators had to rely on complex, external projects like SPIFFE/SPIRE or cert-manager to provision and rotate certificates for their workloads. But what if you could issue a unique, short-lived certificate to your Pods natively and automatically? KEP-4317 is designed to enable such native workload identity. It opens up various possibilities for securing pod-to-pod communication by allowing the kubelet to request and mount certificates for a Pod via a projected volume.

This provides a built-in mechanism for workload identity, complete with automated certificate rotation, significantly simplifying the setup of service meshes and other zero-trust network policies. This feature was introduced as alpha in v1.34 and is targeting beta in v1.35.

You can find more in KEP-4317: Pod Certificates

Numeric values for taints

Kubernetes is enhancing taints and tolerations by adding numeric comparison operators, such as Gt (Greater Than) and Lt (Less Than).

Previously, tolerations supported only exact (Equal) or existence (Exists) matches, which were not suitable for numeric properties such as reliability SLAs.

With this change, a Pod can use a toleration to "opt-in" to nodes that meet a specific numeric threshold. For example, a Pod can require a Node with an SLA taint value greater than 950 (operator: Gt, value: "950").

This approach is more powerful than Node Affinity because it supports the NoExecute effect, allowing Pods to be automatically evicted if a node's numeric value drops below the tolerated threshold.

You can find more in KEP-5471: Enable SLA-based Scheduling

User namespaces

When running Pods, you can use securityContext to drop privileges, but containers inside the pod often still run as root (UID 0). This simplicity poses a significant challenge, as that container UID 0 maps directly to the host's root user.

Before this enhancement, a container breakout vulnerability could grant an attacker full root access to the node. But what if you could dynamically remap the container's root user to a safe, unprivileged user on the host? KEP-127 specifically allows such native support for Linux User Namespaces. It opens up various possibilities for pod security by isolating container and host user/group IDs. This allows a process to have root privileges (UID 0) within its namespace, while running as a non-privileged, high-numbered UID on the host.

Released as alpha in v1.25 and beta in v1.30, this feature continues to progress through beta maturity, paving the way for truly "rootless" containers that drastically reduce the attack surface for a whole class of security vulnerabilities.

You can find more in KEP-127: User Namespaces

Support for mounting OCI images as volumes

When provisioning a Pod, you often need to bundle data, binaries, or configuration files for your containers. Before this enhancement, people often included that kind of data directly into the main container image, or required a custom init container to download and unpack files into an emptyDir. You can still take either of those approaches, of course.

But what if you could populate a volume directly from a data-only artifact in an OCI registry, just like pulling a container image? Kubernetes v1.31 added support for the image volume type, allowing Pods to pull and unpack OCI container image artifacts into a volume declaratively.

This allows for seamless distribution of data, binaries, or ML models using standard registry tooling, completely decoupling data from the container image and eliminating the need for complex init containers or startup scripts. This volume type has been in beta since v1.33 and will likely be enabled by default in v1.35.

You can try out the beta version of image volumes, or you can learn more about the plans from KEP-4639: OCI Volume Source.

Want to know more?

New features and deprecations are also announced in the Kubernetes release notes. We will formally announce what's new in Kubernetes v1.35 as part of the CHANGELOG for that release.

The Kubernetes v1.35 release is planned for December 17, 2025. Stay tuned for updates!

You can also see the announcements of changes in the release notes for:

Get involved

Follow us on Bluesky @kubernetes.io for the latest updates
Join the community discussion on Discuss
Join the community on Slack
Post questions (or answer questions) on Server Fault or Stack Overflow
Share your Kubernetes story
Read more about what’s happening with Kubernetes on the blog
Learn more about the Kubernetes Release Team

Categories: CNCF Projects, Kubernetes

Kubernetes Configuration Good Practices

Kubernetes Blog - Mon, 11/24/2025 - 19:00

Configuration is one of those things in Kubernetes that seems small until it's not. Configuration is at the heart of every Kubernetes workload. A missing quote, a wrong API version or a misplaced YAML indent can ruin your entire deploy.

This blog brings together tried-and-tested configuration best practices. The small habits that make your Kubernetes setup clean, consistent and easier to manage. Whether you are just starting out or already deploying apps daily, these are the little things that keep your cluster stable and your future self sane.

This blog is inspired by the original Configuration Best Practices page, which has evolved through contributions from many members of the Kubernetes community.

General configuration practices

Use the latest stable API version

Kubernetes evolves fast. Older APIs eventually get deprecated and stop working. So, whenever you are defining resources, make sure you are using the latest stable API version. You can always check with

kubectl api-resources

This simple step saves you from future compatibility issues.

Store configuration in version control

Never apply manifest files directly from your desktop. Always keep them in a version control system like Git, it's your safety net. If something breaks, you can instantly roll back to a previous commit, compare changes or recreate your cluster setup without panic.

Write configs in YAML not JSON

Write your configuration files using YAML rather than JSON. Both work technically, but YAML is just easier for humans. It's cleaner to read and less noisy and widely used in the community.

YAML has some sneaky gotchas with boolean values: Use only true or false. Don't write yes, no, on or off. They might work in one version of YAML but break in another. To be safe, quote anything that looks like a Boolean (for example "yes").

Keep configuration simple and minimal

Avoid setting default values that are already handled by Kubernetes. Minimal manifests are easier to debug, cleaner to review and less likely to break things later.

If your Deployment, Service and ConfigMap all belong to one app, put them in a single manifest file.
It's easier to track changes and apply them as a unit. See the Guestbook all-in-one.yaml file for an example of this syntax.

You can even apply entire directories with:

kubectl apply -f configs/

One command and boom everything in that folder gets deployed.

Add helpful annotations

Manifest files are not just for machines, they are for humans too. Use annotations to describe why something exists or what it does. A quick one-liner can save hours when debugging later and also allows better collaboration.

The most helpful annotation to set is kubernetes.io/description. It's like using comment, except that it gets copied into the API so that everyone else can see it even after you deploy.

Managing Workloads: Pods, Deployments, and Jobs

A common early mistake in Kubernetes is creating Pods directly. Pods work, but they don't reschedule themselves if something goes wrong.

Naked Pods (Pods not managed by a controller, such as Deployment or a StatefulSet) are fine for testing, but in real setups, they are risky.

Why? Because if the node hosting that Pod dies, the Pod dies with it and Kubernetes won't bring it back automatically.

Use Deployments for apps that should always be running

A Deployment, which both creates a ReplicaSet to ensure that the desired number of Pods is always available, and specifies a strategy to replace Pods (such as RollingUpdate), is almost always preferable to creating Pods directly. You can roll out a new version, and if something breaks, roll back instantly.

Use Jobs for tasks that should finish

A Job is perfect when you need something to run once and then stop like database migration or batch processing task. It will retry if the pods fails and report success when it's done.

Service Configuration and Networking

Services are how your workloads talk to each other inside (and sometimes outside) your cluster. Without them, your pods exist but can't reach anyone. Let's make sure that doesn't happen.

Create Services before workloads that use them

When Kubernetes starts a Pod, it automatically injects environment variables for existing Services. So, if a Pod depends on a Service, create a Service before its corresponding backend workloads (Deployments or StatefulSets), and before any workloads that need to access it.

For example, if a Service named foo exists, all containers will get the following variables in their initial environment:

FOO_SERVICE_HOST=<the host the Service runs on>
FOO_SERVICE_PORT=<the port the Service runs on>

DNS based discovery doesn't have this problem, but it's a good habit to follow anyway.

Use DNS for Service discovery

If your cluster has the DNS add-on (most do), every Service automatically gets a DNS entry. That means you can access it by name instead of IP:

curl http://my-service.default.svc.cluster.local

It's one of those features that makes Kubernetes networking feel magical.

Avoid `hostPort` and `hostNetwork` unless absolutely necessary

You'll sometimes see these options in manifests:

hostPort: 8080
hostNetwork: true

But here's the thing: They tie your Pods to specific nodes, making them harder to schedule and scale. Because each <hostIP, hostPort, protocol> combination must be unique. If you don't specify the hostIP and protocol explicitly, Kubernetes will use 0.0.0.0 as the default hostIP and TCP as the default protocol. Unless you're debugging or building something like a network plugin, avoid them.

If you just need local access for testing, try kubectl port-forward:

kubectl port-forward deployment/web 8080:80

See Use Port Forwarding to access applications in a cluster to learn more. Or if you really need external access, use a type: NodePort Service. That's the safer, Kubernetes-native way.

Use headless Services for internal discovery

Sometimes, you don't want Kubernetes to load balance traffic. You want to talk directly to each Pod. That's where headless Services come in.

You create one by setting clusterIP: None. Instead of a single IP, DNS gives you a list of all Pods IPs, perfect for apps that manage connections themselves.

Working with labels effectively

Labels are key/value pairs that are attached to objects such as Pods. Labels help you organize, query and group your resources. They don't do anything by themselves, but they make everything else from Services to Deployments work together smoothly.

Use semantics labels

Good labels help you understand what's what, even after months later. Define and use labels that identify semantic attributes of your application or Deployment. For example;

labels:
 app.kubernetes.io/name: myapp
 app.kubernetes.io/component: web
 tier: frontend
 phase: test

app.kubernetes.io/name : what the app is
tier : which layer it belongs to (frontend/backend)
phase : which stage it's in (test/prod)

You can then use these labels to make powerful selectors. For example:

kubectl get pods -l tier=frontend

This will list all frontend Pods across your cluster, no matter which Deployment they came from. Basically you are not manually listing Pod names; you are just describing what you want. See the guestbook app for examples of this approach.

Use common Kubernetes labels

Kubernetes actually recommends a set of common labels. It's a standardized way to name things across your different workloads or projects. Following this convention makes your manifests cleaner, and it means that tools such as Headlamp, dashboard, or third-party monitoring systems can all automatically understand what's running.

Manipulate labels for debugging

Since controllers (like ReplicaSets or Deployments) use labels to manage Pods, you can remove a label to “detach” a Pod temporarily.

Example:

kubectl label pod mypod app-

The app- part removes the label key app. Once that happens, the controller won’t manage that Pod anymore. It’s like isolating it for inspection, a “quarantine mode” for debugging. To interactively remove or add labels, use kubectl label.

You can then check logs, exec into it and once done, delete it manually. That’s a super underrated trick every Kubernetes engineer should know.

Handy kubectl tips

These small tips make life much easier when you are working with multiple manifest files or clusters.

Apply entire directories

Instead of applying one file at a time, apply the whole folder:

# Using server-side apply is also a good practice
kubectl apply -f configs/ --server-side

This command looks for .yaml, .yml and .json files in that folder and applies them all together. It's faster, cleaner and helps keep things grouped by app.

Use label selectors to get or delete resources

You don't always need to type out resource names one by one. Instead, use selectors to act on entire groups at once:

kubectl get pods -l app=myapp
kubectl delete pod -l phase=test

It's especially useful in CI/CD pipelines, where you want to clean up test resources dynamically.

Quickly create Deployments and Services

For quick experiments, you don't always need to write a manifest. You can spin up a Deployment right from the CLI:

kubectl create deployment webapp --image=nginx

Then expose it as a Service:

kubectl expose deployment webapp --port=80

This is great when you just want to test something before writing full manifests. Also, see Use a Service to Access an Application in a cluster for an example.

Conclusion

Cleaner configuration leads to calmer cluster administrators. If you stick to a few simple habits: keep configuration simple and minimal, version-control everything, use consistent labels, and avoid relying on naked Pods, you'll save yourself hours of debugging down the road.

The best part? Clean configurations stay readable. Even after months, you or anyone on your team can glance at them and know exactly what’s happening.

Categories: CNCF Projects, Kubernetes

You are here

CNCF Projects

What's new?

About the feature gate

How can I learn more?

Workload aware scheduling

Workload API

How gang scheduling works

Opportunistic batching

Restrictions

The north star vision

Getting started

Learn more

Motivation: Implicit group memberships defined in /etc/group in the container image

What's wrong with it?

Fine-grained supplemental groups control in a Pod: supplementaryGroupsPolicy

Note:

Attached process identity in Pod status

Note:

Strict policy requires up-to-date container runtimes

Getting involved

How can I learn more?

The problem: managing kubelet configuration at scale

Example use cases

Managing heterogeneous node pools

Base configuration

High-capacity node override

Edge node override

Gradual configuration rollouts

Viewing the merged configuration

Good practices

Acknowledgments

Get involved

Issue summary

The fix and upgrade path

Additional technical detail

Note

Key takeaway

Acknowledgements

What is in-place Pod Resize?

How can I start using in-place Pod Resize?

How does this help me?

Changes between beta (1.33) and stable (1.35)

What's next?

Integration with autoscalers and other projects

Feature expansion

Improved stability

Providing feedback

Why delegate Job reconciliation?

How .spec.managedBy works

Ecosystem Adoption

How can you learn more?

Acknowledgments

Get involved

Release theme and logo

Spotlight on key updates

Stable: In-place update of Pod resources

Beta: Pod certificates for workload identity and security

Alpha: Node declared features before scheduling

Features graduating to Stable

PreferSameNode traffic distribution

Job API managed-by mechanism

Reliable Pod update tracking with .metadata.generation

Configurable NUMA node limit for topology manager

New features in Beta

Expose node topology labels via Downward API

Native support for storage version migration

Mutable Volume attach limits

Opportunistic batching

maxUnavailable for StatefulSets

Configurable credential plugin policy in kuberc

KYAML

Configurable tolerance for HorizontalPodAutoscalers

Support for user namespaces in Pods

VolumeSource: OCI artifact and/or image

Enforced kubelet credential verification for cached images

Fine-grained Container restart rules

CSI driver opt-in for service account tokens via secrets field

Deployment status: count of terminating replicas

New features in Alpha

Motivation: Implicit group memberships defined in `/etc/group` in the container image

Fine-grained supplemental groups control in a Pod: `supplementaryGroupsPolicy`

`Strict` policy requires up-to-date container runtimes

How `.spec.managedBy` works

Reliable Pod update tracking with `.metadata.generation`

`maxUnavailable` for StatefulSets

Configurable credential plugin policy in `kuberc`

Enforced `kubelet` credential verification for cached images

Improved Pod stability during `kubelet` restarts