Kubernetes

Kubernetes v1.36: Server-Side Sharded List and Watch

Kubernetes Blog - Wed, 05/06/2026 - 14:35

As Kubernetes clusters grow to tens of thousands of nodes, controllers that watch high-cardinality resources like Pods face a scaling wall. Every replica of a horizontally scaled controller receives the full stream of events from the API server, paying the CPU, memory, and network cost to deserialize everything, only to discard the objects it is not responsible for. Scaling out the controller does not reduce per-replica cost; it multiplies it.

Kubernetes v1.36 introduces server-side sharded list and watch as an alpha feature (KEP-5866). With this feature enabled, the API server filters events at the source so that each controller replica receives only the slice of the resource collection it owns.

The problem with client-side sharding

Some controllers, such as kube-state-metrics, already support horizontal sharding. Each replica is assigned a portion of the keyspace and discards objects that do not belong to it. While this works functionally, it does not reduce the volume of data flowing from the API server:

N replicas x full event stream: every replica deserializes and processes every event, then throws away what it does not need.
Network bandwidth scales with replicas, not with shard size.
CPU spent on deserialization is wasted for the discarded fraction.

Server-side sharded list and watch solves this by moving the filtering upstream into the API server. Each replica tells the API server which hash range it owns, and the API server only sends matching events.

How it works

The feature adds a shardSelector field to ListOptions. Clients specify a hash range using the shardRange() function:

shardRange(object.metadata.uid, '0x0000000000000000', '0x8000000000000000')

The API server computes a deterministic 64-bit FNV-1a hash of the specified field and returns only objects whose hash falls within the range [start, end). This applies to both list responses and watch event streams. The hash function produces the same result across all API server instances, so the feature is safe to use with multiple API server replicas.

Currently supported field paths are object.metadata.uid and object.metadata.namespace.

Using sharded watches in controllers

Controllers typically use informers to list and watch resources. To shard the workload, each replica injects the shardSelector into the ListOptions used by its informers via WithTweakListOptions:

import (
 metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
 "k8s.io/client-go/informers"
)

shardSelector := "shardRange(object.metadata.uid, '0x0000000000000000', '0x8000000000000000')"

factory := informers.NewSharedInformerFactoryWithOptions(client, resyncPeriod,
 informers.WithTweakListOptions(func(opts *metav1.ListOptions) {
 opts.ShardSelector = shardSelector
 }),
)

For a 2-replica deployment, the selectors split the hash space in half:

// Replica 0: lower half of the hash space
"shardRange(object.metadata.uid, '0x0000000000000000', '0x8000000000000000')"

// Replica 1: upper half of the hash space
"shardRange(object.metadata.uid, '0x8000000000000000', '0x10000000000000000')"

A single replica can also cover non-contiguous ranges using ||:

"shardRange(object.metadata.uid, '0x0000000000000000', '0x4000000000000000') || " +
 "shardRange(object.metadata.uid, '0x8000000000000000', '0xc000000000000000')"

Verifying server support

When the API server honors a shard selector, the list response includes a shardInfo field in the response metadata that echoes back the applied selector:

{
 "kind": "PodList",
 "apiVersion": "v1",
 "metadata": {
 "resourceVersion": "10245",
 "shardInfo": {
 "selector": "shardRange(object.metadata.uid, '0x0000000000000000', '0x8000000000000000')"
 }
 },
 "items": [...]
}

If shardInfo is absent, the server did not honor the shard selector and the client received the complete, unfiltered collection. In this case, the client should be prepared to handle the full result set, for example by applying client-side filtering to discard objects outside its assigned shard range.

Getting involved

This feature is in alpha and requires enabling the ShardedListAndWatch feature gate on the API server. We are looking for feedback from controller authors and operators running large clusters.

If you have questions or feedback, join the #sig-api-machinery channel on Kubernetes Slack.

Categories: CNCF Projects, Kubernetes

Kubernetes v1.36: Declarative Validation Graduates to GA

Kubernetes Blog - Tue, 05/05/2026 - 14:35

In Kubernetes v1.36, Declarative Validation for Kubernetes native types has reached General Availability (GA).

For users, this means more reliable, predictable, and better-documented APIs. By moving to a declarative model, the project also unlocks the future ability to publish validation rules via OpenAPI and integrate with ecosystem tools like Kubebuilder. For contributors and ecosystem developers, this replaces thousands of lines of handwritten validation code with a unified, maintainable framework.

This post covers why this migration was necessary, how the declarative validation framework works, and what new capabilities come with this GA release.

The Motivation: Escaping the "Handwritten" Technical Debt

For years, the validation of Kubernetes native APIs relied almost entirely on handwritten Go code. If a field needed to be bounded by a minimum value, or if two fields needed to be mutually exclusive, developers had to write explicit Go functions to enforce those constraints.

As the Kubernetes API surface expanded, this approach led to several systemic issues:

Technical Debt: The project accumulated roughly 18,000 lines of boilerplate validation code. This code was difficult to maintain, error-prone, and required intense scrutiny during code reviews.
Inconsistency: Without a centralized framework, validation rules were sometimes applied inconsistently across different resources.
Opaque APIs: Handwritten validation logic was difficult to discover or analyze programmatically. This meant clients and tooling couldn't predictably know validation rules without consulting the source code or encountering errors at runtime.

The solution proposed by SIG API Machinery was Declarative Validation: using Interface Definition Language (IDL) tags (specifically +k8s: marker tags) directly within types.go files to define validation rules.

Enter `validation-gen`

At the core of the declarative validation feature is a new code generator called validation-gen. Just as Kubernetes uses generators for deep copies, conversions, and defaulting, validation-gen parses +k8s: tags and automatically generates the corresponding Go validation functions.

These generated functions are then registered seamlessly with the API scheme. The generator is designed as an extensible framework, allowing developers to plug in new "Validators" by describing the tags they parse and the Go logic they should produce.

A Comprehensive Suite of +k8s: Tags

The declarative validation framework introduces a comprehensive suite of marker tags that provide rich validation capabilities highly optimized for Go types. For a full list of supported tags, check out the official documentation. Here is a catalog of some of the most common tags you will now see in the Kubernetes codebase:

Presence: +k8s:optional, +k8s:required
Basic Constraints: +k8s:minimum=0, +k8s:maximum=100, +k8s:maxLength=16, +k8s:format=k8s-short-name
Collections: +k8s:listType=map, +k8s:listMapKey=type
Unions: +k8s:unionMember, +k8s:unionDiscriminator
Immutability: +k8s:immutable, +k8s:update=[NoSet, NoModify, NoClear]

Example Usage:

type ReplicationControllerSpec struct {
 // +k8s:optional
 // +k8s:minimum=0
 Replicas *int32 `json:"replicas,omitempty"`
}

By placing these tags directly above the field definitions, the constraints are self-documenting and immediately visible to anyone reading the type definitions.

Advanced Capabilities: "Ambient Ratcheting"

One of the most substantial outcomes of this work is that validation ratcheting is now a standard, ambient part of the API. In the past, if we needed to tighten validation, we had to first add handwritten ratcheting code, wait a release, and then tighten the validation to avoid breaking existing objects.

With declarative validation, this safety mechanism is built-in. If a user updates an existing object, the validation framework compares the incoming object with the oldObject. If a specific field's value is semantically equivalent to its prior state (i.e., the user didn't change it), the new validation rule is bypassed. This "ambient ratcheting" means we can loosen or tighten validation immediately and in the least disruptive way possible.

Scaling API Reviews with `kube-api-linter`

Reaching GA required absolute confidence in the generated code, but our vision extends beyond just validation. Declarative validation is a key part of a comprehensive approach to making API review easier, more consistent, and highly scalable.

By moving validation rules out of opaque Go functions and into structured markers, we are empowering tools like kube-api-linter. This linter can now statically analyze API types and enforce API conventions automatically, significantly reducing the manual burden on SIG API Machinery reviewers and providing immediate feedback to contributors.

What's next?

With the release of Kubernetes v1.36, Declarative Validation graduates to General Availability (GA). As a stable feature, the associated DeclarativeValidation feature gate is now enabled by default. It has become the primary mechanism for adding new validation rules to Kubernetes native types.

Looking forward, the project is committed to adopting declarative validation even more extensively. This includes migrating the remaining legacy handwritten validation code for established APIs and requiring its use for all new APIs and new fields. This ongoing transition will continue to shrink the codebase's complexity while enhancing the consistency and reliability of the entire Kubernetes API surface.

Beyond the core migration, declarative validation also unlocks an exciting future for the broader ecosystem. Because validation rules are now defined as structured markers rather than opaque Go code, they can be parsed and reflected in the OpenAPI schemas published by the Kubernetes API server. This paves the way for tools like kubectl, client libraries, and IDEs to perform rich client-side validation before a request is ever sent to the cluster. The same declarative framework can also be consumed by ecosystem tools like Kubebuilder, enabling a more consistent developer experience for authors of Custom Resource Definitions (CRDs).

Getting involved

The migration to declarative validation is an ongoing effort. While the framework itself is GA, there is still work to be done migrating older APIs to the new declarative format.

If you are interested in contributing to the core of Kubernetes API Machinery, this is a fantastic place to start. Check out the validation-gen documentation, look for issues tagged with sig/api-machinery, and join the conversation in the #sig-api-machinery and #sig-api-machinery-dev-tools channels on Kubernetes Slack (for an invitation, visit https://slack.k8s.io/). You can also attend the SIG API Machinery meetings to get involved directly.

Acknowledgments

A huge thank you to everyone who helped bring this feature to GA:

And the many others across the Kubernetes community who contributed along the way.

Welcome to the declarative future of Kubernetes validation!

Categories: CNCF Projects, Kubernetes

Kubernetes v1.36: Admission Policies That Can't Be Deleted

Kubernetes Blog - Mon, 05/04/2026 - 14:35

If you've ever tried to enforce a security policy across a fleet of Kubernetes clusters, you've probably run into a frustrating chicken-and-egg problem. Your admission policies are API objects, which means they don't exist until someone creates them, and they can be deleted by anyone with the right permissions. There's always a window during cluster bootstrap where your policies aren't active yet, and there's no way to prevent a privileged user from removing them.

Kubernetes v1.36 introduces an alpha feature that addresses this: manifest-based admission control. It lets you define admission webhooks and CEL-based policies as files on disk, loaded by the API server at startup, before it serves any requests.

The gap we're closing

Most Kubernetes policy enforcement today works through the API. You create a ValidatingAdmissionPolicy or a webhook configuration as an API object, and the admission controller picks it up. This works well in steady state, but it has some fundamental limitations.

During cluster bootstrap, there's a gap between when the API server starts serving requests and when your policies are created and active. If you're restoring from a backup or recovering from an etcd failure, that gap can be significant.

There's also a self-protection problem. Admission webhooks and policies can't intercept operations on their own configuration resources. Kubernetes skips invoking webhooks on types like ValidatingWebhookConfiguration to avoid circular dependencies. That means a sufficiently privileged user can delete your critical admission policies, and there's nothing in the admission chain to stop them.

We - Kubernetes SIG API Machinery - wanted a way to say "these policies are always on, full stop."

How it works

You add a staticManifestsDir field to the AdmissionConfiguration file that you already pass to the API server via --admission-control-config-file. Point it at a directory, drop your policy YAML files in there, and the API server loads them before it starts serving.

apiVersion: apiserver.config.k8s.io/v1
kind: AdmissionConfiguration
plugins:
- name: ValidatingAdmissionPolicy
 configuration:
 apiVersion: apiserver.config.k8s.io/v1
 kind: ValidatingAdmissionPolicyConfiguration
 staticManifestsDir: "/etc/kubernetes/admission/validating-policies/"

The manifest files are standard Kubernetes resource definitions. The only requirement is that all the objects that these manifests define must have names ending in .static.k8s.io. This reserved suffix prevents collisions with API-based configurations and makes it easy to tell where an admission decision came from when you're looking at metrics or audit logs.

Here's a complete example that denies privileged containers outside kube-system:

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicy
metadata:
 name: "deny-privileged.static.k8s.io"
 annotations:
 kubernetes.io/description: "Deny launching privileged pods, anywhere this policy is applied"
spec:
 failurePolicy: Fail
 matchConstraints:
 resourceRules:
 - apiGroups: [""]
 apiVersions: ["v1"]
 operations: ["CREATE", "UPDATE"]
 resources: ["pods"]
 variables:
 - name: allContainers
 expression: >-
 object.spec.containers +
 (has(object.spec.initContainers) ? object.spec.initContainers : []) +
 (has(object.spec.ephemeralContainers) ? object.spec.ephemeralContainers : [])
 validations:
 - expression: >-
 !variables.allContainers.exists(c,
 has(c.securityContext) && has(c.securityContext.privileged) &&
 c.securityContext.privileged == true)
 message: "Privileged containers are not allowed"
---
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicyBinding
metadata:
 name: "deny-privileged-binding.static.k8s.io"
 annotations:
 kubernetes.io/description: "Bind deny-privileged policy to all namespaces except kube-system"
spec:
 policyName: "deny-privileged.static.k8s.io"
 validationActions:
 - Deny
 matchResources:
 namespaceSelector:
 matchExpressions:
 - key: "kubernetes.io/metadata.name"
 operator: NotIn
 values: ["kube-system"]

Protecting what couldn't be protected before

The part we're most excited about is the ability to intercept operations on admission configuration resources themselves.

With API-based admission, webhooks and policies are never invoked on types like ValidatingAdmissionPolicy or ValidatingWebhookConfiguration. That restriction exists for good reason: if a webhook could reject changes to its own configuration, you could end up locked out with no way to fix it through the API.

Manifest-based policies don't have that problem. If a bad policy is blocking something it shouldn't, you fix the file on disk and the API server picks up the change. There's no circular dependency because the recovery path doesn't go through the API.

This means you can write a manifest-based policy that prevents deletion of your critical API-based admission policies. For platform teams managing shared clusters, this is a significant improvement. You can now guarantee that your baseline security policies can't be removed by a cluster admin, accidentally or otherwise.

Here's what that looks like in practice. This policy prevents any modification or deletion of admission resources that carry the platform.example.com/protected: "true" label:

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicy
metadata:
 name: "protect-policies.static.k8s.io"
 annotations:
 kubernetes.io/description: "Prevent modification or deletion of protected admission resources"
spec:
 failurePolicy: Fail
 matchConstraints:
 resourceRules:
 - apiGroups: ["admissionregistration.k8s.io"]
 apiVersions: ["*"]
 operations: ["DELETE", "UPDATE"]
 resources:
 - "validatingadmissionpolicies"
 - "validatingadmissionpolicybindings"
 - "validatingwebhookconfigurations"
 - "mutatingwebhookconfigurations"
 validations:
 - expression: >-
 !has(oldObject.metadata.labels) ||
 !('platform.example.com/protected' in oldObject.metadata.labels) ||
 oldObject.metadata.labels['platform.example.com/protected'] != 'true'
 message: "Protected admission resources cannot be modified or deleted"
---
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicyBinding
metadata:
 name: "protect-policies-binding.static.k8s.io"
 annotations:
 kubernetes.io/description: "Bind protect-policies policy to all admission resources"
spec:
 policyName: "protect-policies.static.k8s.io"
 validationActions:
 - Deny

With this in place, any API-based admission policy or webhook configuration labeled platform.example.com/protected: "true" is shielded from tampering. The protection itself lives on disk and can't be removed through the API.

A few things to know

Manifest-based configurations are intentionally self-contained. They can't reference API resources, which means no paramKind for policies, no Service references for admission webhooks (instead they are URL-only), and bindings may only reference policies in the same manifest set. These restrictions exist because the configurations need to work without any cluster state, including at startup before etcd is available.

If you run multiple API server instances, each one loads its own manifest files independently. There's no cross-server synchronization built in. This is the same model as other file-based API server configurations like encryption at rest. When this feature is enabled, Kubernetes exposes a configuration hash as a label on relevant metrics, so you can detect drift.

Files are watched for changes at runtime, so you don't need to restart the API server to update policies. If you update a manifest file, the API server validates the new configuration and swaps it in atomically. If validation fails, it keeps the previous good configuration and logs the error. This means you can roll out policy changes across your fleet using standard configuration management tools (Ansible, Puppet, or even mounted ConfigMaps) without any API server downtime.

The initial load at startup is stricter: if any manifest is invalid, the API server won't start. This is intentional. At startup, failing fast is safer than running without your expected policies.

Try it out

To try this in Kubernetes v1.36:

Enable the ManifestBasedAdmissionControlConfig feature gate for each kube-apiserver.
Create a directory with your static manifest files. If you need to mount that in to the Pod where the API server runs, do that too. Read-only is fine.
Configure staticManifestsDir in your AdmissionConfiguration with the directory path.
Start the API server with --admission-control-config-file pointing to your AdmissionConfiguration file.

The full documentation is at Manifest-Based Admission Control, and you can follow KEP-5793 for ongoing progress.

We'd love to hear your feedback. Reach out on the #sig-api-machinery channel on Kubernetes Slack (for an invitation, visit https://slack.k8s.io/).

How to get involved

If you're interested in contributing to this feature or other SIG API Machinery projects, join us on #sig-api-machinery on Kubernetes Slack. You're also welcome to attend the SIG API Machinery meetings, held every other Wednesday.

Categories: CNCF Projects, Kubernetes

Kubernetes v1.36: Admission Policies That Can't Be Deleted

Kubernetes Blog - Mon, 05/04/2026 - 14:35

If you've ever tried to enforce a security policy across a fleet of Kubernetes clusters, you've probably run into a frustrating chicken-and-egg problem. Your admission policies are API objects, which means they don't exist until someone creates them, and they can be deleted by anyone with the right permissions. There's always a window during cluster bootstrap where your policies aren't active yet, and there's no way to prevent a privileged user from removing them.

Kubernetes v1.36 introduces an alpha feature that addresses this: manifest-based admission control. It lets you define admission webhooks and CEL-based policies as files on disk, loaded by the API server at startup, before it serves any requests.

The gap we're closing

Most Kubernetes policy enforcement today works through the API. You create a ValidatingAdmissionPolicy or a webhook configuration as an API object, and the admission controller picks it up. This works well in steady state, but it has some fundamental limitations.

During cluster bootstrap, there's a gap between when the API server starts serving requests and when your policies are created and active. If you're restoring from a backup or recovering from an etcd failure, that gap can be significant.

There's also a self-protection problem. Admission webhooks and policies can't intercept operations on their own configuration resources. Kubernetes skips invoking webhooks on types like ValidatingWebhookConfiguration to avoid circular dependencies. That means a sufficiently privileged user can delete your critical admission policies, and there's nothing in the admission chain to stop them.

We - Kubernetes SIG API Machinery - wanted a way to say "these policies are always on, full stop."

How it works

You add a staticManifestsDir field to the AdmissionConfiguration file that you already pass to the API server via --admission-control-config-file. Point it at a directory, drop your policy YAML files in there, and the API server loads them before it starts serving.

apiVersion: apiserver.config.k8s.io/v1
kind: AdmissionConfiguration
plugins:
- name: ValidatingAdmissionPolicy
 configuration:
 apiVersion: apiserver.config.k8s.io/v1
 kind: ValidatingAdmissionPolicyConfiguration
 staticManifestsDir: "/etc/kubernetes/admission/validating-policies/"

The manifest files are standard Kubernetes resource definitions. The only requirement is that all the objects that these manifests define must have names ending in .static.k8s.io. This reserved suffix prevents collisions with API-based configurations and makes it easy to tell where an admission decision came from when you're looking at metrics or audit logs.

Here's a complete example that denies privileged containers outside kube-system:

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicy
metadata:
 name: "deny-privileged.static.k8s.io"
 annotations:
 kubernetes.io/description: "Deny launching privileged pods, anywhere this policy is applied"
spec:
 failurePolicy: Fail
 matchConstraints:
 resourceRules:
 - apiGroups: [""]
 apiVersions: ["v1"]
 operations: ["CREATE", "UPDATE"]
 resources: ["pods"]
 variables:
 - name: allContainers
 expression: >-
 object.spec.containers +
 (has(object.spec.initContainers) ? object.spec.initContainers : []) +
 (has(object.spec.ephemeralContainers) ? object.spec.ephemeralContainers : [])
 validations:
 - expression: >-
 !variables.allContainers.exists(c,
 has(c.securityContext) && has(c.securityContext.privileged) &&
 c.securityContext.privileged == true)
 message: "Privileged containers are not allowed"
---
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicyBinding
metadata:
 name: "deny-privileged-binding.static.k8s.io"
 annotations:
 kubernetes.io/description: "Bind deny-privileged policy to all namespaces except kube-system"
spec:
 policyName: "deny-privileged.static.k8s.io"
 validationActions:
 - Deny
 matchResources:
 namespaceSelector:
 matchExpressions:
 - key: "kubernetes.io/metadata.name"
 operator: NotIn
 values: ["kube-system"]

Protecting what couldn't be protected before

The part we're most excited about is the ability to intercept operations on admission configuration resources themselves.

With API-based admission, webhooks and policies are never invoked on types like ValidatingAdmissionPolicy or ValidatingWebhookConfiguration. That restriction exists for good reason: if a webhook could reject changes to its own configuration, you could end up locked out with no way to fix it through the API.

Manifest-based policies don't have that problem. If a bad policy is blocking something it shouldn't, you fix the file on disk and the API server picks up the change. There's no circular dependency because the recovery path doesn't go through the API.

This means you can write a manifest-based policy that prevents deletion of your critical API-based admission policies. For platform teams managing shared clusters, this is a significant improvement. You can now guarantee that your baseline security policies can't be removed by a cluster admin, accidentally or otherwise.

Here's what that looks like in practice. This policy prevents any modification or deletion of admission resources that carry the platform.example.com/protected: "true" label:

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicy
metadata:
 name: "protect-policies.static.k8s.io"
 annotations:
 kubernetes.io/description: "Prevent modification or deletion of protected admission resources"
spec:
 failurePolicy: Fail
 matchConstraints:
 resourceRules:
 - apiGroups: ["admissionregistration.k8s.io"]
 apiVersions: ["*"]
 operations: ["DELETE", "UPDATE"]
 resources:
 - "validatingadmissionpolicies"
 - "validatingadmissionpolicybindings"
 - "validatingwebhookconfigurations"
 - "mutatingwebhookconfigurations"
 validations:
 - expression: >-
 !has(oldObject.metadata.labels) ||
 !('platform.example.com/protected' in oldObject.metadata.labels) ||
 oldObject.metadata.labels['platform.example.com/protected'] != 'true'
 message: "Protected admission resources cannot be modified or deleted"
---
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicyBinding
metadata:
 name: "protect-policies-binding.static.k8s.io"
 annotations:
 kubernetes.io/description: "Bind protect-policies policy to all admission resources"
spec:
 policyName: "protect-policies.static.k8s.io"
 validationActions:
 - Deny

With this in place, any API-based admission policy or webhook configuration labeled platform.example.com/protected: "true" is shielded from tampering. The protection itself lives on disk and can't be removed through the API.

A few things to know

Manifest-based configurations are intentionally self-contained. They can't reference API resources, which means no paramKind for policies, no Service references for admission webhooks (instead they are URL-only), and bindings may only reference policies in the same manifest set. These restrictions exist because the configurations need to work without any cluster state, including at startup before etcd is available.

If you run multiple API server instances, each one loads its own manifest files independently. There's no cross-server synchronization built in. This is the same model as other file-based API server configurations like encryption at rest. When this feature is enabled, Kubernetes exposes a configuration hash as a label on relevant metrics, so you can detect drift.

Files are watched for changes at runtime, so you don't need to restart the API server to update policies. If you update a manifest file, the API server validates the new configuration and swaps it in atomically. If validation fails, it keeps the previous good configuration and logs the error. This means you can roll out policy changes across your fleet using standard configuration management tools (Ansible, Puppet, or even mounted ConfigMaps) without any API server downtime.

The initial load at startup is stricter: if any manifest is invalid, the API server won't start. This is intentional. At startup, failing fast is safer than running without your expected policies.

Try it out

To try this in Kubernetes v1.36:

Enable the ManifestBasedAdmissionControlConfig feature gate for each kube-apiserver.
Create a directory with your static manifest files. If you need to mount that in to the Pod where the API server runs, do that too. Read-only is fine.
Configure staticManifestsDir in your AdmissionConfiguration with the directory path.
Start the API server with --admission-control-config-file pointing to your AdmissionConfiguration file.

The full documentation is at Manifest-Based Admission Control, and you can follow KEP-5793 for ongoing progress.

We'd love to hear your feedback. Reach out on the #sig-api-machinery channel on Kubernetes Slack (for an invitation, visit https://slack.k8s.io/).

How to get involved

If you're interested in contributing to this feature or other SIG API Machinery projects, join us on #sig-api-machinery on Kubernetes Slack. You're also welcome to attend the SIG API Machinery meetings, held every other Wednesday.

Categories: CNCF Projects, Kubernetes

Kubernetes v1.36: Admission Policies That Can't Be Deleted

Kubernetes Blog - Mon, 05/04/2026 - 14:35

If you've ever tried to enforce a security policy across a fleet of Kubernetes clusters, you've probably run into a frustrating chicken-and-egg problem. Your admission policies are API objects, which means they don't exist until someone creates them, and they can be deleted by anyone with the right permissions. There's always a window during cluster bootstrap where your policies aren't active yet, and there's no way to prevent a privileged user from removing them.

Kubernetes v1.36 introduces an alpha feature that addresses this: manifest-based admission control. It lets you define admission webhooks and CEL-based policies as files on disk, loaded by the API server at startup, before it serves any requests.

The gap we're closing

Most Kubernetes policy enforcement today works through the API. You create a ValidatingAdmissionPolicy or a webhook configuration as an API object, and the admission controller picks it up. This works well in steady state, but it has some fundamental limitations.

During cluster bootstrap, there's a gap between when the API server starts serving requests and when your policies are created and active. If you're restoring from a backup or recovering from an etcd failure, that gap can be significant.

There's also a self-protection problem. Admission webhooks and policies can't intercept operations on their own configuration resources. Kubernetes skips invoking webhooks on types like ValidatingWebhookConfiguration to avoid circular dependencies. That means a sufficiently privileged user can delete your critical admission policies, and there's nothing in the admission chain to stop them.

We - Kubernetes SIG API Machinery - wanted a way to say "these policies are always on, full stop."

How it works

You add a staticManifestsDir field to the AdmissionConfiguration file that you already pass to the API server via --admission-control-config-file. Point it at a directory, drop your policy YAML files in there, and the API server loads them before it starts serving.

apiVersion: apiserver.config.k8s.io/v1
kind: AdmissionConfiguration
plugins:
- name: ValidatingAdmissionPolicy
 configuration:
 apiVersion: apiserver.config.k8s.io/v1
 kind: ValidatingAdmissionPolicyConfiguration
 staticManifestsDir: "/etc/kubernetes/admission/validating-policies/"

The manifest files are standard Kubernetes resource definitions. The only requirement is that all the objects that these manifests define must have names ending in .static.k8s.io. This reserved suffix prevents collisions with API-based configurations and makes it easy to tell where an admission decision came from when you're looking at metrics or audit logs.

Here's a complete example that denies privileged containers outside kube-system:

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicy
metadata:
 name: "deny-privileged.static.k8s.io"
 annotations:
 kubernetes.io/description: "Deny launching privileged pods, anywhere this policy is applied"
spec:
 failurePolicy: Fail
 matchConstraints:
 resourceRules:
 - apiGroups: [""]
 apiVersions: ["v1"]
 operations: ["CREATE", "UPDATE"]
 resources: ["pods"]
 variables:
 - name: allContainers
 expression: >-
 object.spec.containers +
 (has(object.spec.initContainers) ? object.spec.initContainers : []) +
 (has(object.spec.ephemeralContainers) ? object.spec.ephemeralContainers : [])
 validations:
 - expression: >-
 !variables.allContainers.exists(c,
 has(c.securityContext) && has(c.securityContext.privileged) &&
 c.securityContext.privileged == true)
 message: "Privileged containers are not allowed"
---
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicyBinding
metadata:
 name: "deny-privileged-binding.static.k8s.io"
 annotations:
 kubernetes.io/description: "Bind deny-privileged policy to all namespaces except kube-system"
spec:
 policyName: "deny-privileged.static.k8s.io"
 validationActions:
 - Deny
 matchResources:
 namespaceSelector:
 matchExpressions:
 - key: "kubernetes.io/metadata.name"
 operator: NotIn
 values: ["kube-system"]

Protecting what couldn't be protected before

The part we're most excited about is the ability to intercept operations on admission configuration resources themselves.

With API-based admission, webhooks and policies are never invoked on types like ValidatingAdmissionPolicy or ValidatingWebhookConfiguration. That restriction exists for good reason: if a webhook could reject changes to its own configuration, you could end up locked out with no way to fix it through the API.

Manifest-based policies don't have that problem. If a bad policy is blocking something it shouldn't, you fix the file on disk and the API server picks up the change. There's no circular dependency because the recovery path doesn't go through the API.

This means you can write a manifest-based policy that prevents deletion of your critical API-based admission policies. For platform teams managing shared clusters, this is a significant improvement. You can now guarantee that your baseline security policies can't be removed by a cluster admin, accidentally or otherwise.

Here's what that looks like in practice. This policy prevents any modification or deletion of admission resources that carry the platform.example.com/protected: "true" label:

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicy
metadata:
 name: "protect-policies.static.k8s.io"
 annotations:
 kubernetes.io/description: "Prevent modification or deletion of protected admission resources"
spec:
 failurePolicy: Fail
 matchConstraints:
 resourceRules:
 - apiGroups: ["admissionregistration.k8s.io"]
 apiVersions: ["*"]
 operations: ["DELETE", "UPDATE"]
 resources:
 - "validatingadmissionpolicies"
 - "validatingadmissionpolicybindings"
 - "validatingwebhookconfigurations"
 - "mutatingwebhookconfigurations"
 validations:
 - expression: >-
 !has(oldObject.metadata.labels) ||
 !('platform.example.com/protected' in oldObject.metadata.labels) ||
 oldObject.metadata.labels['platform.example.com/protected'] != 'true'
 message: "Protected admission resources cannot be modified or deleted"
---
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicyBinding
metadata:
 name: "protect-policies-binding.static.k8s.io"
 annotations:
 kubernetes.io/description: "Bind protect-policies policy to all admission resources"
spec:
 policyName: "protect-policies.static.k8s.io"
 validationActions:
 - Deny

With this in place, any API-based admission policy or webhook configuration labeled platform.example.com/protected: "true" is shielded from tampering. The protection itself lives on disk and can't be removed through the API.

A few things to know

Manifest-based configurations are intentionally self-contained. They can't reference API resources, which means no paramKind for policies, no Service references for admission webhooks (instead they are URL-only), and bindings may only reference policies in the same manifest set. These restrictions exist because the configurations need to work without any cluster state, including at startup before etcd is available.

If you run multiple API server instances, each one loads its own manifest files independently. There's no cross-server synchronization built in. This is the same model as other file-based API server configurations like encryption at rest. When this feature is enabled, Kubernetes exposes a configuration hash as a label on relevant metrics, so you can detect drift.

Files are watched for changes at runtime, so you don't need to restart the API server to update policies. If you update a manifest file, the API server validates the new configuration and swaps it in atomically. If validation fails, it keeps the previous good configuration and logs the error. This means you can roll out policy changes across your fleet using standard configuration management tools (Ansible, Puppet, or even mounted ConfigMaps) without any API server downtime.

The initial load at startup is stricter: if any manifest is invalid, the API server won't start. This is intentional. At startup, failing fast is safer than running without your expected policies.

Try it out

To try this in Kubernetes v1.36:

Enable the ManifestBasedAdmissionControlConfig feature gate for each kube-apiserver.
Create a directory with your static manifest files. If you need to mount that in to the Pod where the API server runs, do that too. Read-only is fine.
Configure staticManifestsDir in your AdmissionConfiguration with the directory path.
Start the API server with --admission-control-config-file pointing to your AdmissionConfiguration file.

The full documentation is at Manifest-Based Admission Control, and you can follow KEP-5793 for ongoing progress.

We'd love to hear your feedback. Reach out on the #sig-api-machinery channel on Kubernetes Slack (for an invitation, visit https://slack.k8s.io/).

How to get involved

If you're interested in contributing to this feature or other SIG API Machinery projects, join us on #sig-api-machinery on Kubernetes Slack. You're also welcome to attend the SIG API Machinery meetings, held every other Wednesday.

Categories: CNCF Projects, Kubernetes

Kubernetes v1.36: Pod-Level Resource Managers (Alpha)

Kubernetes Blog - Fri, 05/01/2026 - 14:35

Kubernetes v1.36 introduces Pod-Level Resource Managers as an alpha feature, bringing a more flexible and powerful resource management model to performance-sensitive workloads. This enhancement extends the kubelet's Topology, CPU, and Memory Managers to support pod-level resource specifications (.spec.resources), evolving them from a strictly per-container allocation model to a pod-centric one.

Why do we need pod-level resource managers?

When running performance-critical workloads such as machine learning (ML) training, high-frequency trading applications, or low-latency databases, you often need exclusive, NUMA-aligned resources for your primary application containers to ensure predictable performance.

However, modern Kubernetes pods rarely consist of just one container. They frequently include sidecar containers for logging, monitoring, service meshes, or data ingestion.

Before this feature, this created a trade-off, to get NUMA-aligned, exclusive resources for your main application, you had to allocate exclusive, integer-based CPU resources to every container in the pod. This might be wasteful for lightweight sidecars. If you didn't do this, you forfeited the pod's Guaranteed Quality of Service (QoS) class entirely, losing the performance benefits.

Introducing pod-level resource managers

Enabling pod-level resources support for the resource managers (via the PodLevelResourceManagers and PodLevelResources feature gates) allows the kubelet to create hybrid resource allocation models. This brings flexibility and efficiency to high-performance workloads without sacrificing NUMA alignment.

Real-world use cases

Here are a few practical scenarios demonstrating how this feature can be applied, depending on the configured Topology Manager scope:

1. Tightly-coupled database (Topology manager's pod scope)

Consider a latency-sensitive database pod that includes a main database container, a local metrics exporter, and a backup agent sidecar.

When configured with the pod Topology Manager scope, the kubelet performs a single NUMA alignment based on the entire pod's budget. The database container gets its exclusive CPU and memory slices from that NUMA node. The remaining resources from the pod's budget form a new pod shared pool. The metrics exporter and backup agent run in this pod shared pool. They share resources with each other, but they are strictly isolated from the database's exclusive slices and the rest of the node.

This allows you to safely co-locate auxiliary containers on the same NUMA node as your primary workload without wasting dedicated cores on them.

apiVersion: v1
kind: Pod
metadata:
 name: tightly-coupled-database
spec:
 # Pod-level resources establish the overall budget and NUMA alignment size.
 resources:
 requests:
 cpu: "8"
 memory: "16Gi"
 limits:
 cpu: "8"
 memory: "16Gi"
 initContainers:
 - name: metrics-exporter
 image: metrics-exporter:v1
 restartPolicy: Always
 - name: backup-agent
 image: backup-agent:v1
 restartPolicy: Always
 containers:
 - name: database
 image: database:v1
 # This Guaranteed container gets an exclusive 6 CPU slice from the pod's budget.
 # The remaining 2 CPUs and 4Gi memory form the pod shared pool for the sidecars.
 resources:
 requests:
 cpu: "6"
 memory: "12Gi"
 limits:
 cpu: "6"
 memory: "12Gi"

2. ML workload with infrastructure sidecars (Topology manager's container scope)

Imagine a pod running a GPU-accelerated ML training workload alongside a generic service mesh sidecar.

Under the container Topology Manager scope, the kubelet evaluates each container individually. You can grant the ML container exclusive, NUMA-aligned CPUs and Memory for maximum performance. Meanwhile, the service mesh sidecar doesn't need to be NUMA-aligned; it can run in the general node-wide shared pool. The collective resource consumption is still safely bounded by the overall pod limits, but you only allocate NUMA-aligned, exclusive resources to the specific containers that actually require them.

apiVersion: v1
kind: Pod
metadata:
 name: ml-workload
spec:
 # Pod-level resources establish the overall budget constraint.
 resources:
 requests:
 cpu: "4"
 memory: "8Gi"
 limits:
 cpu: "4"
 memory: "8Gi"
 initContainers:
 - name: service-mesh-sidecar
 image: service-mesh:v1
 restartPolicy: Always
 containers:
 - name: ml-training
 image: ml-training:v1
 # Under the 'container' scope, this Guaranteed container receives exclusive,
 # NUMA-aligned resources, while the sidecar runs in the node's shared pool.
 resources:
 requests:
 cpu: "3"
 memory: "6Gi"
 limits:
 cpu: "3"
 memory: "6Gi"

CPU quotas (CFS) and isolation

When running these mixed workloads within a pod, isolation is enforced differently depending on the allocation:

Exclusive containers: Containers granted exclusive CPU slices have their CPU CFS quota enforcement disabled at the container level, allowing them to run without being throttled by the Linux scheduler.
Pod shared pool containers: Containers falling into the pod shared pool have CPU CFS quotas enforced at the pod level, ensuring they do not consume more than the leftover pod budget.

How to enable Pod-Level Resource Managers

Using this feature requires Kubernetes v1.36 or newer. To enable it, you must configure the kubelet with the appropriate feature gates and policies:

Enable the PodLevelResources and PodLevelResourceManagers feature gates.
Configure the Topology Manager with a policy other than none (i.e., best-effort, restricted, or single-numa-node).
Set the Topology Manager scope to either pod or container using the topologyManagerScope field in the KubeletConfiguration.
Configure the CPU Manager with the static policy.
Configure the Memory Manager with the Static policy.

Observability

To help cluster administrators monitor and debug these new allocation models, we have introduced several new kubelet metrics when the feature gate is enabled:

resource_manager_allocations_total: Counts the total number of exclusive resource allocations performed by a manager. The source label ("pod" or "node") distinguishes between allocations drawn from the node-level pool versus a pre-allocated pod-level pool.
resource_manager_allocation_errors_total: Counts errors encountered during exclusive resource allocation, distinguished by the intended allocation source ("pod" or "node").
resource_manager_container_assignments: Tracks the cumulative number of containers running with specific assignment types. The assignment_type label ("node_exclusive", "pod_exclusive", "pod_shared") provides visibility into how workloads are distributed.

Current limitations and caveats

While this feature opens up new possibilities, there are a few things to keep in mind during its alpha phase. Be sure to review the Limitations and caveats in the official documentation for full details on compatibility, requirements, and downgrade instructions.

Getting started and providing feedback

For a deep dive into the technical details and configuration of this feature, check out the official concept documentation:

Pod-level resource managers

To learn more about the overall pod-level resources feature and how to assign resources to pods, see:

Assign Pod-level CPU and memory resources

As this feature moves through Alpha, your feedback is invaluable. Please report any issues or share your experiences via the standard Kubernetes communication channels:

Categories: CNCF Projects, Kubernetes

Kubernetes v1.36: Pod-Level Resource Managers (Alpha)

Kubernetes Blog - Fri, 05/01/2026 - 14:35

Kubernetes v1.36 introduces Pod-Level Resource Managers as an alpha feature, bringing a more flexible and powerful resource management model to performance-sensitive workloads. This enhancement extends the kubelet's Topology, CPU, and Memory Managers to support pod-level resource specifications (.spec.resources), evolving them from a strictly per-container allocation model to a pod-centric one.

Why do we need pod-level resource managers?

When running performance-critical workloads such as machine learning (ML) training, high-frequency trading applications, or low-latency databases, you often need exclusive, NUMA-aligned resources for your primary application containers to ensure predictable performance.

However, modern Kubernetes pods rarely consist of just one container. They frequently include sidecar containers for logging, monitoring, service meshes, or data ingestion.

Before this feature, this created a trade-off, to get NUMA-aligned, exclusive resources for your main application, you had to allocate exclusive, integer-based CPU resources to every container in the pod. This might be wasteful for lightweight sidecars. If you didn't do this, you forfeited the pod's Guaranteed Quality of Service (QoS) class entirely, losing the performance benefits.

Introducing pod-level resource managers

Enabling pod-level resources support for the resource managers (via the PodLevelResourceManagers and PodLevelResources feature gates) allows the kubelet to create hybrid resource allocation models. This brings flexibility and efficiency to high-performance workloads without sacrificing NUMA alignment.

Real-world use cases

Here are a few practical scenarios demonstrating how this feature can be applied, depending on the configured Topology Manager scope:

1. Tightly-coupled database (Topology manager's pod scope)

Consider a latency-sensitive database pod that includes a main database container, a local metrics exporter, and a backup agent sidecar.

When configured with the pod Topology Manager scope, the kubelet performs a single NUMA alignment based on the entire pod's budget. The database container gets its exclusive CPU and memory slices from that NUMA node. The remaining resources from the pod's budget form a new pod shared pool. The metrics exporter and backup agent run in this pod shared pool. They share resources with each other, but they are strictly isolated from the database's exclusive slices and the rest of the node.

This allows you to safely co-locate auxiliary containers on the same NUMA node as your primary workload without wasting dedicated cores on them.

apiVersion: v1
kind: Pod
metadata:
 name: tightly-coupled-database
spec:
 # Pod-level resources establish the overall budget and NUMA alignment size.
 resources:
 requests:
 cpu: "8"
 memory: "16Gi"
 limits:
 cpu: "8"
 memory: "16Gi"
 initContainers:
 - name: metrics-exporter
 image: metrics-exporter:v1
 restartPolicy: Always
 - name: backup-agent
 image: backup-agent:v1
 restartPolicy: Always
 containers:
 - name: database
 image: database:v1
 # This Guaranteed container gets an exclusive 6 CPU slice from the pod's budget.
 # The remaining 2 CPUs and 4Gi memory form the pod shared pool for the sidecars.
 resources:
 requests:
 cpu: "6"
 memory: "12Gi"
 limits:
 cpu: "6"
 memory: "12Gi"

2. ML workload with infrastructure sidecars (Topology manager's container scope)

Imagine a pod running a GPU-accelerated ML training workload alongside a generic service mesh sidecar.

Under the container Topology Manager scope, the kubelet evaluates each container individually. You can grant the ML container exclusive, NUMA-aligned CPUs and Memory for maximum performance. Meanwhile, the service mesh sidecar doesn't need to be NUMA-aligned; it can run in the general node-wide shared pool. The collective resource consumption is still safely bounded by the overall pod limits, but you only allocate NUMA-aligned, exclusive resources to the specific containers that actually require them.

apiVersion: v1
kind: Pod
metadata:
 name: ml-workload
spec:
 # Pod-level resources establish the overall budget constraint.
 resources:
 requests:
 cpu: "4"
 memory: "8Gi"
 limits:
 cpu: "4"
 memory: "8Gi"
 initContainers:
 - name: service-mesh-sidecar
 image: service-mesh:v1
 restartPolicy: Always
 containers:
 - name: ml-training
 image: ml-training:v1
 # Under the 'container' scope, this Guaranteed container receives exclusive,
 # NUMA-aligned resources, while the sidecar runs in the node's shared pool.
 resources:
 requests:
 cpu: "3"
 memory: "6Gi"
 limits:
 cpu: "3"
 memory: "6Gi"

CPU quotas (CFS) and isolation

When running these mixed workloads within a pod, isolation is enforced differently depending on the allocation:

Exclusive containers: Containers granted exclusive CPU slices have their CPU CFS quota enforcement disabled at the container level, allowing them to run without being throttled by the Linux scheduler.
Pod shared pool containers: Containers falling into the pod shared pool have CPU CFS quotas enforced at the pod level, ensuring they do not consume more than the leftover pod budget.

How to enable Pod-Level Resource Managers

Using this feature requires Kubernetes v1.36 or newer. To enable it, you must configure the kubelet with the appropriate feature gates and policies:

Enable the PodLevelResources and PodLevelResourceManagers feature gates.
Configure the Topology Manager with a policy other than none (i.e., best-effort, restricted, or single-numa-node).
Set the Topology Manager scope to either pod or container using the topologyManagerScope field in the KubeletConfiguration.
Configure the CPU Manager with the static policy.
Configure the Memory Manager with the Static policy.

Observability

To help cluster administrators monitor and debug these new allocation models, we have introduced several new kubelet metrics when the feature gate is enabled:

resource_manager_allocations_total: Counts the total number of exclusive resource allocations performed by a manager. The source label ("pod" or "node") distinguishes between allocations drawn from the node-level pool versus a pre-allocated pod-level pool.
resource_manager_allocation_errors_total: Counts errors encountered during exclusive resource allocation, distinguished by the intended allocation source ("pod" or "node").
resource_manager_container_assignments: Tracks the cumulative number of containers running with specific assignment types. The assignment_type label ("node_exclusive", "pod_exclusive", "pod_shared") provides visibility into how workloads are distributed.

Current limitations and caveats

While this feature opens up new possibilities, there are a few things to keep in mind during its alpha phase. Be sure to review the Limitations and caveats in the official documentation for full details on compatibility, requirements, and downgrade instructions.

Getting started and providing feedback

For a deep dive into the technical details and configuration of this feature, check out the official concept documentation:

Pod-level resource managers

To learn more about the overall pod-level resources feature and how to assign resources to pods, see:

Assign Pod-level CPU and memory resources

As this feature moves through Alpha, your feedback is invaluable. Please report any issues or share your experiences via the standard Kubernetes communication channels:

Categories: CNCF Projects, Kubernetes

Kubernetes v1.36: Pod-Level Resource Managers (Alpha)

Kubernetes Blog - Fri, 05/01/2026 - 14:35

Kubernetes v1.36 introduces Pod-Level Resource Managers as an alpha feature, bringing a more flexible and powerful resource management model to performance-sensitive workloads. This enhancement extends the kubelet's Topology, CPU, and Memory Managers to support pod-level resource specifications (.spec.resources), evolving them from a strictly per-container allocation model to a pod-centric one.

Why do we need pod-level resource managers?

When running performance-critical workloads such as machine learning (ML) training, high-frequency trading applications, or low-latency databases, you often need exclusive, NUMA-aligned resources for your primary application containers to ensure predictable performance.

However, modern Kubernetes pods rarely consist of just one container. They frequently include sidecar containers for logging, monitoring, service meshes, or data ingestion.

Before this feature, this created a trade-off, to get NUMA-aligned, exclusive resources for your main application, you had to allocate exclusive, integer-based CPU resources to every container in the pod. This might be wasteful for lightweight sidecars. If you didn't do this, you forfeited the pod's Guaranteed Quality of Service (QoS) class entirely, losing the performance benefits.

Introducing pod-level resource managers

Enabling pod-level resources support for the resource managers (via the PodLevelResourceManagers and PodLevelResources feature gates) allows the kubelet to create hybrid resource allocation models. This brings flexibility and efficiency to high-performance workloads without sacrificing NUMA alignment.

Real-world use cases

Here are a few practical scenarios demonstrating how this feature can be applied, depending on the configured Topology Manager scope:

1. Tightly-coupled database (Topology manager's pod scope)

Consider a latency-sensitive database pod that includes a main database container, a local metrics exporter, and a backup agent sidecar.

When configured with the pod Topology Manager scope, the kubelet performs a single NUMA alignment based on the entire pod's budget. The database container gets its exclusive CPU and memory slices from that NUMA node. The remaining resources from the pod's budget form a new pod shared pool. The metrics exporter and backup agent run in this pod shared pool. They share resources with each other, but they are strictly isolated from the database's exclusive slices and the rest of the node.

This allows you to safely co-locate auxiliary containers on the same NUMA node as your primary workload without wasting dedicated cores on them.

apiVersion: v1
kind: Pod
metadata:
 name: tightly-coupled-database
spec:
 # Pod-level resources establish the overall budget and NUMA alignment size.
 resources:
 requests:
 cpu: "8"
 memory: "16Gi"
 limits:
 cpu: "8"
 memory: "16Gi"
 initContainers:
 - name: metrics-exporter
 image: metrics-exporter:v1
 restartPolicy: Always
 - name: backup-agent
 image: backup-agent:v1
 restartPolicy: Always
 containers:
 - name: database
 image: database:v1
 # This Guaranteed container gets an exclusive 6 CPU slice from the pod's budget.
 # The remaining 2 CPUs and 4Gi memory form the pod shared pool for the sidecars.
 resources:
 requests:
 cpu: "6"
 memory: "12Gi"
 limits:
 cpu: "6"
 memory: "12Gi"

2. ML workload with infrastructure sidecars (Topology manager's container scope)

Imagine a pod running a GPU-accelerated ML training workload alongside a generic service mesh sidecar.

Under the container Topology Manager scope, the kubelet evaluates each container individually. You can grant the ML container exclusive, NUMA-aligned CPUs and Memory for maximum performance. Meanwhile, the service mesh sidecar doesn't need to be NUMA-aligned; it can run in the general node-wide shared pool. The collective resource consumption is still safely bounded by the overall pod limits, but you only allocate NUMA-aligned, exclusive resources to the specific containers that actually require them.

apiVersion: v1
kind: Pod
metadata:
 name: ml-workload
spec:
 # Pod-level resources establish the overall budget constraint.
 resources:
 requests:
 cpu: "4"
 memory: "8Gi"
 limits:
 cpu: "4"
 memory: "8Gi"
 initContainers:
 - name: service-mesh-sidecar
 image: service-mesh:v1
 restartPolicy: Always
 containers:
 - name: ml-training
 image: ml-training:v1
 # Under the 'container' scope, this Guaranteed container receives exclusive,
 # NUMA-aligned resources, while the sidecar runs in the node's shared pool.
 resources:
 requests:
 cpu: "3"
 memory: "6Gi"
 limits:
 cpu: "3"
 memory: "6Gi"

CPU quotas (CFS) and isolation

When running these mixed workloads within a pod, isolation is enforced differently depending on the allocation:

Exclusive containers: Containers granted exclusive CPU slices have their CPU CFS quota enforcement disabled at the container level, allowing them to run without being throttled by the Linux scheduler.
Pod shared pool containers: Containers falling into the pod shared pool have CPU CFS quotas enforced at the pod level, ensuring they do not consume more than the leftover pod budget.

How to enable Pod-Level Resource Managers

Using this feature requires Kubernetes v1.36 or newer. To enable it, you must configure the kubelet with the appropriate feature gates and policies:

Enable the PodLevelResources and PodLevelResourceManagers feature gates.
Configure the Topology Manager with a policy other than none (i.e., best-effort, restricted, or single-numa-node).
Set the Topology Manager scope to either pod or container using the topologyManagerScope field in the KubeletConfiguration.
Configure the CPU Manager with the static policy.
Configure the Memory Manager with the Static policy.

Observability

To help cluster administrators monitor and debug these new allocation models, we have introduced several new kubelet metrics when the feature gate is enabled:

resource_manager_allocations_total: Counts the total number of exclusive resource allocations performed by a manager. The source label ("pod" or "node") distinguishes between allocations drawn from the node-level pool versus a pre-allocated pod-level pool.
resource_manager_allocation_errors_total: Counts errors encountered during exclusive resource allocation, distinguished by the intended allocation source ("pod" or "node").
resource_manager_container_assignments: Tracks the cumulative number of containers running with specific assignment types. The assignment_type label ("node_exclusive", "pod_exclusive", "pod_shared") provides visibility into how workloads are distributed.

Current limitations and caveats

While this feature opens up new possibilities, there are a few things to keep in mind during its alpha phase. Be sure to review the Limitations and caveats in the official documentation for full details on compatibility, requirements, and downgrade instructions.

Getting started and providing feedback

For a deep dive into the technical details and configuration of this feature, check out the official concept documentation:

Pod-level resource managers

To learn more about the overall pod-level resources feature and how to assign resources to pods, see:

Assign Pod-level CPU and memory resources

As this feature moves through Alpha, your feedback is invaluable. Please report any issues or share your experiences via the standard Kubernetes communication channels:

Categories: CNCF Projects, Kubernetes

Kubernetes v1.36: In-Place Vertical Scaling for Pod-Level Resources Graduates to Beta

Kubernetes Blog - Thu, 04/30/2026 - 14:35

Following the graduation of Pod-Level Resources to Beta in v1.34 and the General Availability (GA) of In-Place Pod Vertical Scaling in v1.35, the Kubernetes community is thrilled to announce that In-Place Pod-Level Resources Vertical Scaling has graduated to Beta in v1.36!

This feature is now enabled by default via the InPlacePodLevelResourcesVerticalScaling feature gate. It allows users to update the aggregate Pod resource budget (.spec.resources) for a running Pod, often without requiring a container restart.

Why Pod-level in-place resize?

The Pod-level resource model simplified management for complex Pods (such as those with sidecars) by allowing containers to share a collective pool of resources. In v1.36, you can now adjust this aggregate boundary on-the-fly.

This is particularly useful for Pods where containers do not have individual limits defined. These containers automatically scale their effective boundaries to fit the newly resized Pod-level dimensions, allowing you to expand the shared pool during peak demand without manual per-container recalculations.

Resource inheritance and the `resizePolicy`

When a Pod-level resize is initiated, the Kubelet treats the change as a resize event for every container that inherits its limits from the Pod-level budget. To determine whether a restart is required, the Kubelet consults the resizePolicy defined within individual containers:

Non-disruptive Updates: If a container's restartPolicy is set to NotRequired, the Kubelet attempts to update the cgroup limits dynamically via the Container Runtime Interface (CRI).
Disruptive Updates: If set to RestartContainer, the container will be restarted to apply the new aggregate boundary safely.

Note: Currently, resizePolicy is not supported at the Pod level. The Kubelet always defers to individual container settings to decide if an update can be applied in-place or requires a restart.

Example: Scaling a shared resource pool

In this scenario, a Pod is defined with a 2 CPU pod-level limit. Because the individual containers do not have their own limits defined, they share this total pool.

1. Initial Pod specification

apiVersion: v1
kind: Pod
metadata:
 name: shared-pool-app
spec:
 resources: # Pod-level limits
 limits:
 cpu: "2"
 memory: "4Gi"
 containers:
 - name: main-app
 image: my-app:v1
 resizePolicy: [{resourceName: "cpu", restartPolicy: "NotRequired"}]
 - name: sidecar
 image: logger:v1
 resizePolicy: [{resourceName: "cpu", restartPolicy: "NotRequired"}]

2. The resize operation

To double the CPU capacity to 4 CPUs, apply a patch using the resize subresource:

kubectl patch pod shared-pool-app --subresource resize --patch \
 '{"spec":{"resources":{"limits":{"cpu":"4"}}}}'

Node-Level reality: feasibility and safety

Applying a resize patch is only the first step. The Kubelet performs several checks and follows a specific sequence to ensure node stability:

1. The feasibility check

Before admitting a resize, the Kubelet verifies if the new aggregate request fits within the Node's allocatable capacity. If the Node is overcommitted, the resize is not ignored; instead, the PodResizePending condition will reflect a Deferred or Infeasible status, providing immediate feedback on why the "envelope" hasn't grown.

2. Update sequencing

To prevent resource "overshoot", the Kubelet coordinates the cgroup updates in a specific order:

When Increasing: The Pod-level cgroup is expanded first, creating the "room" before the individual container cgroups are enlarged.
When Decreasing: The container cgroups are throttled first, and only then is the aggregate Pod-level cgroup shrunken.

Observability: tracking resize status

With the move to Beta, Kubernetes uses Pod Conditions to track the lifecycle of a resize:

PodResizePending: The spec is updated, but the Node hasn't admitted the change (e.g., due to capacity).
PodResizeInProgress: The Node has admitted the resize (status.allocatedResources) but the changes aren't yet fully applied to the cgroups (status.resources).

status:
 allocatedResources:
 cpu: "4"
 resources:
 limits:
 cpu: "4"
 conditions:
 - type: PodResizeInProgress
 status: "True"

Constraints and requirements

cgroup v2 Only: Required for accurate aggregate enforcement.
CRI Support: Requires a container runtime that supports the UpdateContainerResources CRI call (e.g., containerd v2.0+ or CRI-O).
Feature Gates: Requires PodLevelResources, InPlacePodVerticalScaling, InPlacePodLevelResourcesVerticalScaling, and NodeDeclaredFeatures.
Linux Only: Currently exclusive to Linux-based nodes.

What's next?

As we move toward General Availability (GA), the community is focusing on Vertical Pod Autoscaler (VPA) Integration, enabling VPA to issue Pod-level resource recommendations and trigger in-place actuation automatically.

Getting started and providing feedback

We encourage you to test this feature and provide feedback via the standard Kubernetes communication channels:

Categories: CNCF Projects, Kubernetes

Kubernetes v1.36: Tiered Memory Protection with Memory QoS

Kubernetes Blog - Wed, 04/29/2026 - 14:35

On behalf of SIG Node, we are pleased to announce updates to the Memory QoS feature (alpha) in Kubernetes v1.36. Memory QoS uses the cgroup v2 memory controller to give the kernel better guidance on how to treat container memory. It was first introduced in v1.22 and updated in v1.27. In Kubernetes v1.36, we're introducing: opt-in memory reservation, tiered protection by QoS class, observability metrics, and kernel-version warning for memory.high.

What's new in v1.36

Opt-in memory reservation with `memoryReservationPolicy`

v1.36 separates throttling from reservation. Enabling the feature gate turns on memory.high throttling (the kubelet sets memory.high based on memoryThrottlingFactor, default 0.9), but memory reservation is now controlled by a separate kubelet configuration field:

None (default): no memory.min or memory.low is written. Throttling via memory.high still works.
TieredReservation: the kubelet writes tiered memory protection based on the Pod's QoS class:

Guaranteed Pods get hard protection via memory.min. For example, a Guaranteed Pod requesting 512 MiB of memory results in:

$ cat /sys/fs/cgroup/kubepods.slice/kubepods-pod6a4f2e3b_1c9d_4a5e_8f7b_2d3e4f5a6b7c.slice/memory.min
536870912

The kernel will not reclaim this memory under any circumstances. If it cannot honor the guarantee, it invokes the OOM killer on other processes to free pages.

Burstable Pods get soft protection via memory.low. For the same 512 MiB request on a Burstable Pod:

$ cat /sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod8b3c7d2e_4f5a_6b7c_9d1e_3f4a5b6c7d8e.slice/memory.low
536870912

The kernel avoids reclaiming this memory under normal pressure, but may reclaim it if the alternative is a system-wide OOM.

BestEffort Pods get neither memory.min nor memory.low. Their memory remains fully reclaimable.

Comparison with v1.27 behavior

In earlier versions, enabling the MemoryQoS feature gate immediately set memory.min for every container with a memory request. memory.min is a hard reservation that the kernel will not reclaim, regardless of memory pressure.

Consider a node with 8 GiB of RAM where Burstable Pod requests total 7 GiB. In earlier versions, that 7 GiB would be locked as memory.min, leaving little headroom for the kernel, system daemons, or BestEffort workloads and increasing the risk of OOM kills.

With v1.36 tiered reservation, those Burstable requests map to memory.low instead of memory.min. Under normal pressure, the kernel still protects that memory, but under extreme pressure it can reclaim part of it to avoid system-wide OOM. Only Guaranteed Pods use memory.min, which keeps hard reservation lower.

With memoryReservationPolicy in v1.36, you can enable throttling first, observe workload behavior, and opt into reservation when your node has enough headroom.

Observability metrics

Two alpha-stability metrics are exposed on the kubelet /metrics endpoint:

Metric Description kubelet_memory_qos_node_memory_min_bytes Total memory.min across Guaranteed Pods kubelet_memory_qos_node_memory_low_bytes Total memory.low across Burstable Pods

These are useful for capacity planning. If kubelet_memory_qos_node_memory_min_bytes is creeping toward your node's physical memory, you know hard reservation is getting tight.

$ curl -sk https://localhost:10250/metrics | grep memory_qos
# HELP kubelet_memory_qos_node_memory_min_bytes [ALPHA] Total memory.min in bytes for Guaranteed pods
kubelet_memory_qos_node_memory_min_bytes 5.36870912e+08
# HELP kubelet_memory_qos_node_memory_low_bytes [ALPHA] Total memory.low in bytes for Burstable pods
kubelet_memory_qos_node_memory_low_bytes 2.147483648e+09

Kernel version check

On kernels older than 5.9, memory.high throttling can trigger the kernel livelock issue. The bug was fixed in kernel 5.9. In v1.36, when the feature gate is enabled, the kubelet checks the kernel version at startup and logs a warning if it is below 5.9. The feature continues to work — this is informational, not a hard block.

How Kubernetes maps Memory QoS to cgroup v2

Memory QoS uses four cgroup v2 memory controller interfaces:

memory.max: hard memory limit — unchanged from previous versions
memory.min: hard memory protection — with TieredReservation, set only for Guaranteed Pods
memory.low: soft memory protection — set for Burstable Pods with TieredReservation
memory.high: memory throttling threshold — unchanged from previous versions

The following table shows how Kubernetes container resources map to cgroup v2 interfaces when memoryReservationPolicy: TieredReservation is configured. With the default memoryReservationPolicy: None, no memory.min or memory.low values are set.

QoS Class memory.min memory.low memory.high memory.max Guaranteed Set to requests.memory
(hard protection) Not set Not set
(requests == limits, so throttling is not useful) Set to limits.memory Burstable Not set Set to requests.memory
(soft protection) Calculated based on
formula with throttling factor Set to limits.memory
(if specified) BestEffort Not set Not set Calculated based on
node allocatable memory Not set

Cgroup hierarchy

cgroup v2 requires that a parent cgroup's memory protection is at least as large as the sum of its children's. The kubelet maintains this by setting memory.min on the kubepods root cgroup to the sum of all Guaranteed and Burstable Pod memory requests, and memory.low on the Burstable QoS cgroup to the sum of all Burstable Pod memory requests. This way the kernel can enforce the per-container and per-pod protection values correctly.

The kubelet manages pod-level and QoS-class cgroups directly using the runc libcontainer library, while container-level cgroups are managed by the container runtime (containerd or CRI-O).

How do I use it?

Prerequisites

Kubernetes v1.36 or later
Linux with cgroup v2. Kernel 5.9 or higher is recommended — earlier kernels work but may experience the livelock issue. You can verify cgroup v2 is active by running mount | grep cgroup2.
A container runtime that supports cgroup v2 (containerd 1.6+, CRI-O 1.22+)

Configuration

To enable Memory QoS with tiered protection:

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
featureGates:
 MemoryQoS: true
memoryReservationPolicy: TieredReservation # Options: None (default), TieredReservation
memoryThrottlingFactor: 0.9 # Optional: default is 0.9

If you want memory.high throttling without memory protection, omit memoryReservationPolicy or set it to None:

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
featureGates:
 MemoryQoS: true
memoryReservationPolicy: None  # This is the default

How can I learn more?

Getting involved

This feature is driven by SIG Node. If you are interested in contributing or have feedback, you can find us on Slack (#sig-node), the mailing list, or at the regular SIG Node meetings. Please file bugs at kubernetes/kubernetes and enhancement proposals at kubernetes/enhancements.

Categories: CNCF Projects, Kubernetes

Kubernetes v1.36: Staleness Mitigation and Observability for Controllers

Kubernetes Blog - Tue, 04/28/2026 - 14:35

Staleness in Kubernetes controllers is a problem that affects many controllers, and is something may affect controller behavior in subtle ways. It is usually not until it is too late, when a controller in production has already taken incorrect action, that staleness is found to be an issue due to some underlying assumption made by the controller author. Some issues caused by staleness include controllers taking incorrect actions, controllers not taking action when they should, and controllers taking too long to take action. I am excited to announce that Kubernetes v1.36 includes new features that help mitigate staleness in controllers and provide better observability into controller behavior.

What is staleness?

Staleness in controllers comes from an outdated view of the world inside of the controller cache. In order to provide a fast user experience, controllers typically maintain a local cache of the state of the cluster. This cache is populated by watching the Kubernetes API server for changes to objects that the controller cares about. When the controller needs to take action, it will first check its cache to see if it has the latest information. If it does not, it will then update its cache by watching the API server for changes to objects that the controller cares about. This process is known as reconciliation.

However, there are some cases where the controller's cache may be outdated. For example, if the controller is restarted, it will need to rebuild its cache by watching the API server for changes to objects that the controller cares about. During this time, the controller's cache will be outdated, and it will not be able to take action. Additionally, if the API server is down, the controller's cache will not be updated, and it will not be able to take action. These are just a few examples of cases where the controller's cache may be outdated.

Improvements in 1.36

Kubernetes v1.36 includes improvements in both client-go as well as implementations of highly contended controllers in kube-controller-manager, using those client-go improvements.

client-go improvements

In client-go, the project added atomic FIFO processing (feature gate name AtomicFIFO), which is on top of the existing FIFO queue implementation. The new approach allows for the queue to atomically handle operations that are recieved in batches, such as the initial set of objects from a list operation that an informer uses to populate its cache. This ensures that the queue is always in a consistent state, even when events come out of order. Prior to this, events were added to the queue in the order that they were received, which could lead to an inconsistent state in the cache that does not accurately reflect the state of the cluster.

With this change, you can now ensure that the queue is always in a consistent state, even when events come out of order. To take advantage of this, clients using client-go can now introspect into the cache to determine the latest resource version that the controller cache has seen. This is done with the newly added function LastStoreSyncResourceVersion() implemented on the Store interface here. This function is the basis for the staleness mitigation features in kube-controller-manager.

kube-controller-manager improvements

In kube-controller-manager, the v1.36 release has added the ability for 4 different controllers to use this new capability. The controllers are:

DaemonSet controller
StatefulSet controller
ReplicaSet controller
Job controller

These controllers all act on pods, which in most cases are under the highest amount of contention in a cluster. The changes are on by default for these controllers, and can be disabled by setting the feature gates StaleControllerConsistency<API type> to false for the specific controller you wish to disable it for. For example, to disable the feature for the DaemonSet controller, you would set the feature gate StaleControllerConsistencyDaemonSet to false.

When the relevant feature gate is enabled, the controller will first check the latest resource version of the cache before taking action. If the latest resource version of the cache is lower than what the controller has written to the API server for the object it is trying to reconcile, the controller will not take action. This is because the controller's cache is outdated, and it does not have the latest information about the state of the cluster.

Use for informer authors

Informer authors using client-go can also immediately take advantage of these improvements. See an example of how to use this feature in the ReplicaSet informer. This PR shows how to use the new feature to check if the informer's cache is stale before taking action. The client-go library provides a ConsistencyStore data structure that queries the store and compares the latest resource version of the cache with the written resource version of the object.

The ReplicaSet controller tracks both the ReplicaSet's resource version and the resource version of the pods that the ReplicaSet manages. For a specific ReplicaSet, it tracks the latest written resource version of the pods that the ReplicaSet owns as well as any writes to the ReplicaSet itself. If the latest resource version of the cache is lower than what the controller has written to the API server for the object it is trying to reconcile, the controller will not take action. This is because the controller's cache is outdated, and it does not have the latest information about the state of the cluster.

An informer author can use the ConsistencyStore to track the latest resource version of the objects that the informer cares about. It provides 3 main functions:

type ConsistencyStore interface {
 // WroteAt records that the given object was written at the given resource version.
 WroteAt(owningObj runtime.Object, uid types.UID, groupResource schema.GroupResource, resourceVersion string)

 // EnsureReady returns true if the cache is up to date for the given object.
 // It is used prior to reconciliation to decide whether to reconcile or not.
 EnsureReady(namespacedName types.NamespacedName) bool

 // Clear removes the given object from the consistency store.
 // It is used when an object is deleted.
 Clear(namespacedName types.NamespacedName, uid types.UID)
}

WroteAt: This function is called by the controller when it writes to the API server for an object. It is used to record the latest resource version of the object that the controller has written to the API server. The owningObj is the object that the controller is reconciling, and the uid is the UID of that object. The resource version and GroupResource are the resource version and GroupResource of the object that the controller has written to the API server. The object is not explicitly tracked, since the controller only cares about waiting to catch up to the latest resource version of the written object.
EnsureReady: This function is called by the controller to ensure that the cache is up to date for the object. It is used prior to reconciliation to decide whether to reconcile or not. It returns true if the cache is up to date for the object, and false otherwise. It will use the information provided by WroteAt to determine if the cache is up to date.
Clear: This function is called by the controller when an object is deleted. It is used to remove the object from the consistency store. This is mostly used for cleanup when an object is deleted to prevent the consistency store from growing indefinitely.

The UID is used to distinguish between different objects that have the same name, such as when an object is deleted and then recreated. It is not needed for EnsureReady because the consistency store is only concerned with catching up to the latest resource version of the object, not the specific object. It is primarily used to ensure that the controller doesn't delete the entry for an object when it is recreated with a new UID.

With these 3 functions, an informer author can implement staleness mitigation in their controller.

Observability

In addition to the staleness mitigation features, the Kubernetes project has also added related instrumentation to kube-controller-manager in 1.36. These metrics are also enabled by default, and are controlled using the same set of feature gates.

Metrics

The following alpha metrics have been added to kube-controller-manager in 1.36:

stale_sync_skips_total: The number of times the controller has skipped a sync due to stale cache. This metric is exposed for each controller that uses the staleness mitigation feature with the subsystem of the controller.

This metric is exposed by the kube-controller-manager metrics endpoint, and can be used to monitor the health of the controller.

Along with this metric, client-go also emits metrics that expose the latest resource version of every shared informer with the subsystem of the informer. This allows you to see the latest resource version of each informer, and use that to determine if the controller's cache is stale, especially great for comparing against the resource version of the API server.

This metric is named store_resource_version and has the Group, Version, and Resource as labels.

What's next?

Kubernetes SIG API Machinery is excited to continue working on this feature and hope to bring it to more controllers in the future. We are also interested in hearing your feedback on this feature. Please let us know what you think in the comments below or by opening an issue on the Kubernetes GitHub repository.

We are also working with controller-runtime to enable this set of semantics for all controllers built with controller-runtime. This will allow any controller built with controller-runtime to gain the benefits of read your own writes, without having to implement the logic themselves.

Categories: CNCF Projects, Kubernetes

Kubernetes v1.36: Mutable Pod Resources for Suspended Jobs (beta)

Kubernetes Blog - Mon, 04/27/2026 - 14:35

Kubernetes v1.36 promotes the ability to modify container resource requests and limits in the pod template of a suspended Job to beta. First introduced as alpha in v1.35, this feature allows queue controllers and cluster administrators to adjust CPU, memory, GPU, and extended resource specifications on a Job while it is suspended, before it starts or resumes running.

Why mutable pod resources for suspended Jobs?

Batch and machine learning workloads often have resource requirements that are not precisely known at Job creation time. The optimal resource allocation depends on current cluster capacity, queue priorities, and the availability of specialized hardware like GPUs.

Before this feature, resource requirements in a Job's pod template were immutable once set. If a queue controller like Kueue determined that a suspended Job should run with different resources, the only option was to delete and recreate the Job, losing any associated metadata, status, or history. This feature also provides a way to let a specific Job instance for a CronJob progress slowly with reduced resources, rather than outright failing to run if the cluster is heavily loaded.

Consider a machine learning training Job initially requesting 4 GPUs:

apiVersion: batch/v1
kind: Job
metadata:
 name: training-job-example-abcd123
 labels:
 app.kubernetes.io/name: trainer
spec:
 suspend: true
 template:
 metadata:
 annotations:
 kubernetes.io/description: "ML training, ID abcd123"
 spec:
 containers:
 - name: trainer
 image: example-registry.example.com/training:2026-04-23T150405.678
 resources:
 requests:
 cpu: "8"
 memory: "32Gi"
 example-hardware-vendor.com/gpu: "4"
 limits:
 cpu: "8"
 memory: "32Gi"
 example-hardware-vendor.com/gpu: "4"
 restartPolicy: Never

A queue controller managing cluster resources might determine that only 2 GPUs are available. With this feature, the controller can update the Job's resource requests before resuming it:

apiVersion: batch/v1
kind: Job
metadata:
 name: training-job-example-abcd123
 labels:
 app.kubernetes.io/name: trainer
spec:
 suspend: true
 template:
 metadata:
 annotations:
 kubernetes.io/description: "ML training, ID abcd123"
 spec:
 containers:
 - name: trainer
 image: example-registry.example.com/training:2026-04-23T150405.678
 resources:
 requests:
 cpu: "4"
 memory: "16Gi"
 example-hardware-vendor.com/gpu: "2"
 limits:
 cpu: "4"
 memory: "16Gi"
 example-hardware-vendor.com/gpu: "2"
 restartPolicy: Never

Once the resources are updated, the controller resumes the Job by setting spec.suspend to false, and the new Pods are created with the adjusted resource specifications.

How it works

The Kubernetes API server relaxes the immutability constraint on pod template resource fields specifically for suspended Jobs. No new API types have been introduced; the existing Job and pod template structures accommodate the change through relaxed validation.

The mutable fields are:

spec.template.spec.containers[*].resources.requests
spec.template.spec.containers[*].resources.limits
spec.template.spec.initContainers[*].resources.requests
spec.template.spec.initContainers[*].resources.limits

Resource updates are permitted when the following conditions are met:

The Job has spec.suspend set to true.
For a Job that was previously running and then suspended, all active Pods must have terminated (status.active equals 0) before resource mutations are accepted.

Standard resource validation still applies. For example, resource limits must be greater than or equal to requests, and extended resources must be specified as whole numbers where required.

What's new in beta

With the promotion to beta in Kubernetes v1.36, the MutablePodResourcesForSuspendedJobs feature gate is enabled by default. This means clusters running v1.36 can use this feature without any additional configuration on the API server.

Try it out

If your cluster is running Kubernetes v1.36 or later, this feature is available by default. For v1.35 clusters, enable the MutablePodResourcesForSuspendedJobs feature gate on the kube-apiserver.

You can test it by creating a suspended Job, updating its container resources using kubectl edit or a controller, and then resuming the Job:

# Create a suspended Job
kubectl apply -f my-job.yaml --server-side

# Edit the resource requests
kubectl edit job training-job-example-abcd123

# Resume the Job
kubectl patch job training-job-example-abcd123 -p '{"spec":{"suspend":false}}'

Considerations

Running Jobs that are suspended

If you suspend a Job that was already running, you must wait for all of that Job's active Pods to terminate before modifying resources. The API server rejects resource mutations while status.active is greater than zero. This prevents inconsistency between running Pods and the updated pod template.

Pod replacement policy

When using this feature with Jobs that may have failed Pods, consider setting podReplacementPolicy: Failed. This ensures that replacement Pods are only created after the previous Pods have fully terminated, preventing resource contention from overlapping Pods.

ResourceClaims

Dynamic Resource Allocation (DRA) resourceClaimTemplates remain immutable. If your workload uses DRA, you must recreate the claim templates separately to match any resource changes.

Getting involved

This feature was developed by SIG Apps This feature was developed by SIG Apps with input from WG Batch. Both groups welcome feedback as the feature progresses toward stable.

You can reach out through:

Slack channel #sig-apps.
Slack channel #wg-batch.
The KEP-5440 tracking issue.

Categories: CNCF Projects, Kubernetes

Kubernetes v1.36: Fine-Grained Kubelet API Authorization Graduates to GA

Kubernetes Blog - Fri, 04/24/2026 - 14:35

On behalf of Kubernetes SIG Auth and SIG Node, we are pleased to announce the graduation of fine-grained kubelet API authorization to General Availability (GA) in Kubernetes v1.36!

The KubeletFineGrainedAuthz feature gate was introduced as an opt-in alpha feature in Kubernetes v1.32, then graduated to beta (enabled by default) in v1.33. Now, the feature is generally available and the feature gate is locked to enabled. This feature enables more precise, least-privilege access control over the kubelet's HTTPS API, replacing the need to grant the overly broad nodes/proxy permission for common monitoring and observability use cases.

Motivation: the `nodes/proxy` problem

The kubelet exposes an HTTPS endpoint with several APIs that give access to data of varying sensitivity, including pod listings, node metrics, container logs, and, critically, the ability to execute commands inside running containers.

Prior to this feature, kubelet authorization used a coarse-grained model. When webhook authorization was enabled, almost all kubelet API paths were mapped to a single nodes/proxy subresource. This meant that any workload needing to read metrics or health status from the kubelet required nodes/proxy permission, the same permission that also grants the ability to execute arbitrary commands in any container running on the node.

What's wrong with that?

Granting nodes/proxy to monitoring agents, log collectors, or health-checking tools violates the principle of least privilege. If any of those workloads were compromised, an attacker would gain the ability to run commands in every container on the node. The nodes/proxy permission is effectively a node-level superuser capability, and granting it broadly dramatically increases the blast radius of a security incident.

This problem has been well understood in the community for years (see kubernetes/kubernetes#83465), and was the driving motivation behind this enhancement KEP-2862.

The `nodes/proxy GET` WebSocket RCE risk

The situation is more severe than it might appear at first glance. Security researchers demonstrated in early 2026 that nodes/proxy GET alone, which is the minimal read-only permission routinely granted to monitoring tools, can be abused to execute commands in any pod on reachable nodes.

The root cause is a mismatch between how WebSocket connections work and how the kubelet maps HTTP methods to RBAC verbs. The WebSocket protocol (RFC 6455) requires an HTTP GET request for the initial connection handshake. The kubelet maps this GET to the RBAC get verb and authorizes the request without performing a secondary check to confirm that CREATE permission is also present for the write operation that follows. Using a WebSocket client like websocat, an attacker can reach the kubelet's /exec endpoint directly on port 10250 and execute arbitrary commands:

websocat --insecure \
 --header "Authorization: Bearer $TOKEN" \
 --protocol v4.channel.k8s.io \
 "wss://$NODE_IP:10250/exec/default/nginx/nginx?output=1&error=1&command=id"

uid=0(root) gid=0(root) groups=0(root)

Fine-grained `kubelet` authorization: how it works

With KubeletFineGrainedAuthz, the kubelet now performs an additional, more specific authorization check before falling back to the nodes/proxy subresource. Several commonly used kubelet API paths are mapped to their own dedicated subresources:

kubelet API Resource Subresource /stats/* nodes stats /metrics/* nodes metrics /logs/* nodes log /pods nodes pods, proxy /runningPods/ nodes pods, proxy /healthz nodes healthz, proxy /configz nodes configz, proxy /spec/* nodes spec /checkpoint/* nodes checkpoint all others nodes proxy

For the endpoints that now have fine-grained subresources (/pods, /runningPods/, /healthz, /configz), the kubelet first sends a SubjectAccessReview for the specific subresource. If that check succeeds, the request is authorized. If it fails, the kubelet retries with the coarse-grained nodes/proxy subresource for backward compatibility.

This dual-check approach ensures a smooth migration path. Existing workloads with nodes/proxy permissions continue to work, while new deployments can adopt least-privilege access from day one.

What this means in practice

Consider a Prometheus node exporter or a monitoring DaemonSet that needs to scrape /metrics from the kubelet. Previously, you would need an RBAC ClusterRole like this:

# Old approach: overly broad
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
 name: monitoring-agent
rules:
- apiGroups: [""]
 resources: ["nodes/proxy"]
 verbs: ["get"]

This grants the monitoring agent far more access than it needs. With fine-grained authorization, you can now scope the permissions precisely:

# New approach: least privilege
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
 name: monitoring-agent
rules:
- apiGroups: [""]
 resources: ["nodes/metrics", "nodes/stats"]
 verbs: ["get"]

The monitoring agent can now read metrics and stats from the kubelet without ever being able to execute commands in containers.

Updated `system:kubelet-api-admin` `ClusterRole`

When RBAC authorization is enabled, the built-in system:kubelet-api-admin ClusterRole is automatically updated to include permissions for all the new fine-grained subresources. This ensures that cluster administrators who already use this role, including the API server's kubelet client, continue to have full access without any manual configuration changes.

The role now includes permissions for:

nodes/proxy
nodes/stats
nodes/metrics
nodes/log
nodes/spec
nodes/checkpoint
nodes/configz
nodes/healthz
nodes/pods

Upgrade considerations

Because the kubelet performs a dual authorization check (fine-grained first, then falling back to nodes/proxy), upgrading to v1.36 should be seamless for most clusters:

Existing workloads with nodes/proxy permissions continue to work without changes. The fallback to nodes/proxy ensures backward compatibility.
The API server always has nodes/proxy permissions via system:kubelet-api-admin, so kube-apiserver-to-kubelet communication is unaffected regardless of feature gate state.
Mixed-version clusters are handled gracefully. If a kubelet supports fine-grained authorization but the API server does not (or vice versa), nodes/proxy permissions serve as the fallback.

Verifying the feature is enabled

You can confirm that the feature is active on a given node by checking the kubelet metrics endpoint. Since the metrics endpoint on port 10250 requires authorization, you'll first need to create appropriate RBAC bindings for the pod or ServiceAccount making the request.

Step 1: Create a ServiceAccount and ClusterRole

apiVersion: v1
kind: ServiceAccount
metadata:
 name: kubelet-metrics-checker
 namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
 name: kubelet-metrics-reader
rules:
- apiGroups: [""]
 resources: ["nodes/metrics"]
 verbs: ["get"]

Step 2: Bind the ClusterRole to the ServiceAccount

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
 name: kubelet-metrics-checker
subjects:
- kind: ServiceAccount
 name: kubelet-metrics-checker
 namespace: default
roleRef:
 kind: ClusterRole
 name: kubelet-metrics-reader
 apiGroup: rbac.authorization.k8s.io

Apply both manifests:

kubectl apply -f serviceaccount.yaml
kubectl apply -f clusterrole.yaml
kubectl apply -f clusterrolebinding.yaml

Step 3: Run a pod with the ServiceAccount and check the feature flag

kubectl run kubelet-check \
 --image=curlimages/curl \
 --serviceaccount=kubelet-metrics-checker \
 --restart=Never \
 --rm -it \
 -- sh

Then from within the pod, retrieve the node IP and query the metrics endpoint:

# Get the token
TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)

# Query the kubelet metrics and filter for the feature gate
curl -sk \
 --header "Authorization: Bearer $TOKEN" \
 https://$NODE_IP:10250/metrics \
 | grep kubernetes_feature_enabled \
 | grep KubeletFineGrainedAuthz

If the feature is enabled, you should see output like:

kubernetes_feature_enabled{name="KubeletFineGrainedAuthz",stage="GA"} 1

Note: Replace $NODE_IP with the IP address of the node you want to check. You can retrieve node IPs with kubectl get nodes -o wide.

The journey from alpha to GA

Release Stage Details v1.32 Alpha Feature gate KubeletFineGrainedAuthz introduced, disabled by default v1.33 Beta Enabled by default; fine-grained checks for /pods, /runningPods/, /healthz, /configz v1.36 GA Feature gate locked to enabled; fine-grained kubelet authorization is always active

What's next?

With fine-grained kubelet authorization now GA, the Kubernetes community can begin recommending and eventually enforcing the use of specific subresources instead of nodes/proxy for monitoring and observability workloads. The urgency of this migration is underscored by research showing that nodes/proxy GET can be abused for unlogged remote code execution via the WebSocket protocol. This risk is present in the default RBAC configurations of dozens of widely deployed Helm charts. Over time, we expect:

Ecosystem adoption: Monitoring tools like Prometheus, Datadog agents, and other DaemonSets can update their default RBAC configurations to use nodes/metrics, nodes/stats, and nodes/pods instead of nodes/proxy. This directly eliminates the WebSocket RCE attack surface for those workloads.
Policy enforcement: Admission controllers and policy engines can flag or reject RBAC bindings that grant nodes/proxy when fine-grained alternatives exist, helping organizations adopt least-privilege access at scale.
Deprecation path: As adoption grows, nodes/proxy may eventually be deprecated for monitoring use cases, further reducing the attack surface of Kubernetes clusters.

Getting involved

This enhancement was driven by SIG Auth and SIG Node. If you are interested in contributing to the security and authorization features of Kubernetes, please join us:

SIG Auth
SIG Node
Slack: #sig-auth and #sig-node
KEP-2862: Fine-Grained Kubelet API Authorization

We look forward to hearing your feedback and experiences with this feature!

Categories: CNCF Projects, Kubernetes

Kubernetes v1.36: Fine-Grained Kubelet API Authorization Graduates to GA

Kubernetes Blog - Fri, 04/24/2026 - 14:35

On behalf of Kubernetes SIG Auth and SIG Node, we are pleased to announce the graduation of fine-grained kubelet API authorization to General Availability (GA) in Kubernetes v1.36!

The KubeletFineGrainedAuthz feature gate was introduced as an opt-in alpha feature in Kubernetes v1.32, then graduated to beta (enabled by default) in v1.33. Now, the feature is generally available and the feature gate is locked to enabled. This feature enables more precise, least-privilege access control over the kubelet's HTTPS API, replacing the need to grant the overly broad nodes/proxy permission for common monitoring and observability use cases.

Motivation: the `nodes/proxy` problem

The kubelet exposes an HTTPS endpoint with several APIs that give access to data of varying sensitivity, including pod listings, node metrics, container logs, and, critically, the ability to execute commands inside running containers.

Prior to this feature, kubelet authorization used a coarse-grained model. When webhook authorization was enabled, almost all kubelet API paths were mapped to a single nodes/proxy subresource. This meant that any workload needing to read metrics or health status from the kubelet required nodes/proxy permission, the same permission that also grants the ability to execute arbitrary commands in any container running on the node.

What's wrong with that?

Granting nodes/proxy to monitoring agents, log collectors, or health-checking tools violates the principle of least privilege. If any of those workloads were compromised, an attacker would gain the ability to run commands in every container on the node. The nodes/proxy permission is effectively a node-level superuser capability, and granting it broadly dramatically increases the blast radius of a security incident.

This problem has been well understood in the community for years (see kubernetes/kubernetes#83465), and was the driving motivation behind this enhancement KEP-2862.

The `nodes/proxy GET` WebSocket RCE risk

The situation is more severe than it might appear at first glance. Security researchers demonstrated in early 2026 that nodes/proxy GET alone, which is the minimal read-only permission routinely granted to monitoring tools, can be abused to execute commands in any pod on reachable nodes.

The root cause is a mismatch between how WebSocket connections work and how the kubelet maps HTTP methods to RBAC verbs. The WebSocket protocol (RFC 6455) requires an HTTP GET request for the initial connection handshake. The kubelet maps this GET to the RBAC get verb and authorizes the request without performing a secondary check to confirm that CREATE permission is also present for the write operation that follows. Using a WebSocket client like websocat, an attacker can reach the kubelet's /exec endpoint directly on port 10250 and execute arbitrary commands:

websocat --insecure \
 --header "Authorization: Bearer $TOKEN" \
 --protocol v4.channel.k8s.io \
 "wss://$NODE_IP:10250/exec/default/nginx/nginx?output=1&error=1&command=id"

uid=0(root) gid=0(root) groups=0(root)

Fine-grained `kubelet` authorization: how it works

With KubeletFineGrainedAuthz, the kubelet now performs an additional, more specific authorization check before falling back to the nodes/proxy subresource. Several commonly used kubelet API paths are mapped to their own dedicated subresources:

kubelet API Resource Subresource /stats/* nodes stats /metrics/* nodes metrics /logs/* nodes log /pods nodes pods, proxy /runningPods/ nodes pods, proxy /healthz nodes healthz, proxy /configz nodes configz, proxy /spec/* nodes spec /checkpoint/* nodes checkpoint all others nodes proxy

For the endpoints that now have fine-grained subresources (/pods, /runningPods/, /healthz, /configz), the kubelet first sends a SubjectAccessReview for the specific subresource. If that check succeeds, the request is authorized. If it fails, the kubelet retries with the coarse-grained nodes/proxy subresource for backward compatibility.

This dual-check approach ensures a smooth migration path. Existing workloads with nodes/proxy permissions continue to work, while new deployments can adopt least-privilege access from day one.

What this means in practice

Consider a Prometheus node exporter or a monitoring DaemonSet that needs to scrape /metrics from the kubelet. Previously, you would need an RBAC ClusterRole like this:

# Old approach: overly broad
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
 name: monitoring-agent
rules:
- apiGroups: [""]
 resources: ["nodes/proxy"]
 verbs: ["get"]

This grants the monitoring agent far more access than it needs. With fine-grained authorization, you can now scope the permissions precisely:

# New approach: least privilege
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
 name: monitoring-agent
rules:
- apiGroups: [""]
 resources: ["nodes/metrics", "nodes/stats"]
 verbs: ["get"]

The monitoring agent can now read metrics and stats from the kubelet without ever being able to execute commands in containers.

Updated `system:kubelet-api-admin` `ClusterRole`

When RBAC authorization is enabled, the built-in system:kubelet-api-admin ClusterRole is automatically updated to include permissions for all the new fine-grained subresources. This ensures that cluster administrators who already use this role, including the API server's kubelet client, continue to have full access without any manual configuration changes.

The role now includes permissions for:

nodes/proxy
nodes/stats
nodes/metrics
nodes/log
nodes/spec
nodes/checkpoint
nodes/configz
nodes/healthz
nodes/pods

Upgrade considerations

Because the kubelet performs a dual authorization check (fine-grained first, then falling back to nodes/proxy), upgrading to v1.36 should be seamless for most clusters:

Existing workloads with nodes/proxy permissions continue to work without changes. The fallback to nodes/proxy ensures backward compatibility.
The API server always has nodes/proxy permissions via system:kubelet-api-admin, so kube-apiserver-to-kubelet communication is unaffected regardless of feature gate state.
Mixed-version clusters are handled gracefully. If a kubelet supports fine-grained authorization but the API server does not (or vice versa), nodes/proxy permissions serve as the fallback.

Verifying the feature is enabled

You can confirm that the feature is active on a given node by checking the kubelet metrics endpoint. Since the metrics endpoint on port 10250 requires authorization, you'll first need to create appropriate RBAC bindings for the pod or ServiceAccount making the request.

Step 1: Create a ServiceAccount and ClusterRole

apiVersion: v1
kind: ServiceAccount
metadata:
 name: kubelet-metrics-checker
 namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
 name: kubelet-metrics-reader
rules:
- apiGroups: [""]
 resources: ["nodes/metrics"]
 verbs: ["get"]

Step 2: Bind the ClusterRole to the ServiceAccount

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
 name: kubelet-metrics-checker
subjects:
- kind: ServiceAccount
 name: kubelet-metrics-checker
 namespace: default
roleRef:
 kind: ClusterRole
 name: kubelet-metrics-reader
 apiGroup: rbac.authorization.k8s.io

Apply both manifests:

kubectl apply -f serviceaccount.yaml
kubectl apply -f clusterrole.yaml
kubectl apply -f clusterrolebinding.yaml

Step 3: Run a pod with the ServiceAccount and check the feature flag

kubectl run kubelet-check \
 --image=curlimages/curl \
 --serviceaccount=kubelet-metrics-checker \
 --restart=Never \
 --rm -it \
 -- sh

Then from within the pod, retrieve the node IP and query the metrics endpoint:

# Get the token
TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)

# Query the kubelet metrics and filter for the feature gate
curl -sk \
 --header "Authorization: Bearer $TOKEN" \
 https://$NODE_IP:10250/metrics \
 | grep kubernetes_feature_enabled \
 | grep KubeletFineGrainedAuthz

If the feature is enabled, you should see output like:

kubernetes_feature_enabled{name="KubeletFineGrainedAuthz",stage="GA"} 1

Note: Replace $NODE_IP with the IP address of the node you want to check. You can retrieve node IPs with kubectl get nodes -o wide.

The journey from alpha to GA

Release Stage Details v1.32 Alpha Feature gate KubeletFineGrainedAuthz introduced, disabled by default v1.33 Beta Enabled by default; fine-grained checks for /pods, /runningPods/, /healthz, /configz v1.36 GA Feature gate locked to enabled; fine-grained kubelet authorization is always active

What's next?

With fine-grained kubelet authorization now GA, the Kubernetes community can begin recommending and eventually enforcing the use of specific subresources instead of nodes/proxy for monitoring and observability workloads. The urgency of this migration is underscored by research showing that nodes/proxy GET can be abused for unlogged remote code execution via the WebSocket protocol. This risk is present in the default RBAC configurations of dozens of widely deployed Helm charts. Over time, we expect:

Ecosystem adoption: Monitoring tools like Prometheus, Datadog agents, and other DaemonSets can update their default RBAC configurations to use nodes/metrics, nodes/stats, and nodes/pods instead of nodes/proxy. This directly eliminates the WebSocket RCE attack surface for those workloads.
Policy enforcement: Admission controllers and policy engines can flag or reject RBAC bindings that grant nodes/proxy when fine-grained alternatives exist, helping organizations adopt least-privilege access at scale.
Deprecation path: As adoption grows, nodes/proxy may eventually be deprecated for monitoring use cases, further reducing the attack surface of Kubernetes clusters.

Getting involved

This enhancement was driven by SIG Auth and SIG Node. If you are interested in contributing to the security and authorization features of Kubernetes, please join us:

SIG Auth
SIG Node
Slack: #sig-auth and #sig-node
KEP-2862: Fine-Grained Kubelet API Authorization

We look forward to hearing your feedback and experiences with this feature!

Categories: CNCF Projects, Kubernetes

Kubernetes v1.36: User Namespaces in Kubernetes are finally GA

Kubernetes Blog - Thu, 04/23/2026 - 14:35

After several years of development, User Namespaces support in Kubernetes reached General Availability (GA) with the v1.36 release. This is a Linux-only feature.

For those of us working on low level container runtimes and rootless technologies, this has been a long awaited milestone. We finally reached the point where "rootless" security isolation can be used for Kubernetes workloads.

This feature also enables a critical pattern: running workloads with privileges and still being confined in the user namespace. When hostUsers: false is set, capabilities like CAP_NET_ADMIN become namespaced, meaning they grant administrative power over container local resources without affecting the host. This effectively enables new use cases that were not possible before without running a fully privileged container.

The Problem with UID 0

A process running as root inside a container is also seen from the kernel as root on the host. If an attacker manages to break out of the container, whether through a kernel vulnerability or a misconfigured mount, they are root on the host.

While there are many security measures in place for running containers, these measures don't change the underlying identity of the process, it still has some "parts" of root.

The engine: ID-mapped mounts

The road to GA wasn't just about the Kubernetes API; it was about making the kernel work for us. In the early stages, one of the biggest blockers was volume ownership. If you mapped a container to a high UID range, the Kubelet had to recursively chown every file in the attached volume so the container could read/write them. For large volumes, this was such an expensive operation that destroyed startup performance.

The key enabler was ID-mapped mounts (introduced in Linux 5.12 and refined in later versions). Instead of rewriting file ownership on disk, the kernel remaps it at mount time.

When a volume is mounted into a Pod with User Namespaces enabled, the kernel performs a transparent translation of the UIDs (user ids) and GIDs (group ids). To the container, the files appear owned by UID 0. On disk, file ownership is unchanged — no chown is needed. This is an O(1) operation, instant and efficient.

Using it in Kubernetes v1.36

Using user namespaces is straightforward: all you need to do is set hostUsers: false in your Pod spec. No changes to your container images, no complex configuration. The interface remains the same one introduced during the Alpha phase. In the spec for a Pod (or PodTemplate), you explicitly opt-out of the host user namespace:

apiVersion: v1
kind: Pod
metadata:
 name: isolated-workload
spec:
 hostUsers: false
 containers:
 - name: app
 image: fedora:42
 securityContext:
 runAsUser: 0

For more details on how user namespaces work in practice and demos of CVEs rated HIGH mitigated, see the previous blog posts: User Namespaces alpha, User Namespaces stateful pods in alpha, User Namespaces beta, and User Namespaces enabled by default.

Getting involved

If you're interested in user namespaces or want to contribute, here are some useful links:

Acknowledgments

This feature has been years in the making: the first KEP was opened 10 years ago by other contributors, and we have been actively working on it for the last 6 years. We'd like to thank everyone who contributed across SIG Node, the container runtimes, and the Linux kernel. Special thanks to the reviewers and early adopters who helped shape the design through multiple alpha and beta cycles.

Categories: CNCF Projects, Kubernetes

SELinux Volume Label Changes goes GA (and likely implications in v1.37)

Kubernetes Blog - Wed, 04/22/2026 - 14:35

If you run Kubernetes on Linux with SELinux in enforcing mode, plan ahead: a future release (anticipated to be v1.37) is expected to turn the SELinuxMount feature gate on by default. This makes volume setup faster for most workloads, but it can break applications that still depend on the older recursive relabeling model in subtle ways (for example, sharing one volume between privileged and unprivileged Pods on the same node). Kubernetes v1.36 is the right release to audit your cluster and fix or opt out of this change.

If your nodes do not use SELinux, nothing changes for you: the kubelet skips the whole SELinux logic when SELinux is unavailable or disabled in the Linux kernel. You can skip this article completely.

This blog builds on the earlier work described in the Kubernetes 1.27: Efficient SELinux Relabeling (Beta) post, where the SELinuxMountReadWriteOncePod feature gate was described. The problem to be addressed remains the same, however, this blog extends that same approach to all volumes.

The problem

Linux systems with Security Enhanced Linux (SELinux) enabled use labels attached to objects (for example, files and network sockets) to make access control decisions. Historically, the container runtime applies SELinux labels to a Pod and all its volumes. Kubernetes only passes the SELinux label from a Pod's securityContext fields to the container runtime.

The container runtime then recursively changes the SELinux label on all files that are visible to the Pod's containers. This can be time-consuming if there are many files on the volume, especially when the volume is on a remote filesystem.

Caution:

If a container uses subPath of a volume, only that subPath of the whole volume is relabeled. This allows two Pods that have two different SELinux labels to use the same volume, as long as they use different subpaths of it.

If a Pod does not have any SELinux label assigned in the Kubernetes API, the container runtime assigns a unique random label, so a process that potentially escapes the container boundary cannot access data of any other container on the host. The container runtime still recursively relabels all Pod volumes with this random SELinux label.

What Kubernetes is improving

Where the stack supports it, the kubelet can mount the volume with -o context=<label> so the kernel applies the correct label for all inodes on that mount without a recursive inode traversal. That path is gated by feature flags and requires, among other things, that the Pod expose enough of an SELinux label (for example spec.securityContext.seLinuxOptions.level) and that the volume driver opts in (for CSI, CSIDriver field spec.seLinuxMount: true).

The project rolled this out in phases:

ReadWriteOncePod volumes were handled under the SELinuxMountReadWriteOncePod feature gate, on by default since v1.28 and GA in v1.36.
Broader coverage was handled under the SELinuxMount flag, paired with the spec.securityContext.seLinuxChangePolicy field on Pods.

If a Pod and its volume meet all of the following conditions, Kubernetes will mount the volume directly with the right SELinux label. Such a mount will happen in a constant time and the container runtime will not need to recursively relabel any files on it. For such a mount to happen:

The operating system must support SELinux. Without SELinux support detected, the kubelet and the container runtime do not do anything with regard to SELinux.
The feature gate SELinuxMountReadWriteOncePod must be enabled. If you're running Kubernetes v1.36, the feature is enabled unconditionally.
The Pod must use a PersistentVolumeClaim with applicable accessModes:
- Either the volume has accessModes: ["ReadWriteOncePod"]
- or the volume can use any other access mode(s), provided that the feature gates SELinuxChangePolicy and SELinuxMount are both enabled and the Pod has spec.securityContext.seLinuxChangePolicy set to nil (default) or as MountOption.
The feature gate SELinuxMount is Beta and disabled by default in Kubernetes 1.36. All other SELinux-related feature gates are now General Availability (GA).

With any of these feature gates disabled, SELinux labels will always be applied by the container runtime via recursively traversing through the volume (or its subPaths).
The Pod must have at least seLinuxOptions.level assigned in its security context or all containers in that Pod must have it set in their container-level security contexts. Kubernetes will read the default user, role and type from the operating system defaults (typically system_u, system_r and container_t).

Without Kubernetes knowing at least the SELinux level, the container runtime will assign a random level after the volumes are mounted. The container runtime will still relabel the volumes recursively in that case.
The volume plugin or the CSI driver responsible for the volume supports mounting with SELinux mount options.

These in-tree volume plugins support mounting with SELinux mount options: fc and iscsi.

CSI drivers that support mounting with SELinux mount options must declare this capability in their CSIDriver instance by setting the seLinuxMount field.

Volumes managed by other volume plugins or CSI drivers that do not set seLinuxMount: true will be recursively relabeled by the container runtime.

The breaking change

The SELinuxMount feature gate changes what volumes can be shared among multiple Pods in a subtle way.

Both of these cases work with recursive relabeling:

Two Pods with different SELinux labels share the same volume, but each of them uses a different subPath to the volume.
A privileged Pod and an unprivileged Pod share the same volume.

The above scenarios will not work with modern, target behavior for Kubernetes mounting when SELinux is active. Instead, one of these Pods will be stuck in ContainerCreating until the other Pod is terminated.

The first case is very niche and hasn't been seen in practice. Although the second case is still quite rare, this setup has been observed in applications. Kubernetes v1.36 offers metrics and events to identify these Pods and allows cluster administrators to opt out of the mount option through the Pod field spec.securityContext.seLinuxChangePolicy.

`seLinuxChangePolicy`

The new Pod field spec.securityContext.seLinuxChangePolicy specifies how the SELinux label is applied to all Pod volumes. In Kubernetes v1.36, this field is part of the stable Pod API.

There are three choices available:

field not set (default): In Kubernetes v1.36, the behavior depends on whether the SELinuxMount feature gate is enabled. By default that feature gate is not enabled, and the SELinux label is applied recursively. If you enable that feature gate in your cluster, and all other conditions are met, labelling will be applied using the mount option.
Recursive: the SELinux label is applied recursively. This opts out from using the mount option.
MountOption: the SELinux label is applied using the mount option, if all other conditions are met. This choice is available only when the SELinuxMount feature gate is enabled.

SELinux warning controller (optional)

Kubernetes v1.36 provides a new controller within the control plane, selinux-warning-controller. This controller runs within the kube-controller-manager controller. To use it, you pass --controllers=*,selinux-warning-controller on the kube-controller-manager command line; you also must not have explicitly overridden the SELinuxChangePolicy feature gate to be disabled.

The controller watches all Pods in the cluster and emits an Event when it finds two Pods that share the same volume in a way that is not compatible with the SELinuxMount feature gate. All such conflicting Pods will receive an event, such as:

SELinuxLabel "system_u:system_r:container_t:s0:c98,c99" conflicts with pod my-other-pod that uses the same volume as this pod with SELinuxLabel "system_u:system_r:container_t:s0:c0,c1". If both pods land on the same node, only one of them may access the volume.

The actual Pod name may be censored when the conflicting Pods run in different namespaces to prevent leaking information across namespace boundaries.

The controller reports such an event even when these Pods don't run on the same node, to make sure all Pods work regardless of the Kubernetes scheduler decision. They could run on the same node next time.

In addition, the controller emits the metric selinux_warning_controller_selinux_volume_conflict that lists all current conflicts among Pods. The metric has labels that identify the conflicting Pods and their SELinux labels, such as:

selinux_warning_controller_selinux_volume_conflict{pod1_name="my-other-pod",pod1_namespace="default",pod1_value="system_u:object_r:container_file_t:s0:c0,c1",pod2_name="my-pod",pod2_namespace="default",pod2_value="system_u:object_r:container_file_t:s0:c0,c2",property="SELinuxLabel"} 1

There is a security consequence from enabling this opt-in controller: it may reveal namespace names, which are always present in the metric. The Kubernetes project assumes only cluster administrators can access kube-controller-manager metrics.

Suggested upgrade path

To ensure a smooth upgrade path from v1.36 to a release with SELinuxMount enabled (anticipated to be v1.37), we suggest you follow these steps:

Enable selinux-warning-controller in the kube-controller-manager.
Check the selinux_warning_controller_selinux_volume_conflict metric. It shows all potential conflicts between Pods. For each conflicting Pod (Deployment, StatefulSet, etc.), either apply the opt-out (set Pod's spec.securityContext.seLinuxChangePolicy: Recursive) or re-architect the application to remove such a conflict. For example, do your Pods really need to run as privileged?
Check the volume_manager_selinux_volume_context_mismatch_warnings_total metric. This metric is emitted by the kubelet when it actually starts a Pod that runs when SELinuxMount is disabled, but such a Pod won't start when SELinuxMount is enabled. This metric lists the number of Pods that will experience a true conflict. Unfortunately, this metric does not expose the exact Pod name as a label. The full Pod name is available only in the selinux_warning_controller_selinux_volume_conflict metric.
Once both metrics have been accounted for, upgrade to a Kubernetes version that has SELinuxMount enabled.

Consider using a MutatingAdmissionPolicy, a mutating webhook, or a policy engine like Kyverno or Gatekeeper to apply the opt-out to all Pods in a namespace or across the entire cluster.

When SELinuxMount is enabled, the kubelet will emit the metric volume_manager_selinux_volume_context_mismatch_errors_total with the number of Pods that could not start because their SELinux label conflicts with an existing Pod that uses the same volume. The exact Pod names should still be available in the selinux_warning_controller_selinux_volume_conflict metric, if the selinux-warning-controller is enabled.

Acknowledgements

If you run into issues, have feedback, or want to contribute, find us on the Kubernetes Slack in #sig-node and #sig-storage or join a SIG Node or SIG Storage meetings.

Categories: CNCF Projects, Kubernetes

SELinux Volume Label Changes goes GA (and likely implications in v1.37)

Kubernetes Blog - Wed, 04/22/2026 - 14:35

If you run Kubernetes on Linux with SELinux in enforcing mode, plan ahead: a future release (anticipated to be v1.37) is expected to turn the SELinuxMount feature gate on by default. This makes volume setup faster for most workloads, but it can break applications that still depend on the older recursive relabeling model in subtle ways (for example, sharing one volume between privileged and unprivileged Pods on the same node). Kubernetes v1.36 is the right release to audit your cluster and fix or opt out of this change.

If your nodes do not use SELinux, nothing changes for you: the kubelet skips the whole SELinux logic when SELinux is unavailable or disabled in the Linux kernel. You can skip this article completely.

This blog builds on the earlier work described in the Kubernetes 1.27: Efficient SELinux Relabeling (Beta) post, where the SELinuxMountReadWriteOncePod feature gate was described. The problem to be addressed remains the same, however, this blog extends that same approach to all volumes.

The problem

Linux systems with Security Enhanced Linux (SELinux) enabled use labels attached to objects (for example, files and network sockets) to make access control decisions. Historically, the container runtime applies SELinux labels to a Pod and all its volumes. Kubernetes only passes the SELinux label from a Pod's securityContext fields to the container runtime.

The container runtime then recursively changes the SELinux label on all files that are visible to the Pod's containers. This can be time-consuming if there are many files on the volume, especially when the volume is on a remote filesystem.

Caution:

If a container uses subPath of a volume, only that subPath of the whole volume is relabeled. This allows two Pods that have two different SELinux labels to use the same volume, as long as they use different subpaths of it.

If a Pod does not have any SELinux label assigned in the Kubernetes API, the container runtime assigns a unique random label, so a process that potentially escapes the container boundary cannot access data of any other container on the host. The container runtime still recursively relabels all Pod volumes with this random SELinux label.

What Kubernetes is improving

Where the stack supports it, the kubelet can mount the volume with -o context=<label> so the kernel applies the correct label for all inodes on that mount without a recursive inode traversal. That path is gated by feature flags and requires, among other things, that the Pod expose enough of an SELinux label (for example spec.securityContext.seLinuxOptions.level) and that the volume driver opts in (for CSI, CSIDriver field spec.seLinuxMount: true).

The project rolled this out in phases:

ReadWriteOncePod volumes were handled under the SELinuxMountReadWriteOncePod feature gate, on by default since v1.28 and GA in v1.36.
Broader coverage was handled under the SELinuxMount flag, paired with the spec.securityContext.seLinuxChangePolicy field on Pods.

If a Pod and its volume meet all of the following conditions, Kubernetes will mount the volume directly with the right SELinux label. Such a mount will happen in a constant time and the container runtime will not need to recursively relabel any files on it. For such a mount to happen:

The operating system must support SELinux. Without SELinux support detected, the kubelet and the container runtime do not do anything with regard to SELinux.
The feature gate SELinuxMountReadWriteOncePod must be enabled. If you're running Kubernetes v1.36, the feature is enabled unconditionally.
The Pod must use a PersistentVolumeClaim with applicable accessModes:
- Either the volume has accessModes: ["ReadWriteOncePod"]
- or the volume can use any other access mode(s), provided that the feature gates SELinuxChangePolicy and SELinuxMount are both enabled and the Pod has spec.securityContext.seLinuxChangePolicy set to nil (default) or as MountOption.
The feature gate SELinuxMount is Beta and disabled by default in Kubernetes 1.36. All other SELinux-related feature gates are now General Availability (GA).

With any of these feature gates disabled, SELinux labels will always be applied by the container runtime via recursively traversing through the volume (or its subPaths).
The Pod must have at least seLinuxOptions.level assigned in its security context or all containers in that Pod must have it set in their container-level security contexts. Kubernetes will read the default user, role and type from the operating system defaults (typically system_u, system_r and container_t).

Without Kubernetes knowing at least the SELinux level, the container runtime will assign a random level after the volumes are mounted. The container runtime will still relabel the volumes recursively in that case.
The volume plugin or the CSI driver responsible for the volume supports mounting with SELinux mount options.

These in-tree volume plugins support mounting with SELinux mount options: fc and iscsi.

CSI drivers that support mounting with SELinux mount options must declare this capability in their CSIDriver instance by setting the seLinuxMount field.

Volumes managed by other volume plugins or CSI drivers that do not set seLinuxMount: true will be recursively relabeled by the container runtime.

The breaking change

The SELinuxMount feature gate changes what volumes can be shared among multiple Pods in a subtle way.

Both of these cases work with recursive relabeling:

Two Pods with different SELinux labels share the same volume, but each of them uses a different subPath to the volume.
A privileged Pod and an unprivileged Pod share the same volume.

The above scenarios will not work with modern, target behavior for Kubernetes mounting when SELinux is active. Instead, one of these Pods will be stuck in ContainerCreating until the other Pod is terminated.

The first case is very niche and hasn't been seen in practice. Although the second case is still quite rare, this setup has been observed in applications. Kubernetes v1.36 offers metrics and events to identify these Pods and allows cluster administrators to opt out of the mount option through the Pod field spec.securityContext.seLinuxChangePolicy.

`seLinuxChangePolicy`

The new Pod field spec.securityContext.seLinuxChangePolicy specifies how the SELinux label is applied to all Pod volumes. In Kubernetes v1.36, this field is part of the stable Pod API.

There are three choices available:

field not set (default): In Kubernetes v1.36, the behavior depends on whether the SELinuxMount feature gate is enabled. By default that feature gate is not enabled, and the SELinux label is applied recursively. If you enable that feature gate in your cluster, and all other conditions are met, labelling will be applied using the mount option.
Recursive: the SELinux label is applied recursively. This opts out from using the mount option.
MountOption: the SELinux label is applied using the mount option, if all other conditions are met. This choice is available only when the SELinuxMount feature gate is enabled.

SELinux warning controller (optional)

Kubernetes v1.36 provides a new controller within the control plane, selinux-warning-controller. This controller runs within the kube-controller-manager controller. To use it, you pass --controllers=*,selinux-warning-controller on the kube-controller-manager command line; you also must not have explicitly overridden the SELinuxChangePolicy feature gate to be disabled.

The controller watches all Pods in the cluster and emits an Event when it finds two Pods that share the same volume in a way that is not compatible with the SELinuxMount feature gate. All such conflicting Pods will receive an event, such as:

SELinuxLabel "system_u:system_r:container_t:s0:c98,c99" conflicts with pod my-other-pod that uses the same volume as this pod with SELinuxLabel "system_u:system_r:container_t:s0:c0,c1". If both pods land on the same node, only one of them may access the volume.

The actual Pod name may be censored when the conflicting Pods run in different namespaces to prevent leaking information across namespace boundaries.

The controller reports such an event even when these Pods don't run on the same node, to make sure all Pods work regardless of the Kubernetes scheduler decision. They could run on the same node next time.

In addition, the controller emits the metric selinux_warning_controller_selinux_volume_conflict that lists all current conflicts among Pods. The metric has labels that identify the conflicting Pods and their SELinux labels, such as:

selinux_warning_controller_selinux_volume_conflict{pod1_name="my-other-pod",pod1_namespace="default",pod1_value="system_u:object_r:container_file_t:s0:c0,c1",pod2_name="my-pod",pod2_namespace="default",pod2_value="system_u:object_r:container_file_t:s0:c0,c2",property="SELinuxLabel"} 1

There is a security consequence from enabling this opt-in controller: it may reveal namespace names, which are always present in the metric. The Kubernetes project assumes only cluster administrators can access kube-controller-manager metrics.

Suggested upgrade path

To ensure a smooth upgrade path from v1.36 to a release with SELinuxMount enabled (anticipated to be v1.37), we suggest you follow these steps:

Enable selinux-warning-controller in the kube-controller-manager.
Check the selinux_warning_controller_selinux_volume_conflict metric. It shows all potential conflicts between Pods. For each conflicting Pod (Deployment, StatefulSet, etc.), either apply the opt-out (set Pod's spec.securityContext.seLinuxChangePolicy: Recursive) or re-architect the application to remove such a conflict. For example, do your Pods really need to run as privileged?
Check the volume_manager_selinux_volume_context_mismatch_warnings_total metric. This metric is emitted by the kubelet when it actually starts a Pod that runs when SELinuxMount is disabled, but such a Pod won't start when SELinuxMount is enabled. This metric lists the number of Pods that will experience a true conflict. Unfortunately, this metric does not expose the exact Pod name as a label. The full Pod name is available only in the selinux_warning_controller_selinux_volume_conflict metric.
Once both metrics have been accounted for, upgrade to a Kubernetes version that has SELinuxMount enabled.

Consider using a MutatingAdmissionPolicy, a mutating webhook, or a policy engine like Kyverno or Gatekeeper to apply the opt-out to all Pods in a namespace or across the entire cluster.

When SELinuxMount is enabled, the kubelet will emit the metric volume_manager_selinux_volume_context_mismatch_errors_total with the number of Pods that could not start because their SELinux label conflicts with an existing Pod that uses the same volume. The exact Pod names should still be available in the selinux_warning_controller_selinux_volume_conflict metric, if the selinux-warning-controller is enabled.

Acknowledgements

If you run into issues, have feedback, or want to contribute, find us on the Kubernetes Slack in #sig-node and #sig-storage or join a SIG Node or SIG Storage meetings.

Categories: CNCF Projects, Kubernetes

Kubernetes v1.36: ハル (Haru)

Kubernetes Blog - Tue, 04/21/2026 - 20:00

Editors: Chad M. Crowell, Kirti Goyal, Sophia Ugochukwu, Swathi Rao, Utkarsh Umre

Similar to previous releases, the release of Kubernetes v1.36 introduces new stable, beta, and alpha features. The consistent delivery of high-quality releases underscores the strength of our development cycle and the vibrant support from our community.

This release consists of 70 enhancements. Of those enhancements, 18 have graduated to Stable, 25 are entering Beta, and 25 have graduated to Alpha.

There are also some deprecations and removals in this release; make sure to read about those.

Release theme and logo

We open 2026 with Kubernetes v1.36, a release that arrives as the season turns and the light shifts on the mountain. ハル (Haru) is a sound in Japanese that carries many meanings; among those we hold closest are 春 (spring), 晴れ (hare, clear skies), and 遥か (haruka, far-off, distant). A season, a sky, and a horizon. You will find all three in what follows.

The logo, created by avocadoneko / Natsuho Ide, draws inspiration from Katsushika Hokusai's Thirty-six Views of Mount Fuji (富嶽三十六景, Fugaku Sanjūrokkei), the same series that gave the world The Great Wave off Kanagawa. Our v1.36 logo reimagines one of the series' most celebrated prints, Fine Wind, Clear Morning (凱風快晴, Gaifū Kaisei), also known as Red Fuji (赤富士, Aka Fuji): the mountain lit red in a summer dawn, bare of snow after the long thaw. Thirty-six views felt like a fitting number to sit with at v1.36, and a reminder that even Hokusai didn't stop there.1 Keeping watch over the scene is the Kubernetes helm, set into the sky alongside the mountain.

At the foot of Fuji sit Stella (left) and Nacho (right), two cats with the Kubernetes helm on their collars, standing in for the role of komainu, the paired lion-dog guardians that watch over Japanese shrines. Paired, because nothing is guarded alone. Stella and Nacho stand in for a very much larger set of paws: the SIGs and working groups, the maintainers and reviewers, the people behind docs, blogs, and translations, the release team, first-time contributors taking their first steps, and lifelong contributors returning season after season. Kubernetes v1.36 is, as always, held up by many hands.

Brushed across Red Fuji in the logo is the calligraphy 晴れに翔け (hare ni kake), "soar into clear skies". It is the first half of a couplet that was too long to fit on the mountain:

晴れに翔け、未来よ明け
hare ni kake, asu yo ake
"Soar into clear skies; toward tomorrow's sunrise."2

That is the wish we carry for this release: to soar into clear skies, for the release itself, for the project, and for everyone who ships it together. The dawn breaking over Red Fuji is not an ending but a passage: this release carries us to the next, and that one to the one after, on toward horizons far beyond what any single view can hold.

1. The series was so popular that Hokusai added ten more prints, bringing the total to forty-six.
2. 未来 means "the future" in its widest sense, not just tomorrow but everything still to come. It is usually read mirai; here it takes the informal reading asu.

Spotlight on key updates

Kubernetes v1.36 is packed with new features and improvements. Here are a few select updates the Release Team would like to highlight!

Stable: Fine-grained API authorization

On behalf of Kubernetes SIG Auth and SIG Node, we are pleased to announce the graduation of fine-grained kubelet API authorization to General Availability (GA) in Kubernetes v1.36!

The KubeletFineGrainedAuthz feature gate was introduced as an opt-in alpha feature in Kubernetes v1.32, then graduated to beta (enabled by default) in v1.33. Now, the feature is generally available. This feature enables more precise, least-privilege access control over the kubelet's HTTPS API replacing the need to grant the overly broad nodes/proxy permission for common monitoring and observability use cases.

This work was done as a part of KEP #2862 led by SIG Auth and SIG Node.

Beta: Resource health status

Before the v1.34 release, Kubernetes lacked a native way to report the health of allocated devices, making it difficult to diagnose Pod crashes caused by hardware failures. Building on the initial alpha release in v1.31 which focused on Device Plugins, Kubernetes v1.36 expands this feature by promoting the allocatedResourcesStatus field within the .status for each Pod (to beta). This field provides a unified health reporting mechanism for all specialized hardware.

Users can now run kubectl describe pod to determine if a container's crash loop is due to an Unhealthy or Unknown device status, regardless of whether the hardware was provisioned via traditional plugins or the newer DRA framework. This enhanced visibility allows administrators and automated controllers to quickly identify faulty hardware and streamline the recovery of high-performance workloads.

This work was done as part of KEP #4680 led by SIG Node.

Alpha: Workload Aware Scheduling (WAS) features

Previously, the Kubernetes scheduler and job controllers managed pods as independent units, often leading to fragmented scheduling or resource waste for complex, distributed workloads. Kubernetes v1.36 introduces a comprehensive suite of Workload Aware Scheduling (WAS) features in Alpha, natively integrating the Job controller with a revised Workload API and a new decoupled PodGroup API, to treat related pods as a single logical entity.

Kubernetes v1.35 already supported gang scheduling by requiring a minimum number of pods to be schedulable before any were bound to nodes. v1.36 goes further with a new PodGroup scheduling cycle that evaluates the entire group atomically, either all pods in the group are bound together, or none are.

This work was done across several KEPs (including #4671, #5547, #5832, #5732, and #5710) led by SIG Scheduling and SIG Apps.

Features graduating to Stable

This is a selection of some of the improvements that are now stable following the v1.36 release.

Volume group snapshots

After several cycles in beta, VolumeGroupSnapshot support reaches General Availability (GA) in Kubernetes v1.36. This feature allows you to take crash-consistent snapshots across multiple PersistentVolumeClaims simultaneously. The support for volume group snapshots relies on a set of extension APIs for group snapshots. These APIs allow users to take crash consistent snapshots for a set of volumes. A key aim is to allow you to restore that set of snapshots to new volumes and recover your workload based on a crash consistent recovery point.

This work was done as part of KEP #3476 led by SIG Storage.

Mutable volume attach limits

In Kubernetes v1.36, the mutable CSINode allocatable feature graduates to stable. This enhancement allows Container Storage Interface (CSI) drivers to dynamically update the reported maximum number of volumes that a node can handle.

With this update, the kubelet can dynamically update a node's volume limits and capacity information. The kubelet adjusts these limits based on periodic checks or in response to resource exhaustion errors from the CSI driver, without requiring a component restart. This ensures the Kubernetes scheduler maintains an accurate view of storage availability, preventing pod scheduling failures caused by outdated volume limits.

This work was done as part of KEP #4876 led by SIG Storage.

API for external signing of ServiceAccount tokens

In Kubernetes v1.36, the external ServiceAccount token signer feature for service accounts graduates to stable, making it possible to offload token signing to an external system while still integrating cleanly with the Kubernetes API. Clusters can now rely on an external JWT signer for issuing projected service account tokens that follow the standard service account token format, including support for extended expiration when needed. This is especially useful for clusters that already rely on external identity or key management systems, allowing Kubernetes to integrate without duplicating key management inside the control plane.

The kube-apiserver is wired to discover public keys from the external signer, cache them, and validate tokens it did not sign itself, so existing authentication and authorization flows continue to work as expected. Over the alpha and beta phases, the API and configuration for the external signer plugin, path validation, and OIDC discovery were hardened to handle real-world deployments and rotation patterns safely.

With GA in v1.36, external ServiceAccount token signing is now a fully supported option for platforms that centralize identity and signing, simplifying integration with external IAM systems and reducing the need to manage signing keys directly inside the control plane.

This work was done as part of KEP #740 led by SIG Auth.

DRA features graduating to Stable

Part of the Dynamic Resource Allocation (DRA) ecosystem reaches full production maturity in Kubernetes v1.36 as key governance and selection features graduate to Stable. The transition of DRA admin access to GA provides a permanent, secure framework for cluster administrators to access and manage hardware resources globally, while the stabilization of prioritized lists ensures that resource selection logic remains consistent and predictable across all cluster environments.

Now, organizations can confidently deploy mission-critical hardware automation with the guarantee of long-term API stability and backward compatibility. These features empower users to implement sophisticated resource-sharing policies and administrative overrides that are essential for large-scale GPU clusters and multi-tenant AI platforms, marking the completion of the core architectural foundation for next-generation resource management.

This work was done as part of KEPs #5018 and #4816 led by SIG Auth and SIG Scheduling.

Mutating admission policies

Declarative cluster management reaches a new level of sophistication in Kubernetes v1.36 with the graduation of MutatingAdmissionPolicies to Stable. This milestone provides a native, high-performance alternative to traditional webhooks by allowing administrators to define resource mutations directly in the API server using the Common Expression Language (CEL), fully replacing the need for external infrastructure for many common use cases.

Now, cluster operators can modify incoming requests without the latency and operational complexity associated with managing custom admission webhooks. By moving mutation logic into a declarative, versioned policy, organizations can achieve more predictable cluster behavior, reduced network overhead, and a hardened security model with the full guarantee of long-term API stability.

This work was done as part of KEP #3962 led by SIG API Machinery.

Declarative validation of Kubernetes native types with `validation-gen`

The development of custom resources reaches a new level of efficiency in Kubernetes v1.36 as declarative validation (with validation-gen) graduates to Stable. This milestone replaces the manual and often error-prone task of writing complex OpenAPI schemas by allowing developers to define sophisticated validation logic directly within Go struct tags using the Common Expression Language (CEL).

Instead of writing custom validation functions, Kubernetes contributors can now define validation rules using IDL marker comments (such as +k8s:minimum or +k8s:enum) directly within the API type definitions (types.go). The validation-gen tool parses these comments to automatically generate robust API validation code at compile-time. This reduces maintenance overhead and ensures that API validation remains consistent and synchronized with the source code.

This work was done as part of KEP #5073 led by SIG API Machinery.

Removal of gogo protobuf dependency for Kubernetes API types

Security and long-term maintainability for the Kubernetes codebase take a major step forward in Kubernetes v1.36 with the completion of the gogoprotobuf removal. This initiative has eliminated a significant dependency on the unmaintained gogoprotobuf library, which had become a source of potential security vulnerabilities and a blocker for adopting modern Go language features.

Instead of migrating to standard Protobuf generation, which presented compatibility risks for Kubernetes API types, the project opted to fork and internalize the required generation logic within k8s.io/code-generator. This approach successfully eliminates the unmaintained runtime dependencies from the Kubernetes dependency graph while preserving existing API behavior and serialization compatibility. For consumers of Kubernetes API Go types, this change reduces technical debt and prevents accidental misuse with standard protobuf libraries.

This work was done as part of KEP #5589 led by SIG API Machinery.

Node log query

Previously, Kubernetes required cluster administrators to log into nodes via SSH or implement a client-side reader for debugging issues pertaining to control-plane or worker nodes. While certain issues still require direct node access, issues with the kube-proxy or kubelet can be diagnosed by inspecting their logs. Node logs offer cluster administrators a method to view these logs using the kubelet API and kubectl plugin to simplify troubleshooting without logging into nodes, similar to debugging issues related to a pod or container. This method is operating system agnostic and requires the services or nodes to log to /var/log.

As this feature reaches GA in Kubernetes 1.36 after thorough performance validation on production workloads, it is enabled by default on the kubelet through the NodeLogQuery feature gate. In addition, the enableSystemLogQuery kubelet configuration option must also be enabled.

This work was done as a part of KEP #2258 led by SIG Windows.

Support User Namespaces in pods

Container isolation and node security reach a major maturity milestone in Kubernetes v1.36 as support for User Namespaces graduates to Stable. This long-awaited feature provides a critical layer of defense-in-depth by allowing the mapping of a container's root user to a non-privileged user on the host, ensuring that even if a process escapes the container, it possesses no administrative power over the underlying node.

Now, cluster operators can confidently enable this hardened isolation for production workloads to mitigate the impact of container breakout vulnerabilities. By decoupling the container's internal identity from the host's identity, Kubernetes provides a robust, standardized mechanism to protect multi-tenant environments and sensitive infrastructure from unauthorized access, all with the full guarantee of long-term API stability.

This work was done as part of KEP #127 led by SIG Node.

Support PSI based on cgroupv2

Node resource management and observability become more precise in Kubernetes v1.36 as the export of Pressure Stall Information (PSI) metrics graduates to Stable. This feature provides the kubelet with the ability to report "pressure" metrics for CPU, memory, and I/O, offering a more granular view of resource contention than traditional utilization metrics.

Cluster operators and autoscalers can use these metrics to distinguish between a system that is simply busy and one that is actively stalling due to resource exhaustion. By leveraging these signals, users can more accurately tune pod resource requests, improve the reliability of vertical autoscaling, and detect noisy neighbor effects before they lead to application performance degradation or node instability.

This work was done as part of KEP #4205 led by SIG Node.

Volume source: OCI artifact and/or image

The distribution of container data becomes more flexible in Kubernetes v1.36 as OCI volume source support graduates to Stable. This feature moves beyond the traditional requirement of mounting volumes from external storage providers or config maps by allowing the kubelet to pull and mount content directly from any OCI-compliant registry, such as a container image or an artifact repository.

Now, developers and platform engineers can package application data, models, or static assets as OCI artifacts and deliver them to pods using the same registries and versioning workflows they already use for container images. This convergence of image and volume management simplifies CI/CD pipelines, reduces dependency on specialized storage backends for read-only content, and ensures that data remains portable and securely accessible across any environment.

This work was done as part of KEP #4639 led by SIG Node.

New features in Beta

This is a selection of some of the improvements that are now beta following the v1.36 release.

Staleness mitigation for controllers

Staleness in Kubernetes controllers is a problem that affects many controllers and can subtly affect controller behavior. It is usually not until it is too late, when a controller in production has already taken incorrect action, that staleness is found to be an issue due to some underlying assumption made by the controller author. This could lead to conflicting updates or data corruption upon controller reconciliation during times of cache staleness.

We are excited to announce that Kubernetes v1.36 includes new features that help mitigate controller staleness and provide better observability of controller behavior. This prevents reconciliation based on an outdated view of cluster state that can often lead to harmful behavior.

This work was done as part of KEP #5647 led by SIG API Machinery.

IP/CIDR validation improvements

In Kubernetes v1.36, the StrictIPCIDRValidation feature for API IP and CIDR fields graduates to beta, tightening validation to catch malformed addresses and prefixes that previously slipped through. This helps prevent subtle configuration bugs where Services, Pods, NetworkPolicies, or other resources reference invalid IPs, which could otherwise lead to confusing runtime behavior or security surprises.

Controllers are updated to canonicalize IPs they write back into objects and to warn when they encounter bad values that were already stored, so clusters can gradually converge on clean, consistent data. With beta, StrictIPCIDRValidation is ready for wider use, giving operators more reliable guardrails around IP-related configuration as they evolve networks and policies over time.

This work was done as a part of KEP #4858 led by SIG Network.

Separate kubectl user preferences from cluster configs

The .kuberc feature for customizing kubectl user preferences continues to be beta and is enabled by default. The ~/.kube/kuberc file allows users to store aliases, default flags, and other personal settings separately from kubeconfig files, which hold cluster endpoints and credentials. This separation prevents personal preferences from interfering with CI pipelines or shared kubeconfig files, while maintaining a consistent kubectl experience across different clusters and contexts.

In Kubernetes v1.36, .kuberc was expanded with the ability to define policies for credential plugins (allowlists or denylists) to enforce safer authentication practicies. Users can disable this functionality if needed by setting the KUBECTL_KUBERC=false or KUBERC=off environment variables.

This work was done as a part of KEP #3104 led by SIG CLI, with the help from SIG Auth.

Mutable container resources when Job is suspended

In Kubernetes v1.36, the MutablePodResourcesForSuspendedJobs feature graduates to beta and is enabled by default. This update relaxes Job validation to allow updates to container CPU, memory, GPU, and extended resource requests and limits while a Job is suspended.

This capability allows queue controllers and operators to adjust batch workload requirements based on real‑time cluster conditions. For example, a queueing system can suspend incoming Jobs, adjust their resource requirements to match available capacity or quota, and then unsuspend them. The feature strictly limits mutability to suspended Jobs (or Jobs whose pods have been terminated upon suspension) to prevent disruptive changes to actively running pods.

This work was done as a part of KEP #5440 led by SIG Apps.

Constrained impersonation

In Kubernetes v1.36, the ConstrainedImpersonation feature for user impersonation graduates to beta, tightening a historically all‑or‑nothing mechanism into something that can actually follow least‑privilege principles. When this feature is enabled, an impersonator must have two distinct sets of permissions: one to impersonate a given identity, and another to perform specific actions on that identity’s behalf. This prevents support tools, controllers, or node agents from using impersonation to gain broader access than they themselves are allowed, even if their impersonation RBAC is misconfigured. Existing impersonate rules keep working, but the API server prefers the new constrained checks first, making the transition incremental instead of a flag day. With beta in v1.36, ConstrainedImpersonation is tested, documented, and ready for wider adoption by platform teams that rely on impersonation for debugging, proxying, or node‑level controllers.

This work was done as a part of KEP #5284 led by SIG Auth.

DRA features in beta

The Dynamic Resource Allocation (DRA) framework reaches another maturity milestone in Kubernetes v1.36 as several core features graduate to beta and are enabled by default. This transition moves DRA beyond basic allocation by graduating partitionable devices and consumable capacity, allowing for more granular sharing of hardware like GPUs, while device taints and tolerations ensure that specialized resources are only utilized by the appropriate workloads.

Now, users benefit from a much more reliable and observable resource lifecycle through ResourceClaim device status and the ability to ensure device attachment before Pod scheduling. By integrating these features with extended resource support, Kubernetes provides a robust production-ready alternative to the legacy device plugin system, enabling complex AI and HPC workloads to manage hardware with unprecedented precision and operational safety.

This work was done across several KEPs (including #5004, #4817, #5055, #5075, #4815, and #5007) led by SIG Scheduling and SIG Node.

Statusz for Kubernetes components

In Kubernetes v1.36, the ComponentStatusz feature gate for core Kubernetes components graduates to beta, providing a /statusz endpoint (enabled by default) that surfaces real‑time build and version details for each component. This low‑overhead z-page exposes information like start time, uptime, Go version, binary version, emulation version, and minimum compatibility version, so operators and developers can quickly see exactly what is running without digging through logs or configs.

The endpoint offers a human‑readable text view by default, plus a versioned structured API (config.k8s.io/v1beta1) for programmatic access in JSON, YAML, or CBOR via explicit content negotiation. Access is granted to the system:monitoring group, keeping it aligned with existing protections on health and metrics endpoints and avoiding exposure of sensitive data.

With beta, ComponentStatusz is enabled by default across all core control‑plane components and node agents, backed by unit, integration, and end‑to‑end tests so it can be safely used in production for observability and debugging workflows.

This work was done as a part of KEP #4827 led by SIG Instrumentation.

Flagz for Kubernetes components

In Kubernetes v1.36, the ComponentFlagz feature gate for core Kubernetes components graduates to beta, standardizing a /flagz endpoint that exposes the effective command‑line flags each component was started with. This gives cluster operators and developers real‑time, in‑cluster visibility into component configuration, making it much easier to debug unexpected behavior or verify that a flag rollout actually took effect after a restart.

The endpoint supports both a human‑readable text view and a versioned structured API (initially config.k8s.io/v1beta1), so you can either curl it during an incident or wire it into automated tooling once you are ready. Access is granted to the system:monitoring group and sensitive values can be redacted, keeping configuration insight aligned with existing security practices around health and status endpoints.

With beta, ComponentFlagz is now enabled by default and implemented across all core control‑plane components and node agents, backed by unit, integration, and end‑to‑end tests to ensure the endpoint is reliable in production clusters.

This work was done as a part of KEP #4828 led by SIG Instrumentation.

Mixed version proxy (aka unknown version interoperability proxy)

In Kubernetes v1.36, the mixed version proxy feature graduates to beta, building on its alpha introduction in v1.28 to provide safer control-plane upgrades for mixed-version clusters. Each API request can now be routed to the apiserver instance that serves the requested group, version, and resource, reducing 404s and failures due to version skew.

The feature relies on peer-aggregated discovery, so apiservers share information about which resources and versions they expose, then use that data to transparently reroute requests when needed. New metrics on rerouted traffic and proxy behavior help operators understand how often requests are forwarded and to which peers. Together, these changes make it easier to run highly available, mixed-version API control planes in production while performing multi-step or partial control-plane upgrades.

This work was done as a part of KEP #4020 led by SIG API Machinery

Memory QoS with cgroups v2

Kubernetes now enhances memory QoS on Linux cgroup v2 nodes with smarter, tiered memory protection that better aligns kernel controls with pod requests and limits, reducing interference and thrashing for workloads sharing the same node. This iteration also refines how kubelet programs memory.high and memory.min, adds metrics and safeguards to avoid livelocks, and introduces configuration options so cluster operators can tune memory protection behavior for their environments.

This work was done as part of KEP #2570 led by SIG Node.

New features in Alpha

This is a selection of some of the improvements that are now alpha following the v1.36 release.

HPA scale to zero for custom metrics

Until now, the HorizontalPodAutoscaler (HPA) required a minimum of at least one replica to remain active, as it could only calculate scaling needs based on metrics (like CPU or Memory) from running pods. Kubernetes v1.36 continues the development of the HPA scale to zero feature (disabled by default) in Alpha, allowing workloads to scale down to zero replicas specifically when using Object or External metrics.

Now, users can experiment with significantly reducing infrastructure costs by completely idling heavy workloads when no work is pending. While the feature remains behind the HPAScaleToZero feature gate, it enables the HPA to stay active even with zero running pods, automatically scaling the deployment back up as soon as the external metric (e.g., queue length) indicates that new tasks have arrived.

This work was done as part of KEP #2021 led by SIG Autoscaling.

DRA features in Alpha

Historically, the Dynamic Resource Allocation (DRA) framework lacked seamless integration with high-level controllers and provided limited visibility into device-specific metadata or availability. Kubernetes v1.36 introduces a wave of DRA enhancements in Alpha, including native ResourceClaim support for workloads, and DRA native resources to provide the flexibility of DRA to cpu management.

Now, users can leverage the downward API to expose complex resource attributes directly to containers and benefit from improved resource availability visibility for more predictable scheduling. these updates, combined with support for list types in device attributes, transform DRA from a low-level primitive into a robust system capable of handling the sophisticated networking and compute requirements of modern AI and high-performance computing (HPC) stacks.

This work was done across several KEPs (including #5729, #5304, #5517, #5677, and #5491) led by SIG Scheduling and SIG Node.

Native histogram support for Kubernetes metrics

High-resolution monitoring reaches a new milestone in Kubernetes v1.36 with the introduction of native histogram support in Alpha. While classical Prometheus histograms relied on static, pre-defined buckets that often forced a compromise between data accuracy and memory usage, this update allows the control plane to export sparse histograms that dynamically adjust their resolution based on real-time data.

Now, cluster operators can capture precise latency distributions for the kube-apiserver and other core components without the overhead of manual bucket management. This architectural shift ensures more reliable SLIs and SLOs, providing high-fidelity heatmaps that remain accurate even during the most unpredictable workload spikes.

This work was done as part of KEP #5808 led by SIG Instrumentation.

Manifest based admission control config

Managing admission controllers moves toward a more declarative and consistent model in Kubernetes v1.36 with the introduction of manifest-based admission control configuration in Alpha. This change addresses the long-standing challenge of configuring admission plugins through disparate command-line flags or separate, complex config files by allowing administrators to define the desired state of admission control directly through a structured manifest.

Now, cluster operators can manage admission plugin settings with the same versioned, declarative workflows used for other Kubernetes objects, significantly reducing the risk of configuration drift and manual errors during cluster upgrades. By centralizing these configurations into a unified manifest, the kube-apiserver becomes easier to audit and automate, paving the way for more secure and reproducible cluster deployments.

This work was done as part of KEP #5793 led by SIG API Machinery.

CRI list streaming

With the introduction of CRI list streaming in Alpha, Kubernetes v1.36 uses new internal streaming operations. This enhancement addresses the memory pressure and latency spikes often seen on large-scale nodes by replacing traditional, monolithic List requests between the kubelet and the container runtime with a more efficient server-side streaming RPC.

Now, instead of waiting for a single, massive response containing all container or image data, the kubelet can process results incrementally as they are streamed. This shift significantly reduces the peak memory footprint of the kubelet and improves responsiveness on high-density nodes, ensuring that cluster management remains fluid even as the number of containers per node continues to grow.

This work was done as part of KEP #5825 led by SIG Node.

Other notable changes

Ingress NGINX retirement

To prioritize the safety and security of the ecosystem, Kubernetes SIG Network and the Security Response Committee have retired Ingress NGINX on March 24, 2026. Since that date, there have been no further releases, no bugfixes, and no updates to resolve any security vulnerabilities discovered. Existing deployments of Ingress NGINX will continue to function, and installation artifacts like Helm charts and container images will remain available.

For full details, see the official retirement announcement.

Faster SELinux labelling for volumes (GA)

Kubernetes v1.36 makes the SELinux volume mounting improvement generally available. This change replaced recursive file relabeling with mount -o context=XYZ option, applying the correct SELinux label to the entire volume at mount time. It brings more consistent performance and reduces Pod startup delays on SELinux-enforcing systems.

This feature was introduced as beta in v1.28 for ReadWriteOncePod volumes. In v1.32, it gained metrics and an opt-out option (securityContext.seLinuxChangePolicy: Recursive) to help catch conflicts. Now in v1.36, it reaches Stable and defaults to all volumes, with Pods or CSIDrivers opting in via spec.seLinuxMount.

However, we expect this feature to create the risk of breaking changes in the future Kubernetes releases, potentially due to sharing one volume between privileged and unprivileged Pods on the same node.

Developers have the responsibility of setting the seLinuxChangePolicy field and SELinux volume labels on Pods. Regardless of whether they are writing a Deployment, StatefulSet, DaemonSet or even a custom resource that includes a Pod template, being careless with these settings can lead to a range of problems such as Pods not starting up correctly when Pods share a volume.

Kubernetes v1.36 is the ideal release to audit your clusters. To learn more, check out SELinux Volume Label Changes goes GA (and likely implications in v1.37) blog.

For more details on this enhancement, refer to KEP-1710: Speed up recursive SELinux label change.

Graduations, deprecations, and removals in v1.36

Graduations to stable

This lists all the features that graduated to stable (also known as general availability). For a full list of updates including new features and graduations from alpha to beta, see the release notes.

This release includes a total of 18 enhancements promoted to stable:

Deprecations removals, and community updates

As Kubernetes develops and matures, features may be deprecated, removed, or replaced with better ones to improve the project's overall health. See the Kubernetes deprecation and removal policy for more details on this process. Kubernetes v1.36 includes a couple of deprecations.

Deprecation of Service .spec.externalIPs

With this release, the externalIPs field in Service spec is deprecated. This means the functionality exists, but will no longer function in a future version of Kubernetes. You should plan to migrate if you currently rely on that field. This field has been a known security headache for years, enabling man-in-the-middle attacks on your cluster traffic, as documented in CVE-2020-8554. From Kubernetes v1.36 and onwards, you will see deprecation warnings when using it, with full removal planned for v1.43.

If your Services still lean on externalIPs, consider using LoadBalancer services for cloud-managed ingress, NodePort for simple port exposure, or Gateway API for a more flexible and secure way to handle external traffic.

For more details on this field and its deprecation, refer to External IPs or read KEP-5707: Deprecate service.spec.externalIPs.

Removal of the `gitRepo` volume driver

The gitRepo volume type has been deprecated since v1.11. For Kubernetes v1.36, the gitRepo volume plugin is permanently disabled and cannot be turned back on. This change protects clusters from a critical security issue where using gitRepo could let an attacker run code as root on the node.

Although gitRepo has been deprecated for years and better alternatives have been recommended, it was still technically possible to use it in previous releases. From v1.36 onward, that path is closed for good, so any existing workloads depending on gitRepo will need to migrate to supported approaches such as init containers or external git-sync style tools.

For more details on this removal, refer to KEP-5040: Remove gitRepo volume driver

Release notes

Check out the full details of the Kubernetes v1.36 release in our release notes.

Availability

Kubernetes v1.36 is available for download on GitHub or on the Kubernetes download page.

To get started with Kubernetes, check out these tutorials or run local Kubernetes clusters using minikube. You can also easily install v1.36 using kubeadm.

Release Team

Kubernetes is only possible with the support, commitment, and hard work of its community. Each release team is made up of dedicated community volunteers who work together to build the many pieces that make up the Kubernetes releases you rely on. This requires the specialized skills of people from all corners of our community, from the code itself to its documentation and project management.

We would like to thank the entire Release Team for the hours spent hard at work to deliver the Kubernetes v1.36 release to our community. The Release Team's membership ranges from first-time shadows to returning team leads with experience forged over several release cycles. A very special thanks goes out to our release lead, Ryota Sawada, for guiding us through a successful release cycle, for his hands-on approach to solving challenges, and for bringing the energy and care that drives our community forward.

Project Velocity

The CNCF K8s DevStats project aggregates a number of interesting data points related to the velocity of Kubernetes and various sub-projects. This includes everything from individual contributions to the number of companies that are contributing, and is an illustration of the depth and breadth of effort that goes into evolving this ecosystem.

During the v1.36 release cycle, which spanned 15 weeks from 12th January 2026 to 22nd April 2026, Kubernetes received contributions from as many as 106 different companies and 491 individuals. In the wider cloud native ecosystem, the figure goes up to 370 companies, counting 2235 total contributors.

Note that “contribution” counts when someone makes a commit, code review, comment, creates an issue or PR, reviews a PR (including blogs and documentation) or comments on issues and PRs. If you are interested in contributing, visit Getting Started on our contributor website.

Source for this data:

Events Update

Explore upcoming Kubernetes and cloud native events, including KubeCon + CloudNativeCon, KCD, and other notable conferences worldwide. Stay informed and get involved with the Kubernetes community!

April 2026

KCD - Kubernetes Community Days: Guadalajara: April 18, 2026 | Guadalajara, Mexico

May 2026

KCD - Kubernetes Community Days: Toronto: May 13, 2026 | Toronto, Canada
KCD - Kubernetes Community Days: Texas: May 15, 2026 | Austin, USA
KCD - Kubernetes Community Days: Istanbul: May 15, 2026 | Istanbul, Turkey
KCD - Kubernetes Community Days: Helsinki: May 20, 2026 | Helsinki, Finland
KCD - Kubernetes Community Days: Czech & Slovak: May 21, 2026 | Prague, Czechia

June 2026

KCD - Kubernetes Community Days: New York: June 10, 2026 | New York, USA
KubeCon + CloudNativeCon India 2026: June 18-19, 2026 | Mumbai, India

July 2026

KubeCon + CloudNativeCon Japan 2026: July 29-30, 2026 | Yokohama, Japan

September 2026

KubeCon + CloudNativeCon China 2026: September 8-9, 2026 | Shanghai, China

October 2026

KCD - Kubernetes Community Days: UK: Oct 19, 2026 | Edinburgh, UK

November 2026

KCD - Kubernetes Community Days: Porto: Nov 19, 2026 | Porto, Portugal
KubeCon + CloudNativeCon North America 2026: Nov 9-12, 2026 | Salt Lake, USA

You can find the latest event details here.

Upcoming Release Webinar

Join members of the Kubernetes v1.36 Release Team on Wednesday, May 20th 2026 at 4:00 PM (UTC) to learn about the release highlights of this release. For more information and registration, visit the event page on the CNCF Online Programs site.

Get Involved

The simplest way to get involved with Kubernetes is by joining one of the many Special Interest Groups (SIGs) that align with your interests. Have something you’d like to broadcast to the Kubernetes community? Share your voice at our weekly community meeting, and through the channels below. Thank you for your continued feedback and support.

Follow us on Bluesky @kubernetes.io for the latest updates
Join the community discussion on Discuss
Join the community on Slack
Post questions (or answer questions) on Stack Overflow
Share your Kubernetes story
Read more about what’s happening with Kubernetes on the blog
Learn more about the Kubernetes Release Team

Categories: CNCF Projects, Kubernetes

Kubernetes v1.36: ハル (Haru)

Kubernetes Blog - Tue, 04/21/2026 - 20:00

Editors: Chad M. Crowell, Kirti Goyal, Sophia Ugochukwu, Swathi Rao, Utkarsh Umre

Similar to previous releases, the release of Kubernetes v1.36 introduces new stable, beta, and alpha features. The consistent delivery of high-quality releases underscores the strength of our development cycle and the vibrant support from our community.

This release consists of 70 enhancements. Of those enhancements, 18 have graduated to Stable, 25 are entering Beta, and 25 have graduated to Alpha.

There are also some deprecations and removals in this release; make sure to read about those.

Release theme and logo

We open 2026 with Kubernetes v1.36, a release that arrives as the season turns and the light shifts on the mountain. ハル (Haru) is a sound in Japanese that carries many meanings; among those we hold closest are 春 (spring), 晴れ (hare, clear skies), and 遥か (haruka, far-off, distant). A season, a sky, and a horizon. You will find all three in what follows.

The logo, created by avocadoneko / Natsuho Ide, draws inspiration from Katsushika Hokusai's Thirty-six Views of Mount Fuji (富嶽三十六景, Fugaku Sanjūrokkei), the same series that gave the world The Great Wave off Kanagawa. Our v1.36 logo reimagines one of the series' most celebrated prints, Fine Wind, Clear Morning (凱風快晴, Gaifū Kaisei), also known as Red Fuji (赤富士, Aka Fuji): the mountain lit red in a summer dawn, bare of snow after the long thaw. Thirty-six views felt like a fitting number to sit with at v1.36, and a reminder that even Hokusai didn't stop there.1 Keeping watch over the scene is the Kubernetes helm, set into the sky alongside the mountain.

At the foot of Fuji sit Stella (left) and Nacho (right), two cats with the Kubernetes helm on their collars, standing in for the role of komainu, the paired lion-dog guardians that watch over Japanese shrines. Paired, because nothing is guarded alone. Stella and Nacho stand in for a very much larger set of paws: the SIGs and working groups, the maintainers and reviewers, the people behind docs, blogs, and translations, the release team, first-time contributors taking their first steps, and lifelong contributors returning season after season. Kubernetes v1.36 is, as always, held up by many hands.

Brushed across Red Fuji in the logo is the calligraphy 晴れに翔け (hare ni kake), "soar into clear skies". It is the first half of a couplet that was too long to fit on the mountain:

晴れに翔け、未来よ明け
hare ni kake, asu yo ake
"Soar into clear skies; toward tomorrow's sunrise."2

That is the wish we carry for this release: to soar into clear skies, for the release itself, for the project, and for everyone who ships it together. The dawn breaking over Red Fuji is not an ending but a passage: this release carries us to the next, and that one to the one after, on toward horizons far beyond what any single view can hold.

1. The series was so popular that Hokusai added ten more prints, bringing the total to forty-six.
2. 未来 means "the future" in its widest sense, not just tomorrow but everything still to come. It is usually read mirai; here it takes the informal reading asu.

Spotlight on key updates

Kubernetes v1.36 is packed with new features and improvements. Here are a few select updates the Release Team would like to highlight!

Stable: Fine-grained API authorization

On behalf of Kubernetes SIG Auth and SIG Node, we are pleased to announce the graduation of fine-grained kubelet API authorization to General Availability (GA) in Kubernetes v1.36!

The KubeletFineGrainedAuthz feature gate was introduced as an opt-in alpha feature in Kubernetes v1.32, then graduated to beta (enabled by default) in v1.33. Now, the feature is generally available. This feature enables more precise, least-privilege access control over the kubelet's HTTPS API replacing the need to grant the overly broad nodes/proxy permission for common monitoring and observability use cases.

This work was done as a part of KEP #2862 led by SIG Auth and SIG Node.

Beta: Resource health status

Before the v1.34 release, Kubernetes lacked a native way to report the health of allocated devices, making it difficult to diagnose Pod crashes caused by hardware failures. Building on the initial alpha release in v1.31 which focused on Device Plugins, Kubernetes v1.36 expands this feature by promoting the allocatedResourcesStatus field within the .status for each Pod (to beta). This field provides a unified health reporting mechanism for all specialized hardware.

Users can now run kubectl describe pod to determine if a container's crash loop is due to an Unhealthy or Unknown device status, regardless of whether the hardware was provisioned via traditional plugins or the newer DRA framework. This enhanced visibility allows administrators and automated controllers to quickly identify faulty hardware and streamline the recovery of high-performance workloads.

This work was done as part of KEP #4680 led by SIG Node.

Alpha: Workload Aware Scheduling (WAS) features

Previously, the Kubernetes scheduler and job controllers managed pods as independent units, often leading to fragmented scheduling or resource waste for complex, distributed workloads. Kubernetes v1.36 introduces a comprehensive suite of Workload Aware Scheduling (WAS) features in Alpha, natively integrating the Job controller with a revised Workload API and a new decoupled PodGroup API, to treat related pods as a single logical entity.

Kubernetes v1.35 already supported gang scheduling by requiring a minimum number of pods to be schedulable before any were bound to nodes. v1.36 goes further with a new PodGroup scheduling cycle that evaluates the entire group atomically, either all pods in the group are bound together, or none are.

This work was done across several KEPs (including #4671, #5547, #5832, #5732, and #5710) led by SIG Scheduling and SIG Apps.

Features graduating to Stable

This is a selection of some of the improvements that are now stable following the v1.36 release.

Volume group snapshots

After several cycles in beta, VolumeGroupSnapshot support reaches General Availability (GA) in Kubernetes v1.36. This feature allows you to take crash-consistent snapshots across multiple PersistentVolumeClaims simultaneously. The support for volume group snapshots relies on a set of extension APIs for group snapshots. These APIs allow users to take crash consistent snapshots for a set of volumes. A key aim is to allow you to restore that set of snapshots to new volumes and recover your workload based on a crash consistent recovery point.

This work was done as part of KEP #3476 led by SIG Storage.

Mutable volume attach limits

In Kubernetes v1.36, the mutable CSINode allocatable feature graduates to stable. This enhancement allows Container Storage Interface (CSI) drivers to dynamically update the reported maximum number of volumes that a node can handle.

With this update, the kubelet can dynamically update a node's volume limits and capacity information. The kubelet adjusts these limits based on periodic checks or in response to resource exhaustion errors from the CSI driver, without requiring a component restart. This ensures the Kubernetes scheduler maintains an accurate view of storage availability, preventing pod scheduling failures caused by outdated volume limits.

This work was done as part of KEP #4876 led by SIG Storage.

API for external signing of ServiceAccount tokens

In Kubernetes v1.36, the external ServiceAccount token signer feature for service accounts graduates to stable, making it possible to offload token signing to an external system while still integrating cleanly with the Kubernetes API. Clusters can now rely on an external JWT signer for issuing projected service account tokens that follow the standard service account token format, including support for extended expiration when needed. This is especially useful for clusters that already rely on external identity or key management systems, allowing Kubernetes to integrate without duplicating key management inside the control plane.

The kube-apiserver is wired to discover public keys from the external signer, cache them, and validate tokens it did not sign itself, so existing authentication and authorization flows continue to work as expected. Over the alpha and beta phases, the API and configuration for the external signer plugin, path validation, and OIDC discovery were hardened to handle real-world deployments and rotation patterns safely.

With GA in v1.36, external ServiceAccount token signing is now a fully supported option for platforms that centralize identity and signing, simplifying integration with external IAM systems and reducing the need to manage signing keys directly inside the control plane.

This work was done as part of KEP #740 led by SIG Auth.

DRA features graduating to Stable

Part of the Dynamic Resource Allocation (DRA) ecosystem reaches full production maturity in Kubernetes v1.36 as key governance and selection features graduate to Stable. The transition of DRA admin access to GA provides a permanent, secure framework for cluster administrators to access and manage hardware resources globally, while the stabilization of prioritized lists ensures that resource selection logic remains consistent and predictable across all cluster environments.

Now, organizations can confidently deploy mission-critical hardware automation with the guarantee of long-term API stability and backward compatibility. These features empower users to implement sophisticated resource-sharing policies and administrative overrides that are essential for large-scale GPU clusters and multi-tenant AI platforms, marking the completion of the core architectural foundation for next-generation resource management.

This work was done as part of KEPs #5018 and #4816 led by SIG Auth and SIG Scheduling.

Mutating admission policies

Declarative cluster management reaches a new level of sophistication in Kubernetes v1.36 with the graduation of MutatingAdmissionPolicies to Stable. This milestone provides a native, high-performance alternative to traditional webhooks by allowing administrators to define resource mutations directly in the API server using the Common Expression Language (CEL), fully replacing the need for external infrastructure for many common use cases.

Now, cluster operators can modify incoming requests without the latency and operational complexity associated with managing custom admission webhooks. By moving mutation logic into a declarative, versioned policy, organizations can achieve more predictable cluster behavior, reduced network overhead, and a hardened security model with the full guarantee of long-term API stability.

This work was done as part of KEP #3962 led by SIG API Machinery.

Declarative validation of Kubernetes native types with `validation-gen`

The development of custom resources reaches a new level of efficiency in Kubernetes v1.36 as declarative validation (with validation-gen) graduates to Stable. This milestone replaces the manual and often error-prone task of writing complex OpenAPI schemas by allowing developers to define sophisticated validation logic directly within Go struct tags using the Common Expression Language (CEL).

Instead of writing custom validation functions, Kubernetes contributors can now define validation rules using IDL marker comments (such as +k8s:minimum or +k8s:enum) directly within the API type definitions (types.go). The validation-gen tool parses these comments to automatically generate robust API validation code at compile-time. This reduces maintenance overhead and ensures that API validation remains consistent and synchronized with the source code.

This work was done as part of KEP #5073 led by SIG API Machinery.

Removal of gogo protobuf dependency for Kubernetes API types

Security and long-term maintainability for the Kubernetes codebase take a major step forward in Kubernetes v1.36 with the completion of the gogoprotobuf removal. This initiative has eliminated a significant dependency on the unmaintained gogoprotobuf library, which had become a source of potential security vulnerabilities and a blocker for adopting modern Go language features.

Instead of migrating to standard Protobuf generation, which presented compatibility risks for Kubernetes API types, the project opted to fork and internalize the required generation logic within k8s.io/code-generator. This approach successfully eliminates the unmaintained runtime dependencies from the Kubernetes dependency graph while preserving existing API behavior and serialization compatibility. For consumers of Kubernetes API Go types, this change reduces technical debt and prevents accidental misuse with standard protobuf libraries.

This work was done as part of KEP #5589 led by SIG API Machinery.

Node log query

Previously, Kubernetes required cluster administrators to log into nodes via SSH or implement a client-side reader for debugging issues pertaining to control-plane or worker nodes. While certain issues still require direct node access, issues with the kube-proxy or kubelet can be diagnosed by inspecting their logs. Node logs offer cluster administrators a method to view these logs using the kubelet API and kubectl plugin to simplify troubleshooting without logging into nodes, similar to debugging issues related to a pod or container. This method is operating system agnostic and requires the services or nodes to log to /var/log.

As this feature reaches GA in Kubernetes 1.36 after thorough performance validation on production workloads, it is enabled by default on the kubelet through the NodeLogQuery feature gate. In addition, the enableSystemLogQuery kubelet configuration option must also be enabled.

This work was done as a part of KEP #2258 led by SIG Windows.

Support User Namespaces in pods

Container isolation and node security reach a major maturity milestone in Kubernetes v1.36 as support for User Namespaces graduates to Stable. This long-awaited feature provides a critical layer of defense-in-depth by allowing the mapping of a container's root user to a non-privileged user on the host, ensuring that even if a process escapes the container, it possesses no administrative power over the underlying node.

Now, cluster operators can confidently enable this hardened isolation for production workloads to mitigate the impact of container breakout vulnerabilities. By decoupling the container's internal identity from the host's identity, Kubernetes provides a robust, standardized mechanism to protect multi-tenant environments and sensitive infrastructure from unauthorized access, all with the full guarantee of long-term API stability.

This work was done as part of KEP #127 led by SIG Node.

Support PSI based on cgroupv2

Node resource management and observability become more precise in Kubernetes v1.36 as the export of Pressure Stall Information (PSI) metrics graduates to Stable. This feature provides the kubelet with the ability to report "pressure" metrics for CPU, memory, and I/O, offering a more granular view of resource contention than traditional utilization metrics.

Cluster operators and autoscalers can use these metrics to distinguish between a system that is simply busy and one that is actively stalling due to resource exhaustion. By leveraging these signals, users can more accurately tune pod resource requests, improve the reliability of vertical autoscaling, and detect noisy neighbor effects before they lead to application performance degradation or node instability.

This work was done as part of KEP #4205 led by SIG Node.

Volume source: OCI artifact and/or image

The distribution of container data becomes more flexible in Kubernetes v1.36 as OCI volume source support graduates to Stable. This feature moves beyond the traditional requirement of mounting volumes from external storage providers or config maps by allowing the kubelet to pull and mount content directly from any OCI-compliant registry, such as a container image or an artifact repository.

Now, developers and platform engineers can package application data, models, or static assets as OCI artifacts and deliver them to pods using the same registries and versioning workflows they already use for container images. This convergence of image and volume management simplifies CI/CD pipelines, reduces dependency on specialized storage backends for read-only content, and ensures that data remains portable and securely accessible across any environment.

This work was done as part of KEP #4639 led by SIG Node.

New features in Beta

This is a selection of some of the improvements that are now beta following the v1.36 release.

Staleness mitigation for controllers

Staleness in Kubernetes controllers is a problem that affects many controllers and can subtly affect controller behavior. It is usually not until it is too late, when a controller in production has already taken incorrect action, that staleness is found to be an issue due to some underlying assumption made by the controller author. This could lead to conflicting updates or data corruption upon controller reconciliation during times of cache staleness.

We are excited to announce that Kubernetes v1.36 includes new features that help mitigate controller staleness and provide better observability of controller behavior. This prevents reconciliation based on an outdated view of cluster state that can often lead to harmful behavior.

This work was done as part of KEP #5647 led by SIG API Machinery.

IP/CIDR validation improvements

In Kubernetes v1.36, the StrictIPCIDRValidation feature for API IP and CIDR fields graduates to beta, tightening validation to catch malformed addresses and prefixes that previously slipped through. This helps prevent subtle configuration bugs where Services, Pods, NetworkPolicies, or other resources reference invalid IPs, which could otherwise lead to confusing runtime behavior or security surprises.

Controllers are updated to canonicalize IPs they write back into objects and to warn when they encounter bad values that were already stored, so clusters can gradually converge on clean, consistent data. With beta, StrictIPCIDRValidation is ready for wider use, giving operators more reliable guardrails around IP-related configuration as they evolve networks and policies over time.

This work was done as a part of KEP #4858 led by SIG Network.

Separate kubectl user preferences from cluster configs

The .kuberc feature for customizing kubectl user preferences continues to be beta and is enabled by default. The ~/.kube/kuberc file allows users to store aliases, default flags, and other personal settings separately from kubeconfig files, which hold cluster endpoints and credentials. This separation prevents personal preferences from interfering with CI pipelines or shared kubeconfig files, while maintaining a consistent kubectl experience across different clusters and contexts.

In Kubernetes v1.36, .kuberc was expanded with the ability to define policies for credential plugins (allowlists or denylists) to enforce safer authentication practicies. Users can disable this functionality if needed by setting the KUBECTL_KUBERC=false or KUBERC=off environment variables.

This work was done as a part of KEP #3104 led by SIG CLI, with the help from SIG Auth.

Mutable container resources when Job is suspended

In Kubernetes v1.36, the MutablePodResourcesForSuspendedJobs feature graduates to beta and is enabled by default. This update relaxes Job validation to allow updates to container CPU, memory, GPU, and extended resource requests and limits while a Job is suspended.

This capability allows queue controllers and operators to adjust batch workload requirements based on real‑time cluster conditions. For example, a queueing system can suspend incoming Jobs, adjust their resource requirements to match available capacity or quota, and then unsuspend them. The feature strictly limits mutability to suspended Jobs (or Jobs whose pods have been terminated upon suspension) to prevent disruptive changes to actively running pods.

This work was done as a part of KEP #5440 led by SIG Apps.

Constrained impersonation

In Kubernetes v1.36, the ConstrainedImpersonation feature for user impersonation graduates to beta, tightening a historically all‑or‑nothing mechanism into something that can actually follow least‑privilege principles. When this feature is enabled, an impersonator must have two distinct sets of permissions: one to impersonate a given identity, and another to perform specific actions on that identity’s behalf. This prevents support tools, controllers, or node agents from using impersonation to gain broader access than they themselves are allowed, even if their impersonation RBAC is misconfigured. Existing impersonate rules keep working, but the API server prefers the new constrained checks first, making the transition incremental instead of a flag day. With beta in v1.36, ConstrainedImpersonation is tested, documented, and ready for wider adoption by platform teams that rely on impersonation for debugging, proxying, or node‑level controllers.

This work was done as a part of KEP #5284 led by SIG Auth.

DRA features in beta

The Dynamic Resource Allocation (DRA) framework reaches another maturity milestone in Kubernetes v1.36 as several core features graduate to beta and are enabled by default. This transition moves DRA beyond basic allocation by graduating partitionable devices and consumable capacity, allowing for more granular sharing of hardware like GPUs, while device taints and tolerations ensure that specialized resources are only utilized by the appropriate workloads.

Now, users benefit from a much more reliable and observable resource lifecycle through ResourceClaim device status and the ability to ensure device attachment before Pod scheduling. By integrating these features with extended resource support, Kubernetes provides a robust production-ready alternative to the legacy device plugin system, enabling complex AI and HPC workloads to manage hardware with unprecedented precision and operational safety.

This work was done across several KEPs (including #5004, #4817, #5055, #5075, #4815, and #5007) led by SIG Scheduling and SIG Node.

Statusz for Kubernetes components

In Kubernetes v1.36, the ComponentStatusz feature gate for core Kubernetes components graduates to beta, providing a /statusz endpoint (enabled by default) that surfaces real‑time build and version details for each component. This low‑overhead z-page exposes information like start time, uptime, Go version, binary version, emulation version, and minimum compatibility version, so operators and developers can quickly see exactly what is running without digging through logs or configs.

The endpoint offers a human‑readable text view by default, plus a versioned structured API (config.k8s.io/v1beta1) for programmatic access in JSON, YAML, or CBOR via explicit content negotiation. Access is granted to the system:monitoring group, keeping it aligned with existing protections on health and metrics endpoints and avoiding exposure of sensitive data.

With beta, ComponentStatusz is enabled by default across all core control‑plane components and node agents, backed by unit, integration, and end‑to‑end tests so it can be safely used in production for observability and debugging workflows.

This work was done as a part of KEP #4827 led by SIG Instrumentation.

Flagz for Kubernetes components

In Kubernetes v1.36, the ComponentFlagz feature gate for core Kubernetes components graduates to beta, standardizing a /flagz endpoint that exposes the effective command‑line flags each component was started with. This gives cluster operators and developers real‑time, in‑cluster visibility into component configuration, making it much easier to debug unexpected behavior or verify that a flag rollout actually took effect after a restart.

The endpoint supports both a human‑readable text view and a versioned structured API (initially config.k8s.io/v1beta1), so you can either curl it during an incident or wire it into automated tooling once you are ready. Access is granted to the system:monitoring group and sensitive values can be redacted, keeping configuration insight aligned with existing security practices around health and status endpoints.

With beta, ComponentFlagz is now enabled by default and implemented across all core control‑plane components and node agents, backed by unit, integration, and end‑to‑end tests to ensure the endpoint is reliable in production clusters.

This work was done as a part of KEP #4828 led by SIG Instrumentation.

Mixed version proxy (aka unknown version interoperability proxy)

In Kubernetes v1.36, the mixed version proxy feature graduates to beta, building on its alpha introduction in v1.28 to provide safer control-plane upgrades for mixed-version clusters. Each API request can now be routed to the apiserver instance that serves the requested group, version, and resource, reducing 404s and failures due to version skew.

The feature relies on peer-aggregated discovery, so apiservers share information about which resources and versions they expose, then use that data to transparently reroute requests when needed. New metrics on rerouted traffic and proxy behavior help operators understand how often requests are forwarded and to which peers. Together, these changes make it easier to run highly available, mixed-version API control planes in production while performing multi-step or partial control-plane upgrades.

This work was done as a part of KEP #4020 led by SIG API Machinery

Memory QoS with cgroups v2

Kubernetes now enhances memory QoS on Linux cgroup v2 nodes with smarter, tiered memory protection that better aligns kernel controls with pod requests and limits, reducing interference and thrashing for workloads sharing the same node. This iteration also refines how kubelet programs memory.high and memory.min, adds metrics and safeguards to avoid livelocks, and introduces configuration options so cluster operators can tune memory protection behavior for their environments.

This work was done as part of KEP #2570 led by SIG Node.

New features in Alpha

This is a selection of some of the improvements that are now alpha following the v1.36 release.

HPA scale to zero for custom metrics

Until now, the HorizontalPodAutoscaler (HPA) required a minimum of at least one replica to remain active, as it could only calculate scaling needs based on metrics (like CPU or Memory) from running pods. Kubernetes v1.36 continues the development of the HPA scale to zero feature (disabled by default) in Alpha, allowing workloads to scale down to zero replicas specifically when using Object or External metrics.

Now, users can experiment with significantly reducing infrastructure costs by completely idling heavy workloads when no work is pending. While the feature remains behind the HPAScaleToZero feature gate, it enables the HPA to stay active even with zero running pods, automatically scaling the deployment back up as soon as the external metric (e.g., queue length) indicates that new tasks have arrived.

This work was done as part of KEP #2021 led by SIG Autoscaling.

DRA features in Alpha

Historically, the Dynamic Resource Allocation (DRA) framework lacked seamless integration with high-level controllers and provided limited visibility into device-specific metadata or availability. Kubernetes v1.36 introduces a wave of DRA enhancements in Alpha, including native ResourceClaim support for workloads, and DRA native resources to provide the flexibility of DRA to cpu management.

Now, users can leverage the downward API to expose complex resource attributes directly to containers and benefit from improved resource availability visibility for more predictable scheduling. these updates, combined with support for list types in device attributes, transform DRA from a low-level primitive into a robust system capable of handling the sophisticated networking and compute requirements of modern AI and high-performance computing (HPC) stacks.

This work was done across several KEPs (including #5729, #5304, #5517, #5677, and #5491) led by SIG Scheduling and SIG Node.

Native histogram support for Kubernetes metrics

High-resolution monitoring reaches a new milestone in Kubernetes v1.36 with the introduction of native histogram support in Alpha. While classical Prometheus histograms relied on static, pre-defined buckets that often forced a compromise between data accuracy and memory usage, this update allows the control plane to export sparse histograms that dynamically adjust their resolution based on real-time data.

Now, cluster operators can capture precise latency distributions for the kube-apiserver and other core components without the overhead of manual bucket management. This architectural shift ensures more reliable SLIs and SLOs, providing high-fidelity heatmaps that remain accurate even during the most unpredictable workload spikes.

This work was done as part of KEP #5808 led by SIG Instrumentation.

Manifest based admission control config

Managing admission controllers moves toward a more declarative and consistent model in Kubernetes v1.36 with the introduction of manifest-based admission control configuration in Alpha. This change addresses the long-standing challenge of configuring admission plugins through disparate command-line flags or separate, complex config files by allowing administrators to define the desired state of admission control directly through a structured manifest.

Now, cluster operators can manage admission plugin settings with the same versioned, declarative workflows used for other Kubernetes objects, significantly reducing the risk of configuration drift and manual errors during cluster upgrades. By centralizing these configurations into a unified manifest, the kube-apiserver becomes easier to audit and automate, paving the way for more secure and reproducible cluster deployments.

This work was done as part of KEP #5793 led by SIG API Machinery.

CRI list streaming

With the introduction of CRI list streaming in Alpha, Kubernetes v1.36 uses new internal streaming operations. This enhancement addresses the memory pressure and latency spikes often seen on large-scale nodes by replacing traditional, monolithic List requests between the kubelet and the container runtime with a more efficient server-side streaming RPC.

Now, instead of waiting for a single, massive response containing all container or image data, the kubelet can process results incrementally as they are streamed. This shift significantly reduces the peak memory footprint of the kubelet and improves responsiveness on high-density nodes, ensuring that cluster management remains fluid even as the number of containers per node continues to grow.

This work was done as part of KEP #5825 led by SIG Node.

Other notable changes

Ingress NGINX retirement

To prioritize the safety and security of the ecosystem, Kubernetes SIG Network and the Security Response Committee have retired Ingress NGINX on March 24, 2026. Since that date, there have been no further releases, no bugfixes, and no updates to resolve any security vulnerabilities discovered. Existing deployments of Ingress NGINX will continue to function, and installation artifacts like Helm charts and container images will remain available.

For full details, see the official retirement announcement.

Faster SELinux labelling for volumes (GA)

Kubernetes v1.36 makes the SELinux volume mounting improvement generally available. This change replaced recursive file relabeling with mount -o context=XYZ option, applying the correct SELinux label to the entire volume at mount time. It brings more consistent performance and reduces Pod startup delays on SELinux-enforcing systems.

This feature was introduced as beta in v1.28 for ReadWriteOncePod volumes. In v1.32, it gained metrics and an opt-out option (securityContext.seLinuxChangePolicy: Recursive) to help catch conflicts. Now in v1.36, it reaches Stable and defaults to all volumes, with Pods or CSIDrivers opting in via spec.seLinuxMount.

However, we expect this feature to create the risk of breaking changes in the future Kubernetes releases, potentially due to sharing one volume between privileged and unprivileged Pods on the same node.

Developers have the responsibility of setting the seLinuxChangePolicy field and SELinux volume labels on Pods. Regardless of whether they are writing a Deployment, StatefulSet, DaemonSet or even a custom resource that includes a Pod template, being careless with these settings can lead to a range of problems such as Pods not starting up correctly when Pods share a volume.

Kubernetes v1.36 is the ideal release to audit your clusters. To learn more, check out SELinux Volume Label Changes goes GA (and likely implications in v1.37) blog.

For more details on this enhancement, refer to KEP-1710: Speed up recursive SELinux label change.

Graduations, deprecations, and removals in v1.36

Graduations to stable

This lists all the features that graduated to stable (also known as general availability). For a full list of updates including new features and graduations from alpha to beta, see the release notes.

This release includes a total of 18 enhancements promoted to stable:

Deprecations removals, and community updates

As Kubernetes develops and matures, features may be deprecated, removed, or replaced with better ones to improve the project's overall health. See the Kubernetes deprecation and removal policy for more details on this process. Kubernetes v1.36 includes a couple of deprecations.

Deprecation of Service .spec.externalIPs

With this release, the externalIPs field in Service spec is deprecated. This means the functionality exists, but will no longer function in a future version of Kubernetes. You should plan to migrate if you currently rely on that field. This field has been a known security headache for years, enabling man-in-the-middle attacks on your cluster traffic, as documented in CVE-2020-8554. From Kubernetes v1.36 and onwards, you will see deprecation warnings when using it, with full removal planned for v1.43.

If your Services still lean on externalIPs, consider using LoadBalancer services for cloud-managed ingress, NodePort for simple port exposure, or Gateway API for a more flexible and secure way to handle external traffic.

For more details on this field and its deprecation, refer to External IPs or read KEP-5707: Deprecate service.spec.externalIPs.

Removal of the `gitRepo` volume driver

The gitRepo volume type has been deprecated since v1.11. For Kubernetes v1.36, the gitRepo volume plugin is permanently disabled and cannot be turned back on. This change protects clusters from a critical security issue where using gitRepo could let an attacker run code as root on the node.

Although gitRepo has been deprecated for years and better alternatives have been recommended, it was still technically possible to use it in previous releases. From v1.36 onward, that path is closed for good, so any existing workloads depending on gitRepo will need to migrate to supported approaches such as init containers or external git-sync style tools.

For more details on this removal, refer to KEP-5040: Remove gitRepo volume driver

Release notes

Check out the full details of the Kubernetes v1.36 release in our release notes.

Availability

Kubernetes v1.36 is available for download on GitHub or on the Kubernetes download page.

To get started with Kubernetes, check out these tutorials or run local Kubernetes clusters using minikube. You can also easily install v1.36 using kubeadm.

Release Team

Kubernetes is only possible with the support, commitment, and hard work of its community. Each release team is made up of dedicated community volunteers who work together to build the many pieces that make up the Kubernetes releases you rely on. This requires the specialized skills of people from all corners of our community, from the code itself to its documentation and project management.

We would like to thank the entire Release Team for the hours spent hard at work to deliver the Kubernetes v1.36 release to our community. The Release Team's membership ranges from first-time shadows to returning team leads with experience forged over several release cycles. A very special thanks goes out to our release lead, Ryota Sawada, for guiding us through a successful release cycle, for his hands-on approach to solving challenges, and for bringing the energy and care that drives our community forward.

Project Velocity

The CNCF K8s DevStats project aggregates a number of interesting data points related to the velocity of Kubernetes and various sub-projects. This includes everything from individual contributions to the number of companies that are contributing, and is an illustration of the depth and breadth of effort that goes into evolving this ecosystem.

During the v1.36 release cycle, which spanned 15 weeks from 12th January 2026 to 22nd April 2026, Kubernetes received contributions from as many as 106 different companies and 491 individuals. In the wider cloud native ecosystem, the figure goes up to 370 companies, counting 2235 total contributors.

Note that “contribution” counts when someone makes a commit, code review, comment, creates an issue or PR, reviews a PR (including blogs and documentation) or comments on issues and PRs. If you are interested in contributing, visit Getting Started on our contributor website.

Source for this data:

Events Update

Explore upcoming Kubernetes and cloud native events, including KubeCon + CloudNativeCon, KCD, and other notable conferences worldwide. Stay informed and get involved with the Kubernetes community!

April 2026

KCD - Kubernetes Community Days: Guadalajara: April 18, 2026 | Guadalajara, Mexico

May 2026

KCD - Kubernetes Community Days: Toronto: May 13, 2026 | Toronto, Canada
KCD - Kubernetes Community Days: Texas: May 15, 2026 | Austin, USA
KCD - Kubernetes Community Days: Istanbul: May 15, 2026 | Istanbul, Turkey
KCD - Kubernetes Community Days: Helsinki: May 20, 2026 | Helsinki, Finland
KCD - Kubernetes Community Days: Czech & Slovak: May 21, 2026 | Prague, Czechia

June 2026

KCD - Kubernetes Community Days: New York: June 10, 2026 | New York, USA
KubeCon + CloudNativeCon India 2026: June 18-19, 2026 | Mumbai, India

July 2026

KubeCon + CloudNativeCon Japan 2026: July 29-30, 2026 | Yokohama, Japan

September 2026

KubeCon + CloudNativeCon China 2026: September 8-9, 2026 | Shanghai, China

October 2026

KCD - Kubernetes Community Days: UK: Oct 19, 2026 | Edinburgh, UK

November 2026

KCD - Kubernetes Community Days: Porto: Nov 19, 2026 | Porto, Portugal
KubeCon + CloudNativeCon North America 2026: Nov 9-12, 2026 | Salt Lake, USA

You can find the latest event details here.

Upcoming Release Webinar

Join members of the Kubernetes v1.36 Release Team on Wednesday, May 20th 2026 at 4:00 PM (UTC) to learn about the release highlights of this release. For more information and registration, visit the event page on the CNCF Online Programs site.

Get Involved

The simplest way to get involved with Kubernetes is by joining one of the many Special Interest Groups (SIGs) that align with your interests. Have something you’d like to broadcast to the Kubernetes community? Share your voice at our weekly community meeting, and through the channels below. Thank you for your continued feedback and support.

Follow us on Bluesky @kubernetes.io for the latest updates
Join the community discussion on Discuss
Join the community on Slack
Post questions (or answer questions) on Stack Overflow
Share your Kubernetes story
Read more about what’s happening with Kubernetes on the blog
Learn more about the Kubernetes Release Team

Categories: CNCF Projects, Kubernetes

Gateway API v1.5: Moving features to Stable

Kubernetes Blog - Tue, 04/21/2026 - 12:30

The Kubernetes SIG Network community presents the release of Gateway API (v1.5)! Released on February 27, 2026, version 1.5 is our biggest release yet, and concentrates on moving existing Experimental features to Standard (Stable).

The Gateway API v1.5.1 patch release is already available.

The Gateway API v1.5 brings six widely-requested feature promotions to the Standard channel (Gateway API's GA release channel):

ListenerSet
TLSRoute
HTTPRoute CORS Filter
Client Certificate Validation
Certificate Selection for Gateway TLS Origination
ReferenceGrant

Special thanks for Gateway API Contributors for their efforts on this release.

New release process

As of Gateway API v1.5, the project has moved to a release train model, where on a feature freeze date, any features that are ready are shipped in the release.

This applies to both Experimental and Standard, and also applies to documentation -- if the documentation isn't ready to ship, the feature isn't ready to ship.

We are aiming for this to produce a more reliable release cadence (since we are basing our work off the excellent work done by SIG Release on Kubernetes itself). As part of this change, we've also introduced Release Manager and Release Shadow roles to our release team. Many thanks to Flynn (Buoyant) and Beka Modebadze (Google) for all the great work coordinating and filing the rough edges of our release process. They are both going to continue in this role for the next release as well.

New standard features

ListenerSet

Leads: Dave Protasowski, David Jumani

GEP-1713

Why ListenerSet?

Prior to ListenerSet, all listeners had to be specified directly on the Gateway object. While this worked well for simple use cases, it created challenges for more complex or multi-tenant environments:

Platform teams and application teams often needed to coordinate changes to the same Gateway
Safely delegating ownership of individual listeners was difficult
Extending existing Gateways required direct modification of the original resource

ListenerSet addresses these limitations by allowing listeners to be defined independently and then merged onto a target Gateway.

ListenerSets also enable attaching more than 64 listeners to a single, shared Gateway. This is critical for large scale deployments and scenarios with multiple hostnames per listener.

Even though the ListenerSet feature significantly enhances scalability, the listener field in Gateway remains a mandatory requirement and the Gateway must have at least one valid listener.

How it works

A ListenerSet attaches to a Gateway and contributes one or more listeners. The Gateway controller is responsible for merging listeners from the Gateway resource itself and any attached ListenerSet resources.

In this example, a central infrastructure team defines a Gateway with a default HTTP listener, while two different application teams define their own ListenerSet resources in separate namespaces. Both ListenerSets attach to the same Gateway and contribute additional HTTPS listeners.

---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
 name: example-gateway
 namespace: infra
spec:
 gatewayClassName: example-gateway-class
 allowedListeners:
 namespaces:
 from: All # A selector lets you fine tune this
 listeners:
 - name: http
 protocol: HTTP
 port: 80
---
apiVersion: gateway.networking.k8s.io/v1
kind: ListenerSet
metadata:
 name: team-a-listeners
 namespace: team-a
spec:
 parentRef:
 name: example-gateway
 namespace: infra
 listeners:
 - name: https-a
 protocol: HTTPS
 port: 443
 hostname: a.example.com
 tls:
 certificateRefs:
 - name: a-cert
---
apiVersion: gateway.networking.k8s.io/v1
kind: ListenerSet
metadata:
 name: team-b-listeners
 namespace: team-b
spec:
 parentRef:
 name: example-gateway
 namespace: infra
 listeners:
 - name: https-b
 protocol: HTTPS
 port: 443
 hostname: b.example.com
 tls:
 certificateRefs:
 - name: b-cert

TLSRoute

Leads: Rostislav Bobrovsky, Ricardo Pchevuzinske Katz

GEP-2643

The TLSRoute resource allows you to route requests by matching the Server Name Indication (SNI) presented by the client during the TLS handshake and directing the stream to the appropriate Kubernetes backends.

When working with TLSRoute, a Gateway's TLS listener can be configured in one of two modes: Passthrough or Terminate.

If you install Gateway API v1.5 Standard over v1.4 or earlier Experimental, your existing Experimental TLSRoutes will not be usable. This is because they will be stored in the v1alpha2 or v1alpha3 version, which is not included in the v1.5 Standard YAMLs. If this applies to you, either continue using Experimental for v1.5.1 and onward, or you'll need to download and migrate your TLSRoutes to v1, which is present in the Standard YAMLs.

Passthrough mode

The Passthrough mode is designed for strict security requirements. It is ideal for scenarios where traffic must remain encrypted end-to-end until it reaches the destination backend, when the external client and backend need to authenticate directly with each other, or when you can’t store certificates on the Gateway. This configuration is also applicable when an encrypted TCP stream is required instead of standard HTTP traffic.

In this mode, the encrypted byte stream is proxied directly to the destination backend. The Gateway has zero access to private keys or unencrypted data.

The following TLSRoute is attached to a listener that is configured in Passthrough mode. It will match only TLS handshakes with the foo.example.com SNI hostname and apply its routing rules to pass the encrypted TCP stream to the configured backend:

---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
 name: example-gateway
spec:
 gatewayClassName: example-gateway-class
 listeners:
 - name: tls-passthrough
 protocol: TLS
 port: 8443
 tls:
 mode: Passthrough
---
apiVersion: gateway.networking.k8s.io/v1
kind: TLSRoute
metadata:
 name: foo-route
spec:
 parentRefs:
 - name: example-gateway
 sectionName: tls-passthrough
 hostnames:
 - "foo.example.com"
 rules:
 - backendRefs:
 - name: foo-svc
 port: 8443

Terminate mode

The Terminate mode provides the convenience of centralized TLS certificate management directly at the Gateway.

In this mode, the TLS session is fully terminated at the Gateway, which then routes the decrypted payload to the destination backend as a plain text TCP stream.

The following TLSRoute is attached to a listener that is configured in Terminate mode. It will match only TLS handshakes with the bar.example.com SNI hostname and apply its routing rules to pass the decrypted TCP stream to the configured backend:

apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
 name: example-gateway
spec:
 gatewayClassName: example-gateway-class
 listeners:
 - name: tls-terminate
 protocol: TLS
 port: 443
 tls:
 mode: Terminate
 certificateRefs:
 - name: tls-terminate-certificate
---
apiVersion: gateway.networking.k8s.io/v1
kind: TLSRoute
metadata:
 name: bar-route
spec:
 parentRefs:
 - name: example-gateway
 sectionName: tls-terminate
 hostnames:
 - "bar.example.com"
 rules:
 - backendRefs:
 - name: bar-svc
 port: 8080

HTTPRoute CORS filter

Leads: Damian Sawicki, Ricardo Pchevuzinske Katz, Norwin Schnyder, Huabing (Robin) Zhao, LiangLliu,

GEP-1767

Cross-origin resource sharing (CORS) is an HTTP-header based security mechanism that allows (or denies) a web page to access resources from a server on an origin different from the domain that served the web page. See our documentation page for more information. The HTTPRoute resource can be used to configure Cross-Origin Resource Sharing (CORS). The following HTTPRoute allows requests from https://app.example:

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
 name: cors
spec:
 parentRefs:
 - name: same-namespace
 rules:
 - matches:
 - path:
 type: PathPrefix
 value: /cors-behavior-creds-false
 backendRefs:
 - name: infra-backend-v1
 port: 8080
 filters:
 - cors:
 allowOrigins:
 - https://app.example
 type: CORS

Instead of specifying a list of specific origins, you can also specify a single wildcard ("*"), which will allow any origin. It is also allowed to use semi-specified origins in the list, where the wildcard appears after the scheme and at the beginning of the hostname, e.g. https://*.bar.com:

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
 name: cors
spec:
 parentRefs:
 - name: same-namespace
 rules:
 - matches:
 - path:
 type: PathPrefix
 value: /cors-behavior-creds-false
 backendRefs:
 - name: infra-backend-v1
 port: 8080
 filters:
 - cors:
 allowOrigins:
 - https://www.baz.com
 - https://*.bar.com
 - https://*.foo.com
 type: CORS

HTTPRoute filters allow for the configuration of CORS settings. See a list of supported options below:

allowCredentials: Specifies whether the browser is allowed to include credentials (such as cookies and HTTP authentication) in the CORS request.
allowMethods: The HTTP methods that are allowed for CORS requests.
allowHeaders: The HTTP headers that are allowed for CORS requests.
exposeHeaders: The HTTP headers that are exposed to the client.
maxAge: The maximum time in seconds that the browser should cache the preflight response.

A comprehensive example:

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
 name: cors-allow-credentials
spec:
 parentRefs:
 - name: same-namespace
 rules:
 - matches:
 - path:
 type: PathPrefix
 value: /cors-behavior-creds-true
 backendRefs:
 - name: infra-backend-v1
 port: 8080
 filters:
 - cors:
 allowOrigins:
 - "https://www.foo.example.com"
 - "https://*.bar.example.com"
 allowMethods:
 - GET
 - OPTIONS
 allowHeaders:
 - "*"
 exposeHeaders:
 - "x-header-3"
 - "x-header-4"
 allowCredentials: true
 maxAge: 3600
 type: CORS

Gateway client certificate validation

Leads: Arko Dasgupta, Katarzyna Łach, Norwin Schnyder

GEP-91

Client certificate validation, also known as mutual TLS (mTLS), is a security mechanism where the client provides a certificate to the server to prove its identity. This is in contrast to standard TLS, where only the server presents a certificate to the client. In the context of the Gateway API, frontend mTLS means that the Gateway validates the client's certificate before allowing the connection to proceed to a backend service. This validation is done by checking the client certificate against a set of trusted Certificate Authorities (CAs) configured on the Gateway. The API was shaped this way to address a critical security vulnerability related to connection reuse and still provide some level of flexibility.

Configuration overview

Client validation is defined using the frontendValidation struct, which specifies how the Gateway should verify the client's identity.

caCertificateRefs: A list of references to Kubernetes objects (typically ConfigMap's) containing PEM-encoded CA certificate bundles used as trust anchors to validate the client's certificate.
mode: Defines the validation behavior.
- AllowValidOnly (Default): The Gateway accepts connections only if the client presents a valid certificate that passes validation against the specified CA bundle.
- AllowInsecureFallback: The Gateway accepts connections even if the client certificate is missing or fails verification. This mode typically delegates authorization to the backend and should be used with caution.

Validation can be applied globally to the Gateway or overridden for specific ports:

Default Configuration: This configuration applies to all HTTPS listeners on the Gateway, unless a per-port override is defined.
Per-Port Configuration: This allows for fine-grained control, overriding the default configuration for all listeners handling traffic on a specific port.

Example:

apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
 name: client-validation-basic
spec:
 gatewayClassName: acme-lb
 tls:
 frontend:
 default:
 validation:
 caCertificateRefs:
 - kind: ConfigMap
 group: ""
 name: foo-example-com-ca-cert
 perPort:
 - port: 8443
 tls:
 validation:
 caCertificateRefs:
 - kind: ConfigMap
 group: ""
 name: foo-example-com-ca-cert
 mode: "AllowInsecureFallback"
 listeners:
 - name: foo-https
 protocol: HTTPS
 port: 443
 hostname: foo.example.com
 tls:
 certificateRefs:
 - kind: Secret
 group: ""
 name: foo-example-com-cert
 - name: bar-https
 protocol: HTTPS
 port: 8443
 hostname: bar.example.com
 tls:
 certificateRefs:
 - kind: Secret
 group: ""
 name: bar-example-com-cert

Certificate selection for Gateway TLS origination

Leads: Marcin Kosieradzki, Rob Scott, Norwin Schnyder, Lior Lieberman, Katarzyna Lach

GEP-3155

Mutual TLS (mTLS) for upstream connections requires the Gateway to present a client certificate to the backend, in addition to verifying the backend's certificate. This ensures that the backend only accepts connections from authorized Gateways.

Gateway’s client certificate configuration

To configure the client certificate that the Gateway uses when connecting to backends, use the tls.backend.clientCertificateRef field in the Gateway resource. This configuration applies to the Gateway as a client for all upstream connections managed by that Gateway.

apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
 name: backend-tls
spec:
 gatewayClassName: acme-lb
 tls:
 backend:
 clientCertificateRef:
 kind: Secret
 group: "" # empty string means core API group
 name: foo-example-cert
 listeners:
 - name: foo-http
 protocol: HTTP
 port: 80
 hostname: foo.example.com

ReferenceGrant promoted to v1

The ReferenceGrant resource has not changed in more than a year, and we do not expect it to change further, so its version has been bumped to v1, and it is now officially in the Standard channel, and abides by the GA API contract (that is, no breaking changes).

Try it out

Unlike other Kubernetes APIs, you don't need to upgrade to the latest version of Kubernetes to get the latest version of Gateway API. As long as you're running Kubernetes 1.30 or later, you'll be able to get up and running with this version of Gateway API.

To try out the API, follow the Getting Started Guide.

As of this writing, seven implementations are already fully conformant with Gateway API v1.5. In alphabetical order:

Get involved

Wondering when a feature will be added? There are lots of opportunities to get involved and help define the future of Kubernetes routing APIs for both ingress and service mesh.

Check out the user guides to see what use-cases can be addressed.
Try out one of the existing Gateway controllers.
Or join us in the community and help us build the future of Gateway API together!

The maintainers would like to thank everyone who's contributed to Gateway API, whether in the form of commits to the repo, discussion, ideas, or general support. We could never have made this kind of progress without the support of this dedicated and active community.

This article was edited in April 2026 to correct the release date for Gateway API 1.5.0.

Categories: CNCF Projects, Kubernetes

You are here

The problem with client-side sharding

How it works

Using sharded watches in controllers

Verifying server support

Getting involved

The Motivation: Escaping the "Handwritten" Technical Debt

Enter validation-gen

A Comprehensive Suite of +k8s: Tags

Advanced Capabilities: "Ambient Ratcheting"

Scaling API Reviews with kube-api-linter

What's next?

Getting involved

Acknowledgments

The gap we're closing

How it works

Protecting what couldn't be protected before

A few things to know

Try it out

How to get involved

The gap we're closing

How it works

Protecting what couldn't be protected before

A few things to know

Try it out

How to get involved

The gap we're closing

How it works

Protecting what couldn't be protected before

A few things to know

Try it out

How to get involved

Why do we need pod-level resource managers?

Introducing pod-level resource managers

Real-world use cases

1. Tightly-coupled database (Topology manager's pod scope)

2. ML workload with infrastructure sidecars (Topology manager's container scope)

CPU quotas (CFS) and isolation

How to enable Pod-Level Resource Managers

Observability

Current limitations and caveats

Getting started and providing feedback

Why do we need pod-level resource managers?

Introducing pod-level resource managers

Real-world use cases

1. Tightly-coupled database (Topology manager's pod scope)

2. ML workload with infrastructure sidecars (Topology manager's container scope)

CPU quotas (CFS) and isolation

How to enable Pod-Level Resource Managers

Observability

Current limitations and caveats

Getting started and providing feedback

Why do we need pod-level resource managers?

Introducing pod-level resource managers

Real-world use cases

1. Tightly-coupled database (Topology manager's pod scope)

2. ML workload with infrastructure sidecars (Topology manager's container scope)

CPU quotas (CFS) and isolation

How to enable Pod-Level Resource Managers

Observability

Current limitations and caveats

Getting started and providing feedback

Why Pod-level in-place resize?

Resource inheritance and the resizePolicy

Example: Scaling a shared resource pool

1. Initial Pod specification

2. The resize operation

Node-Level reality: feasibility and safety

1. The feasibility check

2. Update sequencing

Observability: tracking resize status

Constraints and requirements

What's next?

Getting started and providing feedback

What's new in v1.36

Opt-in memory reservation with memoryReservationPolicy

Comparison with v1.27 behavior

Observability metrics

Kernel version check

Enter `validation-gen`

Scaling API Reviews with `kube-api-linter`

Resource inheritance and the `resizePolicy`

Opt-in memory reservation with `memoryReservationPolicy`

Motivation: the `nodes/proxy` problem

The `nodes/proxy GET` WebSocket RCE risk

Fine-grained `kubelet` authorization: how it works

Updated `system:kubelet-api-admin` `ClusterRole`

Motivation: the `nodes/proxy` problem

The `nodes/proxy GET` WebSocket RCE risk

Fine-grained `kubelet` authorization: how it works

Updated `system:kubelet-api-admin` `ClusterRole`