CNCF Projects
A decade of governance: Cloud Custodian at 10 and its role in the agentic AI era
What is Cloud Custodian? It is an open source, stateless policy engine used to manage public cloud environments, Kubernetes and infrastructure as code through a unified DSL. As an incubating project within CNCF, it allows organizations to define and enforce policies for FinOps, security, and compliance across multiple providers.
Why the 10th anniversary of Cloud Custodian matters now
Reaching a 10-year milestone is significant because Cloud Custodian has transitioned from a cloud management tool into a fundamental cost optimization and safety layer for the AI era. With the rise of agentic AI, where autonomous agents generate and deploy infrastructure code, real-time automated governance has become a necessity. Beyond agentic code, AI workloads like GPU fleets, model serving endpoints, and training pipelines introduce both a larger security attack surface and significantly higher cost exposure, where the risk of ungoverned resources is higher than ever.
Why Cloud Custodian is essential for AI governance
- Automated Guardrails: Cloud Custodian provides the structured, programmable boundaries required when AI agents manage infrastructure. and when high-cost AI workloads like GPU fleets and model serving endpoints are provisioned.
- Real-time enforcement: It closes cost and security risk windows by enforcing organizational and industry best practices as soon as AI-generated resources are deployed.
- Vendor neutrality: The project ensures consistent governance across AWS, Azure, GCP, Oracle Cloud, Kubernetes and Terraform preventing fragmented cost or security postures in complex AI workflows.
Reaching ten years is a testament to the community of maintainers and contributors who have built Cloud Custodian into a foundational tool for cloud governance as code. As we move into an era of AI-driven automation, the project’s ability to provide transparent, programmable guardrails ensures that even when code is generated by a machine, it adheres to human-defined standards of safety and efficiency.
How Cloud Custodian empowers the cloud native ecosystem
Cloud Custodian aligns with CNCF principles by focusing on declarative automation and community-led innovation.
- Declarative policy: Users describe the desired state of their cloud resources, and the engine handles enforcement.
- Action and remediation: Beyond detection, Cloud Custodian is built to fix and prevent issues through customizable remediation workflows — critical at the speed and complexity of AI-scale environments.
- Scalability: Designed for high-velocity environments, it manages thousands of resources without the overhead of stateful management.
- Proven reliability: A decade of production use has resulted in a robust library of thousands of community-vetted policy actions and filters.
Frequently asked questions about Cloud Custodian
How does Cloud Custodian help with cost management?
It uses policies to reduce waste by eliminating idle or underprovisioned resources, including idle training jobs and GPU fleets. It also prevents costly misconfigurations such as oversized storage tiers, ensuring cloud environments stay efficient and well-governed.
Is Cloud Custodian compatible with multiple clouds?
Yes, it provides a unified DSL to manage resources across AWS, Azure, GCP, and OCI , ensuring a single source of truth for organizational policy.
Why is Cloud Custodian relevant for AI-generated code?
AI agents can ship code faster than humans can review it. Cloud Custodian acts as an automated safety net, ensuring all machine-deployed infrastructure follows security and compliance rules while catching costly misconfigurations before they become security gaps or budget overruns.
Next steps for the community
To celebrate this milestone and explore how Cloud Custodian is adapting to the latest industry shifts, we encourage the community to engage with the following resources:
- Read the full announcement: An Open Source Project Turns 10 and Finds Itself Tailor-Made for the Agentic AI Era
- View the documentation: Visit cloudcustodian.io for technical guides.
- Contribute: Join the maintainers and contributors at the Cloud Custodian GitHub repository.
Congratulations to the contributors who have made the last decade possible. Here is to ten years of governance and the road ahead.
Kubernetes v1.36: Moving Volume Group Snapshots to GA
Volume group snapshots were introduced as an Alpha feature with the Kubernetes v1.27 release, moved to Beta in v1.32, and to a second Beta in v1.34. We are excited to announce that in the Kubernetes v1.36 release, support for volume group snapshots has reached General Availability (GA).
The support for volume group snapshots relies on a set of extension APIs for group snapshots. These APIs allow users to take crash-consistent snapshots for a set of volumes. Behind the scenes, Kubernetes uses a label selector to group multiple PersistentVolumeClaim objects for snapshotting. A key aim is to allow you to restore that set of snapshots to new volumes and recover your workload based on a crash-consistent recovery point.
This feature is only supported for CSI volume drivers.
An overview of volume group snapshots
Some storage systems provide the ability to create a crash-consistent snapshot of multiple volumes. A group snapshot represents copies made from multiple volumes that are taken at the same point-in-time. A group snapshot can be used either to rehydrate new volumes (pre-populated with the snapshot data) or to restore existing volumes to a previous state (represented by the snapshots).
Why add volume group snapshots to Kubernetes?
The Kubernetes volume plugin system already provides a powerful abstraction that automates the provisioning, attaching, mounting, resizing, and snapshotting of block and file storage. Underpinning all these features is the Kubernetes goal of workload portability.
There was already a VolumeSnapshot API that provides the ability to take a snapshot of a persistent volume to protect against data loss or data corruption. However, some storage systems support consistent group snapshots that allow a snapshot to be taken from multiple volumes at the same point-in-time to achieve write order consistency. This is extremely useful for applications that contain multiple volumes. For example, an application may have data stored in one volume and logs stored in another. If snapshots for these volumes are taken at different times, the application will not be consistent and will not function properly if restored from those snapshots.
While you can quiesce the application first and take individual snapshots sequentially, this process can be time-consuming or sometimes impossible. Consistent group support provides crash consistency across all volumes in the group without the need for application quiescence.
Kubernetes APIs for volume group snapshots
Kubernetes' support for volume group snapshots relies on three API kinds that are used for managing snapshots:
- VolumeGroupSnapshot
- Created by a Kubernetes user (or automation) to request creation of a volume group snapshot for multiple persistent volume claims.
- VolumeGroupSnapshotContent
- Created by the snapshot controller for a dynamically created VolumeGroupSnapshot. It contains information about the provisioned cluster resource (a group snapshot). The object binds to the VolumeGroupSnapshot for which it was created with a one-to-one mapping.
- VolumeGroupSnapshotClass
- Created by cluster administrators to describe how volume group snapshots should be created, including the driver information, the deletion policy, etc.
These three API kinds are defined as CustomResourceDefinitions (CRDs). For the GA release, the API version has been promoted to v1.
What's new in GA?
- The API version for
VolumeGroupSnapshot,VolumeGroupSnapshotContent, andVolumeGroupSnapshotClassis promoted togroupsnapshot.storage.k8s.io/v1. - Enhanced stability and bug fixes based on feedback from the beta releases, including the improvements introduced in v1beta2 for accurate
restoreSizereporting.
How do I use Kubernetes volume group snapshots
Creating a new group snapshot with Kubernetes
Once a VolumeGroupSnapshotClass object is defined and you have volumes you want to snapshot together, you may request a new group snapshot by creating a VolumeGroupSnapshot object.
Label the PVCs you wish to group:
% kubectl label pvc pvc-0 group=myGroup
persistentvolumeclaim/pvc-0 labeled
% kubectl label pvc pvc-1 group=myGroup
persistentvolumeclaim/pvc-1 labeled
For dynamic provisioning, a selector must be set so that the snapshot controller can find PVCs with the matching labels to be snapshotted together.
apiVersion: groupsnapshot.storage.k8s.io/v1
kind: VolumeGroupSnapshot
metadata:
name: snapshot-daily-20260422
namespace: demo-namespace
spec:
volumeGroupSnapshotClassName: csi-groupSnapclass
source:
selector:
matchLabels:
group: myGroup
The VolumeGroupSnapshotClass is required for dynamic provisioning:
apiVersion: groupsnapshot.storage.k8s.io/v1
kind: VolumeGroupSnapshotClass
metadata:
name: csi-groupSnapclass
driver: example.csi.k8s.io
deletionPolicy: Delete
How to use group snapshot for restore
At restore time, request a new PersistentVolumeClaim to be created from a VolumeSnapshot object that is part of a VolumeGroupSnapshot. Repeat this for all volumes that are part of the group snapshot.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: examplepvc-restored-2026-04-22
namespace: demo-namespace
spec:
storageClassName: example-sc
dataSource:
name: snapshot-0962a745b2bf930bb385b7b50c9b08af471f1a16780726de19429dd9c94eaca0
kind: VolumeSnapshot
apiGroup: snapshot.storage.k8s.io
accessModes:
- ReadWriteOncePod
resources:
requests:
storage: 100Mi
As a storage vendor, how do I add support for group snapshots?
To implement the volume group snapshot feature, a CSI driver must:
- Implement a new group controller service.
- Implement group controller RPCs:
CreateVolumeGroupSnapshot,DeleteVolumeGroupSnapshot, andGetVolumeGroupSnapshot. - Add group controller capability
CREATE_DELETE_GET_VOLUME_GROUP_SNAPSHOT.
See the CSI spec and the Kubernetes-CSI Driver Developer Guide for more details.
How can I learn more?
- The design spec for the volume group snapshot feature.
- The code repository for volume group snapshot APIs and controller.
- CSI documentation on the group snapshot feature.
How do I get involved?
This project, like all of Kubernetes, is the result of hard work by many contributors from diverse backgrounds working together. On behalf of SIG Storage, I would like to offer a huge thank you to all the contributors who stepped up over the years to help the project reach GA:
- Ben Swartzlander (bswartz)
- Cici Huang (cici37)
- Darshan Murthy (darshansreenivas)
- Hemant Kumar (gnufied)
- James Defelice (jdef)
- Jan Šafránek (jsafrane)
- Madhu Rajanna (Madhu-1)
- Manish M Yathnalli (manishym)
- Michelle Au (msau42)
- Niels de Vos (nixpanic)
- Leonardo Cecchi (leonardoce)
- Rakshith R (Rakshith-R)
- Raunak Shah (RaunakShah)
- Saad Ali (saad-ali)
- Wei Duan (duanwei33)
- Xing Yang (xing-yang)
- Yati Padia (yati1998)
For those interested in getting involved with the design and development of CSI or any part of the Kubernetes Storage system, join the Kubernetes Storage Special Interest Group (SIG). We always welcome new contributors.
We also hold regular Data Protection Working Group meetings. New attendees are welcome to join our discussions.
Kubernetes v1.36: Moving Volume Group Snapshots to GA
Volume group snapshots were introduced as an Alpha feature with the Kubernetes v1.27 release, moved to Beta in v1.32, and to a second Beta in v1.34. We are excited to announce that in the Kubernetes v1.36 release, support for volume group snapshots has reached General Availability (GA).
The support for volume group snapshots relies on a set of extension APIs for group snapshots. These APIs allow users to take crash-consistent snapshots for a set of volumes. Behind the scenes, Kubernetes uses a label selector to group multiple PersistentVolumeClaim objects for snapshotting. A key aim is to allow you to restore that set of snapshots to new volumes and recover your workload based on a crash-consistent recovery point.
This feature is only supported for CSI volume drivers.
An overview of volume group snapshots
Some storage systems provide the ability to create a crash-consistent snapshot of multiple volumes. A group snapshot represents copies made from multiple volumes that are taken at the same point-in-time. A group snapshot can be used either to rehydrate new volumes (pre-populated with the snapshot data) or to restore existing volumes to a previous state (represented by the snapshots).
Why add volume group snapshots to Kubernetes?
The Kubernetes volume plugin system already provides a powerful abstraction that automates the provisioning, attaching, mounting, resizing, and snapshotting of block and file storage. Underpinning all these features is the Kubernetes goal of workload portability.
There was already a VolumeSnapshot API that provides the ability to take a snapshot of a persistent volume to protect against data loss or data corruption. However, some storage systems support consistent group snapshots that allow a snapshot to be taken from multiple volumes at the same point-in-time to achieve write order consistency. This is extremely useful for applications that contain multiple volumes. For example, an application may have data stored in one volume and logs stored in another. If snapshots for these volumes are taken at different times, the application will not be consistent and will not function properly if restored from those snapshots.
While you can quiesce the application first and take individual snapshots sequentially, this process can be time-consuming or sometimes impossible. Consistent group support provides crash consistency across all volumes in the group without the need for application quiescence.
Kubernetes APIs for volume group snapshots
Kubernetes' support for volume group snapshots relies on three API kinds that are used for managing snapshots:
- VolumeGroupSnapshot
- Created by a Kubernetes user (or automation) to request creation of a volume group snapshot for multiple persistent volume claims.
- VolumeGroupSnapshotContent
- Created by the snapshot controller for a dynamically created VolumeGroupSnapshot. It contains information about the provisioned cluster resource (a group snapshot). The object binds to the VolumeGroupSnapshot for which it was created with a one-to-one mapping.
- VolumeGroupSnapshotClass
- Created by cluster administrators to describe how volume group snapshots should be created, including the driver information, the deletion policy, etc.
These three API kinds are defined as CustomResourceDefinitions (CRDs). For the GA release, the API version has been promoted to v1.
What's new in GA?
- The API version for
VolumeGroupSnapshot,VolumeGroupSnapshotContent, andVolumeGroupSnapshotClassis promoted togroupsnapshot.storage.k8s.io/v1. - Enhanced stability and bug fixes based on feedback from the beta releases, including the improvements introduced in v1beta2 for accurate
restoreSizereporting.
How do I use Kubernetes volume group snapshots
Creating a new group snapshot with Kubernetes
Once a VolumeGroupSnapshotClass object is defined and you have volumes you want to snapshot together, you may request a new group snapshot by creating a VolumeGroupSnapshot object.
Label the PVCs you wish to group:
% kubectl label pvc pvc-0 group=myGroup
persistentvolumeclaim/pvc-0 labeled
% kubectl label pvc pvc-1 group=myGroup
persistentvolumeclaim/pvc-1 labeled
For dynamic provisioning, a selector must be set so that the snapshot controller can find PVCs with the matching labels to be snapshotted together.
apiVersion: groupsnapshot.storage.k8s.io/v1
kind: VolumeGroupSnapshot
metadata:
name: snapshot-daily-20260422
namespace: demo-namespace
spec:
volumeGroupSnapshotClassName: csi-groupSnapclass
source:
selector:
matchLabels:
group: myGroup
The VolumeGroupSnapshotClass is required for dynamic provisioning:
apiVersion: groupsnapshot.storage.k8s.io/v1
kind: VolumeGroupSnapshotClass
metadata:
name: csi-groupSnapclass
driver: example.csi.k8s.io
deletionPolicy: Delete
How to use group snapshot for restore
At restore time, request a new PersistentVolumeClaim to be created from a VolumeSnapshot object that is part of a VolumeGroupSnapshot. Repeat this for all volumes that are part of the group snapshot.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: examplepvc-restored-2026-04-22
namespace: demo-namespace
spec:
storageClassName: example-sc
dataSource:
name: snapshot-0962a745b2bf930bb385b7b50c9b08af471f1a16780726de19429dd9c94eaca0
kind: VolumeSnapshot
apiGroup: snapshot.storage.k8s.io
accessModes:
- ReadWriteOncePod
resources:
requests:
storage: 100Mi
As a storage vendor, how do I add support for group snapshots?
To implement the volume group snapshot feature, a CSI driver must:
- Implement a new group controller service.
- Implement group controller RPCs:
CreateVolumeGroupSnapshot,DeleteVolumeGroupSnapshot, andGetVolumeGroupSnapshot. - Add group controller capability
CREATE_DELETE_GET_VOLUME_GROUP_SNAPSHOT.
See the CSI spec and the Kubernetes-CSI Driver Developer Guide for more details.
How can I learn more?
- The design spec for the volume group snapshot feature.
- The code repository for volume group snapshot APIs and controller.
- CSI documentation on the group snapshot feature.
How do I get involved?
This project, like all of Kubernetes, is the result of hard work by many contributors from diverse backgrounds working together. On behalf of SIG Storage, I would like to offer a huge thank you to all the contributors who stepped up over the years to help the project reach GA:
- Ben Swartzlander (bswartz)
- Cici Huang (cici37)
- Darshan Murthy (darshansreenivas)
- Hemant Kumar (gnufied)
- James Defelice (jdef)
- Jan Šafránek (jsafrane)
- Madhu Rajanna (Madhu-1)
- Manish M Yathnalli (manishym)
- Michelle Au (msau42)
- Niels de Vos (nixpanic)
- Leonardo Cecchi (leonardoce)
- Rakshith R (Rakshith-R)
- Raunak Shah (RaunakShah)
- Saad Ali (saad-ali)
- Wei Duan (duanwei33)
- Xing Yang (xing-yang)
- Yati Padia (yati1998)
For those interested in getting involved with the design and development of CSI or any part of the Kubernetes Storage system, join the Kubernetes Storage Special Interest Group (SIG). We always welcome new contributors.
We also hold regular Data Protection Working Group meetings. New attendees are welcome to join our discussions.
Kubernetes v1.36: More Drivers, New Features, and the Next Era of DRA
Dynamic Resource Allocation (DRA) has fundamentally changed how platform administrators handle hardware accelerators and specialized resources in Kubernetes. In the v1.36 release, DRA continues to mature, bringing a wave of feature graduations, critical usability improvements, and new capabilities that extend the flexibility of DRA to native resources like memory and CPU, and support for ResourceClaims in PodGroups.
Driver availability continues to expand. Beyond specialized compute accelerators, the ecosystem includes support for networking and other hardware types, reflecting a move toward a more robust, hardware-agnostic infrastructure.
Whether you are managing massive fleets of GPUs, need better handling of failures, or simply looking for better ways to define resource fallback options, the upgrades to DRA in 1.36 have something for you. Let's dive into the new features and graduations!
Feature graduations
The community has been hard at work stabilizing core DRA concepts. In Kubernetes 1.36, several highly anticipated features have graduated to Beta and Stable.
Prioritized list (stable)
Hardware heterogeneity is a reality in most clusters. With the Prioritized list feature, you can confidently define fallback preferences when requesting devices. Instead of hardcoding a request for a specific device model, you can specify an ordered list of preferences (e.g., "Give me an H100, but if none are available, fall back to an A100"). The scheduler will evaluate these requests in order, drastically improving scheduling flexibility and cluster utilization.
Extended resource support (beta)
As DRA becomes the standard for resource allocation, bridging the gap with legacy systems is crucial. The DRA Extended resource feature allows users to request resources via traditional extended resources on a Pod. This allows for a gradual transition to DRA, meaning cluster operators can migrate clusters to DRA but let application developers adopt the ResourceClaim API on their own schedule.
Partitionable devices (beta)
Hardware accelerators are powerful, and sometimes a single workload doesn't need an entire device. The Partitionable devices feature, provides native DRA support for dynamically carving physical hardware into smaller, logical instances (such as Multi-Instance GPUs) based on workload demands. This allows administrators to safely and efficiently share expensive accelerators across multiple Pods.
Device taints (beta)
Just as you can taint a Kubernetes Node, you can apply taints directly to specific DRA devices. Device taints and tolerations empower cluster administrators to manage hardware more effectively. You can taint faulty devices to prevent them from being allocated to standard claims, or reserve specific hardware for dedicated teams, specialized workloads, and experiments. Ultimately, only Pods with matching tolerations are permitted to claim these tainted devices.
Device binding conditions (beta)
To improve scheduling reliability, the Kubernetes scheduler can use the Binding conditions feature to delay committing a Pod to a Node until its required external resources—such as attachable devices or FPGAs—are fully prepared. By explicitly modeling resource readiness, this prevents premature assignments that can lead to Pod failures, ensuring a much more robust and predictable deployment process.
Resource health status (beta)
Knowing when a device has failed or become unhealthy is critical for workloads running on specialized hardware. With Resource health status, Kubernetes expose device health information directly in the Pod status, giving users and controllers crucial visibility to quickly identify and react to hardware failures. The feature includes support for human-readable health status messages, making it significantly easier to diagnose issues without the need to dive into complex driver logs.
New Features
Beyond stabilizing existing capabilities, v1.36 introduces foundational new features that expand what DRA can do. These are alpha features, so they are behind feature gates that are disabled by default.
ResourceClaim support for workloads
To optimize large-scale AI/ML workloads that rely on strict topological scheduling, the ResourceClaim support for workloads feature enables Kubernetes to seamlessly manage shared resources across massive sets of Pods. By associating ResourceClaims or ResourceClaimTemplates with PodGroups, this feature eliminates previous scaling bottlenecks, such as the limit on the number of pods that can share a claim, and removes the burden of manual claim management from specialized orchestrators.
Node allocatable resources
Why should DRA only be for external accelerators? In v1.36, we are introducing the first iteration of using the DRA APIs to manage node allocatable infrastructure resources (like CPU and memory). By bringing CPU and memory allocation under the DRA umbrella with the DRA Node allocatable resources feature, users can leverage DRA's advanced placement, NUMA-awareness, and prioritization semantics for standard compute resources, paving the way for incredibly fine-grained performance tuning.
DRA resource availability visibility
One of the most requested features from cluster administrators has been better visibility
into hardware capacity. The new
Resource pool status
feature allows you to query the availability of devices in DRA resource pools. By creating a
ResourcePoolStatusRequest object, you get a point-in-time snapshot of device counts
— total, allocated, available, and unavailable — for each pool managed by a given
driver. This enables better integration with dashboards and capacity planning tools.
List types for attributes
ResourceClaim constraint evaluation has changed to work better with scalar
and list values:
matchAttribute now checks for a non-empty intersection, and
distinctAttribute checks for pairwise disjoint values.
An includes() function in CEL has also been introduced,
that lets device selectors keep working more easily when an attribute
changes between scalar and list representations.
(The includes() function is only available in DRA
contexts for expression evaluation).
Deterministic device selection
The Kubernetes scheduler has been updated to evaluate devices using lexicographical ordering based on resource pool and ResourceSlice names. This change empowers drivers to proactively influence the scheduling process, leading to improved throughput and more optimal scheduling decisions. The ResourceSlice controller toolkit automatically generates names that reflect the exact device ordering specified by the driver author.
Discoverable device metadata in containers
Workloads running on nodes with DRA devices often need to discover details about their allocated devices, such as PCI bus addresses or network interface configuration, without querying the Kubernetes API. With Device metadata, Kubernetes defines a standard protocol for how DRA drivers expose device attributes to containers as versioned JSON files at well-known paths. Drivers built with the DRA kubelet plugin library get this behavior transparently; they just provide the metadata and the library handles file layout, CDI bind-mounts, versioning, and lifecycle. This gives applications a consistent, driver-independent way to discover and consume device metadata, eliminating the need for custom controllers or looking up ResourceSlice objects to get metadata via attributes.
What’s next?
This release introduced a wealth of new Dynamic Resource Allocation (DRA) features, and the momentum is only building. As we look ahead, our roadmap focuses on maturing existing features toward beta and stable releases while hardening DRA’s performance, scalability, and reliability. A key priority over the coming cycles will be deep integration with workload aware and topology aware scheduling.
A big goal for us is to migrate users from Device Plugin to DRA, and we want you involved. Whether you are currently maintaining a driver or are just beginning to explore the possibilities, your input is vital. Partner with us to shape the next generation of resource management. Reach out today to collaborate on development, share feedback, or start building your first DRA driver.
Getting involved
A good starting point is joining the WG Device Management Slack channel and meetings, which happen at Americas/EMEA and EMEA/APAC friendly time slots.
Not all enhancement ideas are tracked as issues yet, so come talk to us if you want to help or have some ideas yourself! We have work to do at all levels, from difficult core changes to usability enhancements in kubectl, which could be picked up by newcomers.
Microcks becomes a CNCF incubating project
The CNCF Technical Oversight Committee (TOC) has voted to accept Microcks as a CNCF incubating project.
About Microcks
Modern software teams build applications as collections of interconnected APIs and microservices, and with that architecture comes a significant challenge: how do you develop and test services in isolation when so many depend on each other? Microcks solves this by providing an open source, cloud native platform for API mocking and testing.
With Microcks, teams can instantly turn their existing API contract documents, whether they’re OpenAPI specs, AsyncAPI specs, gRPC/Protobuf definitions, GraphQL schemas, Postman collections, or SOAP/WSDL projects, into live mock servers. Those same assets then power automated contract conformance tests against real implementations. The result is a unified, multi-protocol approach that spans both synchronous REST/RPC APIs and event-driven, asynchronous architectures — a combination that sets Microcks apart from narrower tooling.
Microcks’s key milestones and ecosystem development
Created in February 2015 by Laurent Broudoux, Microcks is a community-driven project with global contributors and adopters, including financial institutions (BNP Paribas, Société Générale, and Lombard Odier) and technology/consulting firms (Deloitte, Amway, and J.B. Hunt).
Since joining the CNCF Sandbox on June 22, 2023, Microcks has seen significant growth in adoption, contribution, development, and ecosystem reach.
Adoption has surged, with container image downloads exceeding 2.5 million in 2025 (triple the 2024 total). Over 34 organizations publicly adopt Microcks, with 13 added in 2025 alone. The project has high community interest, evidenced by 1,800 GitHub stars and 311 forks on the main repository, plus consistent documentation traffic growth.
The contributor base is expanding, totaling 645 across GitHub. The last quarter saw 51 active contributors with an “Excellent” 57% quarter-over-quarter retention rate. In 2025, 167 active contributors represented 35 organizations. Maintainers now include code owners from Yosemite Crew and AXA France, signaling growing community ownership.
Development health is strong: the project was active 342 of the last 365 days. The 12-month average is 288 new pull requests monthly, with an average issue resolution time of 11 days and PR merge lead time of 6 days. The core platform has had 19 releases, with the current stable version being 1.14.0.
Post-sandbox, Microcks has deepened integrations with CNCF projects like Dapr, OpenTelemetry, Keycloak, and AsyncAPI (The Linux Foundation). It integrates natively with Kubernetes and Helm for deployment and connects to CI/CD via Jenkins, GitHub Actions, and Tekton. Testcontainers modules for Java, Node.js, Go, Python, and .NET allow developers to embed Microcks in local test loops.
A word from the Maintainers
“When we first started Microcks ten years ago, the idea was simple: developers should be able to simulate any API dependency, regardless of protocol, without writing a single line of custom code. What we didn’t anticipate was how central that problem would become as the industry shifted to microservices, event-driven architectures, and now AI-powered APIs. Reaching CNCF incubation is a validation not just of the technology, but of the community that has shaped it; 645 contributors, 34 public adopters, and organizations are contributing back because they genuinely depend on the project. We’re grateful to CNCF for the neutral, collaborative home it provides, and we’re energized by what’s ahead: deeper AsyncAPI toolchain integration, AI and MCP simulation support, and continuing to make multi-protocol API testing effortless for every team that builds on Kubernetes.”
— Laurent Broudoux, Creator and Maintainer, Microcks
“The ‘better together’ principle has defined how we’ve built Microcks from the start, with a vendor-neutral design, integrated tools that developers already use, and shaped it by the organizations actually running it in production. In 2025 alone, more than 13 organizations joined our public adopters list, and we saw over 2.5 million container image downloads. That growth isn’t just a number: it reflects teams in financial services, cloud platforms, and enterprise software trusting Microcks at the center of their API DevOps workflows. CNCF incubation gives us the governance foundation and community reach to keep building in the open. The next chapter, including intelligent mocking for AI agents, MCP protocol support, and making contract testing a first-class citizen in every CI/CD pipeline, is one we’re excited to write alongside the community.”
— Yacine Kheddache, Maintainer and Community Lead, Microcks
Support from the TOC
The CNCF Technical Oversight Committee (TOC) provides technical leadership to the cloud native community. It defines and maintains the foundation’s technical vision, approves new projects, and stewards them across maturity levels. The TOC also aligns projects within the overall ecosystem, sets cross-cutting standards and best practices, and works with end users to ensure long-term sustainability. As part of its charter, the TOC evaluates and supports projects as they meet the requirements for incubation and continue progressing toward graduation.
“Microcks addresses a gap that any team building distributed systems on Kubernetes will recognize immediately: the difficulty of developing and testing services in isolation when everything depends on everything else. Across adopters, Microcks has consistently proven itself as the only open source solution capable of addressing API mocking at scale across multiple specifications, such as REST, GraphQL, AsyncAPI, and gRPC, natively on Kubernetes and without vendor lock-in. Microcks demonstrates the kind of engaged, sustainable community that CNCF incubation is designed to support. I look forward to seeing the project continue to grow within the ecosystem.”
— Katie Gamanji, CNCF TOC Sponsor
Main components
Microcks is composed of several modular components:
- Core Server: The main Microcks application, built with Java/Spring Boot, providing the API mocking engine, web UI, and REST API. It ingests API contract documents and serves dynamic mock responses.
- Async Minion: A lightweight companion service handling event-driven and asynchronous protocols (Apache Kafka, MQTT, AMQP, WebSocket, Google Pub/Sub, and more), extending mocking beyond HTTP.
- Operator: A Kubernetes Operator for lifecycle management and automated deployment of Microcks instances in Kubernetes environments, as well as full GitOps support for deploying mocks and executing tests.
- Helm Chart: A production-grade Helm chart for flexible, configurable Kubernetes deployments.
- Testcontainers Libraries: Community-maintained modules for Java, Node.js, Go, Python, and .NET that let developers embed Microcks directly in automated tests.
- CLI: A command-line tool for triggering API conformance tests from CI/CD pipelines, with integrations for Jenkins, GitHub Actions, Tekton, and others.
Project roadmap
The Microcks team is focused on several key development areas to enhance the platform. A major theme is integrating with AI and the Model Context Protocol (MCP), positioning Microcks as a crucial testing and simulation layer for AI-powered APIs and agents.
Microcks is also expanding its support for the AsyncAPI ecosystem, notably by incorporating Kafka contract testing into the acceptance testing infrastructure for the AsyncAPI Generator. Furthermore, the maintainers are committed to growing the Testcontainers ecosystem across more languages and frameworks.
Building on the 2025 OpenTelemetry integration, Microcks will feature continued observability enhancements. Finally, future work includes adding support for more event-driven protocols and advancing the JavaScript dispatcher to enable more dynamic and complex mocking scenarios.
The full project roadmap is maintained at https://github.com/orgs/microcks/projects/1.
As a CNCF-hosted project, Microcks is part of a neutral foundation aligned with its technical interests, as well as the larger Linux Foundation, which provides governance, marketing support, and community outreach. Microcks joins incubating technologies that standardize cloud native infrastructure, enhance observability, and streamline service-to-service communication. For more information on maturity requirements for each level, please visit the CNCF Graduation Criteria.
To learn more about Microcks, visit microcks.io, explore the GitHub repository, or join the community on Discord.
Kubernetes v1.36: Server-Side Sharded List and Watch
As Kubernetes clusters grow to tens of thousands of nodes, controllers that watch high-cardinality resources like Pods face a scaling wall. Every replica of a horizontally scaled controller receives the full stream of events from the API server, paying the CPU, memory, and network cost to deserialize everything, only to discard the objects it is not responsible for. Scaling out the controller does not reduce per-replica cost; it multiplies it.
Kubernetes v1.36 introduces server-side sharded list and watch as an alpha feature (KEP-5866). With this feature enabled, the API server filters events at the source so that each controller replica receives only the slice of the resource collection it owns.
The problem with client-side sharding
Some controllers, such as kube-state-metrics, already support horizontal sharding. Each replica is assigned a portion of the keyspace and discards objects that do not belong to it. While this works functionally, it does not reduce the volume of data flowing from the API server:
- N replicas x full event stream: every replica deserializes and processes every event, then throws away what it does not need.
- Network bandwidth scales with replicas, not with shard size.
- CPU spent on deserialization is wasted for the discarded fraction.
Server-side sharded list and watch solves this by moving the filtering upstream into the API server. Each replica tells the API server which hash range it owns, and the API server only sends matching events.
How it works
The feature adds a shardSelector field to ListOptions. Clients specify a
hash range using the shardRange() function:
shardRange(object.metadata.uid, '0x0000000000000000', '0x8000000000000000')
The API server computes a deterministic 64-bit
FNV-1a
hash of the specified field and returns only objects whose hash falls within the
range [start, end). This applies to both list responses and watch event
streams. The hash function produces the same result across all API server
instances, so the feature is safe to use with multiple API server replicas.
Currently supported field paths are object.metadata.uid and
object.metadata.namespace.
Using sharded watches in controllers
Controllers typically use informers to list and watch resources. To shard the
workload, each replica injects the shardSelector into the ListOptions used
by its informers via WithTweakListOptions:
import (
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/client-go/informers"
)
shardSelector := "shardRange(object.metadata.uid, '0x0000000000000000', '0x8000000000000000')"
factory := informers.NewSharedInformerFactoryWithOptions(client, resyncPeriod,
informers.WithTweakListOptions(func(opts *metav1.ListOptions) {
opts.ShardSelector = shardSelector
}),
)
For a 2-replica deployment, the selectors split the hash space in half:
// Replica 0: lower half of the hash space
"shardRange(object.metadata.uid, '0x0000000000000000', '0x8000000000000000')"
// Replica 1: upper half of the hash space
"shardRange(object.metadata.uid, '0x8000000000000000', '0x10000000000000000')"
A single replica can also cover non-contiguous ranges using ||:
"shardRange(object.metadata.uid, '0x0000000000000000', '0x4000000000000000') || " +
"shardRange(object.metadata.uid, '0x8000000000000000', '0xc000000000000000')"
Verifying server support
When the API server honors a shard selector, the list response includes a
shardInfo field in the response metadata that echoes back the applied
selector:
{
"kind": "PodList",
"apiVersion": "v1",
"metadata": {
"resourceVersion": "10245",
"shardInfo": {
"selector": "shardRange(object.metadata.uid, '0x0000000000000000', '0x8000000000000000')"
}
},
"items": [...]
}
If shardInfo is absent, the server did not honor the shard selector and the
client received the complete, unfiltered collection. In this case, the client
should be prepared to handle the full result set, for example by applying
client-side filtering to discard objects outside its assigned shard range.
Getting involved
This feature is in alpha and requires enabling the ShardedListAndWatch feature
gate on the API server. We are looking for feedback from controller authors and
operators running large clusters.
If you have questions or feedback, join the #sig-api-machinery channel on
Kubernetes Slack.
Kubernetes v1.36: Declarative Validation Graduates to GA
In Kubernetes v1.36, Declarative Validation for Kubernetes native types has reached General Availability (GA).
For users, this means more reliable, predictable, and better-documented APIs. By moving to a declarative model, the project also unlocks the future ability to publish validation rules via OpenAPI and integrate with ecosystem tools like Kubebuilder. For contributors and ecosystem developers, this replaces thousands of lines of handwritten validation code with a unified, maintainable framework.
This post covers why this migration was necessary, how the declarative validation framework works, and what new capabilities come with this GA release.
The Motivation: Escaping the "Handwritten" Technical Debt
For years, the validation of Kubernetes native APIs relied almost entirely on handwritten Go code. If a field needed to be bounded by a minimum value, or if two fields needed to be mutually exclusive, developers had to write explicit Go functions to enforce those constraints.
As the Kubernetes API surface expanded, this approach led to several systemic issues:
- Technical Debt: The project accumulated roughly 18,000 lines of boilerplate validation code. This code was difficult to maintain, error-prone, and required intense scrutiny during code reviews.
- Inconsistency: Without a centralized framework, validation rules were sometimes applied inconsistently across different resources.
- Opaque APIs: Handwritten validation logic was difficult to discover or analyze programmatically. This meant clients and tooling couldn't predictably know validation rules without consulting the source code or encountering errors at runtime.
The solution proposed by SIG API Machinery was Declarative Validation: using Interface Definition Language (IDL) tags (specifically +k8s: marker tags) directly within types.go files to define validation rules.
Enter validation-gen
At the core of the declarative validation feature is a new code generator called validation-gen. Just as Kubernetes uses generators for deep copies, conversions, and defaulting, validation-gen parses +k8s: tags and automatically generates the corresponding Go validation functions.
These generated functions are then registered seamlessly with the API scheme. The generator is designed as an extensible framework, allowing developers to plug in new "Validators" by describing the tags they parse and the Go logic they should produce.
A Comprehensive Suite of +k8s: Tags
The declarative validation framework introduces a comprehensive suite of marker tags that provide rich validation capabilities highly optimized for Go types. For a full list of supported tags, check out the official documentation. Here is a catalog of some of the most common tags you will now see in the Kubernetes codebase:
- Presence:
+k8s:optional,+k8s:required - Basic Constraints:
+k8s:minimum=0,+k8s:maximum=100,+k8s:maxLength=16,+k8s:format=k8s-short-name - Collections:
+k8s:listType=map,+k8s:listMapKey=type - Unions:
+k8s:unionMember,+k8s:unionDiscriminator - Immutability:
+k8s:immutable,+k8s:update=[NoSet, NoModify, NoClear]
Example Usage:
type ReplicationControllerSpec struct {
// +k8s:optional
// +k8s:minimum=0
Replicas *int32 `json:"replicas,omitempty"`
}
By placing these tags directly above the field definitions, the constraints are self-documenting and immediately visible to anyone reading the type definitions.
Advanced Capabilities: "Ambient Ratcheting"
One of the most substantial outcomes of this work is that validation ratcheting is now a standard, ambient part of the API. In the past, if we needed to tighten validation, we had to first add handwritten ratcheting code, wait a release, and then tighten the validation to avoid breaking existing objects.
With declarative validation, this safety mechanism is built-in. If a user updates an existing object, the validation framework compares the incoming object with the oldObject. If a specific field's value is semantically equivalent to its prior state (i.e., the user didn't change it), the new validation rule is bypassed. This "ambient ratcheting" means we can loosen or tighten validation immediately and in the least disruptive way possible.
Scaling API Reviews with kube-api-linter
Reaching GA required absolute confidence in the generated code, but our vision extends beyond just validation. Declarative validation is a key part of a comprehensive approach to making API review easier, more consistent, and highly scalable.
By moving validation rules out of opaque Go functions and into structured markers, we are empowering tools like kube-api-linter. This linter can now statically analyze API types and enforce API conventions automatically, significantly reducing the manual burden on SIG API Machinery reviewers and providing immediate feedback to contributors.
What's next?
With the release of Kubernetes v1.36, Declarative Validation graduates to General Availability (GA). As a stable feature, the associated DeclarativeValidation feature gate is now enabled by default. It has become the primary mechanism for adding new validation rules to Kubernetes native types.
Looking forward, the project is committed to adopting declarative validation even more extensively. This includes migrating the remaining legacy handwritten validation code for established APIs and requiring its use for all new APIs and new fields. This ongoing transition will continue to shrink the codebase's complexity while enhancing the consistency and reliability of the entire Kubernetes API surface.
Beyond the core migration, declarative validation also unlocks an exciting future for the broader ecosystem. Because validation rules are now defined as structured markers rather than opaque Go code, they can be parsed and reflected in the OpenAPI schemas published by the Kubernetes API server. This paves the way for tools like kubectl, client libraries, and IDEs to perform rich client-side validation before a request is ever sent to the cluster. The same declarative framework can also be consumed by ecosystem tools like Kubebuilder, enabling a more consistent developer experience for authors of Custom Resource Definitions (CRDs).
Getting involved
The migration to declarative validation is an ongoing effort. While the framework itself is GA, there is still work to be done migrating older APIs to the new declarative format.
If you are interested in contributing to the core of Kubernetes API Machinery, this is a fantastic place to start. Check out the validation-gen documentation, look for issues tagged with sig/api-machinery, and join the conversation in the #sig-api-machinery and #sig-api-machinery-dev-tools channels on Kubernetes Slack (for an invitation, visit https://slack.k8s.io/). You can also attend the SIG API Machinery meetings to get involved directly.
Acknowledgments
A huge thank you to everyone who helped bring this feature to GA:
- Tim Hockin
- Joe Betz
- Aaron Prindle
- Lalit Chauhan
- David Eads
- Darshan Murthy
- Jordan Liggitt
- Patrick Ohly
- Maciej Szulik
- Wojciech Tyczynski
- Joel Speed
- Bryce Palmer
And the many others across the Kubernetes community who contributed along the way.
Welcome to the declarative future of Kubernetes validation!
Announcing Kyverno release 1.18!
We’re excited to announce the release of Kyverno 1.18, our first release since graduating within the Cloud Native Computing Foundation.
This release builds on Kyverno’s growing role as a Kubernetes-native policy engine, with major investments in security, CLI capabilities, and policy engine reliability. It also continues our transition toward CEL-based policy types, setting the foundation for the future of policy as code.
TL;DR
Kyverno 1.18 delivers:
- Stronger security controls for HTTP-based policy execution and multiple CVE mitigations
- Significant CLI enhancements for testing and applying modern policy types
- Policy engine improvements for performance, observability, and scalability
- Enhancements to the policies Helm chart for better customization
There are no breaking changes in this release, but ClusterPolicy deprecation remains on track, and users should begin migrating to the newer policy types.
Security improvements
Security is a core pillar of Kyverno, and 1.18 introduces important safeguards for policy execution.
Safer HTTP execution
Kyverno policies can call external services via HTTP CEL libraries. In 1.18, this capability is significantly hardened:
- Blocklist/allowlist enforcement: by default, unsafe addresses like loopback and metadata services are blocked. Users can configure an allow list and a block list for cluster-scoped and namespaced policies. Additionally, HTTP calls from namespaced policies are default disabled, and need to be explicitly enabled using configuration flags. These changes help prevent SSRF-style abuse. See CVE-2026-4789 for details.
- Scoped token authorization: Previously, Kyverno HTTP calls included a token which could be used to impersonate Kyverno controllers. Now, HTTP calls include a separate scoped token that ensures that servers cannot misuse the token. See CVE-2026-41323 for details.
These changes reduce the risk of unintended external access while maintaining flexibility for advanced policy use cases.
CLI expansion and developer experience
Kyverno’s CLI continues to evolve as a critical tool for policy development and testing.
Expanded policy support
The kyverno apply and kyverno test commands now support:
- Cleanup policies
- HTTP and Envoy authorization policies
mutateExistingrules in MutatingPolicy- The
--exceptions-with-policiesflag for improved testing workflows
This significantly improves the ability to test modern policy types locally and in CI pipelines.
Reliability and usability improvements
Numerous fixes address:
- Error handling and reporting
- CRD compatibility without cluster connections
- Stability issues such as panics and file handle leaks
The result is a more predictable and developer-friendly experience when working with policies.
Policy engine improvements
Kyverno 1.18 includes several enhancements that improve how policies are executed and managed at scale.
Fine-grained success event filtering
A new successEventActions ConfigMap parameter allows users to control:
- Which success events are emitted
- How noisy or quiet policy reporting should be
This is especially valuable in large environments where event volume needs to be tuned.
Performance and scalability
Key improvements include:
- Memory-based HPA autoscaling for the admission controller
- TLS support on the /metrics endpoint
- Improved concurrency handling and reduced risk of race conditions
These changes make Kyverno more resilient in high-scale production environments.
CEL and policy execution enhancements
- Addition of a gzip CEL library for more advanced expressions
- Improved compilation and evaluation of policy variables and conditions
- Better alignment between policy types and execution engines
Image verification improvements
Several targeted improvements land for image verification:
- For
ClusterPolicies,imageRegistryCredentials.secretsnow accepts a namespace/name notation, and pod-levelimagePullSecretsare automatically used as registry credentials, useful in multi-tenant environments where each namespace manages its own pull secrets. - Reliability fixes for
ImageValidatingPolicy, including better handling of signed timestamps and TSA certificate chains, Notary resolver fixes, correctmatchImageReferencesfiltering, and improved autogen support for namespaced policies.
Policies Helm chart enhancements
The policies Helm chart continues to evolve with better customization and control.
New capabilities include:
- Support for excludes in
ValidatingPolicies(namespace, subject, resource rules, match conditions) auditAnnotationconfiguration- Per-policy annotation overrides
These improvements make it easier to tailor policies to specific organizational and operational needs.
Updated support policy
As Kyverno continues to grow in adoption, contributions, and overall project scope, we are evolving how we provide release support.
Starting with the 1.18 release, Kyverno will follow a “main + 1” patch support model.
This means:
- The current release (main) and the immediately previous release will be supported for patches. Patches are limited to critical and high severity CVEs, and other critical fixes. This provides roughly 3 months of community patch support.
- Older versions will no longer receive regular updates or fixes
Why this change
This adjustment allows the maintainer team to:
- Efficiently manage the AI driven increase in security issues and PRs
- Maintain higher standards for security and responsiveness
- Focus efforts on current and actively used versions
- Keep the project sustainable and manageable as it scales
What this means for users
We recommend that users:
- Stay up to date with recent Kyverno releases
- Plan upgrades in alignment with the 3 month support window, or use a commercial distribution that provides higher SLAs and long term support
- Reach out to the community if guidance is needed
This change ensures we can continue to deliver a secure, stable, and forward-moving project for everyone.
ClusterPolicy deprecation reminder
As a reminder, ClusterPolicy resources are planned for deprecation later this year.
We strongly encourage users to begin migrating to the newer policy types:
- ValidatingPolicy
- MutatingPolicy
- GeneratingPolicy
- ImageValidatingPolicy
- DeletingPolicy
What you should do
- Start migrating existing policies
- Test thoroughly using the CLI
- Report any gaps or issues
Community feedback is essential to ensuring a smooth transition and full feature parity. We ask that you please report issues and help us build full parity in the upcoming months.
Community updates
Kyverno’s graduation within the CNCF marks a major milestone for the project and its community.
Join the community
Kyverno community meetings now run at multiple global-friendly times:
- APAC / EU: Every other Wednesday 9:00 CET / India 13:30h / EU: 09:00h / Singapore: 16:00h / Australia: 18:00h
- USA/LATAM: Every other Wednesday 16:00 CET / India 20:30h / EU: 16:00h / NYC: 10:00h / SF: 7:00h
You can find all meetings on the CNCF Calendar using the Kyverno filter.
Additionally, we are working to create a space where community members can publish case studies and use cases to our community blog in hopes that this will serve as a space where everyone can learn from each other. Please keep an eye out for the announcements of when this section of the blog will be live and if you would like to submit a use case or case study, please reach out to [email protected] directly.
Getting started and upgrading
Kyverno 1.18 has no breaking changes, making it a safe and straightforward upgrade for most users.
Upgrade
- Review the release notes
- Test in staging environments
- Follow upgrade guidance in the documentation
Install
Install via the Kyverno website
Release Notes
What’s next
Looking ahead, the Kyverno roadmap focuses on:
- Continued investment in CEL-based policy types
- Improved policy authoring experience
- Scaling policy across multi-cluster environments
- Expanding into AI governance and policy-driven automation
Conclusion
Kyverno 1.18 is a meaningful step forward following our CNCF graduation.
With stronger security, expanded CLI capabilities, and continued investment in policy engine reliability and Kubernetes-native policy, Kyverno is helping teams move from policy enforcement to policy-driven operations at scale.
As the project continues to grow, we are also evolving how we operate to ensure long-term sustainability. Our move to an N-1 support model reflects a commitment to maintaining high-quality releases while keeping pace with the needs of a rapidly expanding community and ecosystem.
Upgrade to Kyverno 1.18, stay current with supported releases, begin your migration to the new policy types, and help us build the future of policy as code.
Kubernetes v1.36: Admission Policies That Can't Be Deleted
If you've ever tried to enforce a security policy across a fleet of Kubernetes clusters, you've probably run into a frustrating chicken-and-egg problem. Your admission policies are API objects, which means they don't exist until someone creates them, and they can be deleted by anyone with the right permissions. There's always a window during cluster bootstrap where your policies aren't active yet, and there's no way to prevent a privileged user from removing them.
Kubernetes v1.36 introduces an alpha feature that addresses this: manifest-based admission control. It lets you define admission webhooks and CEL-based policies as files on disk, loaded by the API server at startup, before it serves any requests.
The gap we're closing
Most Kubernetes policy enforcement today works through the API. You create a ValidatingAdmissionPolicy or a webhook configuration as an API object, and the admission controller picks it up. This works well in steady state, but it has some fundamental limitations.
During cluster bootstrap, there's a gap between when the API server starts serving requests and when your policies are created and active. If you're restoring from a backup or recovering from an etcd failure, that gap can be significant.
There's also a self-protection problem. Admission webhooks and policies can't intercept operations on their own configuration resources. Kubernetes skips invoking webhooks on types like ValidatingWebhookConfiguration to avoid circular dependencies. That means a sufficiently privileged user can delete your critical admission policies, and there's nothing in the admission chain to stop them.
We - Kubernetes SIG API Machinery - wanted a way to say "these policies are always on, full stop."
How it works
You add a staticManifestsDir field to the AdmissionConfiguration file
that you already pass to the API server via --admission-control-config-file.
Point it at a directory, drop your policy YAML files in there, and the API
server loads them before it starts serving.
apiVersion: apiserver.config.k8s.io/v1
kind: AdmissionConfiguration
plugins:
- name: ValidatingAdmissionPolicy
configuration:
apiVersion: apiserver.config.k8s.io/v1
kind: ValidatingAdmissionPolicyConfiguration
staticManifestsDir: "/etc/kubernetes/admission/validating-policies/"
The manifest files are standard Kubernetes resource definitions. The only
requirement is that all the objects that these manifests define must have names ending in .static.k8s.io.
This reserved suffix prevents collisions with API-based configurations and
makes it easy to tell where an admission decision came from when you're
looking at metrics or audit logs.
Here's a complete example that denies privileged containers outside kube-system:
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicy
metadata:
name: "deny-privileged.static.k8s.io"
annotations:
kubernetes.io/description: "Deny launching privileged pods, anywhere this policy is applied"
spec:
failurePolicy: Fail
matchConstraints:
resourceRules:
- apiGroups: [""]
apiVersions: ["v1"]
operations: ["CREATE", "UPDATE"]
resources: ["pods"]
variables:
- name: allContainers
expression: >-
object.spec.containers +
(has(object.spec.initContainers) ? object.spec.initContainers : []) +
(has(object.spec.ephemeralContainers) ? object.spec.ephemeralContainers : [])
validations:
- expression: >-
!variables.allContainers.exists(c,
has(c.securityContext) && has(c.securityContext.privileged) &&
c.securityContext.privileged == true)
message: "Privileged containers are not allowed"
---
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicyBinding
metadata:
name: "deny-privileged-binding.static.k8s.io"
annotations:
kubernetes.io/description: "Bind deny-privileged policy to all namespaces except kube-system"
spec:
policyName: "deny-privileged.static.k8s.io"
validationActions:
- Deny
matchResources:
namespaceSelector:
matchExpressions:
- key: "kubernetes.io/metadata.name"
operator: NotIn
values: ["kube-system"]
Protecting what couldn't be protected before
The part we're most excited about is the ability to intercept operations on admission configuration resources themselves.
With API-based admission, webhooks and policies are never invoked on types like ValidatingAdmissionPolicy or ValidatingWebhookConfiguration. That restriction exists for good reason: if a webhook could reject changes to its own configuration, you could end up locked out with no way to fix it through the API.
Manifest-based policies don't have that problem. If a bad policy is blocking something it shouldn't, you fix the file on disk and the API server picks up the change. There's no circular dependency because the recovery path doesn't go through the API.
This means you can write a manifest-based policy that prevents deletion of your critical API-based admission policies. For platform teams managing shared clusters, this is a significant improvement. You can now guarantee that your baseline security policies can't be removed by a cluster admin, accidentally or otherwise.
Here's what that looks like in practice. This policy prevents any
modification or deletion of admission resources that carry the
platform.example.com/protected: "true" label:
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicy
metadata:
name: "protect-policies.static.k8s.io"
annotations:
kubernetes.io/description: "Prevent modification or deletion of protected admission resources"
spec:
failurePolicy: Fail
matchConstraints:
resourceRules:
- apiGroups: ["admissionregistration.k8s.io"]
apiVersions: ["*"]
operations: ["DELETE", "UPDATE"]
resources:
- "validatingadmissionpolicies"
- "validatingadmissionpolicybindings"
- "validatingwebhookconfigurations"
- "mutatingwebhookconfigurations"
validations:
- expression: >-
!has(oldObject.metadata.labels) ||
!('platform.example.com/protected' in oldObject.metadata.labels) ||
oldObject.metadata.labels['platform.example.com/protected'] != 'true'
message: "Protected admission resources cannot be modified or deleted"
---
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicyBinding
metadata:
name: "protect-policies-binding.static.k8s.io"
annotations:
kubernetes.io/description: "Bind protect-policies policy to all admission resources"
spec:
policyName: "protect-policies.static.k8s.io"
validationActions:
- Deny
With this in place, any API-based admission policy or webhook configuration
labeled platform.example.com/protected: "true" is shielded from tampering.
The protection itself lives on disk and can't be removed through the API.
A few things to know
Manifest-based configurations are intentionally self-contained. They can't
reference API resources, which means no paramKind for policies, no
Service references for admission webhooks (instead they are URL-only),
and bindings may only reference
policies in the same manifest set. These restrictions exist because the
configurations need to work without any cluster state, including at startup
before etcd is available.
If you run multiple API server instances, each one loads its own manifest files independently. There's no cross-server synchronization built in. This is the same model as other file-based API server configurations like encryption at rest. When this feature is enabled, Kubernetes exposes a configuration hash as a label on relevant metrics, so you can detect drift.
Files are watched for changes at runtime, so you don't need to restart the API server to update policies. If you update a manifest file, the API server validates the new configuration and swaps it in atomically. If validation fails, it keeps the previous good configuration and logs the error. This means you can roll out policy changes across your fleet using standard configuration management tools (Ansible, Puppet, or even mounted ConfigMaps) without any API server downtime.
The initial load at startup is stricter: if any manifest is invalid, the API server won't start. This is intentional. At startup, failing fast is safer than running without your expected policies.
Try it out
To try this in Kubernetes v1.36:
- Enable the
ManifestBasedAdmissionControlConfigfeature gate for each kube-apiserver. - Create a directory with your static manifest files. If you need to mount that in to the Pod where the API server runs, do that too. Read-only is fine.
- Configure
staticManifestsDirin yourAdmissionConfigurationwith the directory path. - Start the API server with
--admission-control-config-filepointing to yourAdmissionConfigurationfile.
The full documentation is at Manifest-Based Admission Control, and you can follow KEP-5793 for ongoing progress.
We'd love to hear your feedback. Reach out on the #sig-api-machinery channel on Kubernetes Slack (for an invitation, visit https://slack.k8s.io/).
How to get involved
If you're interested in contributing to this feature or other SIG API Machinery projects, join us on #sig-api-machinery on Kubernetes Slack. You're also welcome to attend the SIG API Machinery meetings, held every other Wednesday.
Kubernetes v1.36: Admission Policies That Can't Be Deleted
If you've ever tried to enforce a security policy across a fleet of Kubernetes clusters, you've probably run into a frustrating chicken-and-egg problem. Your admission policies are API objects, which means they don't exist until someone creates them, and they can be deleted by anyone with the right permissions. There's always a window during cluster bootstrap where your policies aren't active yet, and there's no way to prevent a privileged user from removing them.
Kubernetes v1.36 introduces an alpha feature that addresses this: manifest-based admission control. It lets you define admission webhooks and CEL-based policies as files on disk, loaded by the API server at startup, before it serves any requests.
The gap we're closing
Most Kubernetes policy enforcement today works through the API. You create a ValidatingAdmissionPolicy or a webhook configuration as an API object, and the admission controller picks it up. This works well in steady state, but it has some fundamental limitations.
During cluster bootstrap, there's a gap between when the API server starts serving requests and when your policies are created and active. If you're restoring from a backup or recovering from an etcd failure, that gap can be significant.
There's also a self-protection problem. Admission webhooks and policies can't intercept operations on their own configuration resources. Kubernetes skips invoking webhooks on types like ValidatingWebhookConfiguration to avoid circular dependencies. That means a sufficiently privileged user can delete your critical admission policies, and there's nothing in the admission chain to stop them.
We - Kubernetes SIG API Machinery - wanted a way to say "these policies are always on, full stop."
How it works
You add a staticManifestsDir field to the AdmissionConfiguration file
that you already pass to the API server via --admission-control-config-file.
Point it at a directory, drop your policy YAML files in there, and the API
server loads them before it starts serving.
apiVersion: apiserver.config.k8s.io/v1
kind: AdmissionConfiguration
plugins:
- name: ValidatingAdmissionPolicy
configuration:
apiVersion: apiserver.config.k8s.io/v1
kind: ValidatingAdmissionPolicyConfiguration
staticManifestsDir: "/etc/kubernetes/admission/validating-policies/"
The manifest files are standard Kubernetes resource definitions. The only
requirement is that all the objects that these manifests define must have names ending in .static.k8s.io.
This reserved suffix prevents collisions with API-based configurations and
makes it easy to tell where an admission decision came from when you're
looking at metrics or audit logs.
Here's a complete example that denies privileged containers outside kube-system:
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicy
metadata:
name: "deny-privileged.static.k8s.io"
annotations:
kubernetes.io/description: "Deny launching privileged pods, anywhere this policy is applied"
spec:
failurePolicy: Fail
matchConstraints:
resourceRules:
- apiGroups: [""]
apiVersions: ["v1"]
operations: ["CREATE", "UPDATE"]
resources: ["pods"]
variables:
- name: allContainers
expression: >-
object.spec.containers +
(has(object.spec.initContainers) ? object.spec.initContainers : []) +
(has(object.spec.ephemeralContainers) ? object.spec.ephemeralContainers : [])
validations:
- expression: >-
!variables.allContainers.exists(c,
has(c.securityContext) && has(c.securityContext.privileged) &&
c.securityContext.privileged == true)
message: "Privileged containers are not allowed"
---
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicyBinding
metadata:
name: "deny-privileged-binding.static.k8s.io"
annotations:
kubernetes.io/description: "Bind deny-privileged policy to all namespaces except kube-system"
spec:
policyName: "deny-privileged.static.k8s.io"
validationActions:
- Deny
matchResources:
namespaceSelector:
matchExpressions:
- key: "kubernetes.io/metadata.name"
operator: NotIn
values: ["kube-system"]
Protecting what couldn't be protected before
The part we're most excited about is the ability to intercept operations on admission configuration resources themselves.
With API-based admission, webhooks and policies are never invoked on types like ValidatingAdmissionPolicy or ValidatingWebhookConfiguration. That restriction exists for good reason: if a webhook could reject changes to its own configuration, you could end up locked out with no way to fix it through the API.
Manifest-based policies don't have that problem. If a bad policy is blocking something it shouldn't, you fix the file on disk and the API server picks up the change. There's no circular dependency because the recovery path doesn't go through the API.
This means you can write a manifest-based policy that prevents deletion of your critical API-based admission policies. For platform teams managing shared clusters, this is a significant improvement. You can now guarantee that your baseline security policies can't be removed by a cluster admin, accidentally or otherwise.
Here's what that looks like in practice. This policy prevents any
modification or deletion of admission resources that carry the
platform.example.com/protected: "true" label:
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicy
metadata:
name: "protect-policies.static.k8s.io"
annotations:
kubernetes.io/description: "Prevent modification or deletion of protected admission resources"
spec:
failurePolicy: Fail
matchConstraints:
resourceRules:
- apiGroups: ["admissionregistration.k8s.io"]
apiVersions: ["*"]
operations: ["DELETE", "UPDATE"]
resources:
- "validatingadmissionpolicies"
- "validatingadmissionpolicybindings"
- "validatingwebhookconfigurations"
- "mutatingwebhookconfigurations"
validations:
- expression: >-
!has(oldObject.metadata.labels) ||
!('platform.example.com/protected' in oldObject.metadata.labels) ||
oldObject.metadata.labels['platform.example.com/protected'] != 'true'
message: "Protected admission resources cannot be modified or deleted"
---
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicyBinding
metadata:
name: "protect-policies-binding.static.k8s.io"
annotations:
kubernetes.io/description: "Bind protect-policies policy to all admission resources"
spec:
policyName: "protect-policies.static.k8s.io"
validationActions:
- Deny
With this in place, any API-based admission policy or webhook configuration
labeled platform.example.com/protected: "true" is shielded from tampering.
The protection itself lives on disk and can't be removed through the API.
A few things to know
Manifest-based configurations are intentionally self-contained. They can't
reference API resources, which means no paramKind for policies, no
Service references for admission webhooks (instead they are URL-only),
and bindings may only reference
policies in the same manifest set. These restrictions exist because the
configurations need to work without any cluster state, including at startup
before etcd is available.
If you run multiple API server instances, each one loads its own manifest files independently. There's no cross-server synchronization built in. This is the same model as other file-based API server configurations like encryption at rest. When this feature is enabled, Kubernetes exposes a configuration hash as a label on relevant metrics, so you can detect drift.
Files are watched for changes at runtime, so you don't need to restart the API server to update policies. If you update a manifest file, the API server validates the new configuration and swaps it in atomically. If validation fails, it keeps the previous good configuration and logs the error. This means you can roll out policy changes across your fleet using standard configuration management tools (Ansible, Puppet, or even mounted ConfigMaps) without any API server downtime.
The initial load at startup is stricter: if any manifest is invalid, the API server won't start. This is intentional. At startup, failing fast is safer than running without your expected policies.
Try it out
To try this in Kubernetes v1.36:
- Enable the
ManifestBasedAdmissionControlConfigfeature gate for each kube-apiserver. - Create a directory with your static manifest files. If you need to mount that in to the Pod where the API server runs, do that too. Read-only is fine.
- Configure
staticManifestsDirin yourAdmissionConfigurationwith the directory path. - Start the API server with
--admission-control-config-filepointing to yourAdmissionConfigurationfile.
The full documentation is at Manifest-Based Admission Control, and you can follow KEP-5793 for ongoing progress.
We'd love to hear your feedback. Reach out on the #sig-api-machinery channel on Kubernetes Slack (for an invitation, visit https://slack.k8s.io/).
How to get involved
If you're interested in contributing to this feature or other SIG API Machinery projects, join us on #sig-api-machinery on Kubernetes Slack. You're also welcome to attend the SIG API Machinery meetings, held every other Wednesday.
Kubernetes v1.36: Admission Policies That Can't Be Deleted
If you've ever tried to enforce a security policy across a fleet of Kubernetes clusters, you've probably run into a frustrating chicken-and-egg problem. Your admission policies are API objects, which means they don't exist until someone creates them, and they can be deleted by anyone with the right permissions. There's always a window during cluster bootstrap where your policies aren't active yet, and there's no way to prevent a privileged user from removing them.
Kubernetes v1.36 introduces an alpha feature that addresses this: manifest-based admission control. It lets you define admission webhooks and CEL-based policies as files on disk, loaded by the API server at startup, before it serves any requests.
The gap we're closing
Most Kubernetes policy enforcement today works through the API. You create a ValidatingAdmissionPolicy or a webhook configuration as an API object, and the admission controller picks it up. This works well in steady state, but it has some fundamental limitations.
During cluster bootstrap, there's a gap between when the API server starts serving requests and when your policies are created and active. If you're restoring from a backup or recovering from an etcd failure, that gap can be significant.
There's also a self-protection problem. Admission webhooks and policies can't intercept operations on their own configuration resources. Kubernetes skips invoking webhooks on types like ValidatingWebhookConfiguration to avoid circular dependencies. That means a sufficiently privileged user can delete your critical admission policies, and there's nothing in the admission chain to stop them.
We - Kubernetes SIG API Machinery - wanted a way to say "these policies are always on, full stop."
How it works
You add a staticManifestsDir field to the AdmissionConfiguration file
that you already pass to the API server via --admission-control-config-file.
Point it at a directory, drop your policy YAML files in there, and the API
server loads them before it starts serving.
apiVersion: apiserver.config.k8s.io/v1
kind: AdmissionConfiguration
plugins:
- name: ValidatingAdmissionPolicy
configuration:
apiVersion: apiserver.config.k8s.io/v1
kind: ValidatingAdmissionPolicyConfiguration
staticManifestsDir: "/etc/kubernetes/admission/validating-policies/"
The manifest files are standard Kubernetes resource definitions. The only
requirement is that all the objects that these manifests define must have names ending in .static.k8s.io.
This reserved suffix prevents collisions with API-based configurations and
makes it easy to tell where an admission decision came from when you're
looking at metrics or audit logs.
Here's a complete example that denies privileged containers outside kube-system:
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicy
metadata:
name: "deny-privileged.static.k8s.io"
annotations:
kubernetes.io/description: "Deny launching privileged pods, anywhere this policy is applied"
spec:
failurePolicy: Fail
matchConstraints:
resourceRules:
- apiGroups: [""]
apiVersions: ["v1"]
operations: ["CREATE", "UPDATE"]
resources: ["pods"]
variables:
- name: allContainers
expression: >-
object.spec.containers +
(has(object.spec.initContainers) ? object.spec.initContainers : []) +
(has(object.spec.ephemeralContainers) ? object.spec.ephemeralContainers : [])
validations:
- expression: >-
!variables.allContainers.exists(c,
has(c.securityContext) && has(c.securityContext.privileged) &&
c.securityContext.privileged == true)
message: "Privileged containers are not allowed"
---
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicyBinding
metadata:
name: "deny-privileged-binding.static.k8s.io"
annotations:
kubernetes.io/description: "Bind deny-privileged policy to all namespaces except kube-system"
spec:
policyName: "deny-privileged.static.k8s.io"
validationActions:
- Deny
matchResources:
namespaceSelector:
matchExpressions:
- key: "kubernetes.io/metadata.name"
operator: NotIn
values: ["kube-system"]
Protecting what couldn't be protected before
The part we're most excited about is the ability to intercept operations on admission configuration resources themselves.
With API-based admission, webhooks and policies are never invoked on types like ValidatingAdmissionPolicy or ValidatingWebhookConfiguration. That restriction exists for good reason: if a webhook could reject changes to its own configuration, you could end up locked out with no way to fix it through the API.
Manifest-based policies don't have that problem. If a bad policy is blocking something it shouldn't, you fix the file on disk and the API server picks up the change. There's no circular dependency because the recovery path doesn't go through the API.
This means you can write a manifest-based policy that prevents deletion of your critical API-based admission policies. For platform teams managing shared clusters, this is a significant improvement. You can now guarantee that your baseline security policies can't be removed by a cluster admin, accidentally or otherwise.
Here's what that looks like in practice. This policy prevents any
modification or deletion of admission resources that carry the
platform.example.com/protected: "true" label:
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicy
metadata:
name: "protect-policies.static.k8s.io"
annotations:
kubernetes.io/description: "Prevent modification or deletion of protected admission resources"
spec:
failurePolicy: Fail
matchConstraints:
resourceRules:
- apiGroups: ["admissionregistration.k8s.io"]
apiVersions: ["*"]
operations: ["DELETE", "UPDATE"]
resources:
- "validatingadmissionpolicies"
- "validatingadmissionpolicybindings"
- "validatingwebhookconfigurations"
- "mutatingwebhookconfigurations"
validations:
- expression: >-
!has(oldObject.metadata.labels) ||
!('platform.example.com/protected' in oldObject.metadata.labels) ||
oldObject.metadata.labels['platform.example.com/protected'] != 'true'
message: "Protected admission resources cannot be modified or deleted"
---
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicyBinding
metadata:
name: "protect-policies-binding.static.k8s.io"
annotations:
kubernetes.io/description: "Bind protect-policies policy to all admission resources"
spec:
policyName: "protect-policies.static.k8s.io"
validationActions:
- Deny
With this in place, any API-based admission policy or webhook configuration
labeled platform.example.com/protected: "true" is shielded from tampering.
The protection itself lives on disk and can't be removed through the API.
A few things to know
Manifest-based configurations are intentionally self-contained. They can't
reference API resources, which means no paramKind for policies, no
Service references for admission webhooks (instead they are URL-only),
and bindings may only reference
policies in the same manifest set. These restrictions exist because the
configurations need to work without any cluster state, including at startup
before etcd is available.
If you run multiple API server instances, each one loads its own manifest files independently. There's no cross-server synchronization built in. This is the same model as other file-based API server configurations like encryption at rest. When this feature is enabled, Kubernetes exposes a configuration hash as a label on relevant metrics, so you can detect drift.
Files are watched for changes at runtime, so you don't need to restart the API server to update policies. If you update a manifest file, the API server validates the new configuration and swaps it in atomically. If validation fails, it keeps the previous good configuration and logs the error. This means you can roll out policy changes across your fleet using standard configuration management tools (Ansible, Puppet, or even mounted ConfigMaps) without any API server downtime.
The initial load at startup is stricter: if any manifest is invalid, the API server won't start. This is intentional. At startup, failing fast is safer than running without your expected policies.
Try it out
To try this in Kubernetes v1.36:
- Enable the
ManifestBasedAdmissionControlConfigfeature gate for each kube-apiserver. - Create a directory with your static manifest files. If you need to mount that in to the Pod where the API server runs, do that too. Read-only is fine.
- Configure
staticManifestsDirin yourAdmissionConfigurationwith the directory path. - Start the API server with
--admission-control-config-filepointing to yourAdmissionConfigurationfile.
The full documentation is at Manifest-Based Admission Control, and you can follow KEP-5793 for ongoing progress.
We'd love to hear your feedback. Reach out on the #sig-api-machinery channel on Kubernetes Slack (for an invitation, visit https://slack.k8s.io/).
How to get involved
If you're interested in contributing to this feature or other SIG API Machinery projects, join us on #sig-api-machinery on Kubernetes Slack. You're also welcome to attend the SIG API Machinery meetings, held every other Wednesday.
May 1 Security Release Patches RBAC Bypass in Transactions
SIG-etcd released updates v3.6.11, v3.5.30, and v3.4.44 today. These patch releases fix a vulnerability that allows an authenticated user to bypass RBAC authorization checks when reading data via PrevKv or attaching leases inside Put requests nested in etcd transactions.
In addition, v3.6.11 and v3.5.30 contain a bug fix for an issue that prevented adding a new member when one member was down, even though quorum was still satisfied.
This vulnerability does not affect etcd as a part of the Kubernetes Control Plane. Kubernetes does not rely on etcd’s built-in authentication and authorization; the API server handles authentication and authorization itself. The issue only affects etcd clusters in other contexts, specifically ones with Auth enabled where it is required for access control in untrusted or partially trusted networks or with untrusted users.
Kubernetes v1.36: Pod-Level Resource Managers (Alpha)
Kubernetes v1.36 introduces
Pod-Level Resource Managers
as an alpha feature, bringing a more flexible and powerful resource management
model to performance-sensitive workloads. This enhancement extends the kubelet's
Topology, CPU, and Memory Managers to support pod-level resource specifications
(.spec.resources), evolving them from a strictly per-container allocation
model to a pod-centric one.
Why do we need pod-level resource managers?
When running performance-critical workloads such as machine learning (ML) training, high-frequency trading applications, or low-latency databases, you often need exclusive, NUMA-aligned resources for your primary application containers to ensure predictable performance.
However, modern Kubernetes pods rarely consist of just one container. They frequently include sidecar containers for logging, monitoring, service meshes, or data ingestion.
Before this feature, this created a trade-off, to get NUMA-aligned, exclusive resources for your main application, you had to allocate exclusive, integer-based CPU resources to every container in the pod. This might be wasteful for lightweight sidecars. If you didn't do this, you forfeited the pod's Guaranteed Quality of Service (QoS) class entirely, losing the performance benefits.
Introducing pod-level resource managers
Enabling pod-level resources support for the resource managers (via the
PodLevelResourceManagers and PodLevelResources feature gates) allows the
kubelet to create hybrid resource allocation models. This brings flexibility
and efficiency to high-performance workloads without sacrificing NUMA alignment.
Real-world use cases
Here are a few practical scenarios demonstrating how this feature can be applied, depending on the configured Topology Manager scope:
1. Tightly-coupled database (Topology manager's pod scope)
Consider a latency-sensitive database pod that includes a main database container, a local metrics exporter, and a backup agent sidecar.
When configured with the pod Topology Manager scope, the kubelet performs a
single NUMA alignment based on the entire pod's budget. The database container
gets its exclusive CPU and memory slices from that NUMA node. The remaining
resources from the pod's budget form a new pod shared pool. The metrics
exporter and backup agent run in this pod shared pool. They share resources with
each other, but they are strictly isolated from the database's exclusive slices
and the rest of the node.
This allows you to safely co-locate auxiliary containers on the same NUMA node as your primary workload without wasting dedicated cores on them.
apiVersion: v1
kind: Pod
metadata:
name: tightly-coupled-database
spec:
# Pod-level resources establish the overall budget and NUMA alignment size.
resources:
requests:
cpu: "8"
memory: "16Gi"
limits:
cpu: "8"
memory: "16Gi"
initContainers:
- name: metrics-exporter
image: metrics-exporter:v1
restartPolicy: Always
- name: backup-agent
image: backup-agent:v1
restartPolicy: Always
containers:
- name: database
image: database:v1
# This Guaranteed container gets an exclusive 6 CPU slice from the pod's budget.
# The remaining 2 CPUs and 4Gi memory form the pod shared pool for the sidecars.
resources:
requests:
cpu: "6"
memory: "12Gi"
limits:
cpu: "6"
memory: "12Gi"
2. ML workload with infrastructure sidecars (Topology manager's container scope)
Imagine a pod running a GPU-accelerated ML training workload alongside a generic service mesh sidecar.
Under the container Topology Manager scope, the kubelet evaluates each
container individually. You can grant the ML container exclusive, NUMA-aligned
CPUs and Memory for maximum performance. Meanwhile, the service mesh sidecar
doesn't need to be NUMA-aligned; it can run in the general node-wide shared
pool. The collective resource consumption is still safely bounded by the overall
pod limits, but you only allocate NUMA-aligned, exclusive resources to the
specific containers that actually require them.
apiVersion: v1
kind: Pod
metadata:
name: ml-workload
spec:
# Pod-level resources establish the overall budget constraint.
resources:
requests:
cpu: "4"
memory: "8Gi"
limits:
cpu: "4"
memory: "8Gi"
initContainers:
- name: service-mesh-sidecar
image: service-mesh:v1
restartPolicy: Always
containers:
- name: ml-training
image: ml-training:v1
# Under the 'container' scope, this Guaranteed container receives exclusive,
# NUMA-aligned resources, while the sidecar runs in the node's shared pool.
resources:
requests:
cpu: "3"
memory: "6Gi"
limits:
cpu: "3"
memory: "6Gi"
CPU quotas (CFS) and isolation
When running these mixed workloads within a pod, isolation is enforced differently depending on the allocation:
- Exclusive containers: Containers granted exclusive CPU slices have their CPU CFS quota enforcement disabled at the container level, allowing them to run without being throttled by the Linux scheduler.
- Pod shared pool containers: Containers falling into the pod shared pool have CPU CFS quotas enforced at the pod level, ensuring they do not consume more than the leftover pod budget.
How to enable Pod-Level Resource Managers
Using this feature requires Kubernetes v1.36 or newer. To enable it, you must configure the kubelet with the appropriate feature gates and policies:
- Enable the
PodLevelResourcesandPodLevelResourceManagersfeature gates. - Configure the
Topology Manager
with a policy other than
none(i.e.,best-effort,restricted, orsingle-numa-node). - Set the
Topology Manager scope
to either
podorcontainerusing thetopologyManagerScopefield in theKubeletConfiguration. - Configure the
CPU Manager with
the
staticpolicy. - Configure the
Memory Manager with the
Staticpolicy.
Observability
To help cluster administrators monitor and debug these new allocation models, we have introduced several new kubelet metrics when the feature gate is enabled:
resource_manager_allocations_total: Counts the total number of exclusive resource allocations performed by a manager. Thesourcelabel ("pod" or "node") distinguishes between allocations drawn from the node-level pool versus a pre-allocated pod-level pool.resource_manager_allocation_errors_total: Counts errors encountered during exclusive resource allocation, distinguished by the intended allocationsource("pod" or "node").resource_manager_container_assignments: Tracks the cumulative number of containers running with specific assignment types. Theassignment_typelabel ("node_exclusive", "pod_exclusive", "pod_shared") provides visibility into how workloads are distributed.
Current limitations and caveats
While this feature opens up new possibilities, there are a few things to keep in mind during its alpha phase. Be sure to review the Limitations and caveats in the official documentation for full details on compatibility, requirements, and downgrade instructions.
Getting started and providing feedback
For a deep dive into the technical details and configuration of this feature, check out the official concept documentation:
To learn more about the overall pod-level resources feature and how to assign resources to pods, see:
As this feature moves through Alpha, your feedback is invaluable. Please report any issues or share your experiences via the standard Kubernetes communication channels:
Kubernetes v1.36: Pod-Level Resource Managers (Alpha)
Kubernetes v1.36 introduces
Pod-Level Resource Managers
as an alpha feature, bringing a more flexible and powerful resource management
model to performance-sensitive workloads. This enhancement extends the kubelet's
Topology, CPU, and Memory Managers to support pod-level resource specifications
(.spec.resources), evolving them from a strictly per-container allocation
model to a pod-centric one.
Why do we need pod-level resource managers?
When running performance-critical workloads such as machine learning (ML) training, high-frequency trading applications, or low-latency databases, you often need exclusive, NUMA-aligned resources for your primary application containers to ensure predictable performance.
However, modern Kubernetes pods rarely consist of just one container. They frequently include sidecar containers for logging, monitoring, service meshes, or data ingestion.
Before this feature, this created a trade-off, to get NUMA-aligned, exclusive resources for your main application, you had to allocate exclusive, integer-based CPU resources to every container in the pod. This might be wasteful for lightweight sidecars. If you didn't do this, you forfeited the pod's Guaranteed Quality of Service (QoS) class entirely, losing the performance benefits.
Introducing pod-level resource managers
Enabling pod-level resources support for the resource managers (via the
PodLevelResourceManagers and PodLevelResources feature gates) allows the
kubelet to create hybrid resource allocation models. This brings flexibility
and efficiency to high-performance workloads without sacrificing NUMA alignment.
Real-world use cases
Here are a few practical scenarios demonstrating how this feature can be applied, depending on the configured Topology Manager scope:
1. Tightly-coupled database (Topology manager's pod scope)
Consider a latency-sensitive database pod that includes a main database container, a local metrics exporter, and a backup agent sidecar.
When configured with the pod Topology Manager scope, the kubelet performs a
single NUMA alignment based on the entire pod's budget. The database container
gets its exclusive CPU and memory slices from that NUMA node. The remaining
resources from the pod's budget form a new pod shared pool. The metrics
exporter and backup agent run in this pod shared pool. They share resources with
each other, but they are strictly isolated from the database's exclusive slices
and the rest of the node.
This allows you to safely co-locate auxiliary containers on the same NUMA node as your primary workload without wasting dedicated cores on them.
apiVersion: v1
kind: Pod
metadata:
name: tightly-coupled-database
spec:
# Pod-level resources establish the overall budget and NUMA alignment size.
resources:
requests:
cpu: "8"
memory: "16Gi"
limits:
cpu: "8"
memory: "16Gi"
initContainers:
- name: metrics-exporter
image: metrics-exporter:v1
restartPolicy: Always
- name: backup-agent
image: backup-agent:v1
restartPolicy: Always
containers:
- name: database
image: database:v1
# This Guaranteed container gets an exclusive 6 CPU slice from the pod's budget.
# The remaining 2 CPUs and 4Gi memory form the pod shared pool for the sidecars.
resources:
requests:
cpu: "6"
memory: "12Gi"
limits:
cpu: "6"
memory: "12Gi"
2. ML workload with infrastructure sidecars (Topology manager's container scope)
Imagine a pod running a GPU-accelerated ML training workload alongside a generic service mesh sidecar.
Under the container Topology Manager scope, the kubelet evaluates each
container individually. You can grant the ML container exclusive, NUMA-aligned
CPUs and Memory for maximum performance. Meanwhile, the service mesh sidecar
doesn't need to be NUMA-aligned; it can run in the general node-wide shared
pool. The collective resource consumption is still safely bounded by the overall
pod limits, but you only allocate NUMA-aligned, exclusive resources to the
specific containers that actually require them.
apiVersion: v1
kind: Pod
metadata:
name: ml-workload
spec:
# Pod-level resources establish the overall budget constraint.
resources:
requests:
cpu: "4"
memory: "8Gi"
limits:
cpu: "4"
memory: "8Gi"
initContainers:
- name: service-mesh-sidecar
image: service-mesh:v1
restartPolicy: Always
containers:
- name: ml-training
image: ml-training:v1
# Under the 'container' scope, this Guaranteed container receives exclusive,
# NUMA-aligned resources, while the sidecar runs in the node's shared pool.
resources:
requests:
cpu: "3"
memory: "6Gi"
limits:
cpu: "3"
memory: "6Gi"
CPU quotas (CFS) and isolation
When running these mixed workloads within a pod, isolation is enforced differently depending on the allocation:
- Exclusive containers: Containers granted exclusive CPU slices have their CPU CFS quota enforcement disabled at the container level, allowing them to run without being throttled by the Linux scheduler.
- Pod shared pool containers: Containers falling into the pod shared pool have CPU CFS quotas enforced at the pod level, ensuring they do not consume more than the leftover pod budget.
How to enable Pod-Level Resource Managers
Using this feature requires Kubernetes v1.36 or newer. To enable it, you must configure the kubelet with the appropriate feature gates and policies:
- Enable the
PodLevelResourcesandPodLevelResourceManagersfeature gates. - Configure the
Topology Manager
with a policy other than
none(i.e.,best-effort,restricted, orsingle-numa-node). - Set the
Topology Manager scope
to either
podorcontainerusing thetopologyManagerScopefield in theKubeletConfiguration. - Configure the
CPU Manager with
the
staticpolicy. - Configure the
Memory Manager with the
Staticpolicy.
Observability
To help cluster administrators monitor and debug these new allocation models, we have introduced several new kubelet metrics when the feature gate is enabled:
resource_manager_allocations_total: Counts the total number of exclusive resource allocations performed by a manager. Thesourcelabel ("pod" or "node") distinguishes between allocations drawn from the node-level pool versus a pre-allocated pod-level pool.resource_manager_allocation_errors_total: Counts errors encountered during exclusive resource allocation, distinguished by the intended allocationsource("pod" or "node").resource_manager_container_assignments: Tracks the cumulative number of containers running with specific assignment types. Theassignment_typelabel ("node_exclusive", "pod_exclusive", "pod_shared") provides visibility into how workloads are distributed.
Current limitations and caveats
While this feature opens up new possibilities, there are a few things to keep in mind during its alpha phase. Be sure to review the Limitations and caveats in the official documentation for full details on compatibility, requirements, and downgrade instructions.
Getting started and providing feedback
For a deep dive into the technical details and configuration of this feature, check out the official concept documentation:
To learn more about the overall pod-level resources feature and how to assign resources to pods, see:
As this feature moves through Alpha, your feedback is invaluable. Please report any issues or share your experiences via the standard Kubernetes communication channels:
Kubernetes v1.36: Pod-Level Resource Managers (Alpha)
Kubernetes v1.36 introduces
Pod-Level Resource Managers
as an alpha feature, bringing a more flexible and powerful resource management
model to performance-sensitive workloads. This enhancement extends the kubelet's
Topology, CPU, and Memory Managers to support pod-level resource specifications
(.spec.resources), evolving them from a strictly per-container allocation
model to a pod-centric one.
Why do we need pod-level resource managers?
When running performance-critical workloads such as machine learning (ML) training, high-frequency trading applications, or low-latency databases, you often need exclusive, NUMA-aligned resources for your primary application containers to ensure predictable performance.
However, modern Kubernetes pods rarely consist of just one container. They frequently include sidecar containers for logging, monitoring, service meshes, or data ingestion.
Before this feature, this created a trade-off, to get NUMA-aligned, exclusive resources for your main application, you had to allocate exclusive, integer-based CPU resources to every container in the pod. This might be wasteful for lightweight sidecars. If you didn't do this, you forfeited the pod's Guaranteed Quality of Service (QoS) class entirely, losing the performance benefits.
Introducing pod-level resource managers
Enabling pod-level resources support for the resource managers (via the
PodLevelResourceManagers and PodLevelResources feature gates) allows the
kubelet to create hybrid resource allocation models. This brings flexibility
and efficiency to high-performance workloads without sacrificing NUMA alignment.
Real-world use cases
Here are a few practical scenarios demonstrating how this feature can be applied, depending on the configured Topology Manager scope:
1. Tightly-coupled database (Topology manager's pod scope)
Consider a latency-sensitive database pod that includes a main database container, a local metrics exporter, and a backup agent sidecar.
When configured with the pod Topology Manager scope, the kubelet performs a
single NUMA alignment based on the entire pod's budget. The database container
gets its exclusive CPU and memory slices from that NUMA node. The remaining
resources from the pod's budget form a new pod shared pool. The metrics
exporter and backup agent run in this pod shared pool. They share resources with
each other, but they are strictly isolated from the database's exclusive slices
and the rest of the node.
This allows you to safely co-locate auxiliary containers on the same NUMA node as your primary workload without wasting dedicated cores on them.
apiVersion: v1
kind: Pod
metadata:
name: tightly-coupled-database
spec:
# Pod-level resources establish the overall budget and NUMA alignment size.
resources:
requests:
cpu: "8"
memory: "16Gi"
limits:
cpu: "8"
memory: "16Gi"
initContainers:
- name: metrics-exporter
image: metrics-exporter:v1
restartPolicy: Always
- name: backup-agent
image: backup-agent:v1
restartPolicy: Always
containers:
- name: database
image: database:v1
# This Guaranteed container gets an exclusive 6 CPU slice from the pod's budget.
# The remaining 2 CPUs and 4Gi memory form the pod shared pool for the sidecars.
resources:
requests:
cpu: "6"
memory: "12Gi"
limits:
cpu: "6"
memory: "12Gi"
2. ML workload with infrastructure sidecars (Topology manager's container scope)
Imagine a pod running a GPU-accelerated ML training workload alongside a generic service mesh sidecar.
Under the container Topology Manager scope, the kubelet evaluates each
container individually. You can grant the ML container exclusive, NUMA-aligned
CPUs and Memory for maximum performance. Meanwhile, the service mesh sidecar
doesn't need to be NUMA-aligned; it can run in the general node-wide shared
pool. The collective resource consumption is still safely bounded by the overall
pod limits, but you only allocate NUMA-aligned, exclusive resources to the
specific containers that actually require them.
apiVersion: v1
kind: Pod
metadata:
name: ml-workload
spec:
# Pod-level resources establish the overall budget constraint.
resources:
requests:
cpu: "4"
memory: "8Gi"
limits:
cpu: "4"
memory: "8Gi"
initContainers:
- name: service-mesh-sidecar
image: service-mesh:v1
restartPolicy: Always
containers:
- name: ml-training
image: ml-training:v1
# Under the 'container' scope, this Guaranteed container receives exclusive,
# NUMA-aligned resources, while the sidecar runs in the node's shared pool.
resources:
requests:
cpu: "3"
memory: "6Gi"
limits:
cpu: "3"
memory: "6Gi"
CPU quotas (CFS) and isolation
When running these mixed workloads within a pod, isolation is enforced differently depending on the allocation:
- Exclusive containers: Containers granted exclusive CPU slices have their CPU CFS quota enforcement disabled at the container level, allowing them to run without being throttled by the Linux scheduler.
- Pod shared pool containers: Containers falling into the pod shared pool have CPU CFS quotas enforced at the pod level, ensuring they do not consume more than the leftover pod budget.
How to enable Pod-Level Resource Managers
Using this feature requires Kubernetes v1.36 or newer. To enable it, you must configure the kubelet with the appropriate feature gates and policies:
- Enable the
PodLevelResourcesandPodLevelResourceManagersfeature gates. - Configure the
Topology Manager
with a policy other than
none(i.e.,best-effort,restricted, orsingle-numa-node). - Set the
Topology Manager scope
to either
podorcontainerusing thetopologyManagerScopefield in theKubeletConfiguration. - Configure the
CPU Manager with
the
staticpolicy. - Configure the
Memory Manager with the
Staticpolicy.
Observability
To help cluster administrators monitor and debug these new allocation models, we have introduced several new kubelet metrics when the feature gate is enabled:
resource_manager_allocations_total: Counts the total number of exclusive resource allocations performed by a manager. Thesourcelabel ("pod" or "node") distinguishes between allocations drawn from the node-level pool versus a pre-allocated pod-level pool.resource_manager_allocation_errors_total: Counts errors encountered during exclusive resource allocation, distinguished by the intended allocationsource("pod" or "node").resource_manager_container_assignments: Tracks the cumulative number of containers running with specific assignment types. Theassignment_typelabel ("node_exclusive", "pod_exclusive", "pod_shared") provides visibility into how workloads are distributed.
Current limitations and caveats
While this feature opens up new possibilities, there are a few things to keep in mind during its alpha phase. Be sure to review the Limitations and caveats in the official documentation for full details on compatibility, requirements, and downgrade instructions.
Getting started and providing feedback
For a deep dive into the technical details and configuration of this feature, check out the official concept documentation:
To learn more about the overall pod-level resources feature and how to assign resources to pods, see:
As this feature moves through Alpha, your feedback is invaluable. Please report any issues or share your experiences via the standard Kubernetes communication channels:
Kubernetes v1.36: In-Place Vertical Scaling for Pod-Level Resources Graduates to Beta
Following the graduation of Pod-Level Resources to Beta in v1.34 and the General Availability (GA) of In-Place Pod Vertical Scaling in v1.35, the Kubernetes community is thrilled to announce that In-Place Pod-Level Resources Vertical Scaling has graduated to Beta in v1.36!
This feature is now enabled by default via the InPlacePodLevelResourcesVerticalScaling feature gate. It allows users to update the aggregate Pod resource budget (.spec.resources) for a running Pod, often without requiring a container restart.
Why Pod-level in-place resize?
The Pod-level resource model simplified management for complex Pods (such as those with sidecars) by allowing containers to share a collective pool of resources. In v1.36, you can now adjust this aggregate boundary on-the-fly.
This is particularly useful for Pods where containers do not have individual limits defined. These containers automatically scale their effective boundaries to fit the newly resized Pod-level dimensions, allowing you to expand the shared pool during peak demand without manual per-container recalculations.
Resource inheritance and the resizePolicy
When a Pod-level resize is initiated, the Kubelet treats the change as a resize event for every container that inherits its limits from the Pod-level budget. To determine whether a restart is required, the Kubelet consults the resizePolicy defined within individual containers:
- Non-disruptive Updates: If a container's
restartPolicyis set toNotRequired, the Kubelet attempts to update the cgroup limits dynamically via the Container Runtime Interface (CRI). - Disruptive Updates: If set to
RestartContainer, the container will be restarted to apply the new aggregate boundary safely.
Note: Currently,
resizePolicyis not supported at the Pod level. The Kubelet always defers to individual container settings to decide if an update can be applied in-place or requires a restart.
Example: Scaling a shared resource pool
In this scenario, a Pod is defined with a 2 CPU pod-level limit. Because the individual containers do not have their own limits defined, they share this total pool.
1. Initial Pod specification
apiVersion: v1
kind: Pod
metadata:
name: shared-pool-app
spec:
resources: # Pod-level limits
limits:
cpu: "2"
memory: "4Gi"
containers:
- name: main-app
image: my-app:v1
resizePolicy: [{resourceName: "cpu", restartPolicy: "NotRequired"}]
- name: sidecar
image: logger:v1
resizePolicy: [{resourceName: "cpu", restartPolicy: "NotRequired"}]
2. The resize operation
To double the CPU capacity to 4 CPUs, apply a patch using the resize subresource:
kubectl patch pod shared-pool-app --subresource resize --patch \
'{"spec":{"resources":{"limits":{"cpu":"4"}}}}'
Node-Level reality: feasibility and safety
Applying a resize patch is only the first step. The Kubelet performs several checks and follows a specific sequence to ensure node stability:
1. The feasibility check
Before admitting a resize, the Kubelet verifies if the new aggregate request fits within the Node's allocatable capacity. If the Node is overcommitted, the resize is not ignored; instead, the PodResizePending condition will reflect a Deferred or Infeasible status, providing immediate feedback on why the "envelope" hasn't grown.
2. Update sequencing
To prevent resource "overshoot", the Kubelet coordinates the cgroup updates in a specific order:
- When Increasing: The Pod-level cgroup is expanded first, creating the "room" before the individual container cgroups are enlarged.
- When Decreasing: The container cgroups are throttled first, and only then is the aggregate Pod-level cgroup shrunken.
Observability: tracking resize status
With the move to Beta, Kubernetes uses Pod Conditions to track the lifecycle of a resize:
PodResizePending: The spec is updated, but the Node hasn't admitted the change (e.g., due to capacity).PodResizeInProgress: The Node has admitted the resize (status.allocatedResources) but the changes aren't yet fully applied to the cgroups (status.resources).
status:
allocatedResources:
cpu: "4"
resources:
limits:
cpu: "4"
conditions:
- type: PodResizeInProgress
status: "True"
Constraints and requirements
- cgroup v2 Only: Required for accurate aggregate enforcement.
- CRI Support: Requires a container runtime that supports the
UpdateContainerResourcesCRI call (e.g., containerd v2.0+ or CRI-O). - Feature Gates: Requires
PodLevelResources,InPlacePodVerticalScaling,InPlacePodLevelResourcesVerticalScaling, andNodeDeclaredFeatures. - Linux Only: Currently exclusive to Linux-based nodes.
What's next?
As we move toward General Availability (GA), the community is focusing on Vertical Pod Autoscaler (VPA) Integration, enabling VPA to issue Pod-level resource recommendations and trigger in-place actuation automatically.
Getting started and providing feedback
We encourage you to test this feature and provide feedback via the standard Kubernetes communication channels:
Announcing Vitess 24
Announcing Vitess 24
Announcing Vitess 24
Kubernetes v1.36: Tiered Memory Protection with Memory QoS
On behalf of SIG Node, we are pleased to announce updates to the Memory QoS
feature (alpha) in Kubernetes v1.36. Memory QoS uses the cgroup v2 memory
controller to give the kernel better guidance on how to treat container memory.
It was first introduced in v1.22 and updated in v1.27. In Kubernetes v1.36, we're introducing: opt-in memory reservation, tiered
protection by QoS class, observability metrics, and kernel-version warning for memory.high.
What's new in v1.36
Opt-in memory reservation with memoryReservationPolicy
v1.36 separates throttling from reservation. Enabling the feature gate turns on
memory.high throttling (the kubelet sets memory.high based on
memoryThrottlingFactor, default 0.9), but memory reservation is now controlled
by a separate kubelet configuration field:
None(default): nomemory.minormemory.lowis written. Throttling viamemory.highstill works.TieredReservation: the kubelet writes tiered memory protection based on the Pod's QoS class:
Guaranteed Pods get hard protection via memory.min. For example, a
Guaranteed Pod requesting 512 MiB of memory results in:
$ cat /sys/fs/cgroup/kubepods.slice/kubepods-pod6a4f2e3b_1c9d_4a5e_8f7b_2d3e4f5a6b7c.slice/memory.min
536870912
The kernel will not reclaim this memory under any circumstances. If it cannot honor the guarantee, it invokes the OOM killer on other processes to free pages.
Burstable Pods get soft protection via memory.low. For the same 512 MiB
request on a Burstable Pod:
$ cat /sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod8b3c7d2e_4f5a_6b7c_9d1e_3f4a5b6c7d8e.slice/memory.low
536870912
The kernel avoids reclaiming this memory under normal pressure, but may reclaim it if the alternative is a system-wide OOM.
BestEffort Pods get neither memory.min nor memory.low. Their memory
remains fully reclaimable.
Comparison with v1.27 behavior
In earlier versions, enabling the MemoryQoS feature gate immediately set memory.min for every container with a memory request. memory.min is a hard reservation that the kernel will not reclaim, regardless of memory pressure.
Consider a node with 8 GiB of RAM where Burstable Pod requests total 7 GiB. In earlier versions, that 7 GiB would be locked as memory.min, leaving little headroom for the kernel, system daemons, or BestEffort workloads and increasing the risk of OOM kills.
With v1.36 tiered reservation, those Burstable requests map to memory.low instead of memory.min. Under normal pressure, the kernel still protects that memory, but under extreme pressure it can reclaim part of it to avoid system-wide OOM. Only Guaranteed Pods use memory.min, which keeps hard reservation lower.
With memoryReservationPolicy in v1.36, you can enable throttling first, observe workload behavior, and opt into reservation when your node has enough headroom.
Observability metrics
Two alpha-stability metrics are exposed on the kubelet /metrics endpoint:
kubelet_memory_qos_node_memory_min_bytes
Total memory.min across Guaranteed Pods
kubelet_memory_qos_node_memory_low_bytes
Total memory.low across Burstable Pods
These are useful for capacity planning. If kubelet_memory_qos_node_memory_min_bytes
is creeping toward your node's physical memory, you know hard reservation is
getting tight.
$ curl -sk https://localhost:10250/metrics | grep memory_qos
# HELP kubelet_memory_qos_node_memory_min_bytes [ALPHA] Total memory.min in bytes for Guaranteed pods
kubelet_memory_qos_node_memory_min_bytes 5.36870912e+08
# HELP kubelet_memory_qos_node_memory_low_bytes [ALPHA] Total memory.low in bytes for Burstable pods
kubelet_memory_qos_node_memory_low_bytes 2.147483648e+09
Kernel version check
On kernels older than 5.9, memory.high throttling can trigger the
kernel livelock issue. The bug was fixed
in kernel 5.9. In v1.36, when the feature gate is enabled, the kubelet checks the
kernel version at startup and logs a warning if it is below 5.9. The feature
continues to work — this is informational, not a hard block.
How Kubernetes maps Memory QoS to cgroup v2
Memory QoS uses four cgroup v2 memory controller interfaces:
memory.max: hard memory limit — unchanged from previous versionsmemory.min: hard memory protection — withTieredReservation, set only for Guaranteed Podsmemory.low: soft memory protection — set for Burstable Pods withTieredReservationmemory.high: memory throttling threshold — unchanged from previous versions
The following table shows how Kubernetes container resources map to cgroup v2
interfaces when memoryReservationPolicy: TieredReservation is configured.
With the default memoryReservationPolicy: None, no memory.min or
memory.low values are set.
requests.memory(hard protection) Not set Not set
(requests == limits, so throttling is not useful) Set to
limits.memory
Burstable
Not set
Set to requests.memory(soft protection) Calculated based on
formula with throttling factor Set to
limits.memory(if specified) BestEffort Not set Not set Calculated based on
node allocatable memory Not set
Cgroup hierarchy
cgroup v2 requires that a parent cgroup's memory protection is at least as
large as the sum of its children's. The kubelet maintains this by setting
memory.min on the kubepods root cgroup to the sum of all Guaranteed and
Burstable Pod memory requests, and memory.low on the Burstable QoS cgroup
to the sum of all Burstable Pod memory requests. This way the kernel can
enforce the per-container and per-pod protection values correctly.
The kubelet manages pod-level and QoS-class cgroups directly using the runc libcontainer library, while container-level cgroups are managed by the container runtime (containerd or CRI-O).
How do I use it?
Prerequisites
- Kubernetes v1.36 or later
- Linux with cgroup v2. Kernel 5.9 or higher is recommended — earlier kernels
work but may experience the livelock issue. You can verify cgroup v2 is
active by running
mount | grep cgroup2. - A container runtime that supports cgroup v2 (containerd 1.6+, CRI-O 1.22+)
Configuration
To enable Memory QoS with tiered protection:
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
featureGates:
MemoryQoS: true
memoryReservationPolicy: TieredReservation # Options: None (default), TieredReservation
memoryThrottlingFactor: 0.9 # Optional: default is 0.9
If you want memory.high throttling without memory protection, omit
memoryReservationPolicy or set it to None:
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
featureGates:
MemoryQoS: true
memoryReservationPolicy: None # This is the default
How can I learn more?
- KEP-2570: Memory QoS
- Pod Quality of Service Classes
- Managing Resources for Containers
- Kubernetes cgroups v2 support
- Linux kernel cgroups v2 documentation
Getting involved
This feature is driven by SIG Node. If you are interested in contributing or have feedback, you can find us on Slack (#sig-node), the mailing list, or at the regular SIG Node meetings. Please file bugs at kubernetes/kubernetes and enhancement proposals at kubernetes/enhancements.