You are here

CNCF Projects

Introducing the UX Research Working Group

Prometheus Blog - Tue, 04/07/2026 - 20:00

Prometheus has always prioritized solving complex technical challenges to deliver a reliable, performant open-source monitoring system. Over time, however, users have expressed a variety of experience-related pain points. Those pain points range from onboarding and configuration to documentation, mental models, and interoperability across the ecosystem.

At PromCon 2025, a user research study was presented that highlighted several of these issues. Although the central area of investigation involved Prometheus and OpenTelemetry workflows, the broader takeaway was clear: Prometheus would benefit from a dedicated, ongoing effort to understand user needs and improve the overall user experience.

Recognizing this, the Prometheus team established a Working Group focused on improving user experience through design and user research. This group is meant to support all areas of Prometheus by bringing structured research, user insights, and usability perspectives into the community's development and decision-making processes.

How we can help Prometheus maintainers

Building something where the user needs are unclear? Maybe you're looking at two competing solutions and you'd like to understand the user tradeoffs alongside the technical ones.

That's where we can be of help.

The UX Working Group will partner with you to conduct user research or provide feedback on your plans for user outreach. That could include:

  • User research reports and summaries
  • User journeys, personas, wireframes, prototypes, and other UX artifacts
  • Recommendations for improving usability, onboarding, interoperability, and documentation
  • Prioritized lists of user pain points
  • Suggestions for community discussions or decision-making topics

To get started, tell us what you're trying to do, and we'll work with you to determine what type and scope of research is most appropriate.

How we can help Prometheus end users

We want to hear from you! Let us know if you're interested in participating in a research study and we'll contact you when we're working on one that's a good fit. Having an issue with the Prometheus user experience? We can help you open an issue and direct it to the appropriate community members.

Interested in helping?

New contributors to the working group are always welcome! Get in touch and let us know what you'd like to work on.

Where to find us

Drop us a message in Slack, join a meeting, or raise an issue in GitHub.

Categories: CNCF Projects

Kubernetes v1.36 Sneak Peek

Kubernetes Blog - Sun, 03/29/2026 - 20:00

Kubernetes v1.36 is coming at the end of April 2026. This release will include removals and deprecations, and it is packed with an impressive number of enhancements. Here are some of the features we are most excited about in this cycle!

Please note that this information reflects the current state of v1.36 development and may change before release.

The Kubernetes API removal and deprecation process

The Kubernetes project has a well-documented deprecation policy for features. This policy states that stable APIs may only be deprecated when a newer, stable version of that same API is available and that APIs have a minimum lifetime for each stability level. A deprecated API has been marked for removal in a future Kubernetes release. It will continue to function until removal (at least one year from the deprecation), but usage will result in a warning being displayed. Removed APIs are no longer available in the current version, at which point you must migrate to using the replacement.

  • Generally available (GA) or stable API versions may be marked as deprecated but must not be removed within a major version of Kubernetes.
  • Beta or pre-release API versions must be supported for 3 releases after the deprecation.
  • Alpha or experimental API versions may be removed in any release without prior deprecation notice; this process can become a withdrawal in cases where a different implementation for the same feature is already in place.

Whether an API is removed as a result of a feature graduating from beta to stable, or because that API simply did not succeed, all removals comply with this deprecation policy. Whenever an API is removed, migration options are communicated in the deprecation guide.

A recent example of this principle in action is the retirement of the ingress-nginx project, announced by SIG-Security on March 24, 2026. As stewardship shifts away from the project, the community has been encouraged to evaluate alternative ingress controllers that align with current security and maintenance best practices. This transition reflects the same lifecycle discipline that underpins Kubernetes itself, ensuring continued evolution without abrupt disruption.

Ingress NGINX retirement

To prioritize the safety and security of the ecosystem, Kubernetes SIG Network and the Security Response Committee have retired Ingress NGINX on March 24, 2026. Since that date, there have been no further releases, no bugfixes, and no updates to resolve any security vulnerabilities discovered. Existing deployments of Ingress NGINX will continue to function, and installation artifacts like Helm charts and container images will remain available.

For full details, see the official retirement announcement.

Deprecations and removals for Kubernetes v1.36

Deprecation of .spec.externalIPs in Service

The externalIPs field in Service spec is being deprecated, which means you’ll soon lose a quick way to route arbitrary externalIPs to your Services. This field has been a known security headache for years, enabling man-in-the-middle attacks on your cluster traffic, as documented in CVE-2020-8554. From Kubernetes v1.36 and onwards, you will see deprecation warnings when using it, with full removal planned for v1.43.

If your Services still lean on externalIPs, consider using LoadBalancer services for cloud-managed ingress, NodePort for simple port exposure, or Gateway API for a more flexible and secure way to handle external traffic.

For more details on this enhancement, refer to KEP-5707: Deprecate service.spec.externalIPs

Removal of gitRepo volume driver

The gitRepo volume type has been deprecated since v1.11. Starting Kubernetes v1.36, the gitRepo volume plugin is permanently disabled and cannot be turned back on. This change protects clusters from a critical security issue where using gitRepo could let an attacker run code as root on the node.

Although gitRepo has been deprecated for years and better alternatives have been recommended, it was still technically possible to use it in previous releases. From v1.36 onward, that path is closed for good, so any existing workloads depending on gitRepo will need to migrate to supported approaches such as init containers or external git-sync style tools.

For more details on this enhancement, refer to KEP-5040: Remove gitRepo volume driver

The following list of enhancements is likely to be included in the upcoming v1.36 release. This is not a commitment and the release content is subject to change.

Faster SELinux labelling for volumes (GA)

Kubernetes v1.36 makes the SELinux volume mounting improvement generally available. This change replaced recursive file relabeling with mount -o context=XYZ option, applying the correct SELinux label to the entire volume at mount time. It brings more consistent performance and reduces Pod startup delays on SELinux-enforcing systems.

This feature was introduced as beta in v1.28 for ReadWriteOncePod volumes. In v1.32, it gained metrics and an opt-out option (securityContext.seLinuxChangePolicy: Recursive) to help catch conflicts. Now in v1.36, it reaches stable and defaults to all volumes, with Pods or CSIDrivers opting in via spec.SELinuxMount.

However, we expect this feature to create the risk of breaking changes in the future Kubernetes releases, due to the potential for mixing of privileged and unprivileged pods. Setting the seLinuxChangePolicy field and SELinux volume labels on Pods, correctly, is the responsibility of the Pod author Developers have that responsibility whether they are writing a Deployment, StatefulSet, DaemonSet or even a custom resource that includes a Pod template. Being careless with these settings can lead to a range of problems when Pods share volumes.

For more details on this enhancement, refer to KEP-1710: Speed up recursive SELinux label change

External signing of ServiceAccount tokens

As a beta feature, Kubernetes already supports external signing of ServiceAccount tokens. This allows clusters to integrate with external key management systems or signing services instead of relying only on internally managed keys.

With this enhancement, the kube-apiserver can delegate token signing to external systems such as cloud key management services or hardware security modules. This improves security and simplifies key management services for clusters that rely on centralized signing infrastructure. We expect that this will graduate to stable (GA) in Kubernetes v1.36.

For more details on this enhancement, refer to KEP-740: Support external signing of service account tokens

DRA Driver support for Device taints and tolerations

Kubernetes v1.33 introduced support for taints and tolerations for physical devices managed through Dynamic Resource Allocation (DRA). Normally, any device can be used for scheduling. However, this enhancement allows DRA drivers to mark devices as tainted, which ensures that they will not be used for scheduling purposes. Alternatively, cluster administrators can create a DeviceTaintRule to mark devices that match a certain selection criteria(such as all devices of a certain driver) as tainted. This improves scheduling control and helps ensure that specialized hardware resources are only used by workloads that explicitly request them.

In Kubernetes v1.36, this feature graduates to beta with more comprehensive testing complete, making it accessible by default without the need for a feature flag and open to user feedback.

To learn about taints and tolerations, see taints and tolerations.
For more details on this enhancement, refer to KEP-5055: DRA: device taints and tolerations.

DRA support for partitionable devices

Kubernetes v1.36 expands Dynamic Resource Allocation (DRA) by introducing support for partitionable devices, allowing a single hardware accelerator to be split into multiple logical units that can be shared across workloads. This is especially useful for high-cost resources like GPUs, where dedicating an entire device to a single workload can lead to underutilization.

With this enhancement, platform teams can improve overall cluster efficiency by allocating only the required portion of a device to each workload, rather than reserving it entirely. This makes it easier to run multiple workloads on the same hardware while maintaining isolation and control, helping organizations get more value out of their infrastructure.

To learn more about this enhancement, refer to KEP-4815: DRA Partitionable Devices

Want to know more?

New features and deprecations are also announced in the Kubernetes release notes. We will formally announce what's new in Kubernetes v1.36 as part of the CHANGELOG for that release.

Kubernetes v1.36 release is planned for Wednesday, April 22, 2026. Stay tuned for updates!

You can also see the announcements of changes in the release notes for:

Get involved

The simplest way to get involved with Kubernetes is by joining one of the many Special Interest Groups (SIGs) that align with your interests. Have something you’d like to broadcast to the Kubernetes community? Share your voice at our weekly community meeting, and through the channels below. Thank you for your continued feedback and support.

Categories: CNCF Projects, Kubernetes

Tekton Becomes a CNCF Incubating Project

CNCF Blog Projects Category - Tue, 03/24/2026 - 04:00

The CNCF Technical Oversight Committee (TOC) has voted to accept Tekton as a CNCF incubating project. 

What is Tekton?

Tekton is a powerful and flexible open source framework for creating continuous integration and delivery (CI/CD) systems that allows developers to build, test, and deploy across multiple cloud providers and on-premises systems by abstracting away the underlying implementation details.

While widely adopted for CI/CD, Tekton serves as a general-purpose, security-minded, Kubernetes-native workflow engine. Its composable primitives (Steps, Tasks and Pipelines) allow developers to orchestrate any type of sequential or parallel workload on Kubernetes. Tekton provides a standard, Kubernetes-native interface for defining these workflows, making them portable and reusable.

Tekton’s Key Milestones

The project has matured into a leading framework for Kubernetes-native CI/CD, reaching its stable v1.0 release for the core Pipelines component.

By joining the CNCF, Tekton aligns itself more closely with the ecosystem it powers. It integrates deeply with other CNCF projects like Argo CD (for GitOps) and SPIFFE/SPIRE (for identity), and also Sigstore via OpenSSF (for signing and verification), creating a robust supply chain security story.

Tekton is widely adopted in the industry and used by companies like Puppet and Ford Motor Company. Additionally, Tekton powers major commercial CI/CD offerings, including, but not limited to: Red Hat OpenShift Pipelines and IBM Cloud Continuous Delivery.

A Message from the Tekton Team

“One of the accomplishments I’m most proud of is the broad adoption of Tekton across open source projects, commercial products, and in-house platforms. Seeing teams rely on it in production and build on it within their own ecosystems has been especially rewarding. As a Kubernetes-native project that integrates naturally with other CNCF technologies, Tekton has benefited from close collaboration within the Cloud Native Computing Foundation community. I’m looking forward to deepening those partnerships, learning from our peers across CNCF projects, and meeting more Tekton users who are shaping what cloud native delivery looks like in practice.”

— Andrea Frittoli, Tekton Governing Board Member

“What I’m most proud of is how Tekton has shown that CI/CD can be a true Kubernetes-native primitive, not just another layer on top. Seeing projects like Shipwright—itself a CNCF project—and Konflux build on Tekton as their foundation validates that vision. Building all of this alongside a diverse, multi-vendor community with Red Hat, Google, IBM, and many individual contributors has been one of the most rewarding open source experiences of my career. I’m looking forward to what comes next. The future of Tekton is Trusted Artifacts changing how tasks share data, a simpler developer experience through Pipelines as Code, and deeper collaboration with CNCF projects like Sigstore and Argo CD. Tekton is fundamentally a Kubernetes project, and CNCF is its natural home.”

— Vincent Demeester, Tekton Governing Board Member

Support from TOC Sponsors

The CNCF Technical Oversight Committee (TOC) provides technical leadership to the cloud native community. It defines and maintains the foundation’s technical vision, approves new projects, and stewards them across maturity levels. The TOC also aligns projects within the overall ecosystem, sets cross-cutting standards and best practices and works with end users to ensure long-term sustainability. As part of its charter, the TOC evaluates and supports projects as they meet the requirements for incubation and continue progressing toward graduation.

“Tekton has proven itself as core infrastructure for Kubernetes-native delivery. Its move to incubation reflects strong multi-vendor governance and deep alignment with CNCF projects focused on GitOps, identity and software supply chain security.”

— Chad Beaudin, TOC Sponsor, Cloud Native Computing Foundation

“Tekton’s composable design and broad adoption make it an important part of the cloud native workflow landscape. The TOC’s vote recognizes a healthy contributor community and a clear roadmap.”
— Jeremy Rickard, TOC Sponsor, Cloud Native Computing Foundation

The Main Components of Tekton

  • Pipelines: The core building blocks (Tasks, Pipelines, Workspaces) for defining CI/CD workflows.
  • Triggers: Allows pipelines to be instantiated based on events (like Git pushes or pull requests).
  • CLI: A command-line interface for interacting with Tekton resources.
  • Dashboard: A web-based UI for visualizing and managing pipelines.
  • Chains: A supply chain security tool that automatically signs and attests artifacts built by Tekton.

Community Highlights

These community metrics signal strong momentum and healthy open source governance. For a CNCF project, this level of engagement builds trust with adopters, ensures long-term sustainability and reflects the collaborative innovation that defines the cloud native ecosystem. Tekton’s notable milestones include:

  • 11,000+ GitHub Stars (across all repositories)
  • 5,000+ Pull Requests
  • 2,500+ Issues
  • 600+ Contributors
  • 1.0 Stable Release of Pipelines

The Future of Tekton

The Tekton roadmap focuses on stability, security and scalability. Key initiatives from the project board and enhancement proposals (TEPs) include:

  • Supply Chain Security: Enhancing Tekton Chains to meet SLSA Level 3 requirements by default, including better provenance for build artifacts.
  • Trusted Artifacts: Introducing a secure and efficient way to pass data between tasks without relying on shared storage (PVCs), significantly improving performance and isolation (TEP-0139). 
  • Concise Syntax: Exploring less verbose syntax for referencing remote tasks and pipelines to improve developer experience (TEP-0154).
  • Advanced Scheduling: Integrating with Kueue for better job queuing and priority management of PipelineRuns.
  • Tekton Results: Moving the Results API to stable to provide long-term history and query capabilities for PipelineRuns and TaskRuns.
  • Catalog Evolution: Transitioning reusable tasks to Artifact Hub for better discoverability and standardized distribution.
  • Pipelines as Code: Continued investment in Git-based workflows, improving the “as code” experience for defining and managing pipelines.

For more details, see the Tekton Project Board and approved TEPs (Tekton Enhancement Proposals).

As a CNCF-hosted project, Tekton is committed to the principles of open source, neutrality and collaboration. We invite global developers and ecosystem partners to join us in enabling data to flow and be efficiently used freely anywhere, anytime. For more information on maturity requirements for each level, please visit the CNCF Graduation Criteria.

Categories: CNCF Projects

Fluid Becomes a CNCF Incubating Project

CNCF Blog Projects Category - Tue, 03/24/2026 - 04:00

The CNCF Technical Oversight Committee (TOC) has voted to accept Fluid as a CNCF incubating project. 

What is Fluid?

Kubernetes provides a data access layer through the Container Storage Interface (CSI), enabling workloads to connect to storage systems. However, certain use cases often require additional capabilities such as dataset versioning, access controls, preprocessing, dynamic mounting, and data acceleration.

To help address these needs, Nanjing University, Alibaba Cloud, and the Alluxio community introduced Fluid, a cloud native data orchestration and acceleration system that treats “elastic datasets” as a first-class resource. By adding a data abstraction layer within Kubernetes environments, Fluid enhances data flow and management for data-intensive workloads.

Fluid’s vision is Data Anyway, Anywhere, Anytime:

  • Anyway: Fluid focuses on data accessibility. Storage vendors can flexibly and simply integrate various storage clients without needing deep or extensive knowledge of Kubernetes CSI or Golang programming.
  • Anywhere: Fluid facilitates efficient data access across diverse infrastructure by supporting heterogeneous computing environments (cloud, edge, and serverless). It accelerates access to various storage systems like HDFS, S3, GCS, and CubeFS by utilizing caching engines such as Alluxio, JuiceFS, and Vineyard.
  • Anytime: Runtime dynamic adjustment of data sources allows data scientists to add and remove storage data sources on-demand in Kubernetes environments without service interruption.

Fluid’s Key Milestones and Ecosystem Development

Fluid originated as a joint project from Nanjing University, Alibaba Cloud, and the Alluxio community in September 2020. The project aims to provide efficient, elastic, and transparent data access capabilities for data-intensive AI applications in cloud native environments. In May 2021, Fluid was officially accepted as a CNCF sandbox project.

Since joining the CNCF, Fluid has rapidly grown, continuously releasing multiple important updates, achieving significant breakthroughs in key capabilities such as elastic data cache scaling, unified access to heterogeneous data sources, and application-transparent scheduling, while also improving the operational efficiency of AI and big data workloads on cloud native platforms.

Fluid’s core design concepts and technological innovations have received high-level academic recognition, with related results published in top conferences and journals in the database and computer systems fields, such as IEEE TPDS 2023.

In December 2024, at KubeCon + CloudNativeCon North America, CNCF released the 2024 Technology Landscape Radar Report, where Fluid, along with projects such as Kubeflow, was listed as “Adopt,”becoming one of the de facto standards in the cloud native AI and big data field.

Image of Batch/AI/ML Radar 2024. Fluid was listed as "Adopt"

Now, Fluid has been widely adopted across multiple industries and regions worldwide, with users covering major cloud service providers, internet companies, and vertical technology companies. Some Fluid users include Xiaomi, Alibaba Group, NetEase, China Telecom, Horizon, Weibo, Bilibili, 360, Zuoyebang, Inceptio Technology, Huya, OPPO, Unisound, DP Technology, JoinQuant, among others. Use cases cover a wide range of application scenarios, including, but not limited to, Artificial Intelligence Generated Content (AIGC), large models, big data, hybrid cloud, cloud-based development machine management, and autonomous driving data simulation.

A Word from the Maintainers

“We are deeply honored to see Fluid promoted to an incubating project. Our original intention in initiating Fluid was to fill the gap between compute and storage in cloud native architectures, allowing data to flow freely in the cloud like ‘fluid.’ The vibrant community development and widespread user adoption validate our vision. We will continue to drive the evolution of cloud native data orchestration technology, especially when it comes to exploring intelligent scheduling and orchestration of KVCache for large model inference scenarios and dedicating ourselves to making data serve various applications more efficiently and intelligently.”

— Gu Rong (Nanjing University), Chair and Co-Founder of the Fluid Community

“From sandbox to incubation, the concept of ‘caches also needing elasticity’ has gained widespread recognition. In the future, we will continue to drive Fluid toward becoming the standard for cloud native data orchestration, allowing data scientists to focus on model innovation.”

— Che Yang (Alibaba Cloud), Fluid Community Maintainer and Co-Founder

“Fluid is a key bridge to connecting AI computing frameworks and distributed storage systems. Seeing Fluid grow from a sandbox to an incubating project makes us extremely proud. This milestone proves that building a standardized data abstraction layer on Kubernetes is keeping up with industry trends.”

— Fan Bin (Alluxio Inc.), Alluxio Open Source Community Maintainer

Support from TOC Sponsors

The TOC provides technical leadership to the cloud native community. It defines and maintains the foundation’s technical vision, approves new projects, and stewards them across maturity levels. The TOC also aligns projects within the overall ecosystem, sets cross-cutting standards and best practices, and works with end users to ensure long-term sustainability. As part of its charter, the TOC evaluates and supports projects as they meet the requirements for incubation and continue progressing toward graduation.

“Fluid’s progression to incubation reflects both its technical maturity and the clear demand we’re seeing for stronger data orchestration in cloud native environments. As AI and data-intensive workloads continue to grow on Kubernetes, projects like Fluid help bridge compute and storage in a way that is practical, scalable, and community-driven. The TOC looks forward to supporting the project’s continued evolution within the CNCF ecosystem.”

Alex Chircop, CNCF TOC Member

“Fluid has demonstrated a strong level of maturity that aligns well with CNCF Incubation expectations. Adopter interviews showcase that Fluid has been deployed successfully in large-scale production environments for several years and provides standardized APIs that enable multiple applications to efficiently access and cache diverse datasets. Additionally, Fluid benefits from a healthy, engaged community, with a roadmap clearly shaped by adopter feedback.”

Katie Gamanji, CNCF TOC Member

Main Components in Fluid

  • Dataset Controller: Responsible for dataset abstraction and management, maintaining the binding relationship and status between data and underlying storage.
  • Application Scheduler: The application scheduling component is responsible for perceiving data cache location information and scheduling application pods to the most suitable nodes.
  • Runtime Plugins: Pluggable runtime interface responsible for deployment, configuration, scaling, and failure recovery of specific caching engines (such as Alluxio, JuiceFS, Vineyard, etc.), with excellent extensibility.
  • Webhook: Utilizes the Mutating Admission Webhook mechanism to automatically inject sidecar or volume mount information into application pods, achieving zero intrusion into applications.
  • CSI Plugin: Enables lightweight, transparent dataset mounting support for application pods, enabling them to access cached or remote data via local file system paths.
Image showing the components of Fluid

Community Highlights

These community metrics signal strong momentum and healthy open source governance. For a CNCF project, this level of engagement builds trust with adopters, ensures long-term sustainability, and reflects the collaborative innovation that defines the cloud native ecosystem.

  • 1.9k GitHub Stars 
  • 116 pull requests 
  • 250 issues
  • 979 contributors
  • 28 Releases

The Journey Continues

Becoming  a CNCF incubating project is a turning point for Fluid’s journey. Fluid will continue to deepen its data orchestration capabilities for generative AI and big data scenarios. To meet the exponential growth demands of GenAI applications, Fluid’s next goal is to evolve into an intelligent elastic data platform, allowing users to focus on model innovation and data value mining, while Fluid handles the underlying data distribution, cache acceleration, resource management, and elastic scaling.

As a CNCF incubating project, Fluid will continue to uphold the principles of open source, neutrality, and collaboration, working together with global developers and ecosystem partners to enable data to flow and be efficiently used freely anywhere, anytime.

Hear from Users

“Fluid’s Anytime capability allows our data scientists to self-service data switching without restarting Pods, truly achieving data agility. This is the core reason we chose Fluid over a self-built solution.”

— Liu Bin, Technical Lead at DP Technology

“Fluid’s vendor neutrality and cross-namespace cache sharing capabilities help us avoid cloud vendor lock-in and save approximately 40% in cross-cloud bandwidth costs. It has been deeply integrated into all of our data workflows.”

— Zhao Ming, Head of Horizon AI Platform

“In LLM model inference, remote Safetensors file reading often leads to low I/O utilization. Fluid’s intelligent prefetching and local caching technology allows us to fully saturate bandwidth without modifying code, fully unleashing GPU computing power.”

— Zhang Xiang, Head of NetEase MaaS
As a CNCF-hosted project, Fluid is committed to the principles of open source, neutrality and collaboration. We invite global developers and ecosystem partners to join us in enabling data to flow and be efficiently used freely anywhere, anytime. For more information on maturity requirements for each level, please visit the CNCF Graduation Criteria.

Categories: CNCF Projects

Cloud Native Computing Foundation Announces Kyverno’s Graduation

CNCF Blog Projects Category - Tue, 03/24/2026 - 04:00

Kyverno reaches graduation after demonstrating broad enterprise adoption as platform teams adopt declarative governance

Key Highlights:

  • Kyverno graduates from the Cloud Native Computing Foundation after demonstrating production readiness and strong adoption.
  • Kyverno’s declarative policy-as-code solution makes it easier for platform and security teams to define and enforce guardrails across Kubernetes and cloud native environments.
  • Since joining CNCF in 2020, the Kyverno community has grown significantly, expanding from 574 GitHub stars to more than 9,000 and attracting contributors and end users worldwide.

KUBECON + CLOUDNATIVECON NORTH EUROPE, AMSTERDAM, The Netherlands – March 24, 2026 – The Cloud Native Computing Foundation® (CNCF®), which builds sustainable ecosystems for cloud native software, today announced the graduation of Kyverno, a Kubernetes-native policy engine that enables organizations to define, manage and enforce policy-as-code across cloud native environments.


Originally created by Nirmata and contributed to the CNCF in 2020, Kyverno (which means “to govern” in Greek) has achieved the highest maturity level after demonstrating widespread production adoption and significant community growth. The project’s declarative policy-as-code solution makes it easier for platform and security teams to define and enforce guardrails across Kubernetes and cloud native environments.

“Kyverno’s graduation highlights how important policy-as-code has become for organizations running cloud native in production at scale,” said Chris Aniszczyk, CTO of CNCF. “The project makes it easier for platform teams to enforce governance and security practices using familiar Kubernetes constructs, and the strong community behind Kyverno shows how critical this capability is across the ecosystem.”

Since joining the CNCF, Kyverno has experienced exponential growth and adoption across the Kubernetes ecosystem. The project has grown from 574 to more than 9,000 GitHub stars, and Kyverno continues to attract a growing number of contributors and end users worldwide. Today, Kyverno helps platform and security teams enforce policy, security and operational guardrails across some of the world’s largest Kubernetes environments. Organizations such as Bloomberg, Coinbase, Deutsche Telekom, Groww, LinkedIn, Spotify, Vodafone and Wayfair publicly rely on Kyverno to help secure and manage their Kubernetes platforms.

The project offers multiple ways for organizations to integrate policy management into their workflows, including running as a Kubernetes admission controller, command-line interface (CLI), container image or software development kit (SDK). While Kyverno began as a Kubernetes-native admission controller, it has evolved into a broader policy engine used across the cloud native stack. Declarative policies can now be applied to a wide range of payloads and enforcement points. It integrates deeply with the broader CNCF ecosystem and is commonly used alongside projects such as Argo CD, Backstage, Flux and Kubernetes to help platform teams implement policy-driven governance as part of modern GitOps and platform engineering practices.

To achieve graduation, Kyverno successfully completed a third party security audit and a comprehensive security assessment led by CNCF TAG Security & Compliance. The project also passed a formal governance review, demonstrating mature open source practices. Further, the community introduced contributor guidelines addressing the responsible use of AI-assisted development tools.

The CNCF Technical Oversight Committee (TOC) provides technical leadership to the cloud native community, defining its vision and stewarding projects through maturity levels up to graduation. Kyverno’s graduation was supported by TOC sponsor Karena Angell, who conducted a thorough technical due diligence.

“Graduation is reserved for projects that demonstrate strong governance, sustained community growth and widespread production use,” said Karena Angell, chair of the Technical Oversight Committee, CNCF. “Kyverno met that bar through its technical maturity, security posture and the growing number of organizations relying on it to manage policy across Kubernetes environments.”

With its latest release, Kyverno has fully adopted Common Expression Language (CEL), aligning with the future direction of Kubernetes admission controls for improved performance and enhanced expressiveness. Upcoming releases will focus on extending policy enforcement to additional control points across the cloud native stack, including support for artificial intelligence and Model Context Protocol (MCP) gateways. These innovations will help organizations apply policy-as-code consistently across infrastructure, applications and emerging AI-driven workloads.

“As AI adoption accelerates, policy-as-code provides the essential guardrails for autonomous governance at scale without stifling innovation,” said Jim Bugwadia, Kyverno co-creator and CEO of Nirmata. “We built Kyverno to champion developer agility and self-service, and we are honored by its massive growth and success within the CNCF ecosystem.”

Learn more about Kyverno and join the community: https://kyverno.io 

Supporting Quotes

“Kyverno has become a core part of how I help platform teams take control of their Kubernetes environments. What used to require manual intervention and custom scripts is now policy-as-code that teams can own without learning a separate language. For organisations running Kubernetes at scale, Kyverno’s graduation reflects what I’ve seen firsthand – it’s production-ready, battle-tested and it makes platform teams faster.” 

– Steve Wade, Founder at Platform Fix and Ex-Technical Advisory Board Member at Cisco

“At Deutsche Telekom, Kyverno has played an important role in helping our platform teams implement Kubernetes-native policy management in a scalable and developer-friendly way. Its declarative approach to policy enforcement allows us to embed security, compliance and operational best practices directly into our Kubernetes environments without adding unnecessary complexity for application teams. The project’s strong community, rapid innovation and focus on usability have made Kyverno a valuable tool for organizations operating Kubernetes at scale. We’re excited to see the project reach this stage and look forward to its continued growth in the cloud native ecosystem.” 

– Mamta Bharti, VP of Engineering at Deutsche Telekom

Kyverno has become a critical component of LinkedIn’s Kubernetes admission control pipeline, enforcing consistent security and configuration policies across 230+ clusters with 500K+ nodes. Its YAML-native approach means our platform teams can author and maintain policies without learning a new language. Kyverno has proven its reliability at enterprise scale, handling over 20K admission requests per minute under stress without degradation.” 

– Shan Velleru, Senior Software Engineer at LinkedIn

About Cloud Native Computing Foundation

Cloud native computing empowers organizations to build and run scalable applications with an open source software stack in public, private, and hybrid clouds. The Cloud Native Computing Foundation (CNCF) hosts critical components of the global technology infrastructure, including Kubernetes, Prometheus, and Envoy. CNCF brings together the industry’s top developers, end users, and vendors and runs the largest open source developer conferences in the world. Supported by nearly 800 members, including the world’s largest cloud computing and software companies, as well as over 200 innovative startups, CNCF is part of the nonprofit Linux Foundation. For more information, please visit www.cncf.io.

###

The Linux Foundation has registered trademarks and uses trademarks. For a list of trademarks of The Linux Foundation, please see our trademark usage page. Linux is a registered trademark of Linus Torvalds.

Media Contact

Haley White

The Linux Foundation

[email protected] 

Categories: CNCF Projects

CNCF and SlashData Report Finds Platform Engineering Tools Maturing as Organizations Prepare for AI-Driven Infrastructure

CNCF Blog Projects Category - Tue, 03/24/2026 - 04:00

New CNCF Technology Radar survey shows which cloud native tools developers view as mature and ready for broad adoption

Key Highlights:

  • CNCF and SlashData release findings from the Q1 2026 CNCF Technology Radar survey based on responses from more than 400 professional developers.
  • CNCF and SlashData’s new report highlights which cloud native platform engineering tools developers who were surveyed view as mature, useful and ready for broad adoption.
  • Helm, Backstage and kro are the three technologies placed in the ‘Adopt’ position of the application delivery technology radar, based on survey responses.
  • Hybrid platform approaches are emerging as the dominant model for AI workflows, reflecting how organizations are adapting existing developer platforms to support AI workloads.

AMSTERDAM, KUBECON + CLOUDNATIVECON EUROPE– March 24, 2026 – The Cloud Native Computing Foundation®  (CNCF®), which builds sustainable ecosystems for cloud native software, released new findings from the Q1 2026 CNCF Technology Radar report with SlashData, uncovering how developers are evaluating platform engineering technologies for workflow automation, application delivery, security and compliance management. 

The survey findings provide an overview of how cloud native teams select internal platform tooling as organizations scale application delivery and prepare infrastructure for artificial intelligence (AI) workloads and increasingly automated development environments.

“Cloud native platforms have reached a point where developers are not just experimenting but standardizing on CNCF projects that make software delivery reliable at scale,” said Chris Aniszczyk, CTO, CNCF. “What’s especially notable about this research is how organizations are extending those same platforms to support AI workloads, showing how cloud native is the base layer of powering the next era of applications.”

Platform Engineering Shapes AI Workflow Strategies


The report explores how organizations structure internal developer platforms (IDPs) and how these decisions influence their approach to AI workflows.

  • 28% of organizations report having a dedicated platform engineering team responsible for internal platforms. 
  • The most common IDP model, reported by 41% of organizations, is multi-team collaboration for managing platform capabilities. 
  • 35% of organizations report using a hybrid platform to integrate AI workloads, combining existing developer platforms with specialized AI tooling. 

These survey findings suggest that many organizations are integrating AI capabilities directly into their cloud native platforms, rather than creating entirely new infrastructure stacks.

Workflow Automation Tools Show Strong Developer Confidence

In the workflow automation category, developers identify several technologies as reliable options for production environments, placing ArgoCD, Armada, Buildpacks, GitHub Actions, Jenkins in the ‘Adopt’ category.

  • GitHub Actions received high recommendations across maturity and usefulness metrics, with 91% of developers claiming that they would recommend it to peers. 
  • Jenkins demonstrated strong maturity scores, reflecting its long standing role in CI/CD
  • Developers gave Karmada and other newer tools high maturity ratings. Karmada achieved the highest usefulness rating among workflow automation tools.

The report also highlights that emerging tools are attracting developer interest, even as they continue to mature, suggesting strong developer enthusiasm for multicluster management solutions despite the perception that the technology is still evolving.

Security and Compliance Tooling Becomes Core Platform Infrastructure

According to the survey findings, security and compliance technologies are emerging as core components of modern developer platforms. Developers placed cert-manager, Keycloak, Open Policy Agent (OPA) in the ‘Adopt’ category.

  • cert-manager received the highest maturity ratings, with 87% of developers rating it four to five stars for stability and reliability.
  • Tools addressing emerging areas such as software supply chain security are gaining attention but remain early in their maturity cycle. For example, in-toto and Sigstore showed lower maturity ratings with little negative sentiment.

These findings suggest that developers are still evaluating how these solutions fit into their development pipelines.

Application Delivery Platforms Continue to Standardize
In the application delivery category, Backstage, Helm, and kro were placed in the ‘adopt’ position, reflecting strong developer confidence in these projects.

  • Helm received the highest maturity ratings among application delivery tools, with 94% of developers giving it the greatest number of four- and five-star ratings for reliability and stability.
  • Helm’s widespread usage across the ecosystem reinforces its role as a foundational component of Kubernetes application deployment.
  • Backstage and kro performed strongly in usefulness ratings. 

These findings indicate continued developer demand for tools that simplify Kubernetes complexity and improve developer experience across internal platforms.

“Developers are increasingly evaluating tools based on how well they fit into their internal platform architectures,” said Liam Bollmann-Dodd, principal market research consultant at SlashData. “What we see in this data is those technologies gaining traction are the ones that are reducing operational friction while enabling teams to standardize application delivery and management.”

Methodology

In Q4 2025, more than 400 professional developers using cloud native technologies were surveyed about their experiences with workflow automation, application delivery and security and compliance management tools. Respondents evaluated technologies they were familiar with based on their maturity, usefulness and the likelihood of recommending them.

Additional Resources:

About Cloud Native Computing Foundation

Cloud native computing empowers organizations to build and run scalable applications with an open source software stack in public, private, and hybrid clouds. The Cloud Native Computing Foundation (CNCF) hosts critical components of the global technology infrastructure, including Kubernetes, Prometheus, and Envoy. CNCF brings together the industry’s top developers, end users, and vendors and runs the largest open source developer conferences in the world. Supported by nearly 800 members, including the world’s largest cloud computing and software companies, as well as over 200 innovative startups, CNCF is part of the nonprofit Linux Foundation. For more information, please visit www.cncf.io.

About SlashData

SlashData is an analyst firm with more than 20 years of experience in the software industry, working with the top Tech brands. SlashData helps platform and engineering leaders make better product, marketing and strategy decisions through best-in-class research, benchmarks, and foresight into how developers, tools, and software are changing. 

###

The Linux Foundation has registered trademarks and uses trademarks. For a list of trademarks of The Linux Foundation, please see our trademark usage page. Linux is a registered trademark of Linus Torvalds.

Media Contact

Haley White

The Linux Foundation

[email protected] 

Categories: CNCF Projects

Welcome llm-d to the CNCF: Evolving Kubernetes into SOTA AI infrastructure

CNCF Blog Projects Category - Tue, 03/24/2026 - 03:45

We are thrilled to announce that llm-d has officially been accepted as a Cloud Native Computing Foundation (CNCF) Sandbox project!

As generative AI transitions from research labs to production environments, platform engineering teams are facing a new frontier of infrastructure challenges. llm-d is joining the CNCF to lead the evolution of Kubernetes and the broader CNCF landscape into State of the Art (SOTA) AI infrastructure, treating distributed inference as a first-class cloud native workload. By joining the CNCF, llm-d secures the trusted stewardship and open governance of the Linux Foundation, giving organizations the confidence to build upon a truly neutral standard.

Launched in May 2025 as a collaborative effort between Red Hat, Google Cloud, IBM Research, CoreWeave, and NVIDIA, llm-d was founded with a clear vision: any model, any accelerator, any cloud. The project was joined by industry leaders AMD, Cisco, Hugging Face, Intel, Lambda and Mistral AI and university supporters at the University of California, Berkeley, and the University of Chicago. 

“At Mistral AI, we believe that optimizing inference goes beyond just the engine, and requires solving challenges like KV cache management and disaggregated serving to support next-generation models such as Mixture of Experts (MoE). Open collaboration on these issues is essential to building flexible, future-proof infrastructure. We’re supporting this effort by contributing to the llm-d ecosystem, including the development of a DisaggregatedSet operator for LeaderWorkerSet (LWS), to help advance open standards for AI serving.” – Mathis Felardos, Inference Software Engineer, Mistral AI

What llm-d brings to the CNCF landscape

The CNCF is the natural home for solving complex workload orchestration challenges. AI serving is highly stateful and latency-sensitive, with request costs varying dramatically based on prompt length, cache locality, and model phase. Traditional service routing and autoscaling mechanisms are unaware of this inference state, leading to inefficient placement, cache fragmentation, and unpredictable latency under load. llm-d solves this by providing a pre-integrated, Kubernetes-native distributed inference framework that bridges the gap between high-level control planes (like KServe) and low-level inference engines (like vLLM). llm-d plans to work with the CNCF AI Conformance program to ensure critical capabilities like disaggregated serving are interoperable across the ecosystem.

By building on open APIs and extensible gateway primitives, llm-d introduces several critical capabilities to the CNCF ecosystem:

  • Inference-Aware Traffic Management: Acting as a primary implementation of the Kubernetes Gateway API Inference Extension (GAIE), llm-d utilizes the Endpoint Picker (EPP) for programmable, prefix-cache-aware routing.
  • Native Kubernetes Orchestration: Leveraging primitives like LeaderWorkerSet (LWS), llm-d orchestrates complex multi-node replicas and wide expert parallelism, transforming bespoke AI infrastructure into manageable cloud native microservices.
  • Prefill/Decode Disaggregation: llm-d addresses the resource-utilization asymmetry between prompt processing and token generation by disaggregating these phases into independently scalable pods.

Advanced State Management: The project introduces hierarchical KV cache offloading across GPU, TPU, CPU, and storage tiers.

Kubernetes flow chart showcasing the inference gateway route to selected pods, body-based routing, inference scheduled and variant autoscaler (featuring Nvidia, Google, AMD and Intel nodes).

SOTA inference performance on any accelerator

A core tenet of the cloud native philosophy is preventing vendor lock-in. For AI infrastructure, this means serving capabilities must be hardware-agnostic. 

We believe that democratizing SOTA inference with an accelerator-neutral mindset is the most important enabler for broad LLM adoption. The primary mission of llm-d is to Achieve SOTA Inference Performance On Any Accelerator. By introducing model- and state-aware routing policies that align request placement with specific hardware characteristics, llm-d maximizes utilization and delivers measurable gains in critical inference metrics like Time to First Token (TTFT), Time Per Output Token (TPOT), token throughput, and KV cache utilization. Whether you are running workloads on accelerators from NVIDIA, AMD, or Google, llm-d ensures that high-performance AI serving remains a core, composable capability of your stack.
Crucially, clear benchmarks that prove the value of these optimizations are core to the project. The AI industry often lacks standard, reproducible ways to measure inference performance, relying instead on marketing claims or commercial analysts. llm-d aims to be the neutral, de facto standard for defining and running inference benchmarks through rigorous, open benchmarking. For example, in a ‘Multi-tenant SaaS’ use case, shared customer contexts enable significant computational savings through prefix caching. As demonstrated in the most recent v0.5 release, llm-d’s inference scheduling maintains near-zero latency and massive throughput compared to a baseline Kubernetes service:

 TTFT and throughput vs QPS on Qwen3-32B (8×vLLM pods, 16×NVIDIA H100). 
llm-d inference scheduling maintains near-zero TTFT and scales to ~120k tok/s, while baseline Kubernetes service degrades rapidly under load.

Figure 1: TTFT and throughput vs QPS on Qwen3-32B (8×vLLM pods, 16×NVIDIA H100).
llm-d inference scheduling maintains near-zero TTFT and scales to ~120k tok/s,
while baseline Kubernetes service degrades rapidly under load.

Bridging cloud native and AI native ecosystems

To build the ultimate AI infrastructure, we must bridge the gap between Kubernetes orchestration and frontier AI research. llm-d is actively building deep relationships with AI/ML leaders at large foundation model builders and AI natives, along with traditional enterprises that are rapidly integrating AI throughout their organizations.  Furthermore, we are committed to increasing collaboration with the PyTorch Foundation to ensure a seamless, end-to-end open ecosystem that connects model development and training directly to distributed cloud native serving.

Get involved: Follow the “well-lit paths”

At its core, llm-d follows a “well-lit paths” philosophy. Instead of leaving platform teams to piece together fragile black boxes, llm-d provides validated, production-ready deployment patterns—benchmarked recipes tested end-to-end under realistic load.

We invite developers, platform engineers, and AI researchers to join us in shaping the future of open AI infrastructure:

  • Explore the Well-Lit Paths: Visit the llm-d guides to start deploying SOTA inference stacks on your infrastructure today.
  • Learn More: Check out the official website at llm-d.ai.
  • Contribute: Join the community on slack and get involved in our GitHub repositories at https://github.com/llm-d/

Welcome to the CNCF, llm-d! We look forward to building the future of AI infrastructure together.

Categories: CNCF Projects

Beyond Batch: Volcano Evolves into the AI-Native Unified Scheduling Platform

CNCF Blog Projects Category - Mon, 03/23/2026 - 04:00

The world of AI workloads is changing fast. A few years ago, “AI on Kubernetes” mostly meant running long training jobs. Today, with the rise of Large Language Models (LLMs), the focus has shifted to include complex inference services and Autonomous Agents. The industry consensus, backed by CNCF’s latest Annual Cloud Native Survey, is clear: Kubernetes has evolved to become the essential platform for intelligent systems. This shift from traditional training jobs to real-time inference and agents is transforming cloud native infrastructure.

This shift creates new challenges:

  • Complex Inference Demands: Serving LLMs requires high-performance GPU resources and sophisticated management to control costs and latency.
  • Distinct Agent Requirements: AI Agents introduce “bursty” traffic patterns, requiring instant startup times and state preservation—capabilities not natively optimized in Kubernetes.

The Volcano community is responding to these needs. With the release of Volcano v1.14, Kthena v0.3.0, and the new AgentCube, Volcano is transforming from a batch computing tool into a Full-Scenario, AI-Native Unified Scheduling Platform.

1. Volcano v1.14: Breaking Limits on Scale and Speed

As clusters expand and workloads diversify, scheduler bottlenecks can degrade performance. Volcano v1.14 introduces a major architectural evolution to address this.

Scalable Multi-Scheduler Architecture

Traditional setups often rely on static resource division, leading to wasted capacity. Volcano v1.14 introduces a Sharding Controller that dynamically calculates resource pools for different schedulers (Batch, Agent, etc.) in real-time.

  • Key Benefit: Enables running latency-sensitive Agent tasks alongside massive training jobs on the same cluster without resource contention, ensuring high cluster utilization and cost efficiency.

High-Throughput Agent Scheduling

Standard Kubernetes scheduling often struggles with the high churn rate of AI Agents. The new Agent Scheduler (Alpha) in v1.14 provides a high-performance fast path designed specifically for short-lived, high-concurrency tasks.

Enhanced Resource Efficiency

To optimize infrastructure costs, v1.14 adds support for generic Linux OSs (Ubuntu, CentOS) and democratizes enterprise features like CPU Throttling and Memory QoS. Additionally, native support for Ascend vNPU maximizes the utilization of diverse AI hardware.

2. Kthena v0.3.0: Efficient and Scalable LLM Serving

The CNCF survey has identified AI inference as the next major cloud native workload, representing the bulk of long-term cost, value, and complexity. Kthena v0.3.0 directly addresses this challenge, introducing a specialized Data Plane and Control Plane architecture to solve the speed and cost balance for serving large models.

Optimized Prefill-Decode Disaggregation

Separating “Prefill” and “Decode” phases improves efficiency but introduces heavy cross-node traffic.

  • Key Benefit: Kthena leverages Network Topology Awareness to co-locate interdependent tasks (e.g., on the same switch). Combined with a Smart Router that recognizes KV-Cache and LoRA adapters, it ensures requests are routed with minimal latency and maximum throughput.

Simplified Deployment with ModelBooster

Deploying large models typically involves managing fragmented Kubernetes resources.

  • Key Benefit: The new ModelBooster feature offers a declarative, one-stop deployment experience. Users define the model intent once, and Kthena automates the provisioning and lifecycle management of all underlying resources, significantly reducing operational complexity.

Cost-Efficient Heterogeneous Autoscaling

Running LLMs exclusively on top-tier GPUs can be cost-prohibitive.

  • Key Benefit: Kthena’s autoscaler supports Heterogeneous Scaling, allowing the mixing of different hardware types (e.g., high-end vs. cost-effective GPUs) within strict budget constraints, optimizing the balance between performance and expenditure.

3. AgentCube: Serverless Infrastructure for AI Agents

While Kubernetes provides a solid infrastructure foundation, it lacks specific primitives for AI Agents. AgentCube bridges this gap with specialized capabilities.

Instant Startup via Warm Pools

Agents require immediate responsiveness that standard container startup times cannot match.

  • Key Benefit: AgentCube utilizes a Warm Pool of lightweight MicroVM sandboxes. This mechanism reduces startup latency from seconds to milliseconds, delivering the snappy experience users expect.

Native Session Management

AI Agents require state persistence across multi-turn interactions, unlike typical stateless microservices.

  • Key Benefit: Built-in Session Management automatically routes conversations to the correct context, seamlessly enabling stateful interactions within a stateless Kubernetes environment.

Serverless Abstraction

Developers need to focus on agent logic rather than server management.

  • Key Benefit: AgentCube provides a streamlined API for requesting secure environments (like Code Interpreters). It handles the entire lifecycle—secure creation, execution, and automated recycling—offering a true serverless experience.

Conclusion

Volcano has evolved beyond batch jobs. With v1.14, Kthena, and AgentCube, we now provide a comprehensive platform for the entire AI lifecycle—from training foundation models to serving them at scale  to powering the next generation of intelligent agents.

By embracing cloud native principles to deliver scalable, reliable infrastructure for the AI lifecycle, Volcano is contributing to the community’s goal of ensuring AI workloads behave predictably at scale. As organizations seek consistent and portable AI infrastructure (a concept championed by initiatives like the Kubernetes AI Conformance Program), Volcano is positioning itself as a core component of that solution.

We invite you to explore these new features and join us in building the future of AI infrastructure.

If you are attending KubeCon + CloudNativeCon Europe, we encourage you to stop by our booth, P-14A, in the Project Pavilion to say hi and learn more about the latest updates.

Categories: CNCF Projects

Metal3 at KubeCon + CloudNativeCon Europe 2026: Meet the CNCF’s Freshly Incubated Bare Metal Project

CNCF Blog Projects Category - Mon, 03/23/2026 - 04:00

Metal3 (pronounced “metal cubed”) entered 2026 as one of the newest incubating projects in the CNCF. As the foundational layer for infrastructure management in self-hosted Kubernetes clouds, Metal3 and its ‘stack’ offer essential solutions for cloud service providers, AI-focused distributed systems, edge cloud deployments, and telecom infrastructure. Given the increasing investment in compute infrastructure worldwide, Metal3 addresses a growing number of issues faced by the modern IT industry.

From the start, Metal3 set the ambitious goal of becoming the primary tool for Kubernetes bare metal cluster management across the broader cloud native ecosystem. Real-world feedback is necessary to achieve this, and the community remains committed to increasing the project’s visibility and adoption. Metal3 is at the forefront of automated bare metal lifecycle management and the community is aiming to assist others in achieving the same level of success.

If you’re attending, KubeCon + CloudNativeCon Europe is the perfect opportunity to get better acquainted with Metal3, ask questions, and connect with maintainers and community members. This year’s conference will be one of the most active events yet for Metal3 ever, with a record number of talks and touchpoints for anyone interested in learning about the project.

A packed Metal3 presence at KubeCon + CloudNativeCon Europe

Metal3 has organized a packed presence at the conference, offering a variety of opportunities for attendees to engage with the project. For a quick overview, a concise project status update will be delivered during the lightning talk. For those interested in deeper engagement, there are two in-depth sessions focusing on the project’s governance and path to CNCF Incubation and a real-world adoption use case from the Sylva Project. Additionally, you can meet maintainers and community members for questions and hallway-track conversations at the Metal3 kiosk on the Solutions Showcase floor.

Lightning talk

The first event of the week, a lightning talk, will take place on Monday, 23 March. In classic Metal3 fashion, the community will share a quick status report of the Metal3 project, focusing on future plans toward graduation and beyond, along with highlights of major developments on the roadmap.

If you’re new to Metal3, this session is a great entry point; it’s short, focused, and gives you the “what’s happening” overview you need before you take a deeper dive.

Two in-depth sessions: governance and adoption

In addition to the lightning talk, community members will be presenting two more in-depth sessions around Metal3 governance and adoption.

1) Metal3.io’s Path to CNCF Incubation: Governance, Processes, and Community

Presented by Metal3 maintainers, this session focuses on Metal3’s journey from CNCF Sandbox to Incubation through the lens of governance, processes, and community building.

Be sure to attend if you’re interested in:

  • How Metal3 is run as an open-source project
  • What changed (or matured) during incubation readiness
  • How decisions are made and contributions flow

2) Beyond the Cloud: Managing Bare Metal the Kubernetes Way Using Metal3.io: Sylva Project as a Use Case

This talk approaches Metal3 from the viewpoint of an adopter. The hosts will explain the operational reality and practical use cases of a telco project and Metal3’s role.

Don’t miss this session if you care about:

  • What adopting Metal3 looks like in practice
  • The value proposition of Kubernetes-native bare metal lifecycle management
  • Lessons learned and patterns from real usage in a telco project

Visit the Metal3 kiosk

You can also meet maintainers and community members at the Metal3 kiosk P-21B on the Solutions Showcase floor, from Tuesday, 24 March, to the morning of Thursday, 26 March. This is a great opportunity to connect directly with the people building and operating the project. Whether you have technical queries about implementation, operational questions about running Metal3 in production, governance-related inquiries about its CNCF journey, or if you are simply curious about the project’s future, the kiosk is one of the easiest ways to get answers and context quickly.

Join the conversation

Whether you’re attending KubeCon + CloudNativeCon Europe to learn, evaluate, contribute to, or compare approaches for managing the lifecycle of bare metal Kubernetes, this event is shaping up to be a key moment for Metal3. 

Stop by the kiosk, catch the lightning talk, and join one (or both!) of the longer sessions

The community is eager to meet users and contributors and to discuss the future of bare metal Kubernetes. We welcome new contributors and adopters to our continuously growing community, inviting everyone working with bare metal Kubernetes to share their use cases and feedback. Whether you are already running Metal3 in production or just starting to explore, the community welcomes everyone’s input as an adopter, operator, or contributor. Learn more about how you can get active by visiting: https://metal3.io/contribute.html 

See you at the conference!

Categories: CNCF Projects

Announcing Ingress2Gateway 1.0: Your Path to Gateway API

Kubernetes Blog - Fri, 03/20/2026 - 15:00

With the Ingress-NGINX retirement scheduled for March 2026, the Kubernetes networking landscape is at a turning point. For most organizations, the question isn't whether to migrate to Gateway API, but how to do so safely.

Migrating from Ingress to Gateway API is a fundamental shift in API design. Gateway API provides a modular, extensible API with strong support for Kubernetes-native RBAC. Conversely, the Ingress API is simple, and implementations such as Ingress-NGINX extend the API through esoteric annotations, ConfigMaps, and CRDs. Migrating away from Ingress controllers such as Ingress-NGINX presents the daunting task of capturing all the nuances of the Ingress controller, and mapping that behavior to Gateway API.

Ingress2Gateway is an assistant that helps teams confidently move from Ingress to Gateway API. It translates Ingress resources/manifests along with implementation-specific annotations to Gateway API while warning you about untranslatable configuration and offering suggestions.

Today, SIG Network is proud to announce the 1.0 release of Ingress2Gateway. This milestone represents a stable, tested migration assistant for teams ready to modernize their networking stack.

Ingress2Gateway 1.0

Ingress-NGINX annotation support

The main improvement for the 1.0 release is more comprehensive Ingress-NGINX support. Before the 1.0 release, Ingress2Gateway only supported three Ingress-NGINX annotations. For the 1.0 release, Ingress2Gateway supports over 30 common annotations (CORS, backend TLS, regex matching, path rewrite, etc.).

Comprehensive integration testing

Each supported Ingress-NGINX annotation, and representative combinations of common annotations, is backed by controller-level integration tests that verify the behavioral equivalence of the Ingress-NGINX configuration and the generated Gateway API. These tests exercise real controllers in live clusters and compare runtime behavior (routing, redirects, rewrites, etc.), not just YAML structure.

The tests:

  • spin up an Ingress-NGINX controller
  • spin up multiple Gateway API controllers
  • apply Ingress resources that have implementation-specific configuration
  • translate Ingress resources to Gateway API with ingress2gateway and apply generated manifests
  • verify that the Gateway API controllers and the Ingress controller exhibit equivalent behavior.

A comprehensive test suite not only catches bugs in development, but also ensures the correctness of the translation, especially given surprising edge cases and unexpected defaults, so that you don't find out about them in production.

Notification & error handling

Migration is not a "one-click" affair. Surfacing subtleties and untranslatable behavior is as important as translating supported configuration. The 1.0 release cleans up the formatting and content of notifications, so it is clear what is missing and how you can fix it.

Using Ingress2Gateway

Ingress2Gateway is a migration assistant, not a one-shot replacement. Its goal is to

  • migrate supported Ingress configuration and behavior
  • identify unsupported configuration and suggest alternatives
  • reevaluate and potentially discard undesirable configuration

The rest of the section shows you how to safely migrate the following Ingress-NGINX configuration

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
 annotations:
 nginx.ingress.kubernetes.io/proxy-body-size: "1G"
 nginx.ingress.kubernetes.io/use-regex: "true"
 nginx.ingress.kubernetes.io/proxy-send-timeout: "1"
 nginx.ingress.kubernetes.io/proxy-read-timeout: "1"
 nginx.ingress.kubernetes.io/enable-cors: "true"
 nginx.ingress.kubernetes.io/configuration-snippet: |
 more_set_headers "Request-Id: $req_id";
 name: my-ingress
 namespace: my-ns
spec:
 ingressClassName: nginx
 rules:
 - host: my-host.example.com
 http:
 paths:
 - backend:
 service:
 name: website-service
 port:
 number: 80
 path: /users/(\d+)
 pathType: ImplementationSpecific
 tls:
 - hosts:
 - my-host.example.com
 secretName: my-secret

1. Install Ingress2Gateway

If you have a Go environment set up, you can install Ingress2Gateway with

go install github.com/kubernetes-sigs/[email protected]

Otherwise,

brew install ingress2gateway

You can also download the binary from GitHub or build from source.

2. Run Ingress2Gateway

You can pass Ingress2Gateway Ingress manifests, or have the tool read directly from your cluster.

# Pass it files
ingress2gateway print --input-file my-manifest.yaml,my-other-manifest.yaml --providers=ingress-nginx > gwapi.yaml
# Use a namespace in your cluster
ingress2gateway print --namespace my-api --providers=ingress-nginx > gwapi.yaml
# Or your whole cluster
ingress2gateway print --providers=ingress-nginx --all-namespaces > gwapi.yaml

Note:

You can also pass --emitter <agentgateway|envoy-gateway|kgateway> to output implementation-specific extensions.

3. Review the output

This is the most critical step. The commands from the previous section output a Gateway API manifest to gwapi.yaml, and they also emit warnings that explain what did not translate exactly and what to review manually.

apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
 annotations:
 gateway.networking.k8s.io/generator: ingress2gateway-dev
 name: nginx
 namespace: my-ns
spec:
 gatewayClassName: nginx
 listeners:
 - hostname: my-host.example.com
 name: my-host-example-com-http
 port: 80
 protocol: HTTP
 - hostname: my-host.example.com
 name: my-host-example-com-https
 port: 443
 protocol: HTTPS
 tls:
 certificateRefs:
 - group: ""
 kind: Secret
 name: my-secret
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
 annotations:
 gateway.networking.k8s.io/generator: ingress2gateway-dev
 name: my-ingress-my-host-example-com
 namespace: my-ns
spec:
 hostnames:
 - my-host.example.com
 parentRefs:
 - name: nginx
 port: 443
 rules:
 - backendRefs:
 - name: website-service
 port: 80
 filters:
 - cors:
 allowCredentials: true
 allowHeaders:
 - DNT
 - Keep-Alive
 - User-Agent
 - X-Requested-With
 - If-Modified-Since
 - Cache-Control
 - Content-Type
 - Range
 - Authorization
 allowMethods:
 - GET
 - PUT
 - POST
 - DELETE
 - PATCH
 - OPTIONS
 allowOrigins:
 - '*'
 maxAge: 1728000
 type: CORS
 matches:
 - path:
 type: RegularExpression
 value: (?i)/users/(\d+).*
 name: rule-0
 timeouts:
 request: 10s
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
 annotations:
 gateway.networking.k8s.io/generator: ingress2gateway-dev
 name: my-ingress-my-host-example-com-ssl-redirect
 namespace: my-ns
spec:
 hostnames:
 - my-host.example.com
 parentRefs:
 - name: nginx
 port: 80
 rules:
 - filters:
 - requestRedirect:
 scheme: https
 statusCode: 308
 type: RequestRedirect

Ingress2Gateway successfully translated some annotations into their Gateway API equivalents. For example, the nginx.ingress.kubernetes.io/enable-cors annotation was translated into a CORS filter. But upon closer inspection, the nginx.ingress.kubernetes.io/proxy-{read,send}-timeout and nginx.ingress.kubernetes.io/proxy-body-size annotations do not map perfectly. The logs show the reason for these omissions as well as reasoning behind the translation.

┌─ WARN ────────────────────────────────────────
│ Unsupported annotation nginx.ingress.kubernetes.io/configuration-snippet
│ source: INGRESS-NGINX
│ object: Ingress: my-ns/my-ingress
└─
┌─ INFO ────────────────────────────────────────
│ Using case-insensitive regex path matches. You may want to change this.
│ source: INGRESS-NGINX
│ object: HTTPRoute: my-ns/my-ingress-my-host-example-com
└─
┌─ WARN ────────────────────────────────────────
│ ingress-nginx only supports TCP-level timeouts; i2gw has made a best-effort translation to Gateway API timeouts.request. Please verify that this meets your needs. See documentation: https://gateway-api.sigs.k8s.io/guides/http-timeouts/
│ source: INGRESS-NGINX
│ object: HTTPRoute: my-ns/my-ingress-my-host-example-com
└─
┌─ WARN ────────────────────────────────────────
│ Failed to apply my-ns.my-ingress.metadata.annotations."nginx.ingress.kubernetes.io/proxy-body-size" from my-ns/my-ingress: Most Gateway API implementations have reasonable body size and buffering defaults
│ source: STANDARD_EMITTER
│ object: HTTPRoute: my-ns/my-ingress-my-host-example-com
└─
┌─ WARN ────────────────────────────────────────
│ Gateway API does not support configuring URL normalization (RFC 3986, Section 6). Please check if this matters for your use case and consult implementation-specific details.
│ source: STANDARD_EMITTER
└─

There is a warning that Ingress2Gateway does not support the nginx.ingress.kubernetes.io/configuration-snippet annotation. You will have to check your Gateway API implementation documentation to see if there is a way to achieve equivalent behavior.

The tool also notified us that Ingress-NGINX regex matches are case-insensitive prefix matches, which is why there is a match pattern of (?i)/users/(\d+).*. Most organizations will want to change this behavior to be an exact case-sensitive match by removing the leading (?i) and the trailing .* from the path pattern.

Ingress2Gateway made a best-effort translation from the nginx.ingress.kubernetes.io/proxy-{send,read}-timeout annotations to a 10 second request timeout in our HTTP route. If requests for this service should be much shorter, say 3 seconds, you can make the corresponding changes to your Gateway API manifests.

Also, nginx.ingress.kubernetes.io/proxy-body-size does not have a Gateway API equivalent, and was thus not translated. However, most Gateway API implementations have reasonable defaults for maximum body size and buffering, so this might not be a problem in practice. Further, some emitters might offer support for this annotation through implementation-specific extensions. For example, adding the --emitter agentgateway, --emitter envoy-gateway, or --emitter kgateway flag to the previous ingress2gateway print command would have resulted in additional implementation-specific configuration in the generated Gateway API manifests that attempted to capture the body size configuration.

We also see a warning about URL normalization. Gateway API implementations such as Agentgateway, Envoy Gateway, Kgateway, and Istio have some level of URL normalization, but the behavior varies across implementations and is not configurable through standard Gateway API. You should check and test the URL normalization behavior of your Gateway API implementation to ensure it is compatible with your use case.

To match Ingress-NGINX default behavior, Ingress2Gateway also added a listener on port 80 and a HTTP Request redirect filter to redirect HTTP traffic to HTTPS. You may not want to serve HTTP traffic at all and remove the listener on port 80 and the corresponding HTTPRoute.

Caution:

Always thoroughly review the generated output and logs.

After manually applying these changes, the Gateway API manifests might look as follows.

---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
 annotations:
 gateway.networking.k8s.io/generator: ingress2gateway-dev
 name: nginx
 namespace: my-ns
spec:
 gatewayClassName: nginx
 listeners:
 - hostname: my-host.example.com
 name: my-host-example-com-https
 port: 443
 protocol: HTTPS
 tls:
 certificateRefs:
 - group: ""
 kind: Secret
 name: my-secret
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
 annotations:
 gateway.networking.k8s.io/generator: ingress2gateway-dev
 name: my-ingress-my-host-example-com
 namespace: my-ns
spec:
 hostnames:
 - my-host.example.com
 parentRefs:
 - name: nginx
 port: 443
 rules:
 - backendRefs:
 - name: website-service
 port: 80
 filters:
 - cors:
 allowCredentials: true
 allowHeaders:
 - DNT
 ...
 allowMethods:
 - GET
 ...
 allowOrigins:
 - '*'
 maxAge: 1728000
 type: CORS
 matches:
 - path:
 type: RegularExpression
 value: /users/(\d+)
 name: rule-0
 timeouts:
 request: 3s

4. Verify

Now that you have Gateway API manifests, you should thoroughly test them in a development cluster. In this case, you should at least double-check that your Gateway API implementation's maximum body size defaults are appropriate for you and verify that a three-second timeout is enough.

After validating behavior in a development cluster, deploy your Gateway API configuration alongside your existing Ingress. We strongly suggest that you then gradually shift traffic using weighted DNS, your cloud load balancer, or traffic-splitting features of your platform. This way, you can quickly recover from any misconfiguration that made it through your tests.

Finally, when you have shifted all your traffic to your Gateway API controller, delete your Ingress resources and uninstall your Ingress controller.

Conclusion

The Ingress2Gateway 1.0 release is just the beginning, and we hope that you use Ingress2Gateway to safely migrate to Gateway API. As we approach the March 2026 Ingress-NGINX retirement, we invite the community to help us increase our configuration coverage, expand testing, and improve UX.

Resources about Gateway API

The scope of Gateway API can be daunting. Here are some resources to help you work with Gateway API:

Categories: CNCF Projects, Kubernetes

Running Agents on Kubernetes with Agent Sandbox

Kubernetes Blog - Fri, 03/20/2026 - 14:00

The landscape of artificial intelligence is undergoing a massive architectural shift. In the early days of generative AI, interacting with a model was often treated as a transient, stateless function call: a request that spun up, executed for perhaps 50 milliseconds, and terminated.

Today, the world is witnessing AI v2 eating AI v1. The ecosystem is moving from short-lived, isolated tasks to deploying multiple, coordinated AI agents that run constantly. These autonomous agents need to maintain context, use external tools, write and execute code, and communicate with one another over extended periods.

As platform engineering teams look for the right infrastructure to host these new AI workloads, one platform stands out as the natural choice: Kubernetes. However, mapping these unique agentic workloads to traditional Kubernetes primitives requires a new abstraction.

This is where the new Agent Sandbox project (currently in development under SIG Apps) comes into play.

The Kubernetes advantage (and the abstraction gap)

Kubernetes is the de facto standard for orchestrating cloud-native applications precisely because it solves the challenges of extensibility, robust networking, and ecosystem maturity. However, as AI evolves from short-lived inference requests to long-running, autonomous agents, we are seeing the emergence of a new operational pattern.

AI agents, by contrast, are typically isolated, stateful, singleton workloads. They act as a digital workspace or execution environment for an LLM. An agent needs a persistent identity and a secure scratchpad for writing and executing (often untrusted) code. Crucially, because these long-lived agents are expected to be mostly idle except for brief bursts of activity, they require a lifecycle that supports mechanisms like suspension and rapid resumption.

While you could theoretically approximate this by stringing together a StatefulSet of size 1, a headless Service, and a PersistentVolumeClaim for every single agent, managing this at scale becomes an operational nightmare.

Because of these unique properties, traditional Kubernetes primitives don't perfectly align.

Introducing Kubernetes Agent Sandbox

To bridge this gap, SIG Apps is developing agent-sandbox. The project introduces a declarative, standardized API specifically tailored for singleton, stateful workloads like AI agent runtimes.

At its core, the project introduces the Sandbox CRD. It acts as a lightweight, single-container environment built entirely on Kubernetes primitives, offering:

  • Strong isolation for untrusted code: When an AI agent generates and executes code autonomously, security is paramount. The Sandbox custom resource natively supports different runtimes, like gVisor or Kata Containers. This provides the necessary kernel and network isolation required for multi-tenant, untrusted execution.
  • Lifecycle management: Unlike traditional web servers optimized for steady, stateless traffic, an AI agent operates as a stateful workspace that may be idle for hours between tasks. Agent Sandbox supports scaling these idle environments to zero to save resources, while ensuring they can resume exactly where they left off.
  • Stable identity: Coordinated multi-agent systems require stable networking. Every Sandbox is given a stable hostname and network identity, allowing distinct agents to discover and communicate with each other seamlessly.

Scaling agents with extensions

Because the AI space is moving incredibly quickly, we built an Extensions API layer that enables even faster iteration and development.

Starting a new pod adds about a second of overhead. That's perfectly fine when deploying a new version of a microservice, but when an agent is invoked after being idle, a one-second cold start breaks the continuity of the interaction. It forces the user or the orchestrating service to wait for the environment to provision before the model can even begin to think or act. SandboxWarmPool solves this by maintaining a pool of pre-provisioned Sandbox pods, effectively eliminating cold starts. Users or orchestration services can simply issue a SandboxClaim against a SandboxTemplate, and the controller immediately hands over a pre-warmed, fully isolated environment to the agent.

Quick start

Ready to try it yourself? You can install the Agent Sandbox core components and extensions directly into your learning or sandbox cluster, using your chosen release.

We recommend you use the latest release as the project is moving fast.

# Replace "vX.Y.Z" with a specific version tag (e.g., "v0.1.0") from
# https://github.com/kubernetes-sigs/agent-sandbox/releases
export VERSION="vX.Y.Z"

# Install the core components:
kubectl apply -f https://github.com/kubernetes-sigs/agent-sandbox/releases/download/${VERSION}/manifest.yaml

# Install the extensions components (optional):
kubectl apply -f https://github.com/kubernetes-sigs/agent-sandbox/releases/download/${VERSION}/extensions.yaml

# Install the Python SDK (optional):
# Create a virtual Python environment
python3 -m venv .venv
source .venv/bin/activate
# Install from PyPI
pip install k8s-agent-sandbox

Once installed, you can try out the Python SDK for AI agents or deploy one of the ready-to-use examples to see how easy it is to spin up an isolated agent environment.

The future of agents is cloud native

Whether it’s a 50-millisecond stateless task, or a multi-week, mostly-idle collaborative process, extending Kubernetes with primitives designed specifically for isolated stateful singletons allows us to leverage all the robust benefits of the cloud-native ecosystem.

The Agent Sandbox project is open source and community-driven. If you are building AI platforms, developing agentic frameworks, or are interested in Kubernetes extensibility, we invite you to get involved:

Categories: CNCF Projects, Kubernetes

Crossplane and AI: The case for API-first infrastructure

CNCF Blog Projects Category - Fri, 03/20/2026 - 07:00

AI-assisted development has changed the way engineers create and commit code. But writing code is no longer the bottleneck. The bottleneck is everything that happens after git push.

From infrastructure provisioning, policy enforcement, day-two operations, drift, compliance, to cross-team coordination. That still requires multiple steps, and no new tool will fix it. This is an architecture problem. AI needs APIs, not UIs, and most platforms still aren’t built that way.

Current platforms

Talk to almost any organization, and you’ll hear that the desired state lives in Git, while the actual state lives in cloud providers. Policies are buried in pipeline configs. Organizational knowledge exists in wikis no one reads and in engineers who eventually leave.

This has worked up to now because humans worked with humans to navigate the context switching and informal coordination required to get the job done. People fill in the gaps, ask the questions, and translate intent across systems.

But in a world where AI agents are embedded into our organizations, this workflow breaks down. The agent hits a wall, not because it lacks capability, but because the platform wasn’t built for programmatic access. It was built for humans who can compensate for inconsistency.

Agents require a unified, structured, machine-readable interface. They need explicit governance rules, readable historical patterns, and discoverable dependencies. Without that structure, autonomy stalls.

This screenshot titled 'One API, Everything the agent needs' details a list of human engineer versus AI agent activities including 'checks Slack' for a human engineer activity, versus 'calls platform API' for an AI agent activity.

Platforms built on declarative control

Kubernetes introduced a simple but powerful control pattern that changes this entirely. Every resource follows a consistent schema:

yaml

apiVersion: example.crossplane.io/v1
kind: Database
metadata:
  name: user-db
spec:
  engine: postgres
  storage: 100Gi

Desired state lives in spec, actual state is reflected in status, and controllers observe the difference and reconcile continuously. That reconciliation is consistent and automatic; no human is required to coordinate convergence.

Crossplane extends this model beyond containers to all infrastructure and applications: cloud databases, object storage, networking, SaaS systems, clusters, and custom platform APIs. The result isn’t just infrastructure-as-code. It’s your entire platform, infrastructure, and applications as a single API. That difference matters.

The three core elements that make this work in practice:

  • Desired state: the declarative specification of what we think the world should be. (Example: The frontend service should have 3 replicas with 2 GB of memory each.)
  • Actual state: the operational reality of what exists in the infrastructure. (Example: The frontend service has 2 healthy replicas, 1 pending.)
  • Policy: the rules and governance that constrain operations. (Example: Production changes require approval between 9 AM and 5 PM PST.)

Controllers continuously reconcile desired state with actual state, and policy is enforced at execution rather than left to manual review. Context becomes part of the system, not something external to it.

Why this model works for agents

An AI agent interacting with a Crossplane-managed platform doesn’t need to orchestrate workflows across multiple systems. It interacts with a single API surface.

It can discover resource types via the Kubernetes API, inspect status fields for real-time operational state, watch resources for change events, and submit declarative intent. Since reconciliation handles mechanical execution, agents don’t need to coordinate step-by-step logic; they just declare intent and let controllers handle convergence.

This separation of concerns is critical. Controllers handle mechanics, while agents focus on higher-level reasoning. Without a control plane, agents become fragile orchestrators. With one, they become declarative participants.

When the entire platform is accessible through a single, consistent API, the agent has everything it needs. No Slack messages and no tribal knowledge required.

Policy at the point of execution

In fragmented platforms, governance follows lots of procedures: reviews, tickets, Slack threads. In a Kubernetes-native control plane, governance is architectural.

RBAC controls who can act. Admission controllers validate changes before they’re persisted. Policy engines such as OPA and Kyverno enforce constraints at runtime. Crossplane compositions encode organizational patterns directly into APIs. Every change flows through the same enforcement path, no hidden approval steps, no undocumented exception paths.

This removes ambiguity for agents entirely. The system defines what is allowed. Agents operate within clearly defined boundaries, and the platform enforces them automatically.

Crossplane 2.0: Full-stack control

With Crossplane 2.0, compositions can include any Kubernetes resource, not just managed infrastructure. That means a single composite API can provision infrastructure, deploy applications, configure networking, set up observability, and define operational workflows,  all in one place.

apiVersion: platform.acme.io/v1
kind: Microservice
metadata:
  namespace: team-api
  name: user-service
spec:
  image: acme/user-service:v1.2.3
  database:
    engine: postgres
    size: medium
  ingress:
    subdomain: users

Behind that abstraction may live RDS instances, security groups, deployments, services, ingress rules, and monitoring resources. To a human developer or an AI agent, it’s a single API. That consistency is what enables automation to scale safely.

Day-two operations follow the same pattern. Crossplane’s Operation types bring declarative control to scheduled upgrades, backups, maintenance, and event-driven automation:

apiVersion: ops.crossplane.io/v1alpha1
kind: CronOperation
metadata:
  name: weekly-db-maintenance
spec:
  schedule: "0 2 * * 0"
  operationTemplate:
    spec:
      pipeline:
        - step: upgrade
          functionRef:
            name: function-database-upgrade

Operational workflows are now first-class API objects. Agents can inspect them, trigger them, observe their status, and propose modifications. No need for hidden runbooks.

Where to start

This doesn’t require a start-from-scratch migration. Bring core infrastructure under declarative control first. Your existing resources don’t need to be replaced; they just need to be unified behind a consistent API.

For teams using AI-assisted development, engineers express intent and iterate quickly as tools accelerate implementation. As deployment decouples from release, with changes shipping behind feature flags and systems reconciling toward the desired state, the platform must be deterministic and self-correcting, not reliant on someone catching drift or running the right command at the right time.

That is what a declarative control plane provides. Crossplane ensures that intent has somewhere safe, structured, and deterministic to land. Without it, AI will always be bolted onto human-centric workflows. With it, agents become first-class participants in infrastructure operations.

And that starts with a consistent API.

And that starts with a consistent API. Get started by checking out the Crossplane Docs, attending a community meeting, or watching CNCF’s Cloud Native Live on Crossplane 2.0 – AI-Driven Control Loops for Platform Engineering.

Categories: CNCF Projects

Announcing etcd-operator v0.2.0

etcd Blog - Thu, 03/19/2026 - 20:00

Introduction

Today, we are excited to announce the release of etcd-operator v0.2.0! This release brings important new features and improvements that enhance security, reliability, and operability for managing etcd clusters.

New Features

Certificate Management

Version 0.2.0 introduces built-in certificate management to secure all TLS communication:

  • Between etcd members (inter-member communication)
  • Between clients and etcd members

TLS is only configured when explicitly enabled by the user. Once enabled, etcd-operator automatically provisions and manages certificates based on the selected provider.

Categories: CNCF Projects

March 20 Security Release Patches Auth Vulnerabilities

etcd Blog - Thu, 03/19/2026 - 20:00

SIG-etcd released updates 3.6.9, 3.5.28, and 3.4.42 today. These patch releases fix several vulnerabilities which allow unauthorized users to bypass authentication or authorization controls that are part of etcd Auth using the gRPC API.

These vulnerabilities do not affect etcd as a part of the Kubernetes Control Plane. They only affect etcd clusters in other contexts, specifically ones with Auth enabled where it is required for access control in untrusted or partially trusted networks or with untrused users.

Categories: CNCF Projects

Securing Production Debugging in Kubernetes

Kubernetes Blog - Wed, 03/18/2026 - 14:00

During production debugging, the fastest route is often broad access such as cluster-admin (a ClusterRole that grants administrator-level access), shared bastions/jump boxes, or long-lived SSH keys. It works in the moment, but it comes with two common problems: auditing becomes difficult, and temporary exceptions have a way of becoming routine.

This post offers my recommendations for good practices applicable to existing Kubernetes environments with minimal tooling changes:

  • Least privilege with RBAC
  • Short-lived, identity-bound credentials
  • An SSH-style handshake model for cloud native debugging

A good architecture for securing production debugging workflows is to use a just-in-time secure shell gateway (often deployed as an on demand pod in the cluster). It acts as an SSH-style “front door” that makes temporary access actually temporary. You can authenticate with short-lived, identity-bound credentials, establish a session to the gateway, and the gateway uses the Kubernetes API and RBAC to control what they can do, such as pods/log, pods/exec, and pods/portforward. Sessions expire automatically, and both the gateway logs and Kubernetes audit logs capture who accessed what and when without shared bastion accounts or long-lived keys.

1) Using an access broker on top of Kubernetes RBAC

RBAC controls who can do what in Kubernetes. Many Kubernetes environments rely primarily on RBAC for authorization, although Kubernetes also supports other authorization modes such as Webhook authorization. You can enforce access directly with Kubernetes RBAC, or put an access broker in front of the cluster that still relies on Kubernetes permissions under the hood. In either model, Kubernetes RBAC remains the source of truth for what the Kubernetes API allows and at what scope.

An access broker adds controls that RBAC does not cover well. For example, it can decide whether a request is auto-approved or requires manual approval, whether a user can run a command, and which commands are allowed in a session. It can also manage group membership so that you grant permissions to groups instead of individual users. Kubernetes RBAC can allow actions such as pods/exec, but it cannot restrict which commands run inside an exec session.

With that model, Kubernetes RBAC defines the allowed actions for a user or group (for example, an on-call team in a single namespace). I recommend you only define access rules that grant rights to groups or to ServiceAccounts - never to individual users. The broker or identity provider then adds or removes users from that group as needed.

The broker can also enforce extra policy on top, like which commands are permitted in an interactive session and which requests can be auto-approved versus require manual approval. That policy can live in a JSON or XML file and be maintained through code review, so updates go through a formal pull request and are reviewed like any other production change.

Example: a namespaced on-call debug Role

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
 name: oncall-debug
 namespace: <namespace>
rules:
 # Discover what’s running
 - apiGroups: [""]
 resources: ["pods", "events"]
 verbs: ["get", "list", "watch"]

 # Read logs
 - apiGroups: [""]
 resources: ["pods/log"]
 verbs: ["get"]

 # Interactive debugging actions
 - apiGroups: [""]
 resources: ["pods/exec", "pods/portforward"]
 verbs: ["create"]

 # Understand rollout/controller state
 - apiGroups: ["apps"]
 resources: ["deployments", "replicasets"]
 verbs: ["get", "list", "watch"]

 # Optional: allow kubectl debug ephemeral containers
 - apiGroups: [""]
 resources: ["pods/ephemeralcontainers"]
 verbs: ["update"]

Bind the Role to a group (rather than individual users) so membership can be managed through your identity provider:

apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
 name: oncall-debug
 namespace: <namespace>
subjects:
 - kind: Group
 name: oncall-<team-name>
 apiGroup: rbac.authorization.k8s.io
roleRef:
 kind: Role
 name: oncall-debug
 apiGroup: rbac.authorization.k8s.io

2) Short-lived, identity-bound credentials

The goal is to use short-lived, identity-bound credentials that clearly tie a session to a real person and expire quickly. These credentials can include the user’s identity and the scope of what they’re allowed to do. They’re typically signed using a private key that stays with the engineer, such as a hardware-backed key (for example, a YubiKey), so they can not be forged without access to that key.

You can implement this with Kubernetes-native authentication (for example, client certificates or an OIDC-based flow), or have the access broker from the previous section issue short-lived credentials on the user’s behalf. In many setups, Kubernetes still uses RBAC to enforce permissions based on the authenticated identity and groups/claims. If you use an access broker, it can also encode additional scope constraints in the credential and enforce them during the session, such as which cluster or namespace the session applies to and which actions (or approved commands) are allowed against pods or nodes. In either case, the credentials should be signed by a certificate authority (CA), and that CA should be rotated on a regular schedule (for example, quarterly) to limit long-term risk.

Option A: short-lived OIDC tokens

A lot of managed Kubernetes clusters already give you short-lived tokens. The main thing is to make sure your kubeconfig refreshes them automatically instead of copying a long-lived token into the file.

For example:

users:
- name: oncall
 user:
 exec:
 apiVersion: client.authentication.k8s.io/v1
 command: cred-helper
 args: ["--cluster=prod", "--ttl=30m"]

Option B: Short-lived client certificates (X.509)

If your API server (or your access broker from the previous section) is set up to trust a client CA, you can use short-lived client certificates for debugging access. The idea is:

  • The private key is created and kept under the engineer’s machine (ideally hardware-backed, like a non-exportable key in a YubiKey/PIV token)
  • A short-lived certificate is issued (often via the CertificateSigningRequest API, or your access broker from the previous section, with a TTL).
  • RBAC maps the authenticated identity to a minimal Role

This is straightforward to operationalize with the Kubernetes CertificateSigningRequest API.

Generate a key and CSR locally:

# Generate a private key.
# This could instead be generated within a hardware token;
# OpenSSL and several similar tools include support for that.
openssl genpkey -algorithm Ed25519 -out oncall.key

openssl req -new -key oncall.key -out oncall.csr \
 -subj "/CN=user/O=oncall-payments"

Create a CertificateSigningRequest with a short expiration:

apiVersion: certificates.k8s.io/v1
kind: CertificateSigningRequest
metadata:
 name: oncall-<user>-20260218
spec:
 request: <base64-encoded oncall.csr>
 signerName: kubernetes.io/kube-apiserver-client
 expirationSeconds: 1800 # 30 minutes
 usages:
 - client auth

After the CSR is approved and signed, you extract the issued certificate and use it together with the private key to authenticate, for example via kubectl.

3) Use a just-in-time access gateway to run debugging commands

Once you have short-lived credentials, you can use them to open a secure shell session to a just-in-time access gateway, often exposed over SSH and created on demand. If the gateway is exposed over SSH, a common pattern is to issue the engineer a short-lived OpenSSH user certificate for the session. The gateway trusts your SSH user CA, authenticates the engineer at connection time, and then applies the approved session policy before making Kubernetes API calls on the user’s behalf. OpenSSH certificates are separate from Kubernetes X.509 client certificates, so these are usually treated as distinct layers.

The resulting session should also be scoped so it cannot be reused outside of what was approved. For example, the gateway or broker can limit it to a specific cluster and namespace, and optionally to a narrower target such as a pod or node. That way, even if someone tries to reuse the access, it will not work outside the intended scope. After the session is established, the gateway executes only the allowed actions and records what happened for auditing.

Example: Namespace-scoped role bindings

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
 name: jit-debug
 namespace: <namespace>
 annotations:
 kubernetes.io/description: >
 Colleagues performing semi-privileged debugging, with access provided
 just in time and on demand.
rules:
 - apiGroups: [""]
 resources: ["pods", "pods/log"]
 verbs: ["get", "list", "watch"]
 - apiGroups: [""]
 resources: ["pods/exec"]
 verbs: ["create"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
 name: jit-debug
 namespace: <namespace>
subjects:
 - kind: Group
 name: jit:oncall:<namespace>  # mapped from the short-lived credential (cert/OIDC)
 apiGroup: rbac.authorization.k8s.io
roleRef:
 kind: Role
 name: jit-debug
 apiGroup: rbac.authorization.k8s.io

These RBAC objects, and the rules they define, allow debugging only within the specified namespace; attempts to access other namespaces are not allowed.

Example: Cluster-scoped role binding

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
 name: jit-cluster-read
rules:
 - apiGroups: [""]
 resources: ["nodes", "namespaces"]
 verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
 name: jit-cluster-read
subjects:
 - kind: Group
 name: jit:oncall:cluster
 apiGroup: rbac.authorization.k8s.io
roleRef:
 kind: ClusterRole
 name: jit-cluster-read
 apiGroup: rbac.authorization.k8s.io

These RBAC rules grant cluster-wide read access (for example, to nodes and namespaces) and should be used only for workflows that truly require cluster-scoped resources.

Finer-grained restrictions like “only this pod/node” or “only these commands” are typically enforced by the access gateway/broker during the session, but Kubernetes also offers other options, such as ValidatingAdmissionPolicy for restricting writes and webhook authorization for custom authorization across verbs.

In environments with stricter access controls, you can add an extra, short-lived session mediation layer to separate session establishment from privileged actions. Both layers are ephemeral, use identity-bound expiring credentials, and produce independent audit trails. The mediation layer handles session setup/forwarding, while the execution layer performs only RBAC-authorized Kubernetes actions. This separation can reduce exposure by narrowing responsibilities, scoping credentials per step, and enforcing end-to-end session expiry.

References

Disclaimer: The views expressed in this post are solely those of the author and do not reflect the views of the author’s employer or any other organization.

Categories: CNCF Projects, Kubernetes

Running Rook at Petabyte Scale Across Multiple Regions

Rook Blog - Wed, 03/18/2026 - 10:50

This post describes how SAP’s cloud infrastructure team uses Rook to manage a multi-region Ceph fleet — from bare metal provisioning to rolling upgrades — as part of building a digitally sovereign storage backbone for Europe.

The 120 Petabyte Challenge

When you are responsible for a target of 120 Petabytes of storage across 30 Regions, manual operations don’t scale.

For years, SAP Cloud Infrastructure relied on a mix of proprietary appliances and legacy OpenStack Swift. But as we architected our next-generation cloud stack (internally part of the Apeiro project), we faced a non-negotiable constraint: Digital Sovereignty. Our stack had to be completely free of hyperscaler lock-in, running on our own hardware in our own data centers.

This created a concrete engineering challenge: build a storage layer that is API-first, fully open-source, and capable of self-management at a global scale. We chose Ceph for the storage engine — and Rook for the automation layer that makes it manageable.

Why Rook

Managing Ceph at this scale without an operator would mean building and maintaining custom tooling for OSD lifecycle, daemon placement, upgrade orchestration, and failure recovery across every region. Rook gives us all of this as a declarative Kubernetes-native interface, which means our existing GitOps and CI/CD workflows extend naturally to storage. Instead of writing region-specific runbooks, we write Helm values.

Architecture: The Separation of Metal and Software

Our platform, CobaltCore, is built on top of Gardener and Metal-API both part of the ApeiroRA reference architecture. In this stack, storage isn’t a static resource — it’s a programmable Kubernetes object. We run storage on dedicated nodes, separate from application workloads. At our density (16 NVMe drives per node), co-locating workloads would create unacceptable I/O interference, so storage nodes do one thing: serve data.

The Metal Layer

Metal-API and Gardener manage the physical lifecycle of bare-metal servers: inventory, provisioning, firmware, and OS deployment. This allows Rook to focus purely on the software layer without worrying about the underlying physical state.

The Declarative Storage Layer (Rook)

Once nodes are handed over, Rook takes control. We use a strict GitOps workflow to ensure consistency across the fleet:

  • Base Blueprint: A central Helm chart defines global best practices and standard Ceph configurations.
  • Region Overlay: Region-specific resources (CephBlockPools, RGW placement rules) are injected via localized values.yaml files.
  • Automation: Rook handles the rest: bootstrapping daemons, configuring CRUSH failure domains, and provisioning RGW endpoints.

Standard Storage Node Spec:

  • Server: Dell PowerEdge R7615
  • CPU: AMD EPYC 9554P (64 cores)
  • RAM: 384 GB
  • Storage: 16x 14 TB NVMe
  • Network: 100 GbE (redundant)

Validation: Establishing the Performance Envelope

Before committing to production capacity planning, we needed to establish the performance envelope of our RGW tier. We ran a breakpoint test on a typical Ceph Squid cluster (28 nodes, 362 OSDs) to find the stable operating range, saturation threshold, and hard ceiling.

Test Setup

  • Workload: 2M objects (4 KB each), 20 k6 clients, single “premium” NVMe bucket.
  • Method: Ramping load over 30 minutes until p90 latency exceeded 500 ms.

Results

Request rate ramp: successful requests (pink) peak at 171K ops/sec before the test exits. Failed requests (blue) spike briefly near saturation.
  • Saturation Point: The cluster entered saturation around 90K GET/s — latency percentiles begin diverging and request queues start building.
  • Breaking Point: Peak of 171K GET/s (measured on RGW) before the runners hit the latency exit condition.

Note: isolated 503s appeared as early as ~33K GET/s on a single RGW instance, likely caused by uneven load distribution rather than cluster-wide saturation.

Reading the charts

Client-side latency (k6): flat near zero through moderate load, stepping up as the cluster approaches saturation, and reaching 1.5s+ at the breaking point.

The client-side latency chart tells the story most directly. Average request duration stays flat near zero well into the ramp — then steps up sharply as the cluster enters saturation and eventually hits its ceiling.

RGW-side GET latency: all percentiles stay flat and sub-ms through moderate load. Around 90K GET/s, p99 begins climbing while median remains low — a classic saturation signal.

Comparing the two charts reveals where the system saturates. At peak load, RGW reports p99 latency of ~210ms — but clients observe 1.5 seconds. The gap is connection queueing: requests waiting to be picked up by RGW Beast frontend threads. RGW’s internal metrics only measure processing time after a request is accepted, not time spent in the queue.

The RGW latency chart also shows that RADOS operation latency climbs under load, which means RGW threads stay occupied longer, contributing to the queue buildup. At the breaking point, request queues filled and RGWs began returning 503s across all instances.

This is a read-focused baseline — our primary workload is read-heavy. The saturation point of 90K GET/s gives us a conservative operating ceiling for per-region capacity planning.

Operational Reality: Making Day 2 Uneventful

The true test of any storage system is what happens when things break or need upgrading. At our scale, the goal is to make operations boring.

Zero-Downtime Upgrades

Rook has reduced storage maintenance from a coordinated event to a background task. Since the first cluster went live in May 2024, we have maintained a continuous upgrade cadence with zero customer-facing downtime and zero data loss:

  • GardenLinux: Monthly rolling updates across all regions.
  • Kubernetes: Quarterly version upgrades.
  • Rook: Quarterly upgrades (v1.14 through v1.18), with additional upgrades when a needed feature ships in a new release.
  • Ceph: Major version migration from Reef v18 to Squid v19. A rolling upgrade of the largest cluster (~816 OSDs) completes in approximately 2 days.
Upgrade cadence since May 2024: GardenLinux monthly, Kubernetes quarterly, Rook quarterly (with extra upgrades for needed features), and one Ceph major version migration.

Drive Failures

With ~2,800 OSDs in the fleet, drive failures are a routine event. When a drive fails, Ceph (RADOS) automatically handles data recovery and rebalancing across the remaining OSDs — no operator action is needed to protect data. On the Kubernetes side, Rook detects the failed OSD pod and manages its lifecycle. The full drive replacement cycle (removing the failed OSD, clearing the device, provisioning a new OSD on the replacement drive) still involves operational steps on our side, but Ceph’s self-healing ensures data durability is never at risk while the replacement is carried out.

Current Status and What’s Next

As of early 2026, the fleet spans 10 live regions (with an 11th newly provisioned):

  • Storage Nodes: 251
  • Total OSDs: ~2,800
  • Raw Capacity: ~37 PiB

Region sizes range from 13-node / 96-OSD deployments to 59-node / 816-OSD clusters — the same Rook-based GitOps workflow handles both.

The next phase is bringing high-performance Block Storage (RBD) into this declarative model to fully retire our remaining proprietary SANs.

  • Target: 30 Regions.
  • Target Capacity: 120 PB.

We are active contributors to the Rook project and continue collaborating with the maintainers as we scale toward these targets. The Rook Slack community has been a valuable resource throughout this journey.

This work is part of ApeiroRA — an open initiative developing a reference blueprint for sovereign cloud-edge infrastructure. All components use enterprise-friendly open-source licenses under neutral community governance. ApeiroRA welcomes participants — whether you want to adopt the blueprints, contribute components, or shape the architecture. Get started at the documentation portal.

Authors: SAP Engineering Team, CLYSO Engineering Team.

Running Rook at Petabyte Scale Across Multiple Regions was originally published in Rook Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Categories: CNCF Projects

The Invisible Rewrite: Modernizing the Kubernetes Image Promoter

Kubernetes Blog - Mon, 03/16/2026 - 20:00

Every container image you pull from registry.k8s.io got there through kpromo, the Kubernetes image promoter. It copies images from staging registries to production, signs them with cosign, replicates signatures across more than 20 regional mirrors, and generates SLSA provenance attestations. If this tool breaks, no Kubernetes release ships. Over the past few weeks, we rewrote its core from scratch, deleted 20% of the codebase, made it dramatically faster, and nobody noticed. That was the whole point.

A bit of history

The image promoter started in late 2018 as an internal Google project by Linus Arver. The goal was simple: replace the manual, Googler-gated process of copying container images into k8s.gcr.io with a community-owned, GitOps-based workflow. Push to a staging registry, open a PR with a YAML manifest, get it reviewed and merged, and automation handles the rest. KEP-1734 formalized this proposal.

In early 2019, the code moved to kubernetes-sigs/k8s-container-image-promoter and grew quickly. Over the next few years, Stephen Augustus consolidated multiple tools (cip, gh2gcs, krel promote-images, promobot-files) into a single CLI called kpromo. The repository was renamed to promo-tools. Adolfo Garcia Veytia (Puerco) added cosign signing and SBOM support. Tyler Ferrara built vulnerability scanning. Carlos Panato kept the project in a healthy and releasable state. 42 contributors made about 3,500 commits across more than 60 releases.

It worked. But by 2025 the codebase carried the weight of seven years of incremental additions from multiple SIGs and subprojects. The README said it plainly: you will see duplicated code, multiple techniques for accomplishing the same thing, and several TODOs.

The problems we needed to solve

Production promotion jobs for Kubernetes core images regularly took over 30 minutes and frequently failed with rate limit errors. The core promotion logic had grown into a monolith that was hard to extend and difficult to test, making new features like provenance or vulnerability scanning painful to add.

On the SIG Release roadmap, two work items had been sitting for a while: "Rewrite artifact promoter" and "Make artifact validation more robust". We had discussed these at SIG Release meetings and KubeCons, and the open research spikes on project board #171 captured eight questions that needed answers before we could move forward.

One issue to answer them all

In February 2026, we opened issue #1701 ("Rewrite artifact promoter pipeline") and answered all eight spikes in a single tracking issue. The rewrite was deliberately phased so that each step could be reviewed, merged, and validated independently. Here is what we did:

Phase 1: Rate Limiting (#1702). Rewrote rate limiting to properly throttle all registry operations with adaptive backoff.

Phase 2: Interfaces (#1704). Put registry and auth operations behind clean interfaces so they can be swapped out and tested independently.

Phase 3: Pipeline Engine (#1705). Built a pipeline engine that runs promotion as a sequence of distinct phases instead of one large function.

Phase 4: Provenance (#1706). Added SLSA provenance verification for staging images.

Phase 5: Scanner and SBOMs (#1709). Added vulnerability scanning and SBOM support. Flipped the default to the new pipeline engine. At this point we cut v4.2.0 and let it soak in production before continuing.

Phase 6: Split Signing from Replication (#1713). Separated image signing from signature replication into their own pipeline phases, eliminating the rate limit contention that caused most production failures.

Phase 7: Remove Legacy Pipeline (#1712). Deleted the old code path entirely.

Phase 8: Remove Legacy Dependencies (#1716). Deleted the audit subsystem, deprecated tools, and e2e test infrastructure.

Phase 9: Delete the Monolith (#1718). Removed the old monolithic core and its supporting packages. Thousands of lines deleted across phases 7 through 9.

Each phase shipped independently. v4.3.0 followed the next day with the legacy code fully removed.

With the new architecture in place, a series of follow-up improvements landed: parallelized registry reads (#1736), retry logic for all network operations (#1742), per-request timeouts to prevent pipeline hangs (#1763), HTTP connection reuse (#1759), local registry integration tests (#1746), the removal of deprecated credential file support (#1758), a rework of attestation handling to use cosign's OCI APIs and the removal of deprecated SBOM support (#1764), and a dedicated promotion record predicate type registered with the in-toto attestation framework (#1767). These would have been much harder to land without the clean separation the rewrite provided. v4.4.0 shipped all of these improvements and enabled provenance generation and verification by default.

The new pipeline

The promotion pipeline now has seven clearly separated phases:

graph LR Setup --> Plan --> Provenance --> Validate --> Promote --> Sign --> Attest Phase What it does Setup Validate options, prewarm TUF cache. Plan Parse manifests, read registries, compute which images need promotion. Provenance Verify SLSA attestations on staging images. Validate Check cosign signatures, exit here for dry runs. Promote Copy images server-side, preserving digests. Sign Sign promoted images with keyless cosign. Attest Generate promotion provenance attestations using a dedicated in-toto predicate type.

Phases run sequentially, so each one gets exclusive access to the full rate limit budget. No more contention. Signature replication to mirror registries is no longer part of this pipeline and runs as a dedicated periodic Prow job instead.

Making it fast

With the architecture in place, we turned to performance.

Parallel registry reads (#1736): The plan phase reads 1,350 registries. We parallelized this and the plan phase dropped from about 20 minutes to about 2 minutes.

Two-phase tag listing (#1761): Instead of checking all 46,000 image groups across more than 20 mirrors, we first check only the source repositories. About 57% of images have no signatures at all because they were promoted before signing was enabled. We skip those entirely, cutting API calls roughly in half.

Source check before replication (#1727): Before iterating all mirrors for a given image, we check if the signature exists on the primary registry first. In steady state where most signatures are already replicated, this reduced the work from about 17 hours to about 15 minutes.

Per-request timeouts (#1763): We observed intermittent hangs where a stalled connection blocked the pipeline for over 9 hours. Every network operation now has its own timeout and transient failures are retried automatically.

Connection reuse (#1759): We started reusing HTTP connections and auth state across operations, eliminating redundant token negotiations. This closed a long-standing request from 2023.

By the numbers

Here is what the rewrite looks like in aggregate.

  • Over 40 PRs merged, 3 releases shipped (v4.2.0, v4.3.0, v4.4.0)
  • Over 10,000 lines added and over 16,000 lines deleted, a net reduction of about 5,000 lines (20% smaller codebase)
  • Performance drastically improved across the board
  • Robustness improved with retry logic, per-request timeouts, and adaptive rate limiting
  • 19 long-standing issues closed

The codebase shrank by a fifth while gaining provenance attestations, a pipeline engine, vulnerability scanning integration, parallelized operations, retry logic, integration tests against local registries, and a standalone signature replication mode.

No user-facing changes

This was a hard requirement. The kpromo cip command accepts the same flags and reads the same YAML manifests. The post-k8sio-image-promo Prow job continued working throughout. The promotion manifests in kubernetes/k8s.io did not change. Nobody had to update their workflows or configuration.

We caught two regressions early in production. One (#1731) caused a registry key mismatch that made every image appear as "lost" so that nothing was promoted. Another (#1733) set the default thread count to zero, blocking all goroutines. Both were fixed within hours. The phased release strategy (v4.2.0 with the new engine, v4.3.0 with legacy code removed) gave us a clear rollback path that we fortunately never needed.

What comes next

Signature replication across all mirror registries remains the most expensive part of the promotion cycle. Issue #1762 proposes eliminating it entirely by having archeio (the registry.k8s.io redirect service) route signature tag requests to a single canonical upstream instead of per-region backends. Another option would be to move signing closer to the registry infrastructure itself. Both approaches need further discussion with the SIG Release and infrastructure teams, but either one would remove thousands of API calls per promotion cycle and simplify the codebase even further.

Thank you

This project has been a community effort spanning seven years. Thank you to Linus, Stephen, Adolfo, Carlos, Ben, Marko, Lauri, Tyler, Arnaud, and many others who contributed code, reviews, and planning over the years. The SIG Release and Release Engineering communities provided the context, the discussions, and the patience for a rewrite of infrastructure that every Kubernetes release depends on.

If you want to get involved, join us in #release-management on the Kubernetes Slack or check out the repository.

Categories: CNCF Projects, Kubernetes

Project Harbor at KubeCon + CloudNativeCon Europe 2026 in Amsterdam

Harbor Blog - Sun, 03/15/2026 - 06:00
Project Harbor at KubeCon + CloudNativeCon Europe 2026 in Amsterdam The cloud-native community is once again gathering for one of the most anticipated events of the year — KubeCon + CloudNativeCon Europe 2026, taking place in the beautiful and dynamic city of Amsterdam, Netherlands. As always, Project Harbor will be proudly represented, bringing the latest innovations in container registry technology, security, and software supply chain management to the global Kubernetes community.
Categories: CNCF Projects

Cloud-Native AI Model Management and Distribution for Inference Workloads

Harbor Blog - Wed, 03/11/2026 - 04:00
Author: Wenbo Qi(Gaius), Dragonfly/ModelPack Maintainer Chenyu Zhang(Chlins), Harbor/ModelPack Maintainer Feynman Zhou, ORAS Maintainer, CNCF Ambassador Reviewer: Sascha Grunert, CRI-O Maintainer Wei Fu, containerd Maintainer The weight of AI models: Why infrastructure always arrives slowly As AI adoption accelerates across industries, organizations face a critical bottleneck that is often overlooked until it becomes a serious obstacle: reliably managing and distributing large model weight files at scale. A model’s weights serve as the central artifact that bridges both training and inference pipelines — yet the infrastructure surrounding this artifact is frequently an afterthought.
Categories: CNCF Projects

Sustaining open source in the age of generative AI

CNCF Blog Projects Category - Tue, 03/10/2026 - 07:00

Open source has always evolved alongside shifts in technology.

From distributed version control and CI/CD, from containers to Kubernetes, each wave of tooling has reshaped how we build, collaborate, and contribute. Generative AI seems to be the newest wave and it introduces a tension that open source communities can no longer afford to ignore.

AI has made it simple to generate contributions. It has not however made the necessary review process simpler.

Recently, the Kyverno project introduced an AI Usage Policy. This decision was not driven by resistance to AI. It was driven by something far more practical: the scaling limits of human attention.

Where this conversation began

Like many governance changes in open source, this one didn’t begin with theory. It began with a Slack message.

“20 PRs opened in 15 minutes ?”

What followed was a mixture of humor, curiosity, and a familiar undertone many maintainers recognize immediately as discomfort.

“Were they good PRs?”
“Maybe they were generated by bots?”
“Are any of them helpful or are mostly they noise?”

One maintainer captured the sentiment perfectly:

“Just seeing this number is discouraging enough.”

Another jokingly suggested we might need a:

“Respect the maintainers’ life policy.”

Behind the jokes was something deeply real. Our Maintainers and our project at large were feeling the weight of something very new, very real, and clearly on the verge of changing how open source projects like ours will be maintained.

The maintainer reality few people see

Modern AI tools are extraordinary productivity amplifiers.

They generate code, documentation, tests, refactors, and design suggestions in seconds. But while output scales infinitely, review does not. The bottleneck in open source has never been code generation.

It has always been human cognition.

Every pull request, regardless of how it was produced must still be:

  • Read
  • Understood
  • Evaluated for correctness
  • Assessed for security implications
  • Considered for long-term maintainability
  • More often than not, commented on, questioned, or simply clarified
  • Viewed by more than one set of eyes
  • Merged

In open source, there is always a human in the loop. That human is typically a maintainer, a reviewer, or a combination of both.

When low-effort or poorly understood AI-generated PRs flood a project, the burden of validation shifts entirely onto the humans who bear the majority of the weight in this loop. Even the most well-intentioned contributions become costly when they lack clarity, context, demonstrated understanding, and ownership.

Low-effort AI contributions don’t just exhaust maintainers, they quietly tax every thoughtful contributor waiting in the queue.

AI boomers, AI rizz, and the reality of change

We’re currently living through a fascinating cultural split in the developer ecosystem.

On one side, we see what might playfully be called “AI boomers” otherwise known as those folks deeply skeptical of AI, hesitant to adopt it, or resistant to its growing presence in development workflows. While it might be hard to believe, there are many of these people working in and contributing to open source software development.

On the other side, we see contributors with undeniable “AI rizz.” These are enthusiastic adopters of AI eager to automate, generate, accelerate, and experiment with AI and AI tooling in the open source space and everywhere else possible.

Both reactions are understandable.

Both are human.

But history has taught us something consistent about technological change:

Projects, like businesses, that refuse to adapt rarely remain relevant.

It’s become clear that AI is not a passing trend. It is a structural shift in how software is created. Resisting it entirely is unlikely to be sustainable and blindly embracing it without guardrails is equally risky.

AI as acceleration vs. AI as substitution

Open source contributions have traditionally served as one of the most powerful learning engines in our industry. Developers deepen expertise, explore systems, build portfolios, and give back to the communities they rely on.

But it seems that the arrival of AI has changed how many contributors produce work. The unfortunate thing is that this hasn’t happened in a globally productive way, rather it has happened in a way that undermines the one thing that a meaningful contribution requires:

Understanding.

Using AI to bypass understanding is not acceleration. It’s debt for both the contributor and the project.

Superficially correct code that cannot be explained, reasoned about, or defended introduces risk. It also deprives contributors of the very growth that open source participation has historically enabled.

Across open source communities, we’re hearing the same message shared with AI touting contributors: AI can amplify learning but it cannot replace learning.

Ownership still matters — perhaps more than ever

During an internal discussion about AI-generated contributions, Jim Bugwadia, Nirmata CEO and Kyverno founder, made a deceptively simple observation about what needs to happen with AI generated and assisted contributions:

“Own your commit.”

In a world of AI-assisted development, that idea expands naturally.

If AI helped generate your contribution, you must also own your prompt and whatever is generated by it.

Ownership means:

  • Understanding intent
  • Verifying correctness
  • Taking responsibility for outcomes
  • Standing behind the change

AI can generate output but it can’t and shouldn’t assume accountability. The idea of having a human in the loop isn’t something that can or should ever be only Maintainer facing. To be fair, this concept must be Contributor facing too.

Disclosure as trust infrastructure

Transparency has always been foundational to open source collaboration.

AI introduces new complexities around licensing, copyright, provenance, and tool terms of service. Legal frameworks are still evolving, and uncertainty remains a defining characteristic of this space.

Disclosure is not about tools or bureaucracy.

Disclosure is about accountability. It is trust infrastructure.

Requiring contributors to disclose meaningful AI usage helps preserve:

  • Transparency
  • Reviewer trust
  • Licensing integrity
  • Contribution clarity
  • Responsible authorship

This approach aligns with guidance from the Linux Foundation and discussions across the broaderCNCF community, both of which acknowledge that AI-generated content can be contributed provided contributors ensure compliance with licensing, attribution, and intellectual property obligations.

When AI meets open source: Kyverno’s approach

Kyverno is not a hobby project. Our project is used globally, in production, across organizations ranging from startups to enterprise-scale companies. Adoption continues to grow, and the project is actively moving toward CNCF Graduation.

Kyverno itself exists to create:

  • Clarity
  • Safety
  • Consistency
  • Sustainable workflows 

All through policy as code.

In this case, we are applying the same philosophy to something new: AI usage.

If policy as code provides guardrails and golden paths in platform engineering, then we should be considering how to provide similar guidance in the AI-assisted development space.

Developers can’t sustainably leverage AI within open source ecosystems if projects fail to define the appropriate expectations for them to keep in mind as they develop.

AI-friendly does not mean AI-unbounded

There is an important distinction emerging across open source communities: Being AI-friendly does not mean accepting unreviewed AI output.

Maintainers themselves are often enthusiastic adopters of AI tools and rightly so. Across projects, maintainers are using AI to:

  • Accelerate repetitive tasks
  • Improve documentation
  • Generate scaffolding
  • Explore design alternatives

One emerging pattern is the use of AGENT.md-style configurations, designed to guide how AI tools interact with repositories and project conventions.

Kyverno is actively exploring similar approaches. The goal is not simply to manage AI-assisted contributions, but to improve their quality at the source.

Discomfort, growth, and privilege

AI is forcing open source communities to confront unfamiliar challenges:

  • Scaling review processes
  • Defining authorship norms
  • Navigating licensing uncertainty
  • Re-thinking contributor workflows

Discomfort is inevitable. But as Jim often reminds our team:

“Discomfort in newness is typically a sign of growth.”

The pressure to navigate these new challenges and answer these pressing questions is not a burden. Raising to this challenge is a privilege. It means:

  • Our project matters
  • The ecosystem is evolving
  • We’re participating in shaping the future

A shared challenge across open source

Kyverno’s AI policy work was informed by thoughtful discussions and examples across the ecosystem. We dove into a variety of projects, each reflecting different constraints and priorities for us to keep in mind as we embark on our own journey.

Moving forward, what matters most, is that communities and community members from different projects and industries around the globe engage deliberately with these questions rather than simply responding reactively to the tooling.

Open source sustainability increasingly depends on shared governance patterns, not isolated experimentation.

An invitation to the ecosystem

AI is not going away, nor should it.

The question is not whether AI belongs in open source. The question is how we integrate it responsibly.

Sustainable open source in the AI era requires:

  • Human ownership
  • Transparent authorship
  • Respect for reviewer time
  • Context-aware contributions
  • Community-driven guardrails

AI is a powerful tool. But open source remains at its core, a human system.

While AI changes the tools and accelerates output, it does not change the responsibility.

Acknowledgements and influences

Kyverno’s AI Usage Policy was shaped by the openness and thoughtfulness of many communities and leaders, including:

Open source benefits enormously when governance knowledge is shared. Thanks to everyone who has already shared and to those who will help us continue to adapt our AI policies as we grow our project.

Categories: CNCF Projects

Pages

Subscribe to articles.innovatingtomorrow.net aggregator - CNCF Projects