CNCF Blog Projects Category
Istio at KubeCon + CloudNativeCon North America 2025: Community highlights and project progress
KubeCon + CloudNativeCon North America 2025 lit up Atlanta from November 10–13, bringing together one of the largest gatherings of open-source practitioners, platform engineers, and maintainers across the cloud native ecosystem. For the Istio community, the week was defined by packed rooms, long hallway conversations, and a genuine sense of shared progress across service mesh, Gateway API, security, and AI-driven platforms.
Before the main conference began, the community kicked things off with Istio Day on November 10, a colocated event filled with deep technical sessions, migration stories, and future-looking discussions that set the tone for the rest of the week.
Istio Day at KubeCon + CloudNativeCon NA
Istio Day brought together practitioners, contributors, and adopters for an afternoon of learning, sharing, and open conversations about where service mesh, and Istio, are headed next.
Istio Day opened with welcome remarks from the program co-chairs, setting the tone for an afternoon focused on real-world mesh evolution and the rapid growth of the Istio community. The agenda highlighted three major themes driving Istio’s future: AI-driven traffic patterns, the advancement of Ambient Mesh—including multicluster adoption, and modernizing traffic entry with Gateway API. Speakers across the ecosystem shared practical lessons on scaling, migration, reliability, and operating increasingly complex workloads with Istio.
The co-chairs closed the day by recognizing the speakers, contributors, and a community continuing to push service-mesh innovation forward. Recordings of all sessions are available at the CNCF YouTube channel.
Istio at KubeCon + CloudNativeCon
Outside of Istio Day, the project was highly visible across KubeCon + CloudNativeCon Atlanta, with maintainers, end users, and contributors sharing technical deep dives, production stories, and cutting-edge research. Istio appeared not only across expo booths and breakout sessions, but also throughout several of the keynotes, where companies showcased how Istio plays a critical role in powering their platforms at scale.
The week’s momentum fully met its stride when the Istio community reconvened with the Istio Project Update, where project leads shared latest releases and roadmap advances. In Istio: Set Sailing With Istio Without Sidecars, attendees explored how sidecar-less Ambient Mesh architecture is rapidly moving from experiment to adoption, opening new possibilities for simpler deployments and leaner data-planes.
The session Lessons Applied Building a Next-Generation AI Proxy took the crowd behind the scenes of how mesh technologies adapt to AI-driven traffic patterns and over at Automated Rightsizing for Istio DaemonSet Workloads (Poster Session), practitioners gathered to compare strategies for optimizing control-plane resources, tuning for high scale, and reducing cost without sacrificing performance.
The narrative of traffic-management evolution featured prominently in Gateway API: Table Stakes and its faster sibling Know Before You Go! Speedrun Intro to Gateway API. Meanwhile, Return of the Mesh: Gateway API’s Epic Quest for Unity scaled that conversation: how traffic, API, mesh, and routing converge into one architecture that simplifies complexity rather than multiplies it.
For long-term reflection, 5 Key Lessons From 8 Years of Building Kgateway delivered hard-earned wisdom from years of system design. In GAMMA in Action: How Careem Migrated To Istio Without Downtime, the real-world migration story—a major production rollout that stayed up during transition—provided a roadmap for teams seeking safe mesh adoption at scale.
Safety and rollout risks took center stage in Taming Rollout Risks in Distributed Web Apps: A Location-Aware Gradual Deployment Approach, where strategies for regional rollouts, steering traffic, and minimizing user impact were laid out.
Finally, operations and day-two reality were tackled in End-to-End Security With gRPC in Kubernetes and On-Call the Easy Way With Agents, reminding everyone that mesh isn’t just about architecture, but about how teams run software safely, reliably, and confidently.
Community spaces: ContribFest, Maintainer Track and the Project Pavilion
At the Project Pavilion, the Istio kiosk was constantly buzzing, drawing users with questions about Ambient Mesh, AI workloads, and deployment best practices.
The Maintainer Track brought contributors together to collaborate on roadmap topics, triage issues, and discuss key areas of investment for the next year.
At ContribFest, new contributors joined maintainers to work through good-first issues, discuss contribution pathways, and get their first PRs lined up.
Istio maintainers eecognized at the CNCF Community Awards
This year’s CNCF Community Awards were a proud moment for the project. Two Istio maintainers received well-deserved recognition:
Daniel Hawton — “Chop Wood, Carry Water” Award
John Howard — Top Committer Award
Beyond these awards, Istio was also represented prominently in conference leadership. Faseela K, one of the KubeCon + CloudNativeCon NA co-chairs and an Istio maintainer, participated in a keynote panel on Cloud Native for Good. During closing remarks, it was also announced that Lin Sun, another long-time Istio maintainer, will serve as an upcoming KubeCon + CloudNativeCon co-chair.
What we heard in Atlanta
Across sessions, kiosks, and hallways, a few themes emerged:
- Ambient Mesh is moving quickly from exploration to real-world adoption.
- AI workloads are reshaping traffic patterns and operational practices.
- Multicluster deployments are becoming standard, with stronger focus on identity and failover.
- Gateway API is solidifying as the future of modern traffic management.
- Contributor growth is accelerating, supported by ContribFest and hands-on community guidance.
Looking ahead
KubeCon + CloudNativeCon NA 2025 showcased a vibrant, rapidly growing community taking on some of the toughest challenges in cloud infrastructure—from AI traffic management to zero-downtime migrations, from planet-scale control planes to the next generation of sidecar-less mesh. As we look ahead to 2026, the momentum from Atlanta makes one thing clear: the future of service mesh is bright, and the Istio community is leading it together.
See you in Amsterdam!
Announcing Kyverno release 1.16
Kyverno 1.16 delivers major advancements in policy as code for Kubernetes, centered on a new generation of CEL-based policies now available in beta with a clear path to GA. This release introduces partial support for namespaced CEL policies to confine enforcement and minimize RBAC, aligning with least-privilege best practices. Observability is significantly enhanced with full metrics for CEL policies and native event generation, enabling precise visibility and faster troubleshooting. Security and governance get sharper controls through fine-grained policy exceptions tailored for CEL policies, and validation use cases broaden with the integration of an HTTP authorizer into ValidatingPolicy. Finally, we’re debuting the Kyverno SDK, laying the foundation for ecosystem integrations and custom tooling.
CEL policy types
CEL policies in beta
CEL policy types are introduced as v1beta. The promotion plan provides a clear, non‑breaking path: v1 will be made available in 1.17 with GA targeted for 1.18. This release includes the cluster‑scoped family (Validating, Mutating, Generating, Deleting, ImageValidating) at v1beta1 and adds namespaced variants for validation, deleting, and image validation; namespaced Generating and Mutating will follow in 1.17. PolicyException and GlobalContextEntry will advance in step to keep versions aligned; see the promotion roadmap in this tracking issue.
Namespaced policies
Kyverno 1.16 introduces namespaced CEL policy types— NamespacedValidatingPolicy, NamespacedDeletingPolicy, and NamespacedImageValidatingPolicy—which mirror their cluster-scoped counterparts but apply only within the policy’s namespace. This lets teams enforce guardrails with least-privilege RBAC and without central changes, improving multi-tenancy and safety during rollout. Choose namespaced types for team-owned namespaces and cluster-scoped types for global controls.
Observability upgrades
CEL policies now have comprehensive, native observability for faster diagnosis:
- Validating policy execution latency Metrics: kyverno_validating_policy_execution_duration_seconds_count, …_sum, …_bucket
- What it measures: Time spent evaluating validating policies per admission/background execution as a Prometheus histogram.
- Key labels: policy_name, policy_background_mode, policy_validation_mode (enforce/audit), resource_kind, resource_namespace, resource_request_operation (create/update/delete), execution_cause (admission_request/background_scan), result (PASS/FAIL).
- Mutating policy execution latency: kyverno_mutating_policy_execution_duration_seconds_count, …_sum, …_bucket
- What it measures: Time spent executing mutating policies (admission/background) as a Prometheus histogram.
- Key labels: policy_name, policy_background_mode, resource_kind, resource_namespace, resource_request_operation, execution_cause, result.
- Generating policy execution latency Metrics: kyverno_generating_policy_execution_duration_seconds_count, …_sum, …_bucket
- What it measures: Time spent executing generating policies when evaluating requests or during background scans.
- Key labels: policy_name, policy_background_mode, resource_kind, resource_namespace, resource_request_operation, execution_cause, result.
- Image-validating policy execution latency Metrics: kyverno_image_validating_policy_execution_duration_seconds_count, …_sum, …_bucket
- What it measures: Time spent evaluating image-related validating policies (e.g., image verification) as a Prometheus histogram.
- Key labels: policy_name, policy_background_mode, resource_kind, resource_namespace, resource_request_operation, execution_cause, result.
CEL policies now emit Kubernetes Events for passes, violations, errors, and compile/load issues with rich context (policy/rule, resource, user, mode). This provides instant, kubectl-visible feedback and easier correlation with admission decisions and metrics during rollout and troubleshooting.
Fine-grained policy exceptions
Image-based exceptions
This exception allows Pods in ci using images, via images attribute, that match the provided patterns while keeping the no-latest rule enforced for all other images. It narrows the bypass to specific namespaces and teams for auditability.
apiVersion: policies.kyverno.io/v1beta1
kind: PolicyException
metadata:
name: allow-ci-latest-images
namespace: ci
spec:
policyRefs:
- name: restrict-image-tag
kind: ValidatingPolicy
images:
- "ghcr.io/kyverno/*:latest"
matchConditions:
- expression: "has(object.metadata.labels.team) && object.metadata.labels.team == 'platform'"
The following ValidatingPolicy references exceptions.allowedImages which skips validation checks for white-listed image(s):
apiVersion: policies.kyverno.io/v1beta1
kind: ValidatingPolicy
metadata:
name: restrict-image-tag
spec:
rules:
- name: broker-config
matchConstraints:
resourceRules:
- apiGroups: [apps]
apiVersions: [v1]
operations: [CREATE, UPDATE]
resources: [pods]
validations:
- message: "Containers must not allow privilege escalation unless they are in the allowed images list."
expression: >
object.spec.containers.all(container,
string(container.image) in exceptions.allowedImages ||
(
has(container.securityContext) &&
has(container.securityContext.allowPrivilegeEscalation) &&
container.securityContext.allowPrivilegeEscalation == false
)
)
Value-based exceptions
This exception allows a list of values via allowedValues used by a CEL validation for a constrained set of targets so teams can proceed without weakening the entire policy.
apiVersion: policies.kyverno.io/v1beta1
kind: PolicyException
metadata:
name: allow-debug-annotation
namespace: dev
spec:
policyRefs:
- name: check-security-context
kind: ValidatingPolicy
allowedValues:
- "debug-mode-temporary"
matchConditions:
- expression: "object.metadata.name.startsWith('experiments-')"
Here’s the policy leverages above allowed values. It denies resources unless the annotation value is present in exceptions.allowedValues.
apiVersion: policies.kyverno.io/v1beta1
kind: ValidatingPolicy
metadata:
name: check-security-context
spec:
matchConstraints:
resourceRules:
- apiGroups: [apps]
apiVersions: [v1]
operations: [CREATE, UPDATE]
resources: [deployments]
variables:
- name: allowedCapabilities
expression: "['AUDIT_WRITE','CHOWN','DAC_OVERRIDE','FOWNER','FSETID','KILL','MKNOD','NET_BIND_SERVICE','SETFCAP','SETGID','SETPCAP','SETUID','SYS_CHROOT']"
validations:
- expression: >-
object.spec.containers.all(container,
container.?securityContext.?capabilities.?add.orValue([]).all(capability,
capability in exceptions.allowedValues ||
capability in variables.allowedCapabilities))
message: >-
Any capabilities added beyond the allowed list (AUDIT_WRITE, CHOWN, DAC_OVERRIDE, FOWNER,
FSETID, KILL, MKNOD, NET_BIND_SERVICE, SETFCAP, SETGID, SETPCAP, SETUID, SYS_CHROOT)
are disallowed.
Configurable reporting status
This exception sets reportResult: pass, so when it matches, Policy Reports show “pass” rather than the default “skip”, improving dashboards and SLO signals during planned waivers.
apiVersion: policies.kyverno.io/v1beta1
kind: PolicyException
metadata:
name: exclude-skipped-deployment-2
labels:
polex.kyverno.io/priority: "0.2"
spec:
policyRefs:
- name: "with-multiple-exceptions"
kind: ValidatingPolicy
matchConditions:
- name: "check-name"
expression: "object.metadata.name == 'skipped-deployment'"
reportResult: pass
Kyverno Authz Server
Beyond enriching admission-time validation, Kyverno now extends policy decisions to your service edge. The Kyverno Authz Server applies Kyverno policies to authorize requests for Envoy (via the External Authorization filter) and for plain HTTP services as a standalone HTTP authorization server, returning allow/deny decisions based on the same policy engine you use in Kubernetes. This unifies policy enforcement across admission, gateways, and services, enabling consistent guardrails and faster adoption without duplicating logic. See the project page for details: kyverno/kyverno-authz.
Introducing the Kyverno SDK
Alongside embedding CEL policy evaluation in controllers, CLIs, and CI, there’s now a companion SDK for service-edge authorization. The SDK lets you load Kyverno policies, compile them, and evaluate incoming requests to produce allow/deny decisions with structured results—powering Envoy External Authorization and plain HTTP services without duplicating policy logic. It’s designed for gateways, sidecars, and app middleware with simple Go APIs, optional metrics/hooks, and a path to unify admission-time and runtime enforcement. Note that kyverno-authz is still in active development; start with non-critical paths and add strict timeouts as you evaluate. See the SDK package for details: kyverno SDK.
Other features and enhancements
Label-based reporting configuration
Kyverno now supports label-based report suppression. Add the label reports.kyverno.io/disabled (any value, e.g., “true”) to any policy— ClusterPolicy, CEL policy types, ValidatingAdmissionPolicy, or MutatingAdmissionPolicy—to prevent all reporting (both ephemeral and PolicyReports) for that policy. This lets teams silence noisy or staging policies without changing enforcement; remove the label to resume reporting.
Use Kyverno CEL libraries in policy matchConditions
Kyverno 1.16 enables Kyverno CEL libraries in policy matchConditions, not just in rule bodies, so you can target when rules run using richer, context-aware checks. These expressions are evaluated by Kyverno but are not used to build admission webhook matchConditions—webhook routing remains unchanged.
Getting started and backward compatibility
Upgrading to Kyverno 1.16
To upgrade to Kyverno 1.16, you can use Helm:
helm repo update
helm upgrade --install kyverno kyverno/kyverno -n kyverno --version 3.6.0
Backward compatibility
Kyverno 1.16 remains fully backward compatible with existing ClusterPolicy resources. You can continue running current policies and adopt the new policy types incrementally; once CEL policy types reach GA, the legacy ClusterPolicy API will enter a formal deprecation process following our standard, non‑breaking schedule.
Roadmap
We’re building on 1.16 with a clear, low‑friction path forward. In 1.17, CEL policy types will be available as v1, migration tooling and docs will focus on making upgrades routine. We will continue to expand CEL libraries, samples, and performance optimizations. With SDK and kyverno‑authz maturation to unify admission‑time and runtime enforcement paths. See the release board for the in‑flight work and timelines: Release 1.17.0 Project Board.
Conclusion
Kyverno 1.16 marks a pivotal step toward a unified, CEL‑powered policy platform: you can adopt the new policy types in beta today, move enforcement closer to teams with namespaced policies, and gain sharper visibility with native Events and detailed latency metrics. Fine‑grained exceptions make rollouts safer without weakening guardrails, while label‑based report suppression and CEL in matchConditions reduce noise and let you target policy execution precisely.
Looking ahead, the path to v1 and GA is clear, and the ecosystem is expanding with the Kyverno Authz Server and SDK to bring the same policy engine to gateways and services. Upgrade when ready, start with audits where useful, and tell us what you build—your feedback will shape the final polish and the journey to GA.
OpenFGA Becomes a CNCF Incubating Project
The CNCF Technical Oversight Committee (TOC) has voted to accept OpenFGA as a CNCF incubating project.
What is OpenFGA?
OpenFGA is an authorization engine that addresses the challenge of implementing complex access control at scale in modern software applications. Inspired by Google’s global access control system, Zanzibar, OpenFGA leverages Relationship-Based Access Control (ReBAC). This allows developers to define permissions based on relationships between users and objects (e.g., who can view which document). By serving as an external service with an API and multiple SDKs, it centralizes and abstracts the authorization logic out of the application code. This separation of concerns significantly improves developer velocity by simplifying security implementation and ensures that access rules are consistent, scalable, and easy to audit across all services, solving a critical complexity problem for developers building distributed systems.
OpenFGA’s History
OpenFGA was developed by a group of Okta employees, and is the foundation for the Auth0 FGA commercial offering.
The project was accepted as a CNCF Sandbox project in September 2022. Since then, it has been deployed by hundreds of companies and received multiple contributions. Some major moments and updates include:
- 37 companies publicly acknowledge using it in production.
- Engineers from Grafana Labs and GitPod have become official maintainers.
- OpenFGA was invited to present on the Maintainer’s track at Kubecon + CloudNativeCon Europe 2025.
- A MySQL storage adapter was contributed by TwinTag and SQLite storage adapter was contributed by Grafana Labs.
- OpenFGA started hosting a monthly OpenFGA community meeting in April 2023
- Several developer experience improvements, like:
- New SDKs for Python and Java
- IDE integrations with VS Code and IntelliJ
- A CLI with support for model testing
- A Terraform Provider was donated to the project by Maurice Ackel
- A new caching implementation and multiple performance improvements shipped over the last year.
- OpenFGA also added the ListObjects endpoint to retrieve all resources a user has a specific relation with a resource. Additionally, OpenFGA added the ListUsers endpoint to retrieve all users that have a specific relation with a resource.
Further, OpenFGA integrates with multiple CNCF projects:
- OpenTelemetry for tracing and telemetry
- Helm for deployment
- Grafana dashboards for monitoring
- Prometheus for metrics collection
- ArtifactHub for Helm chart distribution
Maintainer Perspective
“Seeing companies successfully deploy OpenFGA in production demonstrates its viability as an authorization solution. Our focus now is on growth. CNCF Incubation provides increased credibility and visibility – attracting a broader set of contributors and helping secure long-term sustainability. We anticipate this phase supporting us collectively build the definitive and centralized service for fine-grained authorization that the cloud native ecosystem can continue to trust.
— Andres Aguiar, OpenFGA Maintainer and Director of Product at Okta
“When Grafana adopted OpenFGA the community was incredibly welcoming, and we’ve been fortunate to collaborate on enhancements like SQLite support. We are excited to work with CNCF to continue the evolution of the OpenFGA platform.”
— Dan Cech, Distinguished Engineer, Grafana Labs
From the TOC
“Authorization is one of the most complex and critical problems in distributed systems, and OpenFGA provides a clean, scalable solution that developers can actually adopt. Its ReBAC model and API-first approach simplify how teams think about access control, removing layers of custom logic from applications. What impressed me most during the due diligence process was the project’s momentum—strong community growth, diverse maintainers, and real-world production deployments. OpenFGA is quickly becoming a foundational building block for secure, cloud native applications.”
— Ricardo Aravena, CNCF TOC Sponsor
“As the TOC Sponsor for OpenFGA’s incubation, I’ve had the opportunity to work closely with the maintainers and see their deep technical rigor and commitment to excellence firsthand. OpenFGA reflects the kind of thoughtful engineering and collaboration that drives the CNCF ecosystem forward. By externalizing authorization through a developer-friendly API, OpenFGA empowers teams to scale security with the same agility as their infrastructure. Throughout the incubation process, the maintainers have been exceptionally responsive and precise in addressing feedback, demonstrating the project’s maturity and readiness for broader adoption. With growing adoption and strong technical foundations, I’m excited to see how the OpenFGA community continues to expand its capabilities and help organizations strengthen access control across cloud native environments.”
— Faseela Kundattil, CNCF TOC Sponsor
Main Components
Some main components of the project include:
- The OpenFGA server designed to answer authorization requests fast and at scale
- SDKs for Go, .NET, JS, Java, Python
- A CLI to interact with the OpenFGA server and test authorization models
- Helm Charts to deploy to Kubernetes
- Integrations with VS Code and Jetbrains
Notable Milestones
- 4,300+ GitHub Stars
- 2246 Pull Requests
- 459 Issues
- 96 Contributors, 652 across repositories
- 89 Releases
Looking Ahead
OpenFGA is a database, and as with any database, there will always be work to improve performance for every type of query. Future goals of the roadmap are to make it simpler for maintainers to contribute to SDKs; launch new SDKs for Ruby, Rust, and PHP; add support for the AuthZen standard; add new visualization options and open sourcing the OpenFGA playground tool; improve observability; add streaming API endpoints for better performance; and include more robust error handling with new write-conflict options.
You can learn more about OpenFGA here.
As a CNCF-hosted project, OpenFGA is part of a neutral foundation aligned with its technical interests, as well as the larger Linux Foundation, which provides governance, marketing support, and community outreach. OpenFGA joins incubating technologies Backstage, Buildpacks, cert-manager, Chaos Mesh, CloudEvents, Container Network Interface (CNI), Contour, Cortex, CubeFS, Dapr, Dragonfly, Emissary-Ingress, Falco, gRPC, in-toto, Keptn, Keycloak, Knative, KubeEdge, Kubeflow, KubeVela, KubeVirt, Kyverno, Litmus, Longhorn, NATS, Notary, OpenFeature, OpenKruise, OpenMetrics, OpenTelemetry, Operator Framework, Thanos, and Volcano. For more information on maturity requirements for each level, please visit the CNCF Graduation Criteria.
Self-hosted human and machine identities in Keycloak 26.4
Keycloak is a leading open source solution in the cloud-native ecosystem for Identity and Access Management, a key component of accessing applications and their data.
With the release of Keycloak 26.4, we’ve added features for both machine and human identities. New features focus on security enhancement, deeper integration, and improved server administration. See below for the release highlights, or dive deeper in our Keycloak 26.4 release announcement.
Keycloak recently surpassed 30k GitHub stars and 1,350 contributors. If you’re attending KubeCon + CloudNativeCon North America in Atlanta, stop by and say hi—we’d love to hear how you’re using Keycloak!
What’s New in 26.4
Passwordless user authentication with Passkeys
Keycloak now offers full support for Passkeys. As secure, passwordless authentication becomes the new standard, we’ve made passkeys simple to configure. For environments that are unable to adopt passkeys, Keycloak continues to support OTP and recovery codes. You can find a passkey walkthrough on the Keycloak blog.
Tightened OpenID Connect security with FAPI 2 and DPoP
Keycloak 26.4 implements the Financial-grade API (FAPI) 2.0 standard, ensuring strong security best practices. This includes support for Demonstrating Proof-of-Possession (DPoP), which is a safer way to handle tokens in public OpenID Connect clients.
Simplified deployments across multiple availability zones
Deployment across multiple availability zones or data centers is simplified in 26.4:
- Split-brain detection
- Full support in the Keycloak Operator
- Latency optimizations when Keycloak nodes run in different data centers
Keycloak docs contain a full step-by-step guide, and we published a blog post on how to scale to 2,000 logins/sec and 10,000 token refreshes/sec.
Authenticating applications with Kubernetes service account tokens or SPIFFE
When applications interact with Keycloak around OpenID Connect, each confidential server-side application needs credentials. This usually comes with the churn to distribute and rotate them regularly.
With 26.4, you can use Kubernetes service account tokens, which are automatically distributed to each Pod when running on Kubernetes. This removes the need to distribute and rotate an extra pair of credentials. For use cases inside and outside Kubernetes, you can also use SPIFFE.
To test this preview feature:
- Enable the features
client-auth-federated:v1,spiffe:v1, andkubernetes-service-accounts:v1. - Register a Kubernetes or SPIFFE identity provider in Keycloak.
- For a client registered in Keycloak, configure the Client Authenticator in the Credentials tab as Signed JWT – Federated, referencing the identity provider created in the previous step and the expected subject in the JWT.
Looking ahead
Keycloak’s roadmap includes:
- Improved MCP support to better support AI agents
- Improved token exchange
- Automating administrative tasks using workflows triggered by authentication events.
You can follow our journey at keycloak.org and get involved. Our nightly builds give you early access to Keycloak’s latest features.
Connecting distributed Kubernetes with Cilium and SD-WAN: Building an intelligent network fabric
Learn how Kubernetes-native traffic management and SD-WAN integration can deliver consistent security, observability, and performance across distributed clusters.
The challenge of distributed Kubernetes networking
Modern businesses are rapidly adopting distributed architectures to meet growing demands for performance, resilience, and global reach. This shift is driven by emerging workloads that demand distributed infrastructure: AI/ML model training distributed across GPU clusters, real-time edge analytics processing IoT data streams, and global enterprise operations that require seamless connectivity across on-premises workloads, data centers, cloud providers, and edge locations.
Today businesses are increasingly struggling to ensure secure, reliable and high-performance global connectivity while maintaining visibility across this distributed infrastructure. How do you maintain consistent end-to-end policies when applications traverse multiple network boundaries? How do you optimize performance for latency-sensitive critical applications when they could be running anywhere? And how do you gain clear visibility into application communication across this complex, multi-cluster, multi-cloud landscape?
This is where a modern, integrated approach to networking becomes essential, one that understands both the intricacies of Kubernetes and the demands of wide-area connectivity. Let’s explore a proposal for seamlessly bridging your Kubernetes clusters, regardless of location, while intelligently managing the underlying network paths. Such an integrated approach solves several critical business needs:
- Unified security posture: Consistent policy enforcement from the wide-area network down to individual microservices.
- Optimized performance: Intelligent traffic routing that adapts to real-time conditions and application requirements.
- Global visibility: End-to-end observability across all layers of the network stack.
In this post we discuss how to interconnect Cilium with a Software-Defined Wide Area Network (SD-WAN) fabric to extend Kubernetes-native traffic management and security policies into the underlying network interconnect. Learn how such integration simplifies operations while delivering the performance and security modern distributed workloads demand.
Towards an intelligent network fabric
Imagine a globally distributed service deployed across dozens of locations worldwide. Latency-critical microservices are deployed at the edge, critical workloads run on-premises for data protection, while elastic services leverage public cloud scalability. These components must constantly communicate across cluster boundaries: IoT streams flow to central management, customer data replicates across regions for sovereignty compliance, and real-time analytics span multiple sites.
Bridging Kubernetes and SD-WAN with Cilium
Enter Cilium, a universal networking layer connecting Kubernetes workloads, VMs, and physical servers across clouds, data centers, and edge locations. Simply mark a service as “global” and Cilium ensures its availability throughout your distributed multi-cluster infrastructure (Figure 1). But even single-cluster Kubernetes deployments may benefit from an intelligent WAN interconnect, when different nodes and physical servers of the same cluster may run at multiple geographically diverse locations (Figure 2). No matter at which location a service is running, Cilium intelligently routes and balances load across the entire deployment.
Yet a critical gap remains: controlling how traffic traverses the underlying network interconnect. Modern wide-area SDNs like a modern SD-WAN implementation (such as Cisco Catalyst SD-WAN) would easily deliver the intelligent interconnect these services need, by providing performance-guarantees for SD-WAN tunnels between sites with traffic differentiation. Unfortunately, currently leveraging these capabilities in a Kubernetes-native way remains a challenge.
Figure 1: Multi-cluster scenario: An SD-WAN connects multiple Kubernetes clusters.
Figure 2: Single-cluster scenario: An SD-WAN fabric interconnects geographically distributed nodes of a single Kubernetes cluster.
We suggest leveraging the concept of a Kubernetes operator to bridge this divide. Continuously monitoring the Kubernetes API, the operator could translate service requirements into SD-WAN policies, automatically mapping inter-cluster traffic to appropriate network paths based on your declared intent. Simply annotate your Kubernetes service manifests, and the operator handles the rest. For the purposes of this guide we will use a minimalist SD-WAN operator.Other SD-WAN operators (such asAWI Catalyst SD-WAN Operator) offer varying degrees of Kubernetes integration.
The role of a Kubernetes operator
Need guaranteed performance for business-critical global services? One service annotation will route traffic through a dedicated SD-WAN tunnel, bypassing congestion and bottlenecks. Require encryption for sensitive data flows? Another annotation ensures tamper-resistant paths between clusters. In general, such an intelligent cloud SD-WAN interconnect would provide the following features:
- Map services to specific SD-WAN tunnels for optimized routing (see below).
- Provide end-to-end Service Level Objectives (SLOs) across sites and nodes.
- Implement comprehensive monitoring to track service health and performance across the entire network.
- Enable selective traffic inspection or blocking on the interconnect for enhanced security and compliance.
- Isolate tenants’ inter-cluster traffic in distributed multi-tenant workloads.
A Kubernetes operator can bring these capabilities, and many more, into the Kubernetes ecosystem, maintaining the declarative, GitOps-friendly approach cloud-native teams expect.
Enforcing traffic policies with Cilium and Cisco Catalyst SD-WAN
In this guide, we demonstrate how an operator can enforce granular traffic policies for Kubernetes services using Cilium and Cisco Catalyst SD-WAN. The setup ensures secure, prioritized routing for business-critical services while allowing best-effort traffic to use default paths.
We will assume that SD-WAN connectivity is established between the clusters/nodes so that the SD-WAN interconnects all Kubernetes deployment sites (nodes/clusters) and routes pod-to-pod traffic seamlessly across the WAN. We further assume Cilium is configured in Native Routing Mode so that we have full visibility into the traffic that travels between the clusters/nodes in the SD-WAN.
Once installed, the SD-WAN operator will automatically generate SD-WAN policies based on your Kubernetes service configurations. This seamless integration ensures that your network policies adapt dynamically as your Kubernetes environment evolves.
To illustrate, let’s look at a demo environment (see Figure 3) featuring:
- A Kubernetes cluster with two nodes deployed in the “single-cluster scenario” (Figure 2).
- Nodes interconnected via Cilium, running over two distinct SD-WAN tunnels.
In this setup, as you define or update Kubernetes services, the operator will automatically program the underlying SD-WAN fabric to enforce the appropriate connectivity and security policies for your workloads.
Figure 3: A simplified demo setup with a single-cluster Kubernetes with two nodes interconnected with Cilium and a modern SD-WAN implementation (such as Cisco Catalyst SD-WAN). Note: This pattern extends seamlessly to multi-cluster deployments.
End-to-end policy enforcement example
Within the cluster, we will deploy two services, each with specific connectivity and security requirements:
- Best-effort service: Designed for non-sensitive workloads, this service leverages standard network connectivity. It is ideal for applications where best-effort delivery is sufficient and there are no stringent security or performance requirements.
- Business service: This service is responsible for handling business-critical traffic that requires reliable performance. To maintain stringent service levels, all traffic for the Business Service must be routed exclusively through the dedicated Business WAN (bizinternet) SD-WAN tunnel. This approach ensures optimized network performance and strong isolation from general-purpose traffic – ensuring that critical communications remain secure and uninterrupted.
By tailoring network policies to the unique needs of each service, we achieve both operational efficiency for routine workloads and robust protection for sensitive business data.
By default, all traffic crossing the cluster boundary uses the default tunnel. In order to ensure that the Business Service uses the Business WAN we just need to add a Kubernetes annotation to the corresponding Kubernetes Service:
How does this work? The SD-WAN operator watches Service objects, extracts endpoint IPs/ports from the Business Service pods, and dynamically programs the SD-WAN to enforce the business tunnel policy (see Figure 4).
Figure 4: The Kubernetes objects read by the SD-WAN operator to instantiate the SD-WAN rules for the business service.
Future directions: Observability and SLO awareness
Meanwhile, Figure 5 illustrates the SD-WAN configuration generated by the SD-WAN operator. The configuration highlights two key aspects:
- Business WAN tunnel enforcement: All traffic destined for the pods of the Business Service’ is strictly routed through the dedicated bizinternet SD-WAN tunnel. This ensures that business-critical traffic receives optimized performance as it traverses the network.
- Traffic identification: The SD-WAN dynamically identifies Business Service traffic by inspecting the source and destination IP addresses and ports of the service endpoints. This granular detection enables precise policy enforcement, ensuring that only the intended traffic is routed through the secure business tunnel.
Together, these mechanisms provide robust security and fine-grained control over service-specific traffic flows within and across your Kubernetes clusters.
Figure 5: The SD-WAN configuration issued by the operator
Conclusion
By using a Kubernetes operator it is possible to integrate Cilium and a modern SD-WAN implementation (such as Cisco Catalyst SD-WAN) into a single end-to-end framework to intelligently connect distributed workloads at controlled security and performance. Key takeaways:
- Annotation-driven end-to-end policies: Kubernetes service annotations simplify policy definition, enabling developers to declare intent without needing SD-WAN expertise.
- Automated SD-WAN programming: An SD-WAN operator bridges Kubernetes and the SD-WAN, translating service configurations into real-time network policies.
- Secure multi-tenancy: Critical services are isolated in dedicated tunnels. At the same time, best-effort traffic shares the default tunnel, optimizing security and cost.
This demo operator, however, demonstrates only the first step by providing just the bare intelligent connectivity features. Future work includes exploring end-to-end observability, monitoring and tracing tools. Today, Hubble provides an observability layer to Cilium that can show flows from a Kubernetes perspective, while Cisco Catalyst SD-WAN Manager and Cisco Catalyst SD-WAN Analytics provide extended network observability and visibility, however the missing bit is a single plane of glass. In addition, further future work might consider, exposing the SD-WAN SLOs to Kubernetes for automatic service mapping, and extending the framework to new use cases.
Learn more
Feel free to reach out to the authors at the contact details below. Visit cilium.io to learn more about Cilium. More details on Cisco Catalyst SD-WAN can be found on:
https://www.cisco.com/site/us/en/solutions/networking/sdwan/catalyst/index.html
The creators of Cilium and Cisco Catalyst SD-WAN are also hiring! Check out https://jobs.cisco.com/jobs/SearchJobs/sdwan or https://jobs.cisco.com/jobs/SearchJobs/isovalent for their listings.
Authors:
Gábor Rétvári
- Twitter: @littleredspam
- LinkedIn: https://www.linkedin.com/in/GaborRetvari/
Tamás Lévai
- E-mail: [email protected]
- LinkedIn: https://www.linkedin.com/in/tamaslevai/
Kyverno vs Kubernetes policies: How Kyverno complements and completes Kubernetes policy types
Originally posted on Nirmata.com on October 1, 2025
How Kyverno extends and integrates with Kubernetes policies
With the addition of ValidatingAdmissionPolicy and MutatingAdmissionPolicy in Kubernetes, do you still need Kyverno? This post answers the question by providing ten reasons why Kyverno is essential even when you are using Kubernetes policy types.
Introduction
Prior to Kyverno, policy management in Kubernetes was complex and cumbersome. While the need for Policy as Code was clear, initial implementations required learning complex languages and did not implement the full policy as code lifecycle.
Kyverno was created by Nirmata, and donated to the CNCF in November 2020. It rapidly gained popularity due to its embrace of Kubernetes resources for policy declarations, it’s easy-to-use syntax, and breadth of features that addressed all aspects of policy as code.
Recently, Kubernetes has also introduced native policy types which can be executed directly in the Kubernetes API server. This move validates that policies are a must have for Kubernetes, and now allows critical policies to be executed directly in the API server.
The Kubernetes API server is a critical resource that needs to be extremely efficient. To safely execute policies in the API server, the Kubernetes authors chose CEL (Common Expressions Language) to embed logic in policy YAML declarations. In addition to a familiar syntax, CEL programs can be pre-compiled and execution costs can be pre-calculated.
With these changes in Kubernetes, the Kyverno has also evolved to stay true to its mission of providing the best policy engine and tools for Kubernetes native policy as code.
Kyverno now supports five new policy types, two of which, ValidatingPolicy and MutatingPolicy, are extensions of Kubernetes policy types ValidatingAdmissionPolicy and MutatingAdmissionPolicy, respectively.
NOTE: I will use the term “Kubernetes Policies” to refer to ValidatingAdmissionPolicies and MutatingAdmissionPolicies.
Here is a summary of the Kyverno policy types:
- ValidatingPolicy: This policy type checks if a resource’s configuration adheres to predefined rules and can either enforce or audit compliance. This policy type is an extension of the Kubernetes ValidatingAdmissionPolicy.
- ImageValidatingPolicy: A specialized validating policy that verifies a container image’s signatures and attestations to ensure its integrity and trustworthiness.
- MutatingPolicy: This policy type modifies a resource’s configuration as it’s being created or updated, applying changes like adding labels, annotations, or sidecar containers. This policy is an extension of the Kubernetes MutatingAdmissionPolicy.
- GeneratingPolicy: This policy creates or clones new resources in response to a trigger event, such as automatically generating a NetworkPolicy when a new Namespace is created.
- DeletingPolicy: This policy automatically deletes existing resources that match specific criteria on a predefined schedule, often used for garbage collection or enforcing retention policies
So, when should you choose to use Kyverno policies vs the Kubernetes policy types? The right answer is that if you believe that declarative Policy as Code is the right way to manage Kubernetes configuration complexity, you will need both!
As you will see below, Kyverno provides critical features that are missing in Kubernetes policies and also helps with policy management at scale.
1. Applying policies on existing resources
When new policies are created, they need to be applied on existing resources. Kubernetes Policies only apply on resource changes, and hence policy violations on existing resources are not reported.
Kyverno applies policies, including Kubernetes Policies types, on all resources.
2. Reapplying policies on changes
Like code, policies change over time. This can be to adapt to updated or new features, or to fix issues in the policy. When a policy changes, it must be re-applied to all resources. Kubernetes Policies are embedded in the API server and not reapplied when the policy changes.
3. Applying policies pff cluster (shift-left)
Providing feedback to developers as early as possible in a deployment pipeline is highly desirable and has tangible benefits of time and cost savings. The Kyverno CLI can apply Kyverno and Kubernetes Policy types in CI/CD and IaC pipelines.
4. Testing policy as code
Like all software, policies must be thoroughly tested prior to deployment. Kyverno provides tools for testing Kyverno and Kubernetes policy types. You can use the Kyverno CLI for unit tests, and Kyverno Chainsaw for e2e behavioral tests.
5. Reporting policy results
Kyverno provides integrated reporting, where reports are namespaced Kubernetes resources and hence via the Kubernetes API and other tools to application owners. Kyverno reports are generated for both Kyverno and Kubernetes policy types.
6. Managing fine-grained policy exceptions
Kyverno allows configuring policy exceptions to exclude some resources from policies. Kyverno exceptions are Kubernetes resources making it possible to view and manage via the Kubernetes API using standard tools.
Kyverno exceptions can specify an image, so you can exclude certain containers in a pod while still applying the policy to other containers. Exceptions can also declare specific values that are allowed. Exceptions can also be time-bound, but adding a TTL (time to live).
These powerful capabilities allow enforcing policies and then use exceptions to exclude certain resources, or even parts of a resource declaration.
7. Complex policy logic
Kubernetes policies are designed for simple checks, and can only apply to the admission payload. This is often insufficient, as policies may need to look up other resources or even reference external data. These types of checks are not possible with Kubernetes policies. Additionally, Kubernetes MutatingAdmissionPolicies cannot match sub-resources and apply to a resource.
Kyverno supports features for complex policies, including API lookups and external data management. Kyverno also offers an extended CEL library with useful functions necessary for complex policies.
8. Image verification
Kyverno offers built-in verification of OCI (Open Container Initiative) image and artifact signatures, using Sigstore’s Cosign or CNCF’s Notary projects. This allows implementing software supply chain security use cases and achieving high levels of SLSA (Supply-chain Levels for Software Artifacts.)
9. Policy-based automation
Besides validating and mutating resources, policies are an essential tool for automating several complex platform engineering tasks. For example, policies can be used to automatically generate secure defaults, or resources like network policies, on flexible triggers such as a namespace creation or when a label is added. This allows a tight control loop, and can be used to replace custom controllers with declarative and scalable policy as code.
10. Kyverno everywhere
While Kubernetes Policy types can only be applied to Kubernetes resources, Kyverno policies can be applied to any JSON or YAML payload including Terraform or OpenTofu manifests, other IaC manifests such as CDK, and build artifacts such as Dockerfiles.
Kyverno enables a unified policy as code approach, which is essential for platform engineering teams that manage both Kubernetes clusters, and pipelines for CI/CD and IaC.
Conclusion
Kyverno is fully compatible with Kubernetes policies, and is designed to seamlessly support and extend Kubernetes policy types. It applies Kubernetes policies to existing resources and can also provide policy reporting and exception management for Kubernetes policies.
Like Kubernetes policies, Kyverno policies also use the Common Expressions Language (CEL) and extend the Kubernetes policy declarations with additional fields, and extended CEL libraries, required for complex policies and advanced policy as code use cases.
This allows having a mix of Kubernetes and Kyverno policies managed by Kyverno. You can get started with Kubernetes policies and then upgrade to Kyverno policies for advanced use cases.
If you have existing Kubernetes policies, you can use Kyverno to apply them to existing resources, produce reports, apply the policies off-cluster, and perform unit and behavioral tests.
If you are starting out, you can use Kyverno policy types. Wherever possible Kyverno will automatically generate and manage Kubernetes policies for optimal performance. For complex policies, which cannot be handled in the API server, Kyverno will execute these during admission controls and periodically as background scans.
Regardless of where you start, with Kyverno you get a powerful and complete policy as code solution for Kubernetes and all your policy-based authorization needs!
Karmada v1.15 Released! Enhanced Resource Awareness for Multi-Template Workloads
Karmada is an open multi-cloud and multi-cluster container orchestration engine designed to help users deploy and operate business applications in a multi-cloud environment. With its compatibility with the native Kubernetes API, Karmada can smoothly migrate single-cluster workloads while still maintaining coordination with the surrounding Kubernetes ecosystem tools.
Karmada v1.15 has been released, this version includes the following new features:
- Precise resource awareness for multi-template workloads
- Enhanced cluster-level failover functionality
- Structured logging
- Significant performance improvements for Karmada controllers and schedulers
Overview of New Features
Precise Resource Awareness for Multi-Template Workloads
Karmada utilizes a resource interpreter to retrieve the replica count and resource requests of workloads. Based on this data, it calculates the total resource requirements of the workloads, thereby enabling advanced capabilities such as resource-aware scheduling and federated quota management. This mechanism works well for traditional single-template workloads. However, many AI and big data application workloads (e.g., FlinkDeployments, PyTorchJobs, and RayJobs) consist of multiple Pod templates or components, each with unique resource demands. Since the resource interpreter can only process resource requests from a single template and fails to accurately reflect differences between multiple templates, the resource calculation for multi-template workloads is not precise enough.
In this version, Karmada has strengthened its resource awareness for multi-template workloads. By extending the resource interpreter, Karmada can now obtain the replica count and resource requests of different templates within the same workload, ensuring data accuracy. This improvement also provides more reliable and granular data support for federated quota management of multi-template workloads.
Suppose you deploy a FlinkDeployment with the following resource-related configuration:
spec: jobManager: replicas: 1 resource: cpu: 1 memory: 1024m taskManager: replicas: 1 resource: cpu: 2 memory: 2048mThrough ResourceBinding, you can view the replica count and resource requests of each template in the FlinkDeployment parsed by the resource interpreter.
spec: components: – name: jobmanager replicaRequirements: resourceRequest: cpu: “1” memory: “1.024” replicas: 1 – name: taskmanager replicaRequirements: resourceRequest: cpu: “2” memory: “2.048” replicas: 1At this point, the resource usage of the FlinkDeployment calculated by FederatedResourceQuota is as follows:
status: overallUsed: cpu: “3” memory: 3072mNote: This feature is currently in the Alpha stage and requires enabling the MultiplePodTemplatesScheduling feature gate to use.
As multi-template workloads are widely adopted in cloud-native environments, Karmada is committed to providing stronger support for them. In upcoming versions, we will further enhance scheduling support for multi-template workloads based on this feature and offer more granular resource-aware scheduling—stay tuned for more updates!
For more information about this feature, please refer to: Multi-Pod Template Support.
Enhanced Cluster-Level Failover Functionality
In previous versions, Karmada provided basic cluster-level failover capabilities, allowing cluster-level application migration to be triggered through custom failure conditions. To meet the requirement of preserving the running state of stateful applications during cluster failover, Karmada v1.15 supports an application state preservation policy for cluster failover. For big data processing applications (e.g., Flink), this capability enables restarting from the pre-failure checkpoint and seamlessly resuming data processing to the state before the restart, thus avoiding duplicate data processing.
The community has introduced a new StatePreservation field under .spec.failover.cluster in the PropagationPolicy/ClusterPropagationPolicy API. This field is used to define policies for preserving and restoring state data of stateful applications during failover. Combined with this policy, when an application is migrated from a failed cluster to another cluster, key data can be extracted from the original resource configuration.
The state preservation policy StatePreservation includes a series of StatePreservationRule configurations. It uses JSONPath to specify the segments of state data that need to be preserved and leverages the associated AliasLabelName to pass the data to the migrated cluster.
Taking a Flink application as an example: in a Flink application, jobID is a unique identifier used to distinguish and manage different Flink jobs. When a cluster fails, the Flink application can use jobID to restore the state of the job before the failure and continue execution from the failure point. The specific configuration and steps are as follows:
apiVersion: policy.karmada.io/v1alpha1kind: PropagationPolicy
metadata:
name: foo
spec:
#…
failover:
cluster:
purgeMode: Directly
statePreservation:
rules:
– aliasLabelName: application.karmada.io/cluster-failover-jobid
jsonPath: “{ .jobStatus.jobID }”
- Before migration, the Karmada controller extracts the job ID according to the path configured by the user.
- During migration, the Karmada controller injects the extracted job ID into the Flink application configuration in the form of a label, such as application.karmada.io/cluster-failover-jobid: <jobID>.
- Kyverno running in the member cluster intercepts the Flink application creation request, obtains the checkpoint data storage path of the job based on the jobID (e.g., /<shared-path>/<job-namespace>/<jobId>/checkpoints/xxx), and then configures initialSavepointPath to indicate starting from the savepoint.
- The Flink application starts based on the checkpoint data under initialSavepointPath, thereby inheriting the final state saved before migration.
This capability is widely applicable to stateful applications that can start from a specific savepoint. These applications can follow the above process to implement state persistence and restoration for cluster-level failover.
Note: This feature is currently in the Alpha stage and requires enabling the StatefulFailoverInjection feature gate to use.
Function Constraints:
- The application must be restricted to run in a single cluster.
- The migration cleanup policy (PurgeMode) is limited to Directly—this means ensuring that the failed application is deleted from the old cluster before being restored in the new cluster to guarantee data consistency.
Structured Logging
Logs are critical tools for recording events, states, and behaviors during system operation, and are widely used for troubleshooting, performance monitoring, and security auditing. Karmada components provide rich runtime logs to help users quickly locate issues and trace execution scenarios. In previous versions, Karmada only supported unstructured text logs, which were difficult to parse and query efficiently, limiting its integration capabilities in modern observability systems.
Karmada v1.15 introduces support for structured logging, which can be configured to output in JSON format using the –logging-format=json startup flag. An example of structured logging is as follows:
{ “ts”:“日志时间戳”, “logger”:”cluster_status_controller”, “level”: “info”, “msg”:”Syncing cluster status”, “clusterName”:”member1″}The introduction of structured logging significantly improves the usability and observability of logs:
- Efficient Integration: Integration with mainstream logging systems such as Elastic, Loki, and Splunk, without relying on complex regular expressions or log parsers.
- Efficient Query: Structured fields support fast retrieval and analysis, significantly improving troubleshooting efficiency.
- Enhanced Observability: Key context information (e.g., cluster name, log level) is presented as structured fields, facilitating cross-component and cross-time event correlation for accurate issue localization.
- Maintainability: Structured logging makes it easier for developers and operators to maintain, parse, and evolve log formats as the system changes.
Significant Performance Improvements for Karmada Controllers and Schedulers
In this version, the Karmada performance optimization team has continued to focus on improving the performance of Karmada’s key components, achieving significant progress in both controllers and schedulers.
In terms of the controller, by introducing controller-runtime priority queue, the controller can give priority to responding to user-triggered resource changes after a restart or leader transition, thereby significantly reducing the downtime during service restart and failover processes.
The test environment included 5,000 Deployments, 2,500 Policies, and 5,000 ResourceBindings. The Deployment and Policy were updated when the controller restarted with a large number of pending events still in the work queue. Test results showed that the controller could immediately respond to and prioritize processing these update events, verifying the effectiveness of this optimization.
Note: This feature is currently in the Alpha stage and requires enabling the ControllerPriorityQueue feature gate to use.
In terms of the scheduler, by reducing redundant computations in the scheduling process and decreasing the number of remote call requests, the scheduling efficiency of the Karmada scheduler has been significantly improved.
Tests were conducted to record the time taken to schedule 5,000 ResourceBindings with the precise scheduling component karmada-scheduler-estimator enabled. The results are as follows:
- The scheduler throughput QPS increased from approximately 15 to about 22, representing a 46% performance improvement.
- The number of gRPC requests decreased from approximately 10,000 to around 5,000, a reduction of 50%.
These tests confirm that the performance of Karmada controllers and schedulers has been greatly improved in version 1.15. In the future, we will continue to conduct systematic performance optimizations for controllers and schedulers.
For the detailed test report, please refer to [Performance] Overview of performance improvements for v1.15.
Acknowledging Our Contributors
The Karmada v1.15 release includes 269 code commits from 39 contributors. We would like to extend our sincere gratitude to all the contributors:
@abhi0324@abhinav-1305@Arhell@Bhaumik10@CaesarTY@cbaenziger@deefreak@dekaihu@devarsh10@greenmoon55@iawia002@jabellard@jennryaz@liaolecheng@linyao22@LivingCcj@liwang0513@mohamedawnallah@mohit-nagaraj@mszacillo@RainbowMango@ritzdevp@ryanwuer@samzong@seanlaii@SunsetB612@tessapham@wangbowen1401@warjiang@wenhuwang@whitewindmills@whosefriendA@XiShanYongYe-Chang@zach593@zclyne@zhangsquared@zhuyulicfc49@zhzhuang-zju@zzklachlan
References:
[1] Karmada: https://karmada.io/
[2] Karmada v1.15: https://github.com/karmada-io/karmada/releases/tag/v1.15.0
[3] Multi-Pod Template Support: https://github.com/karmada-io/karmada/tree/master/docs/proposals/scheduling/multi-podtemplate-support
[4] [Performance] Overview of performance improvements for v1.15: https://github.com/karmada-io/karmada/issues/6516
[5] Karmada GitHub:https://github.com/karmada-io/karmada
Fluentd to Fluent Bit: A Migration Guide
Fluentd was created over 14 years ago and still continues to be one of the most widely deployed technologies for log collection in the enterprise. Fluentd’s distributed plugin architecture and highly permissive licensing made it ideal as part of the Cloud Native Computing Foundation (CNCF) as a now graduated project.
However, enterprises drowning in telemetry data are now requiring solutions that have higher performance, more native support for evolving schemas and formats, and increased flexibility in processing. Enter Fluent Bit.
When and why to migrate?
Fluent Bit, while initially growing as a sub-project within the Fluent ecosystem, expanded from Fluentd to support all telemetry types – logs, metrics, and traces. Fluent Bit now is the more popular of the two with over 15 billion deployments and used by Amazon, Google, Oracle and Microsoft to name a few.
Fluent Bit also is fully aligned with OpenTelemetry signals, format and protocol, which ensures that users will be able to continue handling telemetry data as it grows and evolves.
Among the most frequent questions we get as the maintainers of the projects are:
- How do we migrate?
- What should we watch out for?
- And what business value do we get for migrating?
This article aims to answer these questions with examples. We want to help make it an easy decision to migrate from Fluentd to Fluent Bit.
Why Migrate?
Here is a quick list of the reasons users switch from Fluentd to Fluent Bit:
- Higher performance for the same resources you are already using
- Full OpenTelemetry support for logs, metrics, and traces as well as Prometheus support for metrics
- Simpler configuration and routing ability to multiple locations
- Higher velocity for adding custom processing rules
- Integrated monitoring to better understand performance and dataflows
Fluentd vs. Fluent Bit: What are the Differences
Background
To understand all the differences between the projects, it is important to understand the background of each project and the era it was built for. With Fluentd, the main language is Ruby and initially designed to help users push data to big data platforms such as Hadoop.The project follows a distributed architecture, where plugins are installed after the main binary is installed and deployed.
Fluent Bit on the other hand, is written in C, with a focus on hyper performance in smaller systems (containers, embedded Linux). The project learned from Fluentd’s plugins and instead opts for fully embedded plugins that are part of the core binary.
Performance
The obvious difference and main value of switching from Fluentd to Fluent Bit is the performance. With Fluent Bit, the amount of logs you can process with the same resources could be anywhere from 10 to 40 times greater depending on the plugin you are using.
Fluent Bit was written from the ground up to be hyper performant, with a focus of shipping data as fast as possible for data analysis. Later on, performance was found to be efficient enough that more edge processing could be added without compromising on the mission to make the agent as fast as possible.
Routing
Other parts of Fluent Bit evolved from challenges encountered with Fluentd, such as buffering and routing. With Fluentd multirouting was an afterthought and users needed to “copy” the data streams to route data to multiple points.
This made configuration management a nightmare, in addition to essentially duplicating the resource requirements for routing that data.
In Fluent Bit the buffers are stored once, which allows multiple plugins to “subscribe” to a stream of data. This ensures that data is stored once and subscribed many times allowing for multirouting without the trade-offs of performance and configuration fatigue.
Telemetry signal focus
While Fluentd was initially a data shipper, it grew into a logging agent used within projects such as Kubernetes and companies like Splunk. Fluent Bit on the other hand started as an embedded metrics collector with log files coming in after. As Fluent Bit adoption started to outweigh Fluentd’s functionality, capabilities such as OpenTelemetry logs/metrics/traces, Prometheus Scrape and Remote Write Support, eBPF and profiling support were all added.
Today Fluent Bit is aligned with OpenTelemetry schema, formats and protocols and meant to be a lightweight implementation that is highly performant.
Custom processing
Fluentd and Fluent Bit have many of the same processor names, but when it comes to custom processing the options are quite different.
With Fluentd the option is `enable_ruby`, which allows custom Ruby scripts within a configuration to perform actions. This can work effectively for small tasks; however it has a large penalty as logic gets more complicated, adding more performance bottlenecks.
With Fluent Bit, custom processing is done in the language Lua, which gives tremendous flexibility. However, unlike Fluentd, Fluent Bit’s Lua processor is quite performant and can be used at scale (100+ TB/day).
Custom plugins
Both projects allow custom plugins to help you connect with your source or destination. With Fluentd, these custom plugins are “Ruby Gems” that you can download and install into existing or new installations or deployments. With Fluent Bit, custom plugins are written and compiled in Go. There are also new initiatives for writing custom plugins in any language you want and compiling them into WebAssembly.
One lesson we learned from Fluentd’s distributed plugin architecture was the number of plugins can increase exponentially. However, the quality and maintenance required generally left many of the plugins abandoned and unsupported. With Fluent Bit, plugins are all incorporated into the source code itself, which ensures compatibility with every release.
Custom plugins still remain independent of the main repository. However, we are looking at ways to allow these to also share the same benefit of native C plugins within the main GitHub repository.
Monitoring
Understanding how data is traversing your environment is generally a top request from users who deploy Fluentd or Fluent Bit. With Fluentd, enabling these settings could require complicated configuration via “monitor_agent” or using a third party prometheus exporter plugin. These monitoring plugins also add maintenance overhead for Fluentd, which can affect performance.
Fluent Bit has monitoring as part of its core functionality and is retrievable via a native plugin (`fluentbit_metrics`) or scrapeable on an HTTP port. Fluent Bit’s metrics also incorporate more information than Fluentd’s, which allows you to understand bytes, records, storage and connection information.
How to get started with a Fluentd to Fluent Bit migration
The next question we’re answering is: How do you get started?
The first important step is to understand how Fluentd is deployed, what processing happens in the environment and where data is flowing.
What you don’t need to worry about:
- Architecture support: Both applications support x86 and ARM.
- Platform support: Fluent Bit supports the same and more as Fluentd does today. Legacy systems may differ, however it is important to note those are not maintained in either OSS project.
- Regular expressions: If you built a large library of regular expressions using the Onigmo parser library, you can rest comfortably knowing that Fluent Bit supports it.
Deployment
Deployed as an Agent (Linux or Windows Package)
When Fluentd is deployed as an agent on Linux or Windows, its primary function is to collect local log files or Windows event logs and route them to a particular destination.Thankfully, Fluent Bit’s local collection capabilities are equal to Fluentd’s, including the ability to resume on failure, store last log lines collected and local buffering.
Deployed in Kubernetes as a DaemonSet
If Fluentd is running as a DaemonSet in your Kubernetes cluster, you should first check the image that is running. As Fluentd has distributed plugins, the DaemonSet image may have specific plugins included, which ensures you can go directly from reading Kubernetes logs to the end destination.
This example has OpenSearch and Kafka included as plugins, so you should validate that the image you are using has the same plugins as Fluent Bit. Fluent Bit also supports Kubernetes enrichment on all logs, giving data around namespace, pod, labels and more.
Deployed as an Aggregator / Collector
If your Fluentd is deployed collecting logs from syslog, network devices or HTTP requests, you can first verify that Fluent Bit has the same capability. For example, Fluent Bit has syslog, TCP, HTTP and UDP plugins that can cover a majority of these use cases.
In addition, Fluent Bit also can receive OpenTelemetry HTTP1/gRPC, Prometheus Remote Write, HTTP gzip and Splunk HTTP Event Collector (HEC) as additional inbound signals.
Adding a Telemetry Pipeline
When migrating from Fluentd to Fluent Bit, we would also recommend looking at adding a Telemetry Pipeline in the middle of the agents and the destinations. This allows you to move larger pieces of processing logic within Fluentd agents downstream.
Configuration
The configuration syntax between Fluentd and Fluent Bit is vastly different. While both have started to support YAML more recently, most legacy Fluentd configurations will still be written in the domain-specific configuration language that is XML-esque.
Some general notes:
- Look at validating a single plugin at a time, and then at expanding to a single route (such as system logs to OpenSearch).
- Buffering and thread settings are not as important within Fluent Bit.
- Security settings should be similar.
When in doubt, reaching out to the Fluent community is useful in helping with some of the more granular settings.
Custom Plugins
When migrating, it’s important to ensure that Fluent Bit supports all plugins (sources and destinations). You should also check that it supports particular settings around authentication, authorization or access. This will be a manual process that can take some time. However, this will also allow you a chance to revisit decisions on specific data formats or plugin settings that you made in the past.
Custom Processing Logic
If you have labels, filters or other processing logic within Fluentd, it is important to note the functionality you are trying to achieve. While it may seem like just swapping those filters over might be easiest, you should also look at ways to migrate those directly into Fluent Bit processors. If you have a significant amount of custom Ruby, you can use large language models (LLMs) to help convert it into suitable Lua.
Migrating Portions at a Time
You don’t need to migrate all your functionality at once. Because Fluent Bit is lightweight and performant, you can look at ways to have each agent handle different portions of the workload. Over time you can follow the logic above to continue migrating without having to worry about log collection disruptions.
Conclusion
While migrating from Fluentd to Fluent Bit might seem like an enormous task, you have many options about how to attack and where to focus to achieve the highest impact. Of course migrations are also a great time to re-evaluate certain logic for improvement and even introduce new architecture patterns such as a telemetry pipeline.
If you are looking for guided or assisted help, let me know. I have helped many folks migrate from Fluentd to Fluent Bit and even assisted with modernizing certain portions to a telemetry pipeline.
Frequently Asked Questions
Why migrate from Fluentd to Fluent Bit
With Fluent Bit you will get higher performance for the same resources you are already using; full OpenTelemetry support for logs, metrics, and traces as well as Prometheus support for metrics; simpler configuration and routing ability to multiple locations; higher velocity for adding custom processing rules; and integrated monitoring to better understand performance and dataflows.
What are some differences between Fluentd and Fluent Bit?
With Fluentd, the main language is Ruby and initially designed to help users push data to big data platforms such as Hadoop. Meanwhile, Fluent Bit is written in C, with a focus on hyper performance in smaller systems (containers, embedded Linux).
Can Fluentd and Fluent Bit work together?
Yes Fluent Bit and Fluentd can work together, which means it’s possible to capture from more sources by using Fluentd and introduce the data into a Fluent Bit deployment. The Forward plugin has a defined standard that Fluent Bit and Fluentd both use. Some external products have also adopted this protocol so they can be connected directly to Fluent Bit.
Autonomous Testing of etcd’s Robustness
As a critical component of many production systems, including Kubernetes, the etcd project’s first priority is reliability. Ensuring consistency and data safety requires our project contributors to continuously improve testing methodologies. In this article, we describe how we use advanced simulation testing to uncover subtle bugs, validate the robustness of our releases, and increase our confidence in etcd’s stability. We’ll share our key findings and how they have improved etcd.
Enhancing etcd’s Robustness Testing
Many critical software systems depend on etcd to be correct and consistent, most notably as the primary datastore for Kubernetes. After some issues with the v3.5 release, the etcd maintainers developed a new robustness testing framework to better test for correctness under various failure scenarios. To further enhance our testing capabilities, we integrated a deterministic simulation testing platform from Antithesis into our workflow.
The platform works by running the entire etcd cluster inside a deterministic hypervisor. This specialized environment gives the testing software complete control over every source of non-determinism, such as network behavior, thread scheduling, and system clocks. This means any bug it discovers can be perfectly and reliably reproduced.
Within this simulated environment, the testing methodology shifts away from traditional, scenario-based tests. Instead of writing tests imperatively with strict assertions for one specific outcome, this approach uses declarative, property-based assertions about system behavior. These properties are high-level invariants about the system that must always hold true. For example, “data consistency is never violated” or “a watch event is never dropped.”
The platform then treats these properties not as passive checks, but as targets to break. It combines automated exploration with targeted fault injection, actively searching for the precise sequence of events and failures that will cause a property to be violated. This active search for violations is what allows the platform to uncover subtle bugs that result from complex combinations of factors. Antithesis refers to this approach as Autonomous Testing.
This builds upon etcd’s existing robustness tests, which also use a property-based approach. However, without a deterministic environment or automated exploration, the original framework resembled throwing darts while blindfolded and hoping to hit the bullseye. A bug might be found, but the process relies heavily on random chance and is difficult to reproduce. Antithesis’s deterministic simulation and active exploration remove the blindfold, enabling a systematic and reproducible search for bugs.
How We Tested
Our goals for this testing effort were to:
- Validate the robustness of etcd v3.6.
- Improve etcd’s software quality by finding and fixing bugs.
- Enhance our existing testing framework with autonomous testing.
We ran our existing robustness tests on the Antithesis simulation platform, testing a 3-node and a 1-node etcd cluster against a variety of faults, including:
- Network faults: latency, congestion, and partitions.
- Container-level faults: thread pauses, process kills, clock jitter, and CPU throttling.
We tested older versions of etcd with known bugs to validate the testing methodology, as well as our stable releases (3.4, 3.5, 3.6) and the main development branch. In total, we ran 830 wall-clock hours of testing, which simulated 4.5 years of usage.
What We Found
The results were impressive. The simulation testing not only found all the known bugs we tested for but also uncovered several new issues in our main development branch.
Here are some of the key findings:
- A critical watch bug was discovered that our existing tests had missed. This bug was present in all stable releases of etcd.
- All known bugs were found, giving us confidence in the ability of the combined testing approach to find regressions.
- Our own testing was improved by revealing a flaw in our linearization checker model.
Issues in the Main Development Branch
DescriptionReport LinkStatusImpactDetailsWatch on future revision might receive old eventsTriage ReportFixed in 3.6.2 (#20281)MediumNew bug discovered by AntithesisWatch on future revision might receive old notificationsTriage ReportFixed in 3.6.2 (#20221)MediumNew bug discovered by both Antithesis and robustness testsPanic when two snapshots are received in a short periodTriage ReportOpenLowPreviously discovered by robustnessPanic from db page expected to be 5Triage ReportOpenLowNew bug discovered by AntithesisOperation time based on watch response is incorrectTriage ReportFixed test on main branch (#19998)LowBug in robustness tests discovered by AntithesisKnown Issues
Antithesis also successfully found and reproduced these known issues in older releases – the “Brown M&Ms” set by the etcd maintainers.
DescriptionReport LinkWatch dropping an event when compacting on deleteTriage ReportRevision decreasing caused by crash during compactionTriage ReportWatch progress notification not synced with streamTriage ReportInconsistent revision caused by crash during defragTriage ReportWatchable runlock bugTriage ReportConclusion
The integration of this advanced simulation testing into our development workflow has been a success. It has allowed us to find and fix critical bugs, improve our existing testing framework, and increase our confidence in the reliability of etcd. We will continue to leverage this technology to ensure that etcd remains a stable and trusted distributed key-value store for the community.
CNCF’s Helm Project Remains Fully Open Source and Unaffected by Recent Vendor Deprecations
Recently, users may have seen the news about Broadcom (Bitnami) regarding upcoming deprecations of their publicly available container images and Helm Charts. These changes, which will take effect by September 29, 2025, mark a shift to a paid subscription model for Bitnami Secure Images and the removal of many free-to-use artifacts from public registries.
We want to be clear: these changes do not impact the Helm project itself.
Helm is a graduated project that will remain under the CNCF. It continues to be fully open source, Apache 2.0 licensed, and governed by a neutral community. The CNCF community retains ownership of all project intellectual property per our IP policy, ensuring no single vendor can alter its open governance model.
While “Helm charts” refer broadly to a packaging format that anyone can use to deploy applications on Kubernetes, Bitnami Helm Charts are a specific vendor-maintained implementation. Developed and maintained by the Bitnami team (now part of Broadcom), these charts are known for their ease of use, security features, and reliability. Bitnami’s decision to deprecate its public chart and image repositories is entirely separate from the Helm project itself.
Users currently depending on Bitnami Helm Charts should begin exploring migration or mirroring strategies to avoid potential disruption.
The Helm community is actively working to support users during this transition, including guidance on:
- Updating chart dependencies
- Exploring alternative chart sources
- Migrating to maintained open image repositories
We encourage users to follow the Helm blog and Helm GitHub for updates and support resources.
CNCF remains committed to maintaining the integrity of our open source projects and supporting communities through transitions like this. This event also reinforces the importance of vendor neutrality and resilient infrastructure design—a principle at the heart of our mission.
For any media inquiries, please contact: [email protected]
Metal3.io becomes a CNCF incubating project
The CNCF Technical Oversight Committee (TOC) has voted to accept Metal3.io as a CNCF incubating project. Metal3.io joins a growing ecosystem of technologies tackling real-world challenges at the edge of cloud native infrastructure.
What is Metal3.io?
The Metal3.io project (pronounced: “Metal Kubed”) provides components for bare metal host management with Kubernetes. You can enroll your bare metal machines, provision operating system images, and then, if you like, deploy Kubernetes clusters to them. From there, operating and upgrading your Kubernetes clusters can be handled by Metal3.io. Moreover, Metal3.io is itself a Kubernetes application, so it runs on Kubernetes and uses Kubernetes resources and APIs as its interface.
Metal3.io is also one of the providers for the Kubernetes subproject Cluster API. Cluster API provides infrastructure-agnostic Kubernetes lifecycle management, and Metal3.io brings the bare metal implementation.
Key Milestones and Ecosystem Growth
The project was started in 2019 by Red Hat and was quickly joined by Ericsson. Metal3.io then joined the CNCF sandbox in September 2020.
Metal3.io has steadily matured and grown during the sandbox phase, with:
- 57 active contributing organizations, led by Ericsson and Red Hat.
- An active community organizing weekly online meetings with working group updates, issue triaging, design discussions, etc.
- Organizations such as Fujitsu, Ikea, SUSE, Ericsson, and Red Hat among the growing list of adopters.
- New features and API iterations, including IP address management, node reuse, firmware settings, and updates management both in provisioning time and on day 2, as well as remediation for the bare metal hosts.
- A new operator, called the Ironic Standalone Operator, has been introduced to replace the shell-based deployment method for Ironic.
- Added robust security processes, regular scans of dependencies, a vulnerability disclosure process, and automated dependency updates.
Integrations Across the Cloud Native Landscape
Metal3.io connects seamlessly with many CNCF projects, including:
- Kubernetes: Metal3.io builds on the success of Kubernetes and makes use of CustomResourceDefinitions
- Cluster API: Turn the bare metal servers into Kubernetes clusters
- Cert-manager: Certificates for webhooks, etc.
- Ironic: Handles the hardware for Metal3.io by interacting with baseboard management controllers
- Prometheus: Metal3.io exposes metrics in a format that Prometheus can scrape
Technical Components
- Baremetal Operator (BMO): Exposes parts of the Ironic API as a Kubernetes native API
- Cluster API Provider Metal³ (CAPM3): Provides integration with Cluster API
- IP Address Manager (IPAM): Handles IP addresses and pools
- Ironic Standalone Operator (IrSO): Makes it easy to deploy Ironic on Kubernetes
- Ironic-Image: Container image for Ironic
Community Highlights
- 1523 GitHub Stars
- 8368 merged pull requests
- 1434 issues
- 186 contributors
- 187 Releases
Maintainer Perspective
“As a maintainer of the Metal3.io project, I’m proud of its growth towards becoming one of the leading solutions for running Kubernetes on bare metal. I take pride in how it has evolved beyond provisioning bare metal only to support broader lifecycle needs, ensuring users can sustain and operate their bare metal deployments effectively. Equally rewarding has been seeing the community come together to establish strong processes and governance, positioning Metal3.io for CNCF incubation.”
—Kashif Khan, Maintainer, Metal3.io
“Metal3.io is a testament to the power of collaboration across open source communities. It marries the battle-tested hardware support of the Ironic project with the Kubernetes API paradigm, using a lightweight Kubernetes-native deployment model. I am delighted to see it begin incubation with CNCF. I have no doubt that the forum the Metal3.io project provides will continue to drive progress in integration between Kubernetes and bare metal.”
—Zane Bitter, Maintainer, Metal3.io
From the TOC
“Metal3.io addresses a critical need for cloud native infrastructure by making bare metal as manageable and Kubernetes-native as any other platform. The project’s steady growth, technical maturity, and strong integration with the Kubernetes ecosystem made it a clear choice for incubation. We’re excited to support Metal3.io as it continues to empower organizations deploying Kubernetes at the edge and beyond.”
— Ricardo Rocha, TOC Sponsor
Looking Ahead
Metal3.io’s roadmap for 2025 includes:
- New API revisions for CAPM3, BMO, and IPAM
- Maturing IPAM as a Cluster API IPAM provider
- Multi-tenancy support
- Support for architectures other than x86_64, i.e., ARM
- Improve DHCP-less provisioning
- Simplifying Ironic deployment with IrSO
As a CNCF-hosted project, Metal3.io is part of a neutral foundation aligned with its technical interests, as well as the larger Linux Foundation, which provides governance, marketing support, and community outreach. Metal3.io joins incubating technologies ArtifactHUB, Backstage, Buildpacks, Chaos Mesh, Cloud Custodian, Container Network Interface (CNI), Contour, Cortex, Crossplane, Dragonfly, Emissary-Ingress, Flatcar, gRPC, Karmada, Keptn, Keycloak, Knative, Kubeflow, Kubescape, KubeVela, KubeVirt, Kyverno, Litmus, Longhorn, NATS, Notary, OpenCost, OpenFeature, OpenKruise, OpenTelemetry, OpenYurt, Operator Framework, Strimzi, Thanos, Volcano, and wasmCloud. For more information on maturity requirements for each level, please visit the CNCF Graduation Criteria.
We look forward to seeing how Metal3.io continues to evolve with the backing of the CNCF community.
Learn more: https://www.cncf.io/projects/metal%C2%B3/