You are here

CNCF Projects

Istio at KubeCon + CloudNativeCon North America 2025: Community highlights and project progress

CNCF Blog Projects Category - Mon, 12/08/2025 - 05:36

KubeCon + CloudNativeCon North America 2025 lit up Atlanta from November 10–13, bringing together one of the largest gatherings of open-source practitioners, platform engineers, and maintainers across the cloud native ecosystem. For the Istio community, the week was defined by packed rooms, long hallway conversations, and a genuine sense of shared progress across service mesh, Gateway API, security, and AI-driven platforms.

Before the main conference began, the community kicked things off with Istio Day on November 10, a colocated event filled with deep technical sessions, migration stories, and future-looking discussions that set the tone for the rest of the week.

Istio Day at KubeCon + CloudNativeCon NA

Istio Day brought together practitioners, contributors, and adopters for an afternoon of learning, sharing, and open conversations about where service mesh, and Istio, are headed next.

Istio Day

Istio Day opened with welcome remarks from the program co-chairs, setting the tone for an afternoon focused on real-world mesh evolution and the rapid growth of the Istio community. The agenda highlighted three major themes driving Istio’s future: AI-driven traffic patterns, the advancement of Ambient Mesh—including multicluster adoption, and modernizing traffic entry with Gateway API. Speakers across the ecosystem shared practical lessons on scaling, migration, reliability, and operating increasingly complex workloads with Istio.

The co-chairs closed the day by recognizing the speakers, contributors, and a community continuing to push service-mesh innovation forward. Recordings of all sessions are available at the CNCF YouTube channel.

 Is Your Service Mesh AI Ready

Istio at  KubeCon + CloudNativeCon 

Outside of Istio Day, the project was highly visible across KubeCon + CloudNativeCon Atlanta, with maintainers, end users, and contributors sharing technical deep dives, production stories, and cutting-edge research. Istio appeared not only across expo booths and breakout sessions, but also throughout several of the keynotes, where companies showcased how Istio plays a critical role in powering their platforms at scale.

Istio at KubeCon Keynotes

The week’s momentum fully met its stride when the Istio community reconvened with the Istio Project Update, where project leads shared latest releases and roadmap advances. In Istio: Set Sailing With Istio Without Sidecars, attendees explored how sidecar-less Ambient Mesh architecture is rapidly moving from experiment to adoption, opening new possibilities for simpler deployments and leaner data-planes.

The session Lessons Applied Building a Next-Generation AI Proxy took the crowd behind the scenes of how mesh technologies adapt to AI-driven traffic patterns and over at Automated Rightsizing for Istio DaemonSet Workloads (Poster Session), practitioners gathered to compare strategies for optimizing control-plane resources, tuning for high scale, and reducing cost without sacrificing performance.

The narrative of traffic-management evolution featured prominently in Gateway API: Table Stakes and its faster sibling Know Before You Go! Speedrun Intro to Gateway API. Meanwhile, Return of the Mesh: Gateway API’s Epic Quest for Unity scaled that conversation: how traffic, API, mesh, and routing converge into one architecture that simplifies complexity rather than multiplies it.

For long-term reflection, 5 Key Lessons From 8 Years of Building Kgateway delivered hard-earned wisdom from years of system design. In GAMMA in Action: How Careem Migrated To Istio Without Downtime, the real-world migration story—a major production rollout that stayed up during transition—provided a roadmap for teams seeking safe mesh adoption at scale.

Safety and rollout risks took center stage in Taming Rollout Risks in Distributed Web Apps: A Location-Aware Gradual Deployment Approach, where strategies for regional rollouts, steering traffic, and minimizing user impact were laid out.

Finally, operations and day-two reality were tackled in End-to-End Security With gRPC in Kubernetes and On-Call the Easy Way With Agents, reminding everyone that mesh isn’t just about architecture, but about how teams run software safely, reliably, and confidently.

Community spaces: ContribFest, Maintainer Track and the Project Pavilion

At the Project Pavilion, the Istio kiosk was constantly buzzing, drawing users with questions about Ambient Mesh, AI workloads, and deployment best practices.

Istio at the Project Pavillion

The Maintainer Track brought contributors together to collaborate on roadmap topics, triage issues, and discuss key areas of investment for the next year.

Istio maintainers

At ContribFest, new contributors joined maintainers to work through good-first issues, discuss contribution pathways, and get their first PRs lined up.

Istio ContribFest Collaboration


Istio maintainers eecognized at the CNCF Community Awards

This year’s CNCF Community Awards were a proud moment for the project. Two Istio maintainers received well-deserved recognition:

Daniel Hawton — “Chop Wood, Carry Water” Award

John Howard — Top Committer Award

Istio at CNCF Community Awards

Beyond these awards, Istio was also represented prominently in conference leadership. Faseela K, one of the KubeCon + CloudNativeCon NA co-chairs and an Istio maintainer, participated in a keynote panel on Cloud Native for Good. During closing remarks, it was also announced that Lin Sun, another long-time Istio maintainer, will serve as an upcoming KubeCon + CloudNativeCon co-chair.

Istio Leadership on Keynote Stage

What we heard in Atlanta

Across sessions, kiosks, and hallways, a few themes emerged:

  • Ambient Mesh is moving quickly from exploration to real-world adoption.
  • AI workloads are reshaping traffic patterns and operational practices.
  • Multicluster deployments are becoming standard, with stronger focus on identity and failover.
  • Gateway API is solidifying as the future of modern traffic management.
  • Contributor growth is accelerating, supported by ContribFest and hands-on community guidance.

Looking ahead

KubeCon + CloudNativeCon NA 2025 showcased a vibrant, rapidly growing community taking on some of the toughest challenges in cloud infrastructure—from AI traffic management to zero-downtime migrations, from planet-scale control planes to the next generation of sidecar-less mesh. As we look ahead to 2026, the momentum from Atlanta makes one thing clear: the future of service mesh is bright, and the Istio community is leading it together.

See you in Amsterdam!

Categories: CNCF Projects

Linkerd Edge Release Roundup: December 2025

Linkerd Blog - Sun, 12/07/2025 - 19:00

Welcome to the excessively-large December 2025 Edge Release Roundup posts, where we dive into the most recent edge releases to help keep everyone up to date on the latest and greatest! This post covers edge releases from September through November 2025 (the runup to KubeCon was hectic around here).

How to give feedback

Edge releases are a snapshot of our current development work on main; by definition, they always have the most recent features but they may have incomplete features, features that end up getting rolled back later, or (like all software) even bugs. That said, edge releases are intended for production use, and go through a rigorous set of automated and manual tests before being released. Once released, we also document whether the release is recommended for broad use – and when needed, we go back and update the recommendations.

Categories: CNCF Projects

Visualizing Target Relabeling Rules in Prometheus 3.8.0

Prometheus Blog - Mon, 12/01/2025 - 19:00

Prometheus' target relabeling feature allows you to adjust the labels of a discovered target or even drop the target entirely. Relabeling rules, while powerful, can be hard to understand and debug. Your rules have to match the expected labels that your service discovery mechanism returns, and getting any step wrong could label your target incorrectly or accidentally drop it.

To help you figure out where things go wrong (or right), Prometheus 3.8.0 just added a relabeling visualizer to the Prometheus server's web UI that allows you to inspect how each relabeling rule is applied to a discovered target's labels. Let's take a look at how it works!

Using the relabeling visualizer

If you head to any Prometheus server's "Service discovery" page (for example: https://demo.promlabs.com/service-discovery), you will now see a new "show relabeling" button for each discovered target:

Service discovery page screenshot

Clicking this button shows you how each relabeling rule is applied to that particular target in sequence:

The visualizer shows you:

  • The initial labels of the target as discovered by the service discovery mechanism.
  • The details of each relabeling rule, including its action type and other parameters.
  • How the labels change after each relabeling rule is applied, with changes, additions, and deletions highlighted in color.
  • Whether the target is ultimately kept or dropped after all relabeling rules have been applied.
  • The final output labels of the target if it is kept.

To debug your relabeling rules, you can now read this diagram from top to bottom and find the exact step where the labels change in an unexpected way or where the target gets dropped. This should help you identify misconfigurations in your relabeling rules more easily.

Conclusion

The new relabeling visualizer in the Prometheus server's web UI is a powerful tool to help you understand and debug your target relabeling configurations. By providing a step-by-step view of how each relabeling rule affects a target's labels, it makes it easier to identify and fix issues in your setup. Update your Prometheus servers to 3.8.0 now to give it a try!

Categories: CNCF Projects

Announcing Kyverno release 1.16

CNCF Blog Projects Category - Wed, 11/26/2025 - 15:44

Kyverno 1.16 delivers major advancements in policy as code for Kubernetes, centered on a new generation of CEL-based policies now available in beta with a clear path to GA. This release introduces partial support for namespaced CEL policies to confine enforcement and minimize RBAC, aligning with least-privilege best practices. Observability is significantly enhanced with full metrics for CEL policies and native event generation, enabling precise visibility and faster troubleshooting. Security and governance get sharper controls through fine-grained policy exceptions tailored for CEL policies, and validation use cases broaden with the integration of an HTTP authorizer into ValidatingPolicy. Finally, we’re debuting the Kyverno SDK, laying the foundation for ecosystem integrations and custom tooling.

CEL policy types

CEL policies in beta

CEL policy types are introduced as v1beta. The promotion plan provides a clear, non‑breaking path: v1 will be made available in 1.17 with GA targeted for 1.18. This release includes the cluster‑scoped family (Validating, Mutating, Generating, Deleting, ImageValidating) at v1beta1 and adds namespaced variants for validation, deleting, and image validation; namespaced Generating and Mutating will follow in 1.17. PolicyException and GlobalContextEntry will advance in step to keep versions aligned; see the promotion roadmap in this tracking issue.

Namespaced policies

Kyverno 1.16 introduces namespaced CEL policy types— NamespacedValidatingPolicy, NamespacedDeletingPolicy, and NamespacedImageValidatingPolicy—which mirror their cluster-scoped counterparts but apply only within the policy’s namespace. This lets teams enforce guardrails with least-privilege RBAC and without central changes, improving multi-tenancy and safety during rollout. Choose namespaced types for team-owned namespaces and cluster-scoped types for global controls.

Observability upgrades

CEL policies now have comprehensive, native observability for faster diagnosis:

  • Validating policy execution latency Metrics: kyverno_validating_policy_execution_duration_seconds_count, …_sum, …_bucket
    • What it measures: Time spent evaluating validating policies per admission/background execution as a Prometheus histogram. 
    • Key labels: policy_name, policy_background_mode, policy_validation_mode (enforce/audit), resource_kind, resource_namespace, resource_request_operation (create/update/delete), execution_cause (admission_request/background_scan), result (PASS/FAIL). 
  • Mutating policy execution latency: kyverno_mutating_policy_execution_duration_seconds_count, …_sum, …_bucket 
    • What it measures: Time spent executing mutating policies (admission/background) as a Prometheus histogram. 
    • Key labels: policy_name, policy_background_mode, resource_kind, resource_namespace, resource_request_operation, execution_cause, result. 
  • Generating policy execution latency Metrics: kyverno_generating_policy_execution_duration_seconds_count, …_sum, …_bucket
    • What it measures: Time spent executing generating policies when evaluating requests or during background scans. 
    • Key labels: policy_name, policy_background_mode, resource_kind, resource_namespace, resource_request_operation, execution_cause, result. 
  • Image-validating policy execution latency Metrics: kyverno_image_validating_policy_execution_duration_seconds_count, …_sum, …_bucket 
    • What it measures: Time spent evaluating image-related validating policies (e.g., image verification) as a Prometheus histogram. 
    • Key labels: policy_name, policy_background_mode, resource_kind, resource_namespace, resource_request_operation, execution_cause, result. 

CEL policies now emit Kubernetes Events for passes, violations, errors, and compile/load issues with rich context (policy/rule, resource, user, mode). This provides instant, kubectl-visible feedback and easier correlation with admission decisions and metrics during rollout and troubleshooting.

Fine-grained policy exceptions

Image-based exceptions

This exception allows Pods in ci using images, via images attribute, that match the provided patterns while keeping the no-latest rule enforced for all other images. It narrows the bypass to specific namespaces and teams for auditability. 

apiVersion: policies.kyverno.io/v1beta1
kind: PolicyException
metadata:
 name: allow-ci-latest-images
 namespace: ci
spec:
 policyRefs:
   - name: restrict-image-tag
     kind: ValidatingPolicy
 images:
   - "ghcr.io/kyverno/*:latest"
 matchConditions:
   - expression: "has(object.metadata.labels.team) && object.metadata.labels.team == 'platform'"

The following ValidatingPolicy references exceptions.allowedImages which skips validation checks for white-listed image(s):

apiVersion: policies.kyverno.io/v1beta1
kind: ValidatingPolicy
metadata:
name: restrict-image-tag
spec:
rules:
  - name: broker-config
    matchConstraints:
      resourceRules:
        - apiGroups:   [apps]
          apiVersions: [v1]
          operations:  [CREATE, UPDATE]
          resources:   [pods]
    validations:
     - message: "Containers must not allow privilege escalation unless they are in the allowed images list."
       expression: >
         object.spec.containers.all(container,
          string(container.image) in exceptions.allowedImages ||
          (
            has(container.securityContext) &&
            has(container.securityContext.allowPrivilegeEscalation) &&
            container.securityContext.allowPrivilegeEscalation == false
          )
        )

Value-based exceptions

This exception allows a list of values via allowedValues used by a CEL validation for a constrained set of targets so teams can proceed without weakening the entire policy.

apiVersion: policies.kyverno.io/v1beta1
kind: PolicyException
metadata:
 name: allow-debug-annotation
 namespace: dev
spec:
 policyRefs:
   - name: check-security-context
     kind: ValidatingPolicy
 allowedValues:
   - "debug-mode-temporary"
 matchConditions:
   - expression: "object.metadata.name.startsWith('experiments-')"

Here’s the policy leverages above allowed values. It denies resources unless the annotation value is present in exceptions.allowedValues.

apiVersion: policies.kyverno.io/v1beta1
kind: ValidatingPolicy
metadata:
 name: check-security-context
spec:
 matchConstraints:
   resourceRules:
   - apiGroups:   [apps]
     apiVersions: [v1]
     operations:  [CREATE, UPDATE]
     resources:   [deployments]
 variables:
   - name: allowedCapabilities
     expression: "['AUDIT_WRITE','CHOWN','DAC_OVERRIDE','FOWNER','FSETID','KILL','MKNOD','NET_BIND_SERVICE','SETFCAP','SETGID','SETPCAP','SETUID','SYS_CHROOT']"
 validations:
   - expression: >-
       object.spec.containers.all(container,
      container.?securityContext.?capabilities.?add.orValue([]).all(capability,
      capability in exceptions.allowedValues ||
      capability in variables.allowedCapabilities))
     message: >-
       Any capabilities added beyond the allowed list (AUDIT_WRITE, CHOWN, DAC_OVERRIDE, FOWNER,
      FSETID, KILL, MKNOD, NET_BIND_SERVICE, SETFCAP, SETGID, SETPCAP, SETUID, SYS_CHROOT)
      are disallowed.

Configurable reporting status 

This exception sets reportResult: pass, so when it matches, Policy Reports show “pass” rather than the default “skip”, improving dashboards and SLO signals during planned waivers.

apiVersion: policies.kyverno.io/v1beta1
kind: PolicyException
metadata:
 name: exclude-skipped-deployment-2
 labels:
   polex.kyverno.io/priority: "0.2"
spec:
 policyRefs:
 - name: "with-multiple-exceptions"
   kind: ValidatingPolicy
 matchConditions:
   - name: "check-name"
     expression: "object.metadata.name == 'skipped-deployment'"
 reportResult: pass

Kyverno Authz Server

Beyond enriching admission-time validation, Kyverno now extends policy decisions to your service edge. The Kyverno Authz Server applies Kyverno policies to authorize requests for Envoy (via the External Authorization filter) and for plain HTTP services as a standalone HTTP authorization server, returning allow/deny decisions based on the same policy engine you use in Kubernetes. This unifies policy enforcement across admission, gateways, and services, enabling consistent guardrails and faster adoption without duplicating logic. See the project page for details: kyverno/kyverno-authz.

Introducing the Kyverno SDK

Alongside embedding CEL policy evaluation in controllers, CLIs, and CI, there’s now a companion SDK for service-edge authorization. The SDK lets you load Kyverno policies, compile them, and evaluate incoming requests to produce allow/deny decisions with structured results—powering Envoy External Authorization and plain HTTP services without duplicating policy logic. It’s designed for gateways, sidecars, and app middleware with simple Go APIs, optional metrics/hooks, and a path to unify admission-time and runtime enforcement. Note that kyverno-authz is still in active development; start with non-critical paths and add strict timeouts as you evaluate. See the SDK package for details: kyverno SDK.

Other features and enhancements 

Label-based reporting configuration

Kyverno now supports label-based report suppression. Add the label reports.kyverno.io/disabled (any value, e.g., “true”) to any policy— ClusterPolicy, CEL policy types, ValidatingAdmissionPolicy, or MutatingAdmissionPolicy—to prevent all reporting (both ephemeral and PolicyReports) for that policy. This lets teams silence noisy or staging policies without changing enforcement; remove the label to resume reporting.

Use Kyverno CEL libraries in policy matchConditions

Kyverno 1.16 enables Kyverno CEL libraries in policy matchConditions, not just in rule bodies, so you can target when rules run using richer, context-aware checks. These expressions are evaluated by Kyverno but are not used to build admission webhook matchConditions—webhook routing remains unchanged.

Getting started and backward compatibility

Upgrading to Kyverno 1.16

To upgrade to Kyverno 1.16, you can use Helm: 

helm repo update
helm upgrade --install kyverno kyverno/kyverno -n kyverno --version 3.6.0

Backward compatibility

Kyverno 1.16 remains fully backward compatible with existing ClusterPolicy resources. You can continue running current policies and adopt the new policy types incrementally; once CEL policy types reach GA, the legacy ClusterPolicy API will enter a formal deprecation process following our standard, non‑breaking schedule.

Roadmap

We’re building on 1.16 with a clear, low‑friction path forward. In 1.17, CEL policy types will be available as v1, migration tooling and docs will focus on making upgrades routine. We will continue to expand CEL libraries, samples, and performance optimizations. With SDK and kyverno‑authz maturation to unify admission‑time and runtime enforcement paths. See the release board for the in‑flight work and timelines: Release 1.17.0 Project Board

Conclusion

Kyverno 1.16 marks a pivotal step toward a unified, CEL‑powered policy platform: you can adopt the new policy types in beta today, move enforcement closer to teams with namespaced policies, and gain sharper visibility with native Events and detailed latency metrics. Fine‑grained exceptions make rollouts safer without weakening guardrails, while label‑based report suppression and CEL in matchConditions reduce noise and let you target policy execution precisely. 

Looking ahead, the path to v1 and GA is clear, and the ecosystem is expanding with the Kyverno Authz Server and SDK to bring the same policy engine to gateways and services. Upgrade when ready, start with audits where useful, and tell us what you build—your feedback will shape the final polish and the journey to GA.

Categories: CNCF Projects

Kubernetes v1.35 Sneak Peek

Kubernetes Blog - Tue, 11/25/2025 - 19:00

As the release of Kubernetes v1.35 approaches, the Kubernetes project continues to evolve. Features may be deprecated, removed, or replaced to improve the project's overall health. This blog post outlines planned changes for the v1.35 release that the release team believes you should be aware of to ensure the continued smooth operation of your Kubernetes cluster(s), and to keep you up to date with the latest developments. The information below is based on the current status of the v1.35 release and is subject to change before the final release date.

Deprecations and removals for Kubernetes v1.35

cgroup v1 support

On Linux nodes, container runtimes typically rely on cgroups (short for "control groups"). Support for using cgroup v2 has been stable in Kubernetes since v1.25, providing an alternative to the original v1 cgroup support. While cgroup v1 provided the initial resource control mechanism, it suffered from well-known inconsistencies and limitations. Adding support for cgroup v2 allowed use of a unified control group hierarchy, improved resource isolation, and served as the foundation for modern features, making legacy cgroup v1 support ready for removal. The removal of cgroup v1 support will only impact cluster administrators running nodes on older Linux distributions that do not support cgroup v2; on those nodes, the kubelet will fail to start. Administrators must migrate their nodes to systems with cgroup v2 enabled. More details on compatibility requirements will be available in a blog post soon after the v1.35 release.

To learn more, read about cgroup v2;
you can also track the switchover work via KEP-5573: Remove cgroup v1 support.

Deprecation of ipvs mode in kube-proxy

Many releases ago, the Kubernetes project implemented an ipvs mode in kube-proxy. It was adopted as a way to provide high-performance service load balancing, with better performance than the existing iptables mode. However, maintaining feature parity between ipvs and other kube-proxy modes became difficult, due to technical complexity and diverging requirements. This created significant technical debt and made the ipvs backend impractical to support alongside newer networking capabilities.

The Kubernetes project intends to deprecate kube-proxy ipvs mode in the v1.35 release, to streamline the kube-proxy codebase. For Linux nodes, the recommended kube-proxy mode is already nftables.

You can find more in KEP-5495: Deprecate ipvs mode in kube-proxy

Kubernetes is deprecating containerd v1.y support

While Kubernetes v1.35 still supports containerd 1.7 and other LTS releases of containerd, as a consequence of automated cgroup driver detection, the Kubernetes SIG Node community has formally agreed upon a final support timeline for containerd v1.X. Kubernetes v1.35 is the last release to offer this support (aligned with containerd 1.7 EOL).

This is a final warning that if you are using containerd 1.X, you must switch to 2.0 or later before upgrading Kubernetes to the next version. You are able to monitor the kubelet_cri_losing_support metric to determine if any nodes in your cluster are using a containerd version that will soon be unsupported.

You can find more in the official blog post or in KEP-4033: Discover cgroup driver from CRI

The following enhancements are some of those likely to be included in the v1.35 release. This is not a commitment, and the release content is subject to change.

Node declared features

When scheduling Pods, Kubernetes uses node labels, taints, and tolerations to match workload requirements with node capabilities. However, managing feature compatibility becomes challenging during cluster upgrades due to version skew between the control plane and nodes. This can lead to Pods being scheduled on nodes that lack required features, resulting in runtime failures.

The node declared features framework will introduce a standard mechanism for nodes to declare their supported Kubernetes features. With the new alpha feature enabled, a Node reports the features it can support, publishing this information to the control plane through a new .status.declaredFeatures field. Then, the kube-scheduler, admission controllers and third-party components can use these declarations. For example, you can enforce scheduling and API validation constraints, ensuring that Pods run only on compatible nodes.

This approach reduces manual node labeling, improves scheduling accuracy, and prevents incompatible pod placements proactively. It also integrates with the Cluster Autoscaler for informed scale-up decisions. Feature declarations are temporary and tied to Kubernetes feature gates, enabling safe rollout and cleanup.

Targeting alpha in v1.35, node declared features aims to solve version skew scheduling issues by making node capabilities explicit, enhancing reliability and cluster stability in heterogeneous version environments.

To learn more about this before the official documentation is published, you can read KEP-5328.

In-place update of Pod resources

Kubernetes is graduating in-place updates for Pod resources to General Availability (GA). This feature allows users to adjust cpu and memory resources without restarting Pods or Containers. Previously, such modifications required recreating Pods, which could disrupt workloads, particularly for stateful or batch applications. Previous Kubernetes releases already allowed you to change infrastructure resources settings (requests and limits) for existing Pods. This allows for smoother vertical scaling, improves efficiency, and can also simplify solution development.

The Container Runtime Interface (CRI) has also been improved, extending the UpdateContainerResources API for Windows and future runtimes while allowing ContainerStatus to report real-time resource configurations. Together, these changes make scaling in Kubernetes faster, more flexible, and disruption-free. The feature was introduced as alpha in v1.27, graduated to beta in v1.33, and is targeting graduation to stable in v1.35.

You can find more in KEP-1287: In-place Update of Pod Resources

Pod certificates

When running microservices, Pods often require a strong cryptographic identity to authenticate with each other using mutual TLS (mTLS). While Kubernetes provides Service Account tokens, these are designed for authenticating to the API server, not for general-purpose workload identity.

Before this enhancement, operators had to rely on complex, external projects like SPIFFE/SPIRE or cert-manager to provision and rotate certificates for their workloads. But what if you could issue a unique, short-lived certificate to your Pods natively and automatically? KEP-4317 is designed to enable such native workload identity. It opens up various possibilities for securing pod-to-pod communication by allowing the kubelet to request and mount certificates for a Pod via a projected volume.

This provides a built-in mechanism for workload identity, complete with automated certificate rotation, significantly simplifying the setup of service meshes and other zero-trust network policies. This feature was introduced as alpha in v1.34 and is targeting beta in v1.35.

You can find more in KEP-4317: Pod Certificates

Numeric values for taints

Kubernetes is enhancing taints and tolerations by adding numeric comparison operators, such as Gt (Greater Than) and Lt (Less Than).

Previously, tolerations supported only exact (Equal) or existence (Exists) matches, which were not suitable for numeric properties such as reliability SLAs.

With this change, a Pod can use a toleration to "opt-in" to nodes that meet a specific numeric threshold. For example, a Pod can require a Node with an SLA taint value greater than 950 (operator: Gt, value: "950").

This approach is more powerful than Node Affinity because it supports the NoExecute effect, allowing Pods to be automatically evicted if a node's numeric value drops below the tolerated threshold.

You can find more in KEP-5471: Enable SLA-based Scheduling

User namespaces

When running Pods, you can use securityContext to drop privileges, but containers inside the pod often still run as root (UID 0). This simplicity poses a significant challenge, as that container UID 0 maps directly to the host's root user.

Before this enhancement, a container breakout vulnerability could grant an attacker full root access to the node. But what if you could dynamically remap the container's root user to a safe, unprivileged user on the host? KEP-127 specifically allows such native support for Linux User Namespaces. It opens up various possibilities for pod security by isolating container and host user/group IDs. This allows a process to have root privileges (UID 0) within its namespace, while running as a non-privileged, high-numbered UID on the host.

Released as alpha in v1.25 and beta in v1.30, this feature continues to progress through beta maturity, paving the way for truly "rootless" containers that drastically reduce the attack surface for a whole class of security vulnerabilities.

You can find more in KEP-127: User Namespaces

Support for mounting OCI images as volumes

When provisioning a Pod, you often need to bundle data, binaries, or configuration files for your containers. Before this enhancement, people often included that kind of data directly into the main container image, or required a custom init container to download and unpack files into an emptyDir. You can still take either of those approaches, of course.

But what if you could populate a volume directly from a data-only artifact in an OCI registry, just like pulling a container image? Kubernetes v1.31 added support for the image volume type, allowing Pods to pull and unpack OCI container image artifacts into a volume declaratively.

This allows for seamless distribution of data, binaries, or ML models using standard registry tooling, completely decoupling data from the container image and eliminating the need for complex init containers or startup scripts. This volume type has been in beta since v1.33 and will likely be enabled by default in v1.35.

You can try out the beta version of image volumes, or you can learn more about the plans from KEP-4639: OCI Volume Source.

Want to know more?

New features and deprecations are also announced in the Kubernetes release notes. We will formally announce what's new in Kubernetes v1.35 as part of the CHANGELOG for that release.

The Kubernetes v1.35 release is planned for December 17, 2025. Stay tuned for updates!

You can also see the announcements of changes in the release notes for:

Get involved

The simplest way to get involved with Kubernetes is by joining one of the many Special Interest Groups (SIGs) that align with your interests. Have something you’d like to broadcast to the Kubernetes community? Share your voice at our weekly community meeting, and through the channels below. Thank you for your continued feedback and support.

Categories: CNCF Projects, Kubernetes

Kubernetes Configuration Good Practices

Kubernetes Blog - Mon, 11/24/2025 - 19:00

Configuration is one of those things in Kubernetes that seems small until it's not. Configuration is at the heart of every Kubernetes workload. A missing quote, a wrong API version or a misplaced YAML indent can ruin your entire deploy.

This blog brings together tried-and-tested configuration best practices. The small habits that make your Kubernetes setup clean, consistent and easier to manage. Whether you are just starting out or already deploying apps daily, these are the little things that keep your cluster stable and your future self sane.

This blog is inspired by the original Configuration Best Practices page, which has evolved through contributions from many members of the Kubernetes community.

General configuration practices

Use the latest stable API version

Kubernetes evolves fast. Older APIs eventually get deprecated and stop working. So, whenever you are defining resources, make sure you are using the latest stable API version. You can always check with

kubectl api-resources

This simple step saves you from future compatibility issues.

Store configuration in version control

Never apply manifest files directly from your desktop. Always keep them in a version control system like Git, it's your safety net. If something breaks, you can instantly roll back to a previous commit, compare changes or recreate your cluster setup without panic.

Write configs in YAML not JSON

Write your configuration files using YAML rather than JSON. Both work technically, but YAML is just easier for humans. It's cleaner to read and less noisy and widely used in the community.

YAML has some sneaky gotchas with boolean values: Use only true or false. Don't write yes, no, on or off. They might work in one version of YAML but break in another. To be safe, quote anything that looks like a Boolean (for example "yes").

Keep configuration simple and minimal

Avoid setting default values that are already handled by Kubernetes. Minimal manifests are easier to debug, cleaner to review and less likely to break things later.

If your Deployment, Service and ConfigMap all belong to one app, put them in a single manifest file.
It's easier to track changes and apply them as a unit. See the Guestbook all-in-one.yaml file for an example of this syntax.

You can even apply entire directories with:

kubectl apply -f configs/

One command and boom everything in that folder gets deployed.

Add helpful annotations

Manifest files are not just for machines, they are for humans too. Use annotations to describe why something exists or what it does. A quick one-liner can save hours when debugging later and also allows better collaboration.

The most helpful annotation to set is kubernetes.io/description. It's like using comment, except that it gets copied into the API so that everyone else can see it even after you deploy.

Managing Workloads: Pods, Deployments, and Jobs

A common early mistake in Kubernetes is creating Pods directly. Pods work, but they don't reschedule themselves if something goes wrong.

Naked Pods (Pods not managed by a controller, such as Deployment or a StatefulSet) are fine for testing, but in real setups, they are risky.

Why? Because if the node hosting that Pod dies, the Pod dies with it and Kubernetes won't bring it back automatically.

Use Deployments for apps that should always be running

A Deployment, which both creates a ReplicaSet to ensure that the desired number of Pods is always available, and specifies a strategy to replace Pods (such as RollingUpdate), is almost always preferable to creating Pods directly. You can roll out a new version, and if something breaks, roll back instantly.

Use Jobs for tasks that should finish

A Job is perfect when you need something to run once and then stop like database migration or batch processing task. It will retry if the pods fails and report success when it's done.

Service Configuration and Networking

Services are how your workloads talk to each other inside (and sometimes outside) your cluster. Without them, your pods exist but can't reach anyone. Let's make sure that doesn't happen.

Create Services before workloads that use them

When Kubernetes starts a Pod, it automatically injects environment variables for existing Services. So, if a Pod depends on a Service, create a Service before its corresponding backend workloads (Deployments or StatefulSets), and before any workloads that need to access it.

For example, if a Service named foo exists, all containers will get the following variables in their initial environment:

FOO_SERVICE_HOST=<the host the Service runs on>
FOO_SERVICE_PORT=<the port the Service runs on>

DNS based discovery doesn't have this problem, but it's a good habit to follow anyway.

Use DNS for Service discovery

If your cluster has the DNS add-on (most do), every Service automatically gets a DNS entry. That means you can access it by name instead of IP:

curl http://my-service.default.svc.cluster.local

It's one of those features that makes Kubernetes networking feel magical.

Avoid hostPort and hostNetwork unless absolutely necessary

You'll sometimes see these options in manifests:

hostPort: 8080
hostNetwork: true

But here's the thing: They tie your Pods to specific nodes, making them harder to schedule and scale. Because each <hostIP, hostPort, protocol> combination must be unique. If you don't specify the hostIP and protocol explicitly, Kubernetes will use 0.0.0.0 as the default hostIP and TCP as the default protocol. Unless you're debugging or building something like a network plugin, avoid them.

If you just need local access for testing, try kubectl port-forward:

kubectl port-forward deployment/web 8080:80

See Use Port Forwarding to access applications in a cluster to learn more. Or if you really need external access, use a type: NodePort Service. That's the safer, Kubernetes-native way.

Use headless Services for internal discovery

Sometimes, you don't want Kubernetes to load balance traffic. You want to talk directly to each Pod. That's where headless Services come in.

You create one by setting clusterIP: None. Instead of a single IP, DNS gives you a list of all Pods IPs, perfect for apps that manage connections themselves.

Working with labels effectively

Labels are key/value pairs that are attached to objects such as Pods. Labels help you organize, query and group your resources. They don't do anything by themselves, but they make everything else from Services to Deployments work together smoothly.

Use semantics labels

Good labels help you understand what's what, even after months later. Define and use labels that identify semantic attributes of your application or Deployment. For example;

labels:
 app.kubernetes.io/name: myapp
 app.kubernetes.io/component: web
 tier: frontend
 phase: test
  • app.kubernetes.io/name : what the app is
  • tier : which layer it belongs to (frontend/backend)
  • phase : which stage it's in (test/prod)

You can then use these labels to make powerful selectors. For example:

kubectl get pods -l tier=frontend

This will list all frontend Pods across your cluster, no matter which Deployment they came from. Basically you are not manually listing Pod names; you are just describing what you want. See the guestbook app for examples of this approach.

Use common Kubernetes labels

Kubernetes actually recommends a set of common labels. It's a standardized way to name things across your different workloads or projects. Following this convention makes your manifests cleaner, and it means that tools such as Headlamp, dashboard, or third-party monitoring systems can all automatically understand what's running.

Manipulate labels for debugging

Since controllers (like ReplicaSets or Deployments) use labels to manage Pods, you can remove a label to “detach” a Pod temporarily.

Example:

kubectl label pod mypod app-

The app- part removes the label key app. Once that happens, the controller won’t manage that Pod anymore. It’s like isolating it for inspection, a “quarantine mode” for debugging. To interactively remove or add labels, use kubectl label.

You can then check logs, exec into it and once done, delete it manually. That’s a super underrated trick every Kubernetes engineer should know.

Handy kubectl tips

These small tips make life much easier when you are working with multiple manifest files or clusters.

Apply entire directories

Instead of applying one file at a time, apply the whole folder:

# Using server-side apply is also a good practice
kubectl apply -f configs/ --server-side

This command looks for .yaml, .yml and .json files in that folder and applies them all together. It's faster, cleaner and helps keep things grouped by app.

Use label selectors to get or delete resources

You don't always need to type out resource names one by one. Instead, use selectors to act on entire groups at once:

kubectl get pods -l app=myapp
kubectl delete pod -l phase=test

It's especially useful in CI/CD pipelines, where you want to clean up test resources dynamically.

Quickly create Deployments and Services

For quick experiments, you don't always need to write a manifest. You can spin up a Deployment right from the CLI:

kubectl create deployment webapp --image=nginx

Then expose it as a Service:

kubectl expose deployment webapp --port=80

This is great when you just want to test something before writing full manifests. Also, see Use a Service to Access an Application in a cluster for an example.

Conclusion

Cleaner configuration leads to calmer cluster administrators. If you stick to a few simple habits: keep configuration simple and minimal, version-control everything, use consistent labels, and avoid relying on naked Pods, you'll save yourself hours of debugging down the road.

The best part? Clean configurations stay readable. Even after months, you or anyone on your team can glance at them and know exactly what’s happening.

Categories: CNCF Projects, Kubernetes

Helm 4 Released

Helm Blog - Sun, 11/16/2025 - 19:00

On Wednesday November 12th, during the Helm 4 presentation at KubeCon + CloudNativeCon, Helm v4.0.0 was released. This is the first new major version of Helm in 6 years.

What's New

Helm v3 has served the Kubernetes community well for many years. During that time we saw new ways to use Helm, new applications installed via charts, the rise of Artifact Hub, and numerous tools that build on top of Helm. We also saw where we wanted to add features but the internal architecture of Helm didn't provide a path forward without breaking public APIs in the SDK. Helm 4 makes those changes to enable new features now and into the future.

Some of the new features include:

  • Redesigned plugin system that supports Web Assembly based plugins
  • Post-renderers are now plugins
  • Server side apply is now supported
  • Improved resource watching, to support waiting, based on kstatus
  • Local Content-based caching (e.g. for charts)
  • Logging via slog enabling SDK logging to integrate with modern loggers
  • Reproducible/Idempotent builds of chart archives
  • Updated SDK API including support for multiple chart API versions (new experimental v3 chart API version coming soon)

You can learn about more of the changes in the Helm 4 Overview.

Helm v3 Support

When a major version of software comes out, it takes awhile to make the transition. Helm v3 will continue to be supported to enable a clean transition period. The dates of continued support are:

  • Bug fixes until July 8th 2026.
  • Security fixes until November 11th 2026.

Helm releases updates on Wednesdays (typically the 2nd Wednesday in a month) and these dates correspond with release schedule dates. During this time there will be NO features backported other than updates to the Kubernetes client libraries that enable support of new Kubernetes versions.

Learn More

You can learn about the Helm changes in the overview or find all the changes in the full changelog. The documentation shares many more details as you can find all the ways Helm has stayed the same and the new features you can take advantage of.

Categories: CNCF Projects, Kubernetes

Ingress NGINX Retirement: What You Need to Know

Kubernetes Blog - Tue, 11/11/2025 - 13:30

To prioritize the safety and security of the ecosystem, Kubernetes SIG Network and the Security Response Committee are announcing the upcoming retirement of Ingress NGINX. Best-effort maintenance will continue until March 2026. Afterward, there will be no further releases, no bugfixes, and no updates to resolve any security vulnerabilities that may be discovered. Existing deployments of Ingress NGINX will continue to function and installation artifacts will remain available.

We recommend migrating to one of the many alternatives. Consider migrating to Gateway API, the modern replacement for Ingress. If you must continue using Ingress, many alternative Ingress controllers are listed in the Kubernetes documentation. Continue reading for further information about the history and current state of Ingress NGINX, as well as next steps.

About Ingress NGINX

Ingress is the original user-friendly way to direct network traffic to workloads running on Kubernetes. (Gateway API is a newer way to achieve many of the same goals.) In order for an Ingress to work in your cluster, there must be an Ingress controller running. There are many Ingress controller choices available, which serve the needs of different users and use cases. Some are cloud-provider specific, while others have more general applicability.

Ingress NGINX was an Ingress controller, developed early in the history of the Kubernetes project as an example implementation of the API. It became very popular due to its tremendous flexibility, breadth of features, and independence from any particular cloud or infrastructure provider. Since those days, many other Ingress controllers have been created within the Kubernetes project by community groups, and by cloud native vendors. Ingress NGINX has continued to be one of the most popular, deployed as part of many hosted Kubernetes platforms and within innumerable independent users’ clusters.

History and Challenges

The breadth and flexibility of Ingress NGINX has caused maintenance challenges. Changing expectations about cloud native software have also added complications. What were once considered helpful options have sometimes come to be considered serious security flaws, such as the ability to add arbitrary NGINX configuration directives via the "snippets" annotations. Yesterday’s flexibility has become today’s insurmountable technical debt.

Despite the project’s popularity among users, Ingress NGINX has always struggled with insufficient or barely-sufficient maintainership. For years, the project has had only one or two people doing development work, on their own time, after work hours and on weekends. Last year, the Ingress NGINX maintainers announced their plans to wind down Ingress NGINX and develop a replacement controller together with the Gateway API community. Unfortunately, even that announcement failed to generate additional interest in helping maintain Ingress NGINX or develop InGate to replace it. (InGate development never progressed far enough to create a mature replacement; it will also be retired.)

Current State and Next Steps

Currently, Ingress NGINX is receiving best-effort maintenance. SIG Network and the Security Response Committee have exhausted our efforts to find additional support to make Ingress NGINX sustainable. To prioritize user safety, we must retire the project.

In March 2026, Ingress NGINX maintenance will be halted, and the project will be retired. After that time, there will be no further releases, no bugfixes, and no updates to resolve any security vulnerabilities that may be discovered. The GitHub repositories will be made read-only and left available for reference.

Existing deployments of Ingress NGINX will not be broken. Existing project artifacts such as Helm charts and container images will remain available.

In most cases, you can check whether you use Ingress NGINX by running kubectl get pods \--all-namespaces \--selector app.kubernetes.io/name=ingress-nginx with cluster administrator permissions.

We would like to thank the Ingress NGINX maintainers for their work in creating and maintaining this project–their dedication remains impressive. This Ingress controller has powered billions of requests in datacenters and homelabs all around the world. In a lot of ways, Kubernetes wouldn’t be where it is without Ingress NGINX, and we are grateful for so many years of incredible effort.

SIG Network and the Security Response Committee recommend that all Ingress NGINX users begin migration to Gateway API or another Ingress controller immediately. Many options are listed in the Kubernetes documentation: Gateway API, Ingress. Additional options may be available from vendors you work with.

Categories: CNCF Projects, Kubernetes

OpenFGA Becomes a CNCF Incubating Project

CNCF Blog Projects Category - Tue, 11/11/2025 - 09:00

The CNCF Technical Oversight Committee (TOC) has voted to accept OpenFGA as a CNCF incubating project. 

What is OpenFGA?

OpenFGA is an authorization engine that addresses the challenge of implementing complex access control at scale in modern software applications. Inspired by Google’s global access control system, Zanzibar, OpenFGA leverages Relationship-Based Access Control (ReBAC). This allows developers to define permissions based on relationships between users and objects (e.g., who can view which document). By serving as an external service with an API and multiple SDKs, it centralizes and abstracts the authorization logic out of the application code. This separation of concerns significantly improves developer velocity by simplifying security implementation and ensures that access rules are consistent, scalable, and easy to audit across all services, solving a critical complexity problem for developers building distributed systems.

OpenFGA’s History

OpenFGA was developed by a group of Okta employees, and is the foundation for the Auth0 FGA commercial offering. 

The project was accepted as a CNCF Sandbox project in September 2022. Since then, it has been deployed by hundreds of companies and received multiple contributions. Some major moments and updates include:

  • 37 companies publicly acknowledge using it in production.
  • Engineers from Grafana Labs and GitPod have become official maintainers.
  • OpenFGA was invited to present on the Maintainer’s track at Kubecon + CloudNativeCon Europe 2025.
  • A MySQL storage adapter was contributed by TwinTag and SQLite storage adapter was contributed by Grafana Labs.
  • OpenFGA started hosting a monthly OpenFGA community meeting in April 2023
  • Several developer experience improvements, like:
    • New SDKs for Python and Java
    • IDE integrations with VS Code and IntelliJ
    • A CLI with support for model testing
    • A Terraform Provider was donated to the project by Maurice Ackel
  • A new caching implementation and multiple performance improvements shipped over the last year.
  • OpenFGA also added the ListObjects endpoint to retrieve all resources a user has a specific relation with a resource. Additionally, OpenFGA added the ListUsers endpoint to retrieve all users that have a specific relation with a resource.

Further, OpenFGA integrates with multiple CNCF projects:

Maintainer Perspective

“Seeing companies successfully deploy OpenFGA in production demonstrates its viability as an authorization solution. Our focus now is on growth. CNCF Incubation provides increased credibility and visibility – attracting a broader set of contributors and helping secure long-term sustainability. We anticipate this phase supporting us collectively build the definitive and centralized service for fine-grained authorization that the cloud native ecosystem can continue to trust.

Andres Aguiar, OpenFGA Maintainer and Director of Product at Okta

“When Grafana adopted OpenFGA the community was incredibly welcoming, and we’ve been fortunate to collaborate on enhancements like SQLite support. We are excited to work with CNCF to continue the evolution of the OpenFGA platform.”

Dan Cech, Distinguished Engineer, Grafana Labs

From the TOC

“Authorization is one of the most complex and critical problems in distributed systems, and OpenFGA provides a clean, scalable solution that developers can actually adopt. Its ReBAC model and API-first approach simplify how teams think about access control, removing layers of custom logic from applications. What impressed me most during the due diligence process was the project’s momentum—strong community growth, diverse maintainers, and real-world production deployments. OpenFGA is quickly becoming a foundational building block for secure, cloud native applications.”

Ricardo Aravena, CNCF TOC Sponsor

“As the TOC Sponsor for OpenFGA’s incubation, I’ve had the opportunity to work closely with the maintainers and see their deep technical rigor and commitment to excellence firsthand. OpenFGA reflects the kind of thoughtful engineering and collaboration that drives the CNCF ecosystem forward. By externalizing authorization through a developer-friendly API, OpenFGA empowers teams to scale security with the same agility as their infrastructure. Throughout the incubation process, the maintainers have been exceptionally responsive and precise in addressing feedback, demonstrating the project’s maturity and readiness for broader adoption. With growing adoption and strong technical foundations, I’m excited to see how the OpenFGA community continues to expand its capabilities and help organizations strengthen access control across cloud native environments.”

Faseela Kundattil, CNCF TOC Sponsor

Main Components

Some main components of the project include:

  • The OpenFGA server designed to answer authorization requests fast and at scale
  • SDKs for Go, .NET, JS, Java, Python
  • A CLI to interact with the OpenFGA server and test authorization models
  • Helm Charts to deploy to Kubernetes
  • Integrations with VS Code and Jetbrains

Notable Milestones

Looking Ahead

OpenFGA is a database, and as with any database, there will always be work to improve performance for every type of query. Future goals of the roadmap are to make it simpler for maintainers to contribute to SDKs; launch new SDKs for Ruby, Rust, and PHP; add support for the AuthZen standard; add new visualization options and open sourcing the OpenFGA playground tool; improve observability; add streaming API endpoints for better performance; and include more robust error handling with new write-conflict options.

You can learn more about OpenFGA here.

As a CNCF-hosted project, OpenFGA is part of a neutral foundation aligned with its technical interests, as well as the larger Linux Foundation, which provides governance, marketing support, and community outreach. OpenFGA joins incubating technologies Backstage, Buildpacks, cert-manager, Chaos Mesh, CloudEvents, Container Network Interface (CNI), Contour, Cortex, CubeFS, Dapr, Dragonfly, Emissary-Ingress, Falco, gRPC, in-toto, Keptn, Keycloak, Knative, KubeEdge, Kubeflow, KubeVela, KubeVirt, Kyverno, Litmus, Longhorn, NATS, Notary, OpenFeature, OpenKruise, OpenMetrics, OpenTelemetry, Operator Framework, Thanos, and Volcano. For more information on maturity requirements for each level, please visit the CNCF Graduation Criteria.

Categories: CNCF Projects

Announcing the 2025 Steering Committee Election Results

Kubernetes Blog - Sun, 11/09/2025 - 15:10

The 2025 Steering Committee Election is now complete. The Kubernetes Steering Committee consists of 7 seats, 4 of which were up for election in 2025. Incoming committee members serve a term of 2 years, and all members are elected by the Kubernetes Community.

The Steering Committee oversees the governance of the entire Kubernetes project. With that great power comes great responsibility. You can learn more about the steering committee’s role in their charter.

Thank you to everyone who voted in the election; your participation helps support the community’s continued health and success.

Results

Congratulations to the elected committee members whose two year terms begin immediately (listed in alphabetical order by GitHub handle):

They join continuing members:

Maciej Szulik and Paco Xu are returning Steering Committee Members.

Big thanks!

Thank you and congratulations on a successful election to this round’s election officers:

Thanks to the Emeritus Steering Committee Members. Your service is appreciated by the community:

And thank you to all the candidates who came forward to run for election.

Get involved with the Steering Committee

This governing body, like all of Kubernetes, is open to all. You can follow along with Steering Committee meeting notes and weigh in by filing an issue or creating a PR against their repo. They have an open meeting on the first Monday at 8am PT of every month. They can also be contacted at their public mailing list [email protected].

You can see what the Steering Committee meetings are all about by watching past meetings on the YouTube Playlist.

This post was adapted from one written by the Contributor Comms Subproject. If you want to write stories about the Kubernetes community, learn more about us.

Categories: CNCF Projects, Kubernetes

Self-hosted human and machine identities in Keycloak 26.4

CNCF Blog Projects Category - Fri, 11/07/2025 - 10:00

Keycloak is a leading open source solution in the cloud-native ecosystem for Identity and Access Management, a key component of accessing applications and their data.

With the release of Keycloak 26.4, we’ve added features for both machine and human identities. New features focus on security enhancement, deeper integration, and improved server administration. See below for the release highlights, or dive deeper in our Keycloak 26.4 release announcement.

Keycloak recently surpassed 30k GitHub stars and 1,350 contributors. If you’re attending KubeCon + CloudNativeCon North America in Atlanta, stop by and say hi—we’d love to hear how you’re using Keycloak!

What’s New in 26.4

Passwordless user authentication with Passkeys

Keycloak now offers full support for Passkeys. As secure, passwordless authentication becomes the new standard, we’ve made passkeys simple to configure. For environments that are unable to adopt passkeys, Keycloak continues to support OTP and recovery codes. You can find a passkey walkthrough on the Keycloak blog.

Tightened OpenID Connect security with FAPI 2 and DPoP

Keycloak 26.4 implements the Financial-grade API (FAPI) 2.0 standard, ensuring strong security best practices. This includes support for Demonstrating Proof-of-Possession (DPoP), which is a safer way to handle tokens in public OpenID Connect clients.

Simplified deployments across multiple availability zones

Deployment across multiple availability zones or data centers is simplified in 26.4:

  • Split-brain detection
  • Full support in the Keycloak Operator 
  • Latency optimizations when Keycloak nodes run in different data centers

Keycloak docs contain a full step-by-step guide, and we published a blog post on how to scale to 2,000 logins/sec and 10,000 token refreshes/sec. 

Authenticating applications with Kubernetes service account tokens or SPIFFE

When applications interact with Keycloak around OpenID Connect, each confidential server-side application needs credentials. This usually comes with the churn to distribute and rotate them regularly.

With 26.4, you can use Kubernetes service account tokens, which are automatically distributed to each Pod when running on Kubernetes. This removes the need to distribute and rotate an extra pair of credentials. For use cases inside and outside Kubernetes, you can also use SPIFFE.

To test this preview feature:

  1. Enable the features client-auth-federated:v1,spiffe:v1, and kubernetes-service-accounts:v1.
  2. Register a Kubernetes or SPIFFE identity provider in Keycloak.
  3. For a client registered in Keycloak, configure the Client Authenticator in the Credentials tab as Signed JWT – Federated, referencing the identity provider created in the previous step and the expected subject in the JWT.  
Keycloak

Looking ahead

Keycloak’s roadmap includes:

You can follow our journey at keycloak.org and get involved. Our nightly builds give you early access to Keycloak’s latest features.

Categories: CNCF Projects

Gateway API 1.4: New Features

Kubernetes Blog - Thu, 11/06/2025 - 12:00

Gateway API logo

Ready to rock your Kubernetes networking? The Kubernetes SIG Network community presented the General Availability (GA) release of Gateway API (v1.4.0)! Released on October 6, 2025, version 1.4.0 reinforces the path for modern, expressive, and extensible service networking in Kubernetes.

Gateway API v1.4.0 brings three new features to the Standard channel (Gateway API's GA release channel):

  • BackendTLSPolicy for TLS between gateways and backends
  • supportedFeatures in GatewayClass status
  • Named rules for Routes

and introduces three new experimental features:

  • Mesh resource for service mesh configuration
  • Default gateways to ease configuration burden**
  • externalAuth filter for HTTPRoute

Graduations to Standard Channel

Backend TLS policy

Leads: Candace Holman, Norwin Schnyder, Katarzyna Łach

GEP-1897: BackendTLSPolicy

BackendTLSPolicy is a new Gateway API type for specifying the TLS configuration of the connection from the Gateway to backend pod(s). . Prior to the introduction of BackendTLSPolicy, there was no API specification that allowed encrypted traffic on the hop from Gateway to backend.

The BackendTLSPolicy validation configuration requires a hostname. This hostname serves two purposes. It is used as the SNI header when connecting to the backend and for authentication, the certificate presented by the backend must match this hostname, unless subjectAltNames is explicitly specified.

If subjectAltNames (SANs) are specified, the hostname is only used for SNI, and authentication is performed against the SANs instead. If you still need to authenticate against the hostname value in this case, you MUST add it to the subjectAltNames list.

BackendTLSPolicy validation configuration also requires either caCertificateRefs or wellKnownCACertificates. caCertificateRefs refer to one or more (up to 8) PEM-encoded TLS certificate bundles. If there are no specific certificates to use, then depending on your implementation, you may use wellKnownCACertificates, set to "System" to tell the Gateway to use an implementation-specific set of trusted CA Certificates.

In this example, the BackendTLSPolicy is configured to use certificates defined in the auth-cert ConfigMap to connect with a TLS-encrypted upstream connection where pods backing the auth service are expected to serve a valid certificate for auth.example.com. It uses subjectAltNames with a Hostname type, but you may also use a URI type.

apiVersion: gateway.networking.k8s.io/v1
kind: BackendTLSPolicy
metadata:
 name: tls-upstream-auth
spec:
 targetRefs:
 - kind: Service
 name: auth
 group: ""
 sectionName: "https"
 validation:
 caCertificateRefs:
 - group: "" # core API group
 kind: ConfigMap
 name: auth-cert
 subjectAltNames:
 - type: "Hostname"
 hostname: "auth.example.com"

In this example, the BackendTLSPolicy is configured to use system certificates to connect with a TLS-encrypted backend connection where Pods backing the dev Service are expected to serve a valid certificate for dev.example.com.

apiVersion: gateway.networking.k8s.io/v1
kind: BackendTLSPolicy
metadata:
 name: tls-upstream-dev
spec:
 targetRefs:
 - kind: Service
 name: dev
 group: ""
 sectionName: "btls"
 validation:
 wellKnownCACertificates: "System"
 hostname: dev.example.com

More information on the configuration of TLS in Gateway API can be found in Gateway API - TLS Configuration.

Status information about the features that an implementation supports

Leads: Lior Lieberman, Beka Modebadze

GEP-2162: Supported features in GatewayClass Status

GatewayClass status has a new field, supportedFeatures. This addition allows implementations to declare the set of features they support. This provides a clear way for users and tools to understand the capabilities of a given GatewayClass.

This feature's name for conformance tests (and GatewayClass status reporting) is SupportedFeatures. Implementations must populate the supportedFeatures field in the .status of the GatewayClass before the GatewayClass is accepted, or in the same operation.

Here’s an example of a supportedFeatures published under GatewayClass' .status:

apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
...
status:
 conditions:
 - lastTransitionTime: "2022-11-16T10:33:06Z"
 message: Handled by Foo controller
 observedGeneration: 1
 reason: Accepted
 status: "True"
 type: Accepted
 supportedFeatures:
 - HTTPRoute
 - HTTPRouteHostRewrite
 - HTTPRoutePortRedirect
 - HTTPRouteQueryParamMatching

Graduation of SupportedFeatures to Standard, helped improve the conformance testing process for Gateway API. The conformance test suite will now automatically run tests based on the features populated in the GatewayClass' status. This creates a strong, verifiable link between an implementation's declared capabilities and the test results, making it easier for implementers to run the correct conformance tests and for users to trust the conformance reports.

This means when the SupportedFeatures field is populated in the GatewayClass status there will be no need for additional conformance tests flags like –suported-features, or –exempt or –all-features. It's important to note that Mesh features are an exception to this and can be tested for conformance by using Conformance Profiles, or by manually providing any combination of features related flags until the dedicated resource graduates from the experimental channel.

Named rules for Routes

GEP-995: Adding a new name field to all xRouteRule types (HTTPRouteRule, GRPCRouteRule, etc.)

Leads: Guilherme Cassolato

This enhancement enables route rules to be explicitly identified and referenced across the Gateway API ecosystem. Some of the key use cases include:

  • Status: Allowing status conditions to reference specific rules directly by name.
  • Observability: Making it easier to identify individual rules in logs, traces, and metrics.
  • Policies: Enabling policies (GEP-713) to target specific route rules via the sectionName field in their targetRef[s].
  • Tooling: Simplifying filtering and referencing of route rules in tools such as gwctl, kubectl, and general-purpose utilities like jq and yq.
  • Internal configuration mapping: Facilitating the generation of internal configurations that reference route rules by name within gateway and mesh implementations.

This follows the same well-established pattern already adopted for Gateway listeners, Service ports, Pods (and containers), and many other Kubernetes resources.

While the new name field is optional (so existing resources remain valid), its use is strongly encouraged. Implementations are not expected to assign a default value, but they may enforce constraints such as immutability.

Finally, keep in mind that the name format is validated, and other fields (such as sectionName) may impose additional, indirect constraints.

Experimental channel changes

Enabling external Auth for HTTPRoute

Giving Gateway API the ability to enforce authentication and maybe authorization as well at the Gateway or HTTPRoute level has been a highly requested feature for a long time. (See the GEP-1494 issue for some background.)

This Gateway API release adds an Experimental filter in HTTPRoute that tells the Gateway API implementation to call out to an external service to authenticate (and, optionally, authorize) requests.

This filter is based on the Envoy ext_authz API, and allows talking to an Auth service that uses either gRPC or HTTP for its protocol.

Both methods allow the configuration of what headers to forward to the Auth service, with the HTTP protocol allowing some extra information like a prefix path.

A HTTP example might look like this (noting that this example requires the Experimental channel to be installed and an implementation that supports External Auth to actually understand the config):

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
 name: require-auth
 namespace: default
spec:
 parentRefs:
 - name: your-gateway-here
 rules:
 - matches:
 - path:
 type: Prefix
 value: /admin
 filters:
 - type: ExternalAuth
 externalAuth:
 protocol: HTTP
 backendRef:
 name: auth-service
 http:
 # These headers are always sent for the HTTP protocol,
 # but are included here for illustrative purposes
 allowedHeaders:
 - Host
 - Method
 - Path
 - Content-Length
 - Authorization
 backendRefs:
 - name: admin-backend
 port: 8080

This allows the backend Auth service to use the supplied headers to make a determination about the authentication for the request.

When a request is allowed, the external Auth service will respond with a 200 HTTP response code, and optionally extra headers to be included in the request that is forwarded to the backend. When the request is denied, the Auth service will respond with a 403 HTTP response.

Since the Authorization header is used in many authentication methods, this method can be used to do Basic, Oauth, JWT, and other common authentication and authorization methods.

Mesh resource

Lead(s): Flynn

GEP-3949: Mesh-wide configuration and supported features

Gateway API v1.4.0 introduces a new experimental Mesh resource, which provides a way to configure mesh-wide settings and discover the features supported by a given mesh implementation. This resource is analogous to the Gateway resource and will initially be mainly used for conformance testing, with plans to extend its use to off-cluster Gateways in the future.

The Mesh resource is cluster-scoped and, as an experimental feature, is named XMesh and resides in the gateway.networking.x-k8s.io API group. A key field is controllerName, which specifies the mesh implementation responsible for the resource. The resource's status stanza indicates whether the mesh implementation has accepted it and lists the features the mesh supports.

One of the goals of this GEP is to avoid making it more difficult for users to adopt a mesh. To simplify adoption, mesh implementations are expected to create a default Mesh resource upon startup if one with a matching controllerName doesn't already exist. This avoids the need for manual creation of the resource to begin using a mesh.

The new XMesh API kind, within the gateway.networking.x-k8s.io/v1alpha1 API group, provides a central point for mesh configuration and feature discovery (source).

A minimal XMesh object specifies the controllerName:

apiVersion: gateway.networking.x-k8s.io/v1alpha1
kind: XMesh
metadata:
 name: one-mesh-to-mesh-them-all
spec:
 controllerName: one-mesh.example.com/one-mesh

The mesh implementation populates the status field to confirm it has accepted the resource and to list its supported features ( source):

status:
 conditions:
 - type: Accepted
 status: "True"
 reason: Accepted
 supportedFeatures:
 - name: MeshHTTPRoute
 - name: OffClusterGateway

Introducing default Gateways

Lead(s): Flynn

GEP-3793: Allowing Gateways to program some routes by default.

For application developers, one common piece of feedback has been the need to explicitly name a parent Gateway for every single north-south Route. While this explicitness prevents ambiguity, it adds friction, especially for developers who just want to expose their application to the outside world without worrying about the underlying infrastructure's naming scheme. To address this, we have introduce the concept of Default Gateways.

For application developers: Just "use the default"

As an application developer, you often don't care about the specific Gateway your traffic flows through, you just want it to work. With this enhancement, you can now create a Route and simply ask it to use a default Gateway.

This is done by setting the new useDefaultGateways field in your Route's spec.

Here’s a simple HTTPRoute that uses a default Gateway:

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
 name: my-route
spec:
 useDefaultGateways: All
 rules:
 - backendRefs:
 - name: my-service
 port: 80

That's it! No more need to hunt down the correct Gateway name for your environment. Your Route is now a "defaulted Route."

For cluster operators: You're still in control

This feature doesn't take control away from cluster operators ("Chihiro"). In fact, they have explicit control over which Gateways can act as a default. A Gateway will only accept these defaulted Routes if it is configured to do so.

You can also use a ValidatingAdmissionPolicy to either require or even forbid for Routes to rely on a default Gateway.

As a cluster operator, you can designate a Gateway as a default by setting the (new) .spec.defaultScope field:

apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
 name: my-default-gateway
 namespace: default
spec:
 defaultScope: All
 # ... other gateway configuration

Operators can choose to have no default Gateways, or even multiple.

How it works and key details

  • To maintain a clean, GitOps-friendly workflow, a default Gateway does not modify the spec.parentRefs of your Route. Instead, the binding is reflected in the Route's status field. You can always inspect the status.parents stanza of your Route to see exactly which Gateway or Gateways have accepted it. This preserves your original intent and avoids conflicts with CD tools.

  • The design explicitly supports having multiple Gateways designated as defaults within a cluster. When this happens, a defaulted Route will bind to all of them. This enables cluster operators to perform zero-downtime migrations and testing of new default Gateways.

  • You can create a single Route that handles both north-south traffic (traffic entering or leaving the cluster, via a default Gateway) and east-west/mesh traffic (traffic between services within the cluster), by explicitly referencing a Service in parentRefs.

Default Gateways represent a significant step forward in making the Gateway API simpler and more intuitive for everyday use cases, bridging the gap between the flexibility needed by operators and the simplicity desired by developers.

Configuring client certificate validation

Lead(s): Arko Dasgupta, Katarzyna Łach

GEP-91: Address connection coalescing security issue

This release brings updates for configuring client certificate validation, addressing a critical security vulnerability related to connection reuse. HTTP connection coalescing is a web performance optimization that allows a client to reuse an existing TLS connection for requests to different domains. While this reduces the overhead of establishing new connections, it introduces a security risk in the context of API gateways. The ability to reuse a single TLS connection across multiple Listeners brings the need to introduce shared client certificate configuration in order to avoid unauthorized access.

Why SNI-based mTLS is not the answer

One might think that using Server Name Indication (SNI) to differentiate between Listeners would solve this problem. However, TLS SNI is not a reliable mechanism for enforcing security policies in a connection coalescing scenario. A client could use a single TLS connection for multiple peer connections, as long as they are all covered by the same certificate. This means that a client could establish a connection by indicating one peer identity (using SNI), and then reuse that connection to access a different virtual host that is listening on the same IP address and port. That reuse, which is controlled by client side heuristics, could bypass mutual TLS policies that were specific to the second listener configuration.

Here's an example to help explain it:

apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
 name: wildcard-tls-gateway
spec:
 gatewayClassName: example
 listeners:
 - name: foo-https
 protocol: HTTPS
 port: 443
 hostname: foo.example.com
 tls:
 certificateRefs:
 - group: "" # core API group
 kind: Secret
 name: foo-example-com-cert # SAN: foo.example.com
 - name: wildcard-https
 protocol: HTTPS
 port: 443
 hostname: "*.example.com"
 tls:
 certificateRefs:
 - group: "" # core API group
 kind: Secret
 name: wildcard-example-com-cert # SAN: *.example.com

I have configured a Gateway with two listeners, both having overlapping hostnames. My intention is for the foo-http listener to be accessible only by clients presenting the foo-example-com-cert certificate. In contrast, the wildcard-https listener should allow access to a broader audience using any certificate valid for the *.example.com domain.

Consider a scenario where a client initially connects to foo.example.com. The server requests and successfully validates the foo-example-com-cert certificate, establishing the connection. Subsequently, the same client wishes to access other sites within this domain, such as bar.example.com, which is handled by the wildcard-https listener. Due to connection reuse, clients can access wildcard-https backends without an additional TLS handshake on the existing connection. This process functions as expected.

However, a critical security vulnerability arises when the order of access is reversed. If a client first connects to bar.example.com and presents a valid bar.example.com certificate, the connection is successfully established. If this client then attempts to access foo.example.com, the existing connection's client certificate will not be re-validated. This allows the client to bypass the specific certificate requirement for the foo backend, leading to a serious security breach.

The solution: per-port TLS configuration

The updated Gateway API gains a tls field in the .spec of a Gateway, that allows you to define a default client certificate validation configuration for all Listeners, and then if needed override it on a per-port basis. This provides a flexible and powerful way to manage your TLS policies.

Here’s a look at the updated API definitions (shown as Go source code):

// GatewaySpec defines the desired state of Gateway.
type GatewaySpec struct {
 ...
 // GatewayTLSConfig specifies frontend tls configuration for gateway.
 TLS *GatewayTLSConfig `json:"tls,omitempty"`
}

// GatewayTLSConfig specifies frontend tls configuration for gateway.
type GatewayTLSConfig struct {
 // Default specifies the default client certificate validation configuration
 Default TLSConfig `json:"default"`

 // PerPort specifies tls configuration assigned per port.
 PerPort []TLSPortConfig `json:"perPort,omitempty"`
}

// TLSPortConfig describes a TLS configuration for a specific port.
type TLSPortConfig struct {
 // The Port indicates the Port Number to which the TLS configuration will be applied.
 Port PortNumber `json:"port"`

 // TLS store the configuration that will be applied to all Listeners handling
 // HTTPS traffic and matching given port.
 TLS TLSConfig `json:"tls"`
}

Breaking changes

Standard GRPCRoute - .spec field required (technicality)

The promotion of GRPCRoute to Standard introduces a minor but technically breaking change regarding the presence of the top-level .spec field. As part of achieving Standard status, the Gateway API has tightened the OpenAPI schema validation within the GRPCRoute CustomResourceDefinition (CRD) to explicitly ensure the spec field is required for all GRPCRoute resources. This change enforces stricter conformance to Kubernetes object standards and enhances the resource's stability and predictability. While it is highly unlikely that users were attempting to define a GRPCRoute without any specification, any existing automation or manifests that might have relied on a relaxed interpretation allowing a completely absent spec field will now fail validation and must be updated to include the .spec field, even if empty.

Experimental CORS support in HTTPRoute - breaking change for allowCredentials field

The Gateway API subproject has introduced a breaking change to the Experimental CORS support in HTTPRoute, concerning the allowCredentials field within the CORS policy. This field's definition has been strictly aligned with the upstream CORS specification, which dictates that the corresponding Access-Control-Allow-Credentials header must represent a Boolean value. Previously, the implementation might have been overly permissive, potentially accepting non-standard or string representations such as true due to relaxed schema validation. Users who were configuring CORS rules must now review their manifests and ensure the value for allowCredentials strictly conforms to the new, more restrictive schema. Any existing HTTPRoute definitions that do not adhere to this stricter validation will now be rejected by the API server, requiring a configuration update to maintain functionality.

Improving the development and usage experience

As part of this release, we have improved some of the developer experience workflow:

  • Added Kube API Linter to the CI/CD pipelines, reducing the burden of API reviewers and also reducing the amount of common mistakes.
  • Improving the execution time of CRD tests with the usage of envtest.

Additionally, as part of the effort to improve Gateway API usage experience, some efforts were made to remove some ambiguities and some old tech-debts from our documentation website:

  • The API reference is now explicit when a field is experimental.
  • The GEP (GatewayAPI Enhancement Proposal) navigation bar is automatically generated, reflecting the real status of the enhancements.

Try it out

Unlike other Kubernetes APIs, you don't need to upgrade to the latest version of Kubernetes to get the latest version of Gateway API. As long as you're running Kubernetes 1.26 or later, you'll be able to get up and running with this version of Gateway API.

To try out the API, follow the Getting Started Guide.

As of this writing, seven implementations are already conformant with Gateway API v1.4.0. In alphabetical order:

Get involved

Wondering when a feature will be added? There are lots of opportunities to get involved and help define the future of Kubernetes routing APIs for both ingress and service mesh.

The maintainers would like to thank everyone who's contributed to Gateway API, whether in the form of commits to the repo, discussion, ideas, or general support. We could never have made this kind of progress without the support of this dedicated and active community.

Categories: CNCF Projects, Kubernetes

Helm @ KubeCon + CloudNativeCon NA '25

Helm Blog - Mon, 11/03/2025 - 19:00

The Helm team is headed to KubeCon + CloudNativeCon NA '25 in Atlanta, Georgia next week and it's truly a special one for us! This time around, as we celebrate our 10th birthday (fun fact, Helm was launched at the first KubeCon in 2015), we will also be releasing the highly anticipated Helm 4! Join us for a series of exciting activities throughout the week -- read on for more details!

Helm Booth in Project Pavilion

Don't miss out on meeting our project maintainers at the Helm booth - we'll be hanging out the second half of each day. Drop by to ask questions, learn about what's in Helm 4, and pick up special Hazel swag celebrating Helm's 10th birthday!

  • TUES Nov 11 @ 03:30 PM - 07:45 PM ET

  • WED Nov 12 @ 02:00 PM - 05:00 PM ET

  • THUR Nov 13 @ 12:30 PM - 02:00 PM ET

LOCATION: 8B, back side of Flux and across from Argo

Tuesday November 11, 2025

Simplifying Advanced AI Model Serving on Kubernetes Using Helm Charts

TIME: 12:00 PM - 12:30 PM ET

LOCATION: Building B | Level 4 | B401-402

SPEAKERS: Ajay Vohra & Caron Zhang

The AI model serving landscape on Kubernetes presents practitioners with an overwhelming array of technology choices: From inference servers like Ray Serve and Triton Inference Server, inference engines like vLLM, and orchestration platforms like Ray Cluster and KServe. While this diversity drives innovation, it also creates complexity. Teams often prematurely standardize on limited technology stacks to manage this complexity.

This talk introduces an innovative Helm-based approach that abstracts the complexity of AI model serving while preserving the flexibility to leverage the best tools for each use case. Our solution is accelerator agnostic, and provides a consistent YAML interface for deploying and experimenting with various serving technologies.

We'll demonstrate this approach through two concrete examples of multi-node, multi-accelerator model serving with auto scaling: 1/ Ray Serve + vLLM + Ray Cluster, and 2/ LeaderWorkerSet + Triton Inference Server + vLLM + Ray Cluster + HPA.

Wednesday November 12, 2025

Maintainer Track: Introducing Helm 4

TIME: 11:00 AM - 11:30 AM ET

LOCATION: Building C | Level 3 | Georgia Ballroom 1

SPEAKERS: Matt Farina & Robert Sirchia (Helm Maintainers)

The wait is over! After six years with Helm v3, Helm v4 is finally here. In this session you'll learn about Helm v4, why there was 6 years between major versions (from backwards compatible feature development to maintainer ups and downs), what's new in Helm v4, how long Helm v3 will still be supported, and what comes next. Could that include a Helm v5?

Contribfest: Hands-On With Helm 4: Wasm Plugins, OCI, and Resource Sequencing. Oh My!

TIME: 02:15pm - 03:30 PM ET

LOCATION: Building B | Level 2 | B207

SPEAKERS: Andrew Block, Scott Rigby, & George Jenkins (Helm Maintainers)

Join Helm maintainers for an interactive session contributing to core Helm and building integrations with some of Helm 4's emerging features. We'll guide contributors through creating Helm 4's newest enhancements including WebAssembly plugins, enhancements to how OCI content is manged, and implementing resource sequencing for controlled deployment order. Attendees will explore how to build Download/Postrender/CLI plugins in WebAssembly, develop capabilities related to changes to Helm's management of OCI content including repository prefixes and aliases, and use approaches for sequencing chart deployments beyond Helm's traditional mechanisms.

This session is geared toward anyone interested in Helm development including leveraging and building upon some of the latest features associated with Helm 4!

Helm 4 Release Party

TIME: 06:00 PM - 09:00 PM EST

LOCATION: Max Lager's Wood-Fired Grill & Brewery

Replicated and the CNCF are throwing a Helm 4 Release Party to celebrate the release of Helm 4! Drop by for a low country boil and hang out with the Helm project maintainers for the night! See the invitation here, and don't forget to save your spot – RSVP here.

Thursday November 13, 2025

Mission Abort: Intercepting Dangerous Deletes Before Helm Hits Apply

TIME: 01:45 PM - 02:15 PM ET

LOCATION: Building B | Level 5 | Thomas Murphy Ballroom 4

SPEAKERS: Payal Godhani

What if your next Helm deployment silently deletes a LoadBalancer, a Gateway, or an entire namespace? We’ve lived that nightmare—multiple times. In this talk, we’ll share how we turned painful Sev1 outages into a resilient, guardrail-first deployment strategy. By integrating Helm Diff and Argo CD Diff, we built a system that scans every deployment for destructive changes—like the removal of LoadBalancers, KGateways, Services, PVCs, or Namespaces—and blocks them unless explicitly approved. This second-layer approval acts as a safety circuit for your release pipelines. No guesswork. No blind deploys. Just real-time visibility into what’s about to break—before it actually does. Whether you’re managing a single cluster or an entire fleet, this talk will show you how to stop fearing Helm and start trusting it again. Because resilience isn’t about avoiding failure—it’s about learning, adapting, and building guardrails that protect everyone.

Categories: CNCF Projects, Kubernetes

Announcing Vitess 23.0.0

Vitess Blog - Mon, 11/03/2025 - 19:00
Announcing Vitess 23.0.0 # We’re excited to release Vitess 23.0.0 — the latest major version of Vitess — bringing new defaults, better operational tooling, and refined metrics. This release builds on the strong foundation of version 22 and is designed to make deployment and observability smoother, while continuing to scale MySQL workloads horizontally with confidence. ✅ Why This Release Matters # For production users of Vitess, this release is meaningful in several ways:
Categories: CNCF Projects

Announcing Linkerd 2.19: Post-quantum cryptography

Linkerd Blog - Thu, 10/30/2025 - 20:00

Today we’re happy to announce Linkerd 2.19! This release introduces a significant state-of-the-art security improvement for Linkerd: a modernized TLS stack that uses post-quantum key exchange algorithms by default.

Linkerd has now seen almost a decade of continuous improvement and evolution. Our goal is to build a service mesh that our users can rely on for 100 years. To do this, we partner with users like Grammarly to ensure that Linkerd can accelerate the full scale and scope of modern software environments—and then we feed those lessons directly back into the product. Linkerd 2.19 release is the third major version since the announcement of Buoyant’s profitability and Linkerd project sustainability a year ago, and continues our laser focus on operational simplicity—delivering the notoriously complex service mesh feature set in a way that is manageable, scalable, and performant.

Categories: CNCF Projects

How Non-Developers Can Contribute to Prometheus

Prometheus Blog - Thu, 10/30/2025 - 20:00

My first introduction to the Prometheus project was through the Linux Foundation mentorship program, where I conducted UX research. I remember the anxiety I felt when I was selected as a mentee. I was new not just to Prometheus, but to observability entirely. I worried I was in over my head, working in a heavily developer-focused domain with no development background.

That anxiety turned out to be unfounded. I went on to make meaningful contributions to the project, and I've learned that what I experienced is nearly universal among non-technical contributors to open source.

If you're feeling that same uncertainty, this post is for you. I'll share the challenges you're likely to face (or already face), why your contributions matter, and how to find your place in the Prometheus community.

The Challenges Non-Technical Contributors Face

As a non-technical contributor, I've had my share of obstacles in open source. And from conversations with others navigating these spaces, I've found the struggles are remarkably consistent. Here are the most common barriers:

1. The Technical Intimidation Factor

I've felt out of place in open source spaces, mostly because technical contributors vastly outnumber non-technical ones. Even the non-technical people often have technical backgrounds or have been around long enough to understand what's happening.

When every conversation references concepts you don't know, it's easy to feel intimidated. You join meetings and stay silent throughout. You respond in the chat instead of unmuting because you don't trust yourself to speak up in a recorded meeting where everyone else seems fluent in a technical language you're still learning.

2. Unclear Value Proposition

Open source projects rarely spell out their non-technical needs the way a job posting would. You would hardly find an issue titled "Need: someone to interview users and write case studies" or "Wanted: community manager to organize monthly meetups." Instead, you’re more likely to see a backlog of GitHub issues about bugs, feature requests, and code refactoring.

Even if you have valuable skills, you don't know where they're needed, how to articulate your value, or whether your contributions will be seen as mission-critical or just nice-to-have. Without a clear sense of how you fit in, it's difficult to reach out confidently. You end up sending vague messages like "I'd love to help! Let me know if there's anything I can do", which rarely leads anywhere because maintainers are busy and don't have time to figure out how to match your skills to their needs.

3. Lack of Visible Non-Technical Contributors

One of the things that draws me to an open source community or project is finding other people like me. I think it's the same way for most people. Representation matters. It's hard to be what you can't see.

It’s even more difficult to find non-technical contributors because their contributions are often invisible in the ways projects typically showcase work. GitHub contribution graphs count commits. Changelogs list code changes and bug fixes. You only get the "contributor" label when you've created a pull request that got merged. So, even when people are organizing events, supporting users, or conducting research, their work doesn't show up in the same prominent ways code does.

4. The Onboarding Gap

A typical "Contributing Guide" will walk you through setting up a development environment, creating a branch, running tests, and submitting a pull request. But it rarely explains how to contribute documentation improvements, where design feedback should go, or how community support is organized.

You see "Join our community" with a link to a Slack workspace. But between joining and making your first contribution, there's a significant gap. There are hundreds of people in dozens of channels. Who's a maintainer and who's just another community member? Which channel is appropriate for your questions? Who should you tag when you need guidance?

Why These Gaps Exist

It's worth acknowledging that most of the time, these gaps aren't intentional. Projects don't set out to exclude non-technical contributors or make it harder for them to participate.

In most cases, a small group of developers build something useful and decide to open-source it. They invite people they know who might need it (often other developers) to contribute. The project grows organically within those networks. It becomes a community of developers building tools for developers, and certain functions simply don't feel necessary yet. Marketing? The word spreads naturally through tech circles. Community management? The community is small and self-organizing. UX design? They're developers comfortable with command-line interfaces, so they may not fully consider the experience of using a graphical interface.

None of this is malicious. It's just that the project evolved in a context where those skills weren't obviously needed.

The shift happens when someone, often a non-technical contributor who sees the potential, steps in and says: "You've built something valuable and grown an impressive community. But here's what you might be missing. Here's how documentation could lower the barrier to entry. Here's how community management could retain contributors. Here's how user research could guide your roadmap."

Why Non-Technical Contributions Matter

Prometheus is a powerful monitoring system backed by a large community of developers. But like any open source project, it needs more than code to thrive.

It needs accessible documentation. From my experience working with engineers, most would rather focus on building than writing docs, and understandably so. Engineers who know the system inside out often write documentation that assumes knowledge newcomers don't have. What makes perfect sense to someone who built the feature can feel impenetrable to someone encountering it for the first time. A technical writer testing the product from an end user's perspective, not a builder's, can bridge that gap and lower the barrier to entry.

It needs organization. The GitHub issues backlog has hundreds of open items that haven't been triaged. Maintainers spend valuable time parsing what users actually need instead of building solutions. A project manager or someone with triage experience could turn that chaos into a clear roadmap, allowing maintainers to spend their time building solutions.

It needs community support. Imagine a user who joins the Slack workspace, excited to contribute. They don't know where to start. They ask a question that gets buried in the stream of messages. They quietly leave. The project just lost a potential contributor because no one was there to welcome them and point them in the right direction.

These are the situations non-technical contributions can help prevent. Good documentation lowers the barrier to entry, which means more adoption, more feedback, and better features. Active community management retains contributors who would otherwise drift away, which means distributed knowledge and less maintainer burnout. Organization and triage turn scattered input into actionable priorities.

The Prometheus maintainers are doing exceptional work building a robust, scalable monitoring system. But they can't do everything, and they shouldn't have to. The question now isn't whether non-technical contributions matter. It's whether we create the space for them to happen.

Practical Ways You Can Contribute to Prometheus

If you're ready to contribute to Prometheus but aren't sure where to start, here are some areas where non-technical skills are actively needed.

1. Join the UX Efforts

Prometheus is actively working to improve its user experience, and the community now has a UX Working Group dedicated to this effort.

If you're a UX researcher, designer, or someone with an eye for usability, this is an excellent time to get involved. The working group is still taking shape, with ongoing discussions about priorities and processes. Join the Slack channel to participate in these conversations and watch for upcoming announcements about specific ways to contribute.

I can tell you from experience that the community is receptive to UX contributions, and your work will have a real impact.

2. Write for the Prometheus Blog

If you're a technical writer or content creator, the Prometheus blog is a natural entry point. The blog publishes tutorials, case studies, best practices, community updates, and generally, content that helps users get more value from Prometheus.

Check out the blog content guide to understand what makes a strong blog proposal and how to publish a post on the blog. There's an audience eager to learn from your experience.

3. Improve and Maintain Documentation

Documentation is one of those perpetual needs in open source. There's always something that could be clearer, more complete, or better organized. The Prometheus docs repo is no exception.

You can contribute by fixing typos and broken links, expanding getting-started guides, creating tutorials for common monitoring scenarios, or triaging issues to help prioritize what needs attention. Even if you don't consider yourself a technical writer, if you've ever been confused by the docs and figured something out, you can help make it clearer for the next person.

4. Help Organize PromCon

PromCon is Prometheus's annual conference, and it takes significant coordination to pull off. The organizing team handles everything from speaker selection and scheduling to venue logistics and sponsor relationships.

If you have experience in event planning, sponsor outreach, marketing, or communications, the PromCon organizers would welcome your help. Reach out to the organizing team or watch for announcements in the Prometheus community channels.

5. Advocate and Amplify

Finally, one of the simplest but most impactful things you can do is talk about Prometheus. Write blog posts about how you're using Prometheus. Give talks at local meetups or conferences. Share tips and learnings on social media. Create video tutorials or live streams. Recommend Prometheus to teams evaluating monitoring solutions.

Every piece of content, every conference talk, every social media post expands Prometheus's reach and helps new users discover it.

How to Get Started

If you're ready to contribute to Prometheus, here's what I've learned from my own experience navigating the community as a non-technical contributor:

Start by introducing yourself. When you join the #prometheus-dev Slack channel, say hello. Slack doesn't always make it obvious when someone new joins, so if you stay silent, people simply won't know you're there. A simple introduction—your name, what you do, what brought you to Prometheus—is enough to make your presence known.

Attend community meetings. Check out the community calendar and sync the meetings that interest you. Even if you don't understand everything being discussed at first (and that's completely normal), stay. The more you sit in, the more you'll learn about the community's needs and find more opportunities to contribute.

Observe before you act. It's tempting to jump in with ideas immediately, but spending time as an observer first pays off. Read through Slack discussions and conversations in GitHub issues. Browse the documentation. Notice what kinds of contributions are being made. You'll start to see patterns: recurring questions, documentation gaps, areas where help is needed. That's where your opportunity lies.

Ask questions. Everyone was new once. If something isn't clear, ask. If you don't get a response right away, give it some time—people are busy—then follow up. The community is welcoming, but you have to make yourself visible.

The Prometheus community has room for you. Now you know exactly where to begin.

Categories: CNCF Projects

Connecting distributed Kubernetes with Cilium and SD-WAN: Building an intelligent network fabric 

CNCF Blog Projects Category - Sat, 10/25/2025 - 10:00

Learn how Kubernetes-native traffic management and SD-WAN integration can deliver consistent security, observability, and performance across distributed clusters.

The challenge of distributed Kubernetes networking

Modern businesses are rapidly adopting distributed architectures to meet growing demands for performance, resilience, and global reach. This shift is driven by emerging workloads that demand distributed infrastructure: AI/ML model training distributed across GPU clusters, real-time edge analytics processing IoT data streams, and global enterprise operations that require seamless connectivity across on-premises workloads, data centers, cloud providers, and edge locations.

Today businesses are increasingly struggling to ensure secure, reliable and high-performance global connectivity while maintaining visibility across this distributed infrastructure. How do you maintain consistent end-to-end policies when applications traverse multiple network boundaries? How do you optimize performance for latency-sensitive critical applications when they could be running anywhere? And how do you gain clear visibility into application communication across this complex, multi-cluster, multi-cloud landscape? 

This is where a modern, integrated approach to networking becomes essential, one that understands both the intricacies of Kubernetes and the demands of wide-area connectivity. Let’s explore a proposal for seamlessly bridging your Kubernetes clusters, regardless of location, while intelligently managing the underlying network paths. Such an integrated approach solves several critical business needs:

  • Unified security posture: Consistent policy enforcement from the wide-area network down to individual microservices.
  • Optimized performance: Intelligent traffic routing that adapts to real-time conditions and application requirements.
  • Global visibility: End-to-end observability across all layers of the network stack.

In this post we discuss how to interconnect Cilium with a Software-Defined Wide Area Network (SD-WAN) fabric to extend Kubernetes-native traffic management and security policies into the underlying network interconnect. Learn how such integration simplifies operations while delivering the performance and security modern distributed workloads demand.

Towards an intelligent network fabric 

Imagine a globally distributed service deployed across dozens of locations worldwide. Latency-critical microservices are deployed at the edge, critical workloads run on-premises for data protection, while elastic services leverage public cloud scalability. These components must constantly communicate across cluster boundaries: IoT streams flow to central management, customer data replicates across regions for sovereignty compliance, and real-time analytics span multiple sites.

Bridging Kubernetes and SD-WAN with Cilium

Enter Cilium, a universal networking layer connecting Kubernetes workloads, VMs, and physical servers across clouds, data centers, and edge locations. Simply mark a service as “global” and Cilium ensures its availability throughout your distributed multi-cluster infrastructure (Figure 1). But even single-cluster Kubernetes deployments may benefit from an intelligent WAN interconnect, when different nodes and physical servers of the same cluster may run at multiple geographically diverse locations (Figure 2). No matter at which location a service is running, Cilium intelligently routes and balances load across the entire deployment.

Yet a critical gap remains: controlling how traffic traverses the underlying network interconnect. Modern wide-area SDNs like a modern SD-WAN implementation (such as Cisco Catalyst SD-WAN) would easily deliver the intelligent interconnect these services need, by providing performance-guarantees for SD-WAN tunnels between sites with traffic differentiation. Unfortunately, currently leveraging these capabilities in a Kubernetes-native way remains a challenge.

 An SD-WAN connects multiple Kubernetes clusters.

Figure 1: Multi-cluster scenario: An SD-WAN connects multiple Kubernetes clusters.

 An SD-WAN fabric interconnects geographically distributed nodes of a single Kubernetes cluster.

Figure 2: Single-cluster scenario: An SD-WAN fabric interconnects geographically distributed nodes of a single Kubernetes cluster.

We suggest leveraging the concept of a Kubernetes operator to bridge this divide. Continuously monitoring the Kubernetes API, the operator could translate service requirements into SD-WAN policies, automatically mapping inter-cluster traffic to appropriate network paths based on your declared intent. Simply annotate your Kubernetes service manifests, and the operator handles the rest. For the purposes of this guide we will use a minimalist SD-WAN operator.Other SD-WAN operators (such asAWI Catalyst SD-WAN Operator)  offer varying degrees of Kubernetes integration. 

The role of a Kubernetes operator

Need guaranteed performance for business-critical global services? One service annotation will route traffic through a dedicated SD-WAN tunnel, bypassing congestion and bottlenecks. Require encryption for sensitive data flows? Another annotation ensures tamper-resistant paths between clusters. In general, such an intelligent cloud SD-WAN interconnect would provide the following features:

  • Map services to specific SD-WAN tunnels for optimized routing (see below).
  • Provide end-to-end Service Level Objectives (SLOs) across sites and nodes.
  • Implement comprehensive monitoring to track service health and performance across the entire network.
  • Enable selective traffic inspection or blocking on the interconnect for enhanced security and compliance.
  • Isolate tenants’ inter-cluster traffic in distributed multi-tenant workloads.  

A Kubernetes operator can bring these capabilities, and many more, into the Kubernetes ecosystem, maintaining the declarative, GitOps-friendly approach cloud-native teams expect.

Enforcing traffic policies with Cilium and Cisco Catalyst SD-WAN

In this guide, we demonstrate how an operator can enforce granular traffic policies for Kubernetes services using Cilium and Cisco Catalyst SD-WAN. The setup ensures secure, prioritized routing for business-critical services while allowing best-effort traffic to use default paths. 

We will assume that SD-WAN connectivity is established between the clusters/nodes so that the SD-WAN interconnects all Kubernetes deployment sites (nodes/clusters) and routes pod-to-pod traffic seamlessly across the WAN. We further assume Cilium is configured in Native Routing Mode so that we have full visibility into the traffic that travels between the clusters/nodes in the SD-WAN. 

Once installed, the SD-WAN operator will automatically generate SD-WAN policies based on your Kubernetes service configurations. This seamless integration ensures that your network policies adapt dynamically as your Kubernetes environment evolves.

To illustrate, let’s look at a demo environment (see Figure 3) featuring:

  • A Kubernetes cluster with two nodes deployed in the “single-cluster scenario” (Figure 2).
  • Nodes interconnected via Cilium, running over two distinct SD-WAN tunnels.

In this setup, as you define or update Kubernetes services, the operator will automatically program the underlying SD-WAN fabric to enforce the appropriate connectivity and security policies for your workloads.

 This pattern extends seamlessly to multi-cluster deployments.

Figure 3: A simplified demo setup with a single-cluster Kubernetes with two nodes interconnected with Cilium and a modern SD-WAN implementation (such as Cisco Catalyst SD-WAN). Note: This pattern extends seamlessly to multi-cluster deployments.

End-to-end policy enforcement example

Within the cluster, we will deploy two services, each with specific connectivity and security requirements:

  • Best-effort service: Designed for non-sensitive workloads, this service leverages standard network connectivity. It is ideal for applications where best-effort delivery is sufficient and there are no stringent security or performance requirements.
  • Business service: This service is responsible for handling business-critical traffic that requires reliable performance. To maintain stringent service levels, all traffic for the Business Service must be routed exclusively through the dedicated Business WAN (bizinternet) SD-WAN tunnel. This approach ensures optimized network performance and strong isolation from general-purpose traffic – ensuring that critical communications remain secure and uninterrupted.

By tailoring network policies to the unique needs of each service, we achieve both operational efficiency for routine workloads and robust protection for sensitive business data.

By default, all traffic crossing the cluster boundary uses the default tunnel. In order to ensure that the Business Service uses the Business WAN we just need to add a Kubernetes annotation to the corresponding Kubernetes Service:

code

How does this work? The SD-WAN operator watches Service objects, extracts endpoint IPs/ports from the Business Service pods, and dynamically programs the SD-WAN to enforce the business tunnel policy (see Figure 4).

 The Kubernetes objects read by the SD-WAN operator to instantiate the SD-WAN rules for the business service.

Figure 4: The Kubernetes objects read by the SD-WAN operator to instantiate the SD-WAN rules for the business service.

Future directions: Observability and SLO awareness

Meanwhile, Figure 5 illustrates the SD-WAN configuration generated by the SD-WAN operator. The configuration highlights two key aspects:

  • Business WAN tunnel enforcement: All traffic destined for the pods of the Business Service’ is strictly routed through the dedicated bizinternet SD-WAN tunnel. This ensures that business-critical traffic receives optimized performance as it traverses the network.
  • Traffic identification: The SD-WAN dynamically identifies Business Service traffic by inspecting the source and destination IP addresses and ports of the service endpoints. This granular detection enables precise policy enforcement, ensuring that only the intended traffic is routed through the secure business tunnel.

Together, these mechanisms provide robust security and fine-grained control over service-specific traffic flows within and across your Kubernetes clusters.

 The SD-WAN configuration issued by the operator

Figure 5: The SD-WAN configuration issued by the operator

Conclusion

By using a Kubernetes operator it is possible to integrate Cilium and a modern SD-WAN implementation (such as Cisco Catalyst SD-WAN) into a single end-to-end framework to intelligently connect distributed workloads at controlled security and performance. Key takeaways:

  • Annotation-driven end-to-end policies: Kubernetes service annotations simplify policy definition, enabling developers to declare intent without needing SD-WAN expertise.
  • Automated SD-WAN programming: An SD-WAN operator bridges Kubernetes and the SD-WAN, translating service configurations into real-time network policies.
  • Secure multi-tenancy: Critical services are isolated in dedicated tunnels. At the same time, best-effort traffic shares the default tunnel, optimizing security and cost.

This demo operator, however, demonstrates only the first step by providing just the bare intelligent connectivity features. Future work includes  exploring end-to-end observability, monitoring and tracing tools. Today, Hubble provides an observability layer to Cilium that can show flows from a Kubernetes perspective, while Cisco Catalyst SD-WAN Manager and Cisco Catalyst SD-WAN Analytics provide extended network observability and visibility, however the missing bit is a single plane of glass. In addition, further future work might consider, exposing the SD-WAN SLOs to Kubernetes for automatic service mapping, and extending the framework to new use cases.

Learn more

Feel free to reach out to the authors at the contact details below. Visit cilium.io to learn more about Cilium. More details on Cisco Catalyst SD-WAN can be found on:
https://www.cisco.com/site/us/en/solutions/networking/sdwan/catalyst/index.html

The creators of Cilium and Cisco Catalyst SD-WAN are also hiring! Check out https://jobs.cisco.com/jobs/SearchJobs/sdwan or https://jobs.cisco.com/jobs/SearchJobs/isovalent for their listings.

Authors:

Gábor Rétvári

Tamás Lévai

Categories: CNCF Projects

Follow Up - Preventing Upgrade Failures from etcd v3.5 to v3.6

etcd Blog - Mon, 10/20/2025 - 20:00

We have identified and fixed an additional scenario that may cause upgrade failures when moving from etcd v3.5 to v3.6. This post contains details, the fix, and additional workarounds. Please refer to issue 20793 to get detailed technical information.

Issue

In a previous post — How to Prevent a Common Failure when Upgrading etcd v3.5 to v3.6 — we described an upgrade issue affecting etcd versions in v3.5.1-v3.5.19. That issue was addressed in v3.5.20. However, a follow-up investigation revealed that the original fix did not cover all scenarios.

Categories: CNCF Projects

7 Common Kubernetes Pitfalls (and How I Learned to Avoid Them)

Kubernetes Blog - Mon, 10/20/2025 - 11:30

It’s no secret that Kubernetes can be both powerful and frustrating at times. When I first started dabbling with container orchestration, I made more than my fair share of mistakes enough to compile a whole list of pitfalls. In this post, I want to walk through seven big gotchas I’ve encountered (or seen others run into) and share some tips on how to avoid them. Whether you’re just kicking the tires on Kubernetes or already managing production clusters, I hope these insights help you steer clear of a little extra stress.

1. Skipping resource requests and limits

The pitfall: Not specifying CPU and memory requirements in Pod specifications. This typically happens because Kubernetes does not require these fields, and workloads can often start and run without them—making the omission easy to overlook in early configurations or during rapid deployment cycles.

Context: In Kubernetes, resource requests and limits are critical for efficient cluster management. Resource requests ensure that the scheduler reserves the appropriate amount of CPU and memory for each pod, guaranteeing that it has the necessary resources to operate. Resource limits cap the amount of CPU and memory a pod can use, preventing any single pod from consuming excessive resources and potentially starving other pods. When resource requests and limits are not set:

  1. Resource Starvation: Pods may get insufficient resources, leading to degraded performance or failures. This is because Kubernetes schedules pods based on these requests. Without them, the scheduler might place too many pods on a single node, leading to resource contention and performance bottlenecks.
  2. Resource Hoarding: Conversely, without limits, a pod might consume more than its fair share of resources, impacting the performance and stability of other pods on the same node. This can lead to issues such as other pods getting evicted or killed by the Out-Of-Memory (OOM) killer due to lack of available memory.

How to avoid it:

  • Start with modest requests (for example 100m CPU, 128Mi memory) and see how your app behaves.
  • Monitor real-world usage and refine your values; the HorizontalPodAutoscaler can help automate scaling based on metrics.
  • Keep an eye on kubectl top pods or your logging/monitoring tool to confirm you’re not over- or under-provisioning.

My reality check: Early on, I never thought about memory limits. Things seemed fine on my local cluster. Then, on a larger environment, Pods got OOMKilled left and right. Lesson learned. For detailed instructions on configuring resource requests and limits for your containers, please refer to Assign Memory Resources to Containers and Pods (part of the official Kubernetes documentation).

2. Underestimating liveness and readiness probes

The pitfall: Deploying containers without explicitly defining how Kubernetes should check their health or readiness. This tends to happen because Kubernetes will consider a container “running” as long as the process inside hasn’t exited. Without additional signals, Kubernetes assumes the workload is functioning—even if the application inside is unresponsive, initializing, or stuck.

Context:
Liveness, readiness, and startup probes are mechanisms Kubernetes uses to monitor container health and availability.

  • Liveness probes determine if the application is still alive. If a liveness check fails, the container is restarted.
  • Readiness probes control whether a container is ready to serve traffic. Until the readiness probe passes, the container is removed from Service endpoints.
  • Startup probes help distinguish between long startup times and actual failures.

How to avoid it:

  • Add a simple HTTP livenessProbe to check a health endpoint (for example /healthz) so Kubernetes can restart a hung container.
  • Use a readinessProbe to ensure traffic doesn’t reach your app until it’s warmed up.
  • Keep probes simple. Overly complex checks can create false alarms and unnecessary restarts.

My reality check: I once forgot a readiness probe for a web service that took a while to load. Users hit it prematurely, got weird timeouts, and I spent hours scratching my head. A 3-line readiness probe would have saved the day.

For comprehensive instructions on configuring liveness, readiness, and startup probes for containers, please refer to Configure Liveness, Readiness and Startup Probes in the official Kubernetes documentation.

3. “We’ll just look at container logs” (famous last words)

The pitfall: Relying solely on container logs retrieved via kubectl logs. This often happens because the command is quick and convenient, and in many setups, logs appear accessible during development or early troubleshooting. However, kubectl logs only retrieves logs from currently running or recently terminated containers, and those logs are stored on the node’s local disk. As soon as the container is deleted, evicted, or the node is restarted, the log files may be rotated out or permanently lost.

How to avoid it:

  • Centralize logs using CNCF tools like Fluentd or Fluent Bit to aggregate output from all Pods.
  • Adopt OpenTelemetry for a unified view of logs, metrics, and (if needed) traces. This lets you spot correlations between infrastructure events and app-level behavior.
  • Pair logs with Prometheus metrics to track cluster-level data alongside application logs. If you need distributed tracing, consider CNCF projects like Jaeger.

My reality check: The first time I lost Pod logs to a quick restart, I realized how flimsy “kubectl logs” can be on its own. Since then, I’ve set up a proper pipeline for every cluster to avoid missing vital clues.

4. Treating dev and prod exactly the same

The pitfall: Deploying the same Kubernetes manifests with identical settings across development, staging, and production environments. This often occurs when teams aim for consistency and reuse, but overlook that environment-specific factors—such as traffic patterns, resource availability, scaling needs, or access control—can differ significantly. Without customization, configurations optimized for one environment may cause instability, poor performance, or security gaps in another.

How to avoid it:

  • Use environment overlays or kustomize to maintain a shared base while customizing resource requests, replicas, or config for each environment.
  • Extract environment-specific configuration into ConfigMaps and / or Secrets. You can use a specialized tool such as Sealed Secrets to manage confidential data.
  • Plan for scale in production. Your dev cluster can probably get away with minimal CPU/memory, but prod might need significantly more.

My reality check: One time, I scaled up replicaCount from 2 to 10 in a tiny dev environment just to “test.” I promptly ran out of resources and spent half a day cleaning up the aftermath. Oops.

5. Leaving old stuff floating around

The pitfall: Leaving unused or outdated resources—such as Deployments, Services, ConfigMaps, or PersistentVolumeClaims—running in the cluster. This often happens because Kubernetes does not automatically remove resources unless explicitly instructed, and there is no built-in mechanism to track ownership or expiration. Over time, these forgotten objects can accumulate, consuming cluster resources, increasing cloud costs, and creating operational confusion, especially when stale Services or LoadBalancers continue to route traffic.

How to avoid it:

  • Label everything with a purpose or owner label. That way, you can easily query resources you no longer need.
  • Regularly audit your cluster: run kubectl get all -n <namespace> to see what’s actually running, and confirm it’s all legit.
  • Adopt Kubernetes’ Garbage Collection: K8s docs show how to remove dependent objects automatically.
  • Leverage policy automation: Tools like Kyverno can automatically delete or block stale resources after a certain period, or enforce lifecycle policies so you don’t have to remember every single cleanup step.

My reality check: After a hackathon, I forgot to tear down a “test-svc” pinned to an external load balancer. Three weeks later, I realized I’d been paying for that load balancer the entire time. Facepalm.

6. Diving too deep into networking too soon

The pitfall: Introducing advanced networking solutions—such as service meshes, custom CNI plugins, or multi-cluster communication—before fully understanding Kubernetes' native networking primitives. This commonly occurs when teams implement features like traffic routing, observability, or mTLS using external tools without first mastering how core Kubernetes networking works: including Pod-to-Pod communication, ClusterIP Services, DNS resolution, and basic ingress traffic handling. As a result, network-related issues become harder to troubleshoot, especially when overlays introduce additional abstractions and failure points.

How to avoid it:

  • Start small: a Deployment, a Service, and a basic ingress controller such as one based on NGINX (e.g., Ingress-NGINX).
  • Make sure you understand how traffic flows within the cluster, how service discovery works, and how DNS is configured.
  • Only move to a full-blown mesh or advanced CNI features when you actually need them, complex networking adds overhead.

My reality check: I tried Istio on a small internal app once, then spent more time debugging Istio itself than the actual app. Eventually, I stepped back, removed Istio, and everything worked fine.

7. Going too light on security and RBAC

The pitfall: Deploying workloads with insecure configurations, such as running containers as the root user, using the latest image tag, disabling security contexts, or assigning overly broad RBAC roles like cluster-admin. These practices persist because Kubernetes does not enforce strict security defaults out of the box, and the platform is designed to be flexible rather than opinionated. Without explicit security policies in place, clusters can remain exposed to risks like container escape, unauthorized privilege escalation, or accidental production changes due to unpinned images.

How to avoid it:

  • Use RBAC to define roles and permissions within Kubernetes. While RBAC is the default and most widely supported authorization mechanism, Kubernetes also allows the use of alternative authorizers. For more advanced or external policy needs, consider solutions like OPA Gatekeeper (based on Rego), Kyverno, or custom webhooks using policy languages such as CEL or Cedar.
  • Pin images to specific versions (no more :latest!). This helps you know what’s actually deployed.
  • Look into Pod Security Admission (or other solutions like Kyverno) to enforce non-root containers, read-only filesystems, etc.

My reality check: I never had a huge security breach, but I’ve heard plenty of cautionary tales. If you don’t tighten things up, it’s only a matter of time before something goes wrong.

Final thoughts

Kubernetes is amazing, but it’s not psychic, it won’t magically do the right thing if you don’t tell it what you need. By keeping these pitfalls in mind, you’ll avoid a lot of headaches and wasted time. Mistakes happen (trust me, I’ve made my share), but each one is a chance to learn more about how Kubernetes truly works under the hood. If you’re curious to dive deeper, the official docs and the community Slack are excellent next steps. And of course, feel free to share your own horror stories or success tips, because at the end of the day, we’re all in this cloud native adventure together.

Happy Shipping!

Categories: CNCF Projects, Kubernetes

Hands off Linkerd certificate rotation

Linkerd Blog - Sun, 10/19/2025 - 20:00

This blog post was originally published on Matthew McLane’s Medium blog.

I’ll start by saying that I think Linkerd is a great tool. We use it at work to provide TLS between our pods, which frees us from having to build that functionality directly into our containers. When it works, it’s fantastic! It’s simple to get up and running and just does the job without a lot of extra fuss. For the most part, it’s been a very hands-off experience, which is exactly what we need.

Categories: CNCF Projects

Pages

Subscribe to articles.innovatingtomorrow.net aggregator - CNCF Projects