You are here

Feed aggregator

Five ways of looking at Jco, Part 1

Bytecode Alliance - Wed, 03/18/2026 - 20:00
Jco (@bytecodealliance/jco on NPM)is a “multi-tool for the JS WebAssembly ecosystem.” At the 2026 Bytecode Alliance Plumbers Summit, Technical Steering Committee member Bailey Hayes put it another way: Jco is “like five projects in one.” It’s certainly a project with many facets—five big ones, arguably! Recognizing what those facets are, and how they fit together, is the key to understanding why Jco matters beyond the JavaScript ecosystem. In this blog series, we’ll draw on Victor Adossi’s Plumbers Summit presentation to take an in-depth look at Jco from five different perspectives, in order to better grasp how you can use (and contribute to!) Jco today. There’s a lot to unpack here, so in this first post, we’ll try to get to grips with Jco as a layered architecture that brings together many pieces of the Wasm and JS ecosystem.
Categories: Web Assembly

Securing Production Debugging in Kubernetes

Kubernetes Blog - Wed, 03/18/2026 - 14:00

During production debugging, the fastest route is often broad access such as cluster-admin (a ClusterRole that grants administrator-level access), shared bastions/jump boxes, or long-lived SSH keys. It works in the moment, but it comes with two common problems: auditing becomes difficult, and temporary exceptions have a way of becoming routine.

This post offers my recommendations for good practices applicable to existing Kubernetes environments with minimal tooling changes:

  • Least privilege with RBAC
  • Short-lived, identity-bound credentials
  • An SSH-style handshake model for cloud native debugging

A good architecture for securing production debugging workflows is to use a just-in-time secure shell gateway (often deployed as an on demand pod in the cluster). It acts as an SSH-style “front door” that makes temporary access actually temporary. You can authenticate with short-lived, identity-bound credentials, establish a session to the gateway, and the gateway uses the Kubernetes API and RBAC to control what they can do, such as pods/log, pods/exec, and pods/portforward. Sessions expire automatically, and both the gateway logs and Kubernetes audit logs capture who accessed what and when without shared bastion accounts or long-lived keys.

1) Using an access broker on top of Kubernetes RBAC

RBAC controls who can do what in Kubernetes. Many Kubernetes environments rely primarily on RBAC for authorization, although Kubernetes also supports other authorization modes such as Webhook authorization. You can enforce access directly with Kubernetes RBAC, or put an access broker in front of the cluster that still relies on Kubernetes permissions under the hood. In either model, Kubernetes RBAC remains the source of truth for what the Kubernetes API allows and at what scope.

An access broker adds controls that RBAC does not cover well. For example, it can decide whether a request is auto-approved or requires manual approval, whether a user can run a command, and which commands are allowed in a session. It can also manage group membership so that you grant permissions to groups instead of individual users. Kubernetes RBAC can allow actions such as pods/exec, but it cannot restrict which commands run inside an exec session.

With that model, Kubernetes RBAC defines the allowed actions for a user or group (for example, an on-call team in a single namespace). I recommend you only define access rules that grant rights to groups or to ServiceAccounts - never to individual users. The broker or identity provider then adds or removes users from that group as needed.

The broker can also enforce extra policy on top, like which commands are permitted in an interactive session and which requests can be auto-approved versus require manual approval. That policy can live in a JSON or XML file and be maintained through code review, so updates go through a formal pull request and are reviewed like any other production change.

Example: a namespaced on-call debug Role

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
 name: oncall-debug
 namespace: <namespace>
rules:
 # Discover what’s running
 - apiGroups: [""]
 resources: ["pods", "events"]
 verbs: ["get", "list", "watch"]

 # Read logs
 - apiGroups: [""]
 resources: ["pods/log"]
 verbs: ["get"]

 # Interactive debugging actions
 - apiGroups: [""]
 resources: ["pods/exec", "pods/portforward"]
 verbs: ["create"]

 # Understand rollout/controller state
 - apiGroups: ["apps"]
 resources: ["deployments", "replicasets"]
 verbs: ["get", "list", "watch"]

 # Optional: allow kubectl debug ephemeral containers
 - apiGroups: [""]
 resources: ["pods/ephemeralcontainers"]
 verbs: ["update"]

Bind the Role to a group (rather than individual users) so membership can be managed through your identity provider:

apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
 name: oncall-debug
 namespace: <namespace>
subjects:
 - kind: Group
 name: oncall-<team-name>
 apiGroup: rbac.authorization.k8s.io
roleRef:
 kind: Role
 name: oncall-debug
 apiGroup: rbac.authorization.k8s.io

2) Short-lived, identity-bound credentials

The goal is to use short-lived, identity-bound credentials that clearly tie a session to a real person and expire quickly. These credentials can include the user’s identity and the scope of what they’re allowed to do. They’re typically signed using a private key that stays with the engineer, such as a hardware-backed key (for example, a YubiKey), so they can not be forged without access to that key.

You can implement this with Kubernetes-native authentication (for example, client certificates or an OIDC-based flow), or have the access broker from the previous section issue short-lived credentials on the user’s behalf. In many setups, Kubernetes still uses RBAC to enforce permissions based on the authenticated identity and groups/claims. If you use an access broker, it can also encode additional scope constraints in the credential and enforce them during the session, such as which cluster or namespace the session applies to and which actions (or approved commands) are allowed against pods or nodes. In either case, the credentials should be signed by a certificate authority (CA), and that CA should be rotated on a regular schedule (for example, quarterly) to limit long-term risk.

Option A: short-lived OIDC tokens

A lot of managed Kubernetes clusters already give you short-lived tokens. The main thing is to make sure your kubeconfig refreshes them automatically instead of copying a long-lived token into the file.

For example:

users:
- name: oncall
 user:
 exec:
 apiVersion: client.authentication.k8s.io/v1
 command: cred-helper
 args: ["--cluster=prod", "--ttl=30m"]

Option B: Short-lived client certificates (X.509)

If your API server (or your access broker from the previous section) is set up to trust a client CA, you can use short-lived client certificates for debugging access. The idea is:

  • The private key is created and kept under the engineer’s machine (ideally hardware-backed, like a non-exportable key in a YubiKey/PIV token)
  • A short-lived certificate is issued (often via the CertificateSigningRequest API, or your access broker from the previous section, with a TTL).
  • RBAC maps the authenticated identity to a minimal Role

This is straightforward to operationalize with the Kubernetes CertificateSigningRequest API.

Generate a key and CSR locally:

# Generate a private key.
# This could instead be generated within a hardware token;
# OpenSSL and several similar tools include support for that.
openssl genpkey -algorithm Ed25519 -out oncall.key

openssl req -new -key oncall.key -out oncall.csr \
 -subj "/CN=user/O=oncall-payments"

Create a CertificateSigningRequest with a short expiration:

apiVersion: certificates.k8s.io/v1
kind: CertificateSigningRequest
metadata:
 name: oncall-<user>-20260218
spec:
 request: <base64-encoded oncall.csr>
 signerName: kubernetes.io/kube-apiserver-client
 expirationSeconds: 1800 # 30 minutes
 usages:
 - client auth

After the CSR is approved and signed, you extract the issued certificate and use it together with the private key to authenticate, for example via kubectl.

3) Use a just-in-time access gateway to run debugging commands

Once you have short-lived credentials, you can use them to open a secure shell session to a just-in-time access gateway, often exposed over SSH and created on demand. If the gateway is exposed over SSH, a common pattern is to issue the engineer a short-lived OpenSSH user certificate for the session. The gateway trusts your SSH user CA, authenticates the engineer at connection time, and then applies the approved session policy before making Kubernetes API calls on the user’s behalf. OpenSSH certificates are separate from Kubernetes X.509 client certificates, so these are usually treated as distinct layers.

The resulting session should also be scoped so it cannot be reused outside of what was approved. For example, the gateway or broker can limit it to a specific cluster and namespace, and optionally to a narrower target such as a pod or node. That way, even if someone tries to reuse the access, it will not work outside the intended scope. After the session is established, the gateway executes only the allowed actions and records what happened for auditing.

Example: Namespace-scoped role bindings

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
 name: jit-debug
 namespace: <namespace>
 annotations:
 kubernetes.io/description: >
 Colleagues performing semi-privileged debugging, with access provided
 just in time and on demand.
rules:
 - apiGroups: [""]
 resources: ["pods", "pods/log"]
 verbs: ["get", "list", "watch"]
 - apiGroups: [""]
 resources: ["pods/exec"]
 verbs: ["create"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
 name: jit-debug
 namespace: <namespace>
subjects:
 - kind: Group
 name: jit:oncall:<namespace>  # mapped from the short-lived credential (cert/OIDC)
 apiGroup: rbac.authorization.k8s.io
roleRef:
 kind: Role
 name: jit-debug
 apiGroup: rbac.authorization.k8s.io

These RBAC objects, and the rules they define, allow debugging only within the specified namespace; attempts to access other namespaces are not allowed.

Example: Cluster-scoped role binding

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
 name: jit-cluster-read
rules:
 - apiGroups: [""]
 resources: ["nodes", "namespaces"]
 verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
 name: jit-cluster-read
subjects:
 - kind: Group
 name: jit:oncall:cluster
 apiGroup: rbac.authorization.k8s.io
roleRef:
 kind: ClusterRole
 name: jit-cluster-read
 apiGroup: rbac.authorization.k8s.io

These RBAC rules grant cluster-wide read access (for example, to nodes and namespaces) and should be used only for workflows that truly require cluster-scoped resources.

Finer-grained restrictions like “only this pod/node” or “only these commands” are typically enforced by the access gateway/broker during the session, but Kubernetes also offers other options, such as ValidatingAdmissionPolicy for restricting writes and webhook authorization for custom authorization across verbs.

In environments with stricter access controls, you can add an extra, short-lived session mediation layer to separate session establishment from privileged actions. Both layers are ephemeral, use identity-bound expiring credentials, and produce independent audit trails. The mediation layer handles session setup/forwarding, while the execution layer performs only RBAC-authorized Kubernetes actions. This separation can reduce exposure by narrowing responsibilities, scoping credentials per step, and enforcing end-to-end session expiry.

References

Disclaimer: The views expressed in this post are solely those of the author and do not reflect the views of the author’s employer or any other organization.

Categories: CNCF Projects, Kubernetes

Running Rook at Petabyte Scale Across Multiple Regions

Rook Blog - Wed, 03/18/2026 - 10:50

This post describes how SAP’s cloud infrastructure team uses Rook to manage a multi-region Ceph fleet — from bare metal provisioning to rolling upgrades — as part of building a digitally sovereign storage backbone for Europe.

The 120 Petabyte Challenge

When you are responsible for a target of 120 Petabytes of storage across 30 Regions, manual operations don’t scale.

For years, SAP Cloud Infrastructure relied on a mix of proprietary appliances and legacy OpenStack Swift. But as we architected our next-generation cloud stack (internally part of the Apeiro project), we faced a non-negotiable constraint: Digital Sovereignty. Our stack had to be completely free of hyperscaler lock-in, running on our own hardware in our own data centers.

This created a concrete engineering challenge: build a storage layer that is API-first, fully open-source, and capable of self-management at a global scale. We chose Ceph for the storage engine — and Rook for the automation layer that makes it manageable.

Why Rook

Managing Ceph at this scale without an operator would mean building and maintaining custom tooling for OSD lifecycle, daemon placement, upgrade orchestration, and failure recovery across every region. Rook gives us all of this as a declarative Kubernetes-native interface, which means our existing GitOps and CI/CD workflows extend naturally to storage. Instead of writing region-specific runbooks, we write Helm values.

Architecture: The Separation of Metal and Software

Our platform, CobaltCore, is built on top of Gardener and Metal-API both part of the ApeiroRA reference architecture. In this stack, storage isn’t a static resource — it’s a programmable Kubernetes object. We run storage on dedicated nodes, separate from application workloads. At our density (16 NVMe drives per node), co-locating workloads would create unacceptable I/O interference, so storage nodes do one thing: serve data.

The Metal Layer

Metal-API and Gardener manage the physical lifecycle of bare-metal servers: inventory, provisioning, firmware, and OS deployment. This allows Rook to focus purely on the software layer without worrying about the underlying physical state.

The Declarative Storage Layer (Rook)

Once nodes are handed over, Rook takes control. We use a strict GitOps workflow to ensure consistency across the fleet:

  • Base Blueprint: A central Helm chart defines global best practices and standard Ceph configurations.
  • Region Overlay: Region-specific resources (CephBlockPools, RGW placement rules) are injected via localized values.yaml files.
  • Automation: Rook handles the rest: bootstrapping daemons, configuring CRUSH failure domains, and provisioning RGW endpoints.

Standard Storage Node Spec:

  • Server: Dell PowerEdge R7615
  • CPU: AMD EPYC 9554P (64 cores)
  • RAM: 384 GB
  • Storage: 16x 14 TB NVMe
  • Network: 100 GbE (redundant)

Validation: Establishing the Performance Envelope

Before committing to production capacity planning, we needed to establish the performance envelope of our RGW tier. We ran a breakpoint test on a typical Ceph Squid cluster (28 nodes, 362 OSDs) to find the stable operating range, saturation threshold, and hard ceiling.

Test Setup

  • Workload: 2M objects (4 KB each), 20 k6 clients, single “premium” NVMe bucket.
  • Method: Ramping load over 30 minutes until p90 latency exceeded 500 ms.

Results

Request rate ramp: successful requests (pink) peak at 171K ops/sec before the test exits. Failed requests (blue) spike briefly near saturation.
  • Saturation Point: The cluster entered saturation around 90K GET/s — latency percentiles begin diverging and request queues start building.
  • Breaking Point: Peak of 171K GET/s (measured on RGW) before the runners hit the latency exit condition.

Note: isolated 503s appeared as early as ~33K GET/s on a single RGW instance, likely caused by uneven load distribution rather than cluster-wide saturation.

Reading the charts

Client-side latency (k6): flat near zero through moderate load, stepping up as the cluster approaches saturation, and reaching 1.5s+ at the breaking point.

The client-side latency chart tells the story most directly. Average request duration stays flat near zero well into the ramp — then steps up sharply as the cluster enters saturation and eventually hits its ceiling.

RGW-side GET latency: all percentiles stay flat and sub-ms through moderate load. Around 90K GET/s, p99 begins climbing while median remains low — a classic saturation signal.

Comparing the two charts reveals where the system saturates. At peak load, RGW reports p99 latency of ~210ms — but clients observe 1.5 seconds. The gap is connection queueing: requests waiting to be picked up by RGW Beast frontend threads. RGW’s internal metrics only measure processing time after a request is accepted, not time spent in the queue.

The RGW latency chart also shows that RADOS operation latency climbs under load, which means RGW threads stay occupied longer, contributing to the queue buildup. At the breaking point, request queues filled and RGWs began returning 503s across all instances.

This is a read-focused baseline — our primary workload is read-heavy. The saturation point of 90K GET/s gives us a conservative operating ceiling for per-region capacity planning.

Operational Reality: Making Day 2 Uneventful

The true test of any storage system is what happens when things break or need upgrading. At our scale, the goal is to make operations boring.

Zero-Downtime Upgrades

Rook has reduced storage maintenance from a coordinated event to a background task. Since the first cluster went live in May 2024, we have maintained a continuous upgrade cadence with zero customer-facing downtime and zero data loss:

  • GardenLinux: Monthly rolling updates across all regions.
  • Kubernetes: Quarterly version upgrades.
  • Rook: Quarterly upgrades (v1.14 through v1.18), with additional upgrades when a needed feature ships in a new release.
  • Ceph: Major version migration from Reef v18 to Squid v19. A rolling upgrade of the largest cluster (~816 OSDs) completes in approximately 2 days.
Upgrade cadence since May 2024: GardenLinux monthly, Kubernetes quarterly, Rook quarterly (with extra upgrades for needed features), and one Ceph major version migration.

Drive Failures

With ~2,800 OSDs in the fleet, drive failures are a routine event. When a drive fails, Ceph (RADOS) automatically handles data recovery and rebalancing across the remaining OSDs — no operator action is needed to protect data. On the Kubernetes side, Rook detects the failed OSD pod and manages its lifecycle. The full drive replacement cycle (removing the failed OSD, clearing the device, provisioning a new OSD on the replacement drive) still involves operational steps on our side, but Ceph’s self-healing ensures data durability is never at risk while the replacement is carried out.

Current Status and What’s Next

As of early 2026, the fleet spans 10 live regions (with an 11th newly provisioned):

  • Storage Nodes: 251
  • Total OSDs: ~2,800
  • Raw Capacity: ~37 PiB

Region sizes range from 13-node / 96-OSD deployments to 59-node / 816-OSD clusters — the same Rook-based GitOps workflow handles both.

The next phase is bringing high-performance Block Storage (RBD) into this declarative model to fully retire our remaining proprietary SANs.

  • Target: 30 Regions.
  • Target Capacity: 120 PB.

We are active contributors to the Rook project and continue collaborating with the maintainers as we scale toward these targets. The Rook Slack community has been a valuable resource throughout this journey.

This work is part of ApeiroRA — an open initiative developing a reference blueprint for sovereign cloud-edge infrastructure. All components use enterprise-friendly open-source licenses under neutral community governance. ApeiroRA welcomes participants — whether you want to adopt the blueprints, contribute components, or shape the architecture. Get started at the documentation portal.

Authors: SAP Engineering Team, CLYSO Engineering Team.

Running Rook at Petabyte Scale Across Multiple Regions was originally published in Rook Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Categories: CNCF Projects

Meta’s AI Glasses and Privacy

Schneier on Security - Wed, 03/18/2026 - 07:07

Surprising no one, Meta’s new AI glasses are a privacy disaster.

I’m not sure what can be done here. This is a technology that will exist, whether we like it or not.

Meanwhile, there is a new Android app that detects when there are smart glasses nearby.

Categories: Software Security

South Korean Police Accidentally Post Cryptocurrency Wallet Password

Schneier on Security - Tue, 03/17/2026 - 06:01

An expensive mistake:

Someone jumped at the opportunity to steal $4.4 million in crypto assets after South Korea’s National Tax Service exposed publicly the mnemonic recovery phrase of a seized cryptocurrency wallet.

The funds were stored in a Ledger cold wallet seized in law enforcement raids at 124 high-value tax evaders that resulted in confiscating digital assets worth 8.1 billion won (currently approximately $5.6 million).

When announcing the success of the operation, the agency released photos of a Ledger device, a popular hardware wallet for crypto storage and management...

Categories: Software Security

Introducing OpenShift Service Mesh 3.3 with post-quantum cryptography

Red Hat Security - Mon, 03/16/2026 - 20:00
Red Hat OpenShift Service Mesh 3.3 is now generally available with Red Hat OpenShift Container Platform and Red Hat OpenShift Platform Plus. Based on the Istio, Envoy, and Kiali projects, this release updates the version of Istio to 1.28 and Kiali to 2.22, and is supported on OpenShift Container Platform 4.18 and above. While this release includes many updates, it also sets the stage for the next generation of service mesh features, including post-quantum cryptographic (PQC) encryption, AI enablement, and support for the inclusion of external virtual machines (VMs) with service mesh.Updates in
Categories: Software Security

What's New in the Fastly Extension for Raycast

Fastly Blog (Security) - Mon, 03/16/2026 - 20:00
Manage Fastly Compute data stores (KV, Config, Secret), ACLs, and view audit logs with 5 powerful new commands in the Fastly Raycast extension.
Categories: Software Security

Under Attack? How Fastly Can Help

Fastly Blog (Security) - Mon, 03/16/2026 - 20:00
Under attack? Fastly's CSOC provides human-led security with a median 1-minute response time for critical DDoS and security threats.
Categories: Software Security

The Invisible Rewrite: Modernizing the Kubernetes Image Promoter

Kubernetes Blog - Mon, 03/16/2026 - 20:00

Every container image you pull from registry.k8s.io got there through kpromo, the Kubernetes image promoter. It copies images from staging registries to production, signs them with cosign, replicates signatures across more than 20 regional mirrors, and generates SLSA provenance attestations. If this tool breaks, no Kubernetes release ships. Over the past few weeks, we rewrote its core from scratch, deleted 20% of the codebase, made it dramatically faster, and nobody noticed. That was the whole point.

A bit of history

The image promoter started in late 2018 as an internal Google project by Linus Arver. The goal was simple: replace the manual, Googler-gated process of copying container images into k8s.gcr.io with a community-owned, GitOps-based workflow. Push to a staging registry, open a PR with a YAML manifest, get it reviewed and merged, and automation handles the rest. KEP-1734 formalized this proposal.

In early 2019, the code moved to kubernetes-sigs/k8s-container-image-promoter and grew quickly. Over the next few years, Stephen Augustus consolidated multiple tools (cip, gh2gcs, krel promote-images, promobot-files) into a single CLI called kpromo. The repository was renamed to promo-tools. Adolfo Garcia Veytia (Puerco) added cosign signing and SBOM support. Tyler Ferrara built vulnerability scanning. Carlos Panato kept the project in a healthy and releasable state. 42 contributors made about 3,500 commits across more than 60 releases.

It worked. But by 2025 the codebase carried the weight of seven years of incremental additions from multiple SIGs and subprojects. The README said it plainly: you will see duplicated code, multiple techniques for accomplishing the same thing, and several TODOs.

The problems we needed to solve

Production promotion jobs for Kubernetes core images regularly took over 30 minutes and frequently failed with rate limit errors. The core promotion logic had grown into a monolith that was hard to extend and difficult to test, making new features like provenance or vulnerability scanning painful to add.

On the SIG Release roadmap, two work items had been sitting for a while: "Rewrite artifact promoter" and "Make artifact validation more robust". We had discussed these at SIG Release meetings and KubeCons, and the open research spikes on project board #171 captured eight questions that needed answers before we could move forward.

One issue to answer them all

In February 2026, we opened issue #1701 ("Rewrite artifact promoter pipeline") and answered all eight spikes in a single tracking issue. The rewrite was deliberately phased so that each step could be reviewed, merged, and validated independently. Here is what we did:

Phase 1: Rate Limiting (#1702). Rewrote rate limiting to properly throttle all registry operations with adaptive backoff.

Phase 2: Interfaces (#1704). Put registry and auth operations behind clean interfaces so they can be swapped out and tested independently.

Phase 3: Pipeline Engine (#1705). Built a pipeline engine that runs promotion as a sequence of distinct phases instead of one large function.

Phase 4: Provenance (#1706). Added SLSA provenance verification for staging images.

Phase 5: Scanner and SBOMs (#1709). Added vulnerability scanning and SBOM support. Flipped the default to the new pipeline engine. At this point we cut v4.2.0 and let it soak in production before continuing.

Phase 6: Split Signing from Replication (#1713). Separated image signing from signature replication into their own pipeline phases, eliminating the rate limit contention that caused most production failures.

Phase 7: Remove Legacy Pipeline (#1712). Deleted the old code path entirely.

Phase 8: Remove Legacy Dependencies (#1716). Deleted the audit subsystem, deprecated tools, and e2e test infrastructure.

Phase 9: Delete the Monolith (#1718). Removed the old monolithic core and its supporting packages. Thousands of lines deleted across phases 7 through 9.

Each phase shipped independently. v4.3.0 followed the next day with the legacy code fully removed.

With the new architecture in place, a series of follow-up improvements landed: parallelized registry reads (#1736), retry logic for all network operations (#1742), per-request timeouts to prevent pipeline hangs (#1763), HTTP connection reuse (#1759), local registry integration tests (#1746), the removal of deprecated credential file support (#1758), a rework of attestation handling to use cosign's OCI APIs and the removal of deprecated SBOM support (#1764), and a dedicated promotion record predicate type registered with the in-toto attestation framework (#1767). These would have been much harder to land without the clean separation the rewrite provided. v4.4.0 shipped all of these improvements and enabled provenance generation and verification by default.

The new pipeline

The promotion pipeline now has seven clearly separated phases:

graph LR Setup --> Plan --> Provenance --> Validate --> Promote --> Sign --> Attest Phase What it does Setup Validate options, prewarm TUF cache. Plan Parse manifests, read registries, compute which images need promotion. Provenance Verify SLSA attestations on staging images. Validate Check cosign signatures, exit here for dry runs. Promote Copy images server-side, preserving digests. Sign Sign promoted images with keyless cosign. Attest Generate promotion provenance attestations using a dedicated in-toto predicate type.

Phases run sequentially, so each one gets exclusive access to the full rate limit budget. No more contention. Signature replication to mirror registries is no longer part of this pipeline and runs as a dedicated periodic Prow job instead.

Making it fast

With the architecture in place, we turned to performance.

Parallel registry reads (#1736): The plan phase reads 1,350 registries. We parallelized this and the plan phase dropped from about 20 minutes to about 2 minutes.

Two-phase tag listing (#1761): Instead of checking all 46,000 image groups across more than 20 mirrors, we first check only the source repositories. About 57% of images have no signatures at all because they were promoted before signing was enabled. We skip those entirely, cutting API calls roughly in half.

Source check before replication (#1727): Before iterating all mirrors for a given image, we check if the signature exists on the primary registry first. In steady state where most signatures are already replicated, this reduced the work from about 17 hours to about 15 minutes.

Per-request timeouts (#1763): We observed intermittent hangs where a stalled connection blocked the pipeline for over 9 hours. Every network operation now has its own timeout and transient failures are retried automatically.

Connection reuse (#1759): We started reusing HTTP connections and auth state across operations, eliminating redundant token negotiations. This closed a long-standing request from 2023.

By the numbers

Here is what the rewrite looks like in aggregate.

  • Over 40 PRs merged, 3 releases shipped (v4.2.0, v4.3.0, v4.4.0)
  • Over 10,000 lines added and over 16,000 lines deleted, a net reduction of about 5,000 lines (20% smaller codebase)
  • Performance drastically improved across the board
  • Robustness improved with retry logic, per-request timeouts, and adaptive rate limiting
  • 19 long-standing issues closed

The codebase shrank by a fifth while gaining provenance attestations, a pipeline engine, vulnerability scanning integration, parallelized operations, retry logic, integration tests against local registries, and a standalone signature replication mode.

No user-facing changes

This was a hard requirement. The kpromo cip command accepts the same flags and reads the same YAML manifests. The post-k8sio-image-promo Prow job continued working throughout. The promotion manifests in kubernetes/k8s.io did not change. Nobody had to update their workflows or configuration.

We caught two regressions early in production. One (#1731) caused a registry key mismatch that made every image appear as "lost" so that nothing was promoted. Another (#1733) set the default thread count to zero, blocking all goroutines. Both were fixed within hours. The phased release strategy (v4.2.0 with the new engine, v4.3.0 with legacy code removed) gave us a clear rollback path that we fortunately never needed.

What comes next

Signature replication across all mirror registries remains the most expensive part of the promotion cycle. Issue #1762 proposes eliminating it entirely by having archeio (the registry.k8s.io redirect service) route signature tag requests to a single canonical upstream instead of per-region backends. Another option would be to move signing closer to the registry infrastructure itself. Both approaches need further discussion with the SIG Release and infrastructure teams, but either one would remove thousands of API calls per promotion cycle and simplify the codebase even further.

Thank you

This project has been a community effort spanning seven years. Thank you to Linus, Stephen, Adolfo, Carlos, Ben, Marko, Lauri, Tyler, Arnaud, and many others who contributed code, reviews, and planning over the years. The SIG Release and Release Engineering communities provided the context, the discussions, and the patience for a rewrite of infrastructure that every Kubernetes release depends on.

If you want to get involved, join us in #release-management on the Kubernetes Slack or check out the repository.

Categories: CNCF Projects, Kubernetes

Possible New Result in Quantum Factorization

Schneier on Security - Mon, 03/16/2026 - 05:46

I’m skeptical about—and not qualified to review—this new result in factorization with a quantum computer, but if it’s true it’s a theoretical improvement in the speed of factoring large numbers with a quantum computer.

Categories: Software Security

Project Harbor at KubeCon + CloudNativeCon Europe 2026 in Amsterdam

Harbor Blog - Sun, 03/15/2026 - 06:00
Project Harbor at KubeCon + CloudNativeCon Europe 2026 in Amsterdam The cloud-native community is once again gathering for one of the most anticipated events of the year — KubeCon + CloudNativeCon Europe 2026, taking place in the beautiful and dynamic city of Amsterdam, Netherlands. As always, Project Harbor will be proudly represented, bringing the latest innovations in container registry technology, security, and software supply chain management to the global Kubernetes community.
Categories: CNCF Projects

Upcoming Speaking Engagements

Schneier on Security - Sat, 03/14/2026 - 12:02

This is a current list of where and when I am scheduled to speak:

Categories: Software Security

Friday Squid Blogging: Increased Squid Population in the Falklands

Schneier on Security - Fri, 03/13/2026 - 17:05

Some good news: squid stocks seem to be recovering in the waters off the Falkland Islands.

As usual, you can also use this squid post to talk about the security stories in the news that I haven’t covered.

Blog moderation policy.

Categories: Software Security

Academia and the “AI Brain Drain”

Schneier on Security - Fri, 03/13/2026 - 07:04

In 2025, Google, Amazon, Microsoft and Meta collectively spent US$380 billion on building artificial-intelligence tools. That number is expected to surge still higher this year, to $650 billion, to fund the building of physical infrastructure, such as data centers (see go.nature.com/3lzf79q). Moreover, these firms are spending lavishly on one particular segment: top technical talent.

Meta reportedly offered a single AI researcher, who had cofounded a start-up firm focused on training AI agents to use computers, a compensation package of $250 million over four years (see ...

Categories: Software Security

iPhones and iPads Approved for NATO Classified Data

Schneier on Security - Thu, 03/12/2026 - 15:59

Apple announcement:

…iPhone and iPad are the first and only consumer devices in compliance with the information assurance requirements of NATO nations. This enables iPhone and iPad to be used with classified information up to the NATO restricted level without requiring special software or settings—a level of government certification no other consumer mobile device has met.

This is out of the box, no modifications required.

Boing Boing post.

Categories: Software Security

Ingress NGINX Controller for Kubernetes Retires – Where to Go From Here

Fastly Blog (Security) - Wed, 03/11/2026 - 20:00
The Ingress NGINX Controller for Kubernetes is retiring in March 2026. Learn your migration options to the Gateway API or HAProxy to maintain security and long-term architecture.
Categories: Software Security

Iran-Backed Hackers Claim Wiper Attack on Medtech Firm Stryker

Krebs on Security - Wed, 03/11/2026 - 12:20

A hacktivist group with links to Iran’s intelligence agencies is claiming responsibility for a data-wiping attack against Stryker, a global medical technology company based in Michigan. News reports out of Ireland, Stryker’s largest hub outside of the United States, said the company sent home more than 5,000 workers there today. Meanwhile, a voicemail message at Stryker’s main U.S. headquarters says the company is currently experiencing a building emergency.

Based in Kalamazoo, Michigan, Stryker [NYSE:SYK] is a medical and surgical equipment maker that reported $25 billion in global sales last year. In a lengthy statement posted to Telegram, an Iranian hacktivist group known as Handala (a.k.a. Handala Hack Team) claimed that Stryker’s offices in 79 countries have been forced to shut down after the group erased data from more than 200,000 systems, servers and mobile devices.

A manifesto posted by the Iran-backed hacktivist group Handala, claiming a mass data-wiping attack against medical technology maker Stryker.

A manifesto posted by the Iran-backed hacktivist group Handala, claiming a mass data-wiping attack against medical technology maker Stryker.

“All the acquired data is now in the hands of the free people of the world, ready to be used for the true advancement of humanity and the exposure of injustice and corruption,” a portion of the Handala statement reads.

The group said the wiper attack was in retaliation for a Feb. 28 missile strike that hit an Iranian school and killed at least 175 people, most of them children. The New York Times reports today that an ongoing military investigation has determined the United States is responsible for the deadly Tomahawk missile strike.

Handala was one of several Iran-linked hacker groups recently profiled by Palo Alto Networks, which links it to Iran’s Ministry of Intelligence and Security (MOIS). Palo Alto says Handala surfaced in late 2023 and is assessed as one of several online personas maintained by Void Manticore, a MOIS-affiliated actor.

Stryker’s website says the company has 56,000 employees in 61 countries. A phone call placed Wednesday morning to the media line at Stryker’s Michigan headquarters sent this author to a voicemail message that stated, “We are currently experiencing a building emergency. Please try your call again later.”

A report Wednesday morning from the Irish Examiner said Stryker staff are now communicating via WhatsApp for any updates on when they can return to work. The story quoted an unnamed employee saying anything connected to the network is down, and that “anyone with Microsoft Outlook on their personal phones had their devices wiped.”

“Multiple sources have said that systems in the Cork headquarters have been ‘shut down’ and that Stryker devices held by employees have been wiped out,” the Examiner reported. “The login pages coming up on these devices have been defaced with the Handala logo.”

Wiper attacks usually involve malicious software designed to overwrite any existing data on infected devices. But a trusted source with knowledge of the attack who spoke on condition of anonymity told KrebsOnSecurity the perpetrators in this case appear to have used a Microsoft service called Microsoft Intune to issue a ‘remote wipe’ command against all connected devices.

Intune is a cloud-based solution built for IT teams to enforce security and data compliance policies, and it provides a single, web-based administrative console to monitor and control devices regardless of location. The Intune connection is supported by this Reddit discussion on the Stryker outage, where several users who claimed to be Stryker employees said they were told to uninstall Intune urgently.

Palo Alto says Handala’s hack-and-leak activity is primarily focused on Israel, with occasional targeting outside that scope when it serves a specific agenda. The security firm said Handala also has taken credit for recent attacks against fuel systems in Jordan and an Israeli energy exploration company.

“Recent observed activities are opportunistic and ‘quick and dirty,’ with a noticeable focus on supply-chain footholds (e.g., IT/service providers) to reach downstream victims, followed by ‘proof’ posts to amplify credibility and intimidate targets,” Palo Alto researchers wrote.

The Handala manifesto posted to Telegram referred to Stryker as a “Zionist-rooted corporation,” which may be a reference to the company’s 2019 acquisition of the Israeli company OrthoSpace.

Stryker is a major supplier of medical devices, and the ongoing attack is already affecting healthcare providers. One healthcare professional at a major university medical system in the United States told KrebsOnSecurity they are currently unable to order surgical supplies that they normally source through Stryker.

“This is a real-world supply chain attack,” the expert said, who asked to remain anonymous because they were not authorized to speak to the press. “Pretty much every hospital in the U.S. that performs surgeries uses their supplies.”

John Riggi, national advisor for the American Hospital Association (AHA), said the AHA is not aware of any supply-chain disruptions as of yet.

“We are aware of reports of the cyber attack against Stryker and are actively exchanging information with the hospital field and the federal government to understand the nature of the threat and assess any impact to hospital operations,” Riggi said in an email. “As of this time, we are not aware of any direct impacts or disruptions to U.S. hospitals as a result of this attack. That may change as hospitals evaluate services, technology and supply chain related to Stryker and if the duration of the attack extends.”

This is a developing story. Updates will be noted with a timestamp.

Update, 2:54 p.m. ET: Added comment from Riggi and perspectives on this attack’s potential to turn into a supply-chain problem for the healthcare system.

Categories: Software Security

Canada Needs Nationalized, Public AI

Schneier on Security - Wed, 03/11/2026 - 07:04

Canada has a choice to make about its artificial intelligence future. The Carney administration is investing $2-billion over five years in its Sovereign AI Compute Strategy. Will any value generated by “sovereign AI” be captured in Canada, making a difference in the lives of Canadians, or is this just a passthrough to investment in American Big Tech?

Forcing the question is OpenAI, the company behind ChatGPT, which has been pushing an “OpenAI for Countries” initiative. It is not the only one eyeing its share of the $2-billion, but it appears to be the most aggressive. OpenAI’s top lobbyist in the region has met with Ottawa officials, including Artificial Intelligence Minister Evan Solomon...

Categories: Software Security

Cloud-Native AI Model Management and Distribution for Inference Workloads

Harbor Blog - Wed, 03/11/2026 - 04:00
Author: Wenbo Qi(Gaius), Dragonfly/ModelPack Maintainer Chenyu Zhang(Chlins), Harbor/ModelPack Maintainer Feynman Zhou, ORAS Maintainer, CNCF Ambassador Reviewer: Sascha Grunert, CRI-O Maintainer Wei Fu, containerd Maintainer The weight of AI models: Why infrastructure always arrives slowly As AI adoption accelerates across industries, organizations face a critical bottleneck that is often overlooked until it becomes a serious obstacle: reliably managing and distributing large model weight files at scale. A model’s weights serve as the central artifact that bridges both training and inference pipelines — yet the infrastructure surrounding this artifact is frequently an afterthought.
Categories: CNCF Projects

Microsoft Patch Tuesday, March 2026 Edition

Krebs on Security - Tue, 03/10/2026 - 20:32

Microsoft Corp. today pushed security updates to fix at least 77 vulnerabilities in its Windows operating systems and other software. There are no pressing “zero-day” flaws this month (compared to February’s five zero-day treat), but as usual some patches may deserve more rapid attention from organizations using Windows. Here are a few highlights from this month’s Patch Tuesday.

Image: Shutterstock, @nwz.

Two of the bugs Microsoft patched today were publicly disclosed previously. CVE-2026-21262 is a weakness that allows an attacker to elevate their privileges on SQL Server 2016 and later editions.

“This isn’t just any elevation of privilege vulnerability, either; the advisory notes that an authorized attacker can elevate privileges to sysadmin over a network,” Rapid7’s Adam Barnett said. “The CVSS v3 base score of 8.8 is just below the threshold for critical severity, since low-level privileges are required. It would be a courageous defender who shrugged and deferred the patches for this one.”

The other publicly disclosed flaw is CVE-2026-26127, a vulnerability in applications running on .NET. Barnett said the immediate impact of exploitation is likely limited to denial of service by triggering a crash, with the potential for other types of attacks during a service reboot.

It would hardly be a proper Patch Tuesday without at least one critical Microsoft Office exploit, and this month doesn’t disappoint. CVE-2026-26113 and CVE-2026-26110 are both remote code execution flaws that can be triggered just by viewing a booby-trapped message in the Preview Pane.

Satnam Narang at Tenable notes that just over half (55%) of all Patch Tuesday CVEs this month are privilege escalation bugs, and of those, a half dozen were rated “exploitation more likely” — across Windows Graphics Component, Windows Accessibility Infrastructure, Windows Kernel, Windows SMB Server and Winlogon. These include:

CVE-2026-24291: Incorrect permission assignments within the Windows Accessibility Infrastructure to reach SYSTEM (CVSS 7.8)
CVE-2026-24294: Improper authentication in the core SMB component (CVSS 7.8)
CVE-2026-24289: High-severity memory corruption and race condition flaw (CVSS 7.8)
CVE-2026-25187: Winlogon process weakness discovered by Google Project Zero (CVSS 7.8).

Ben McCarthy, lead cyber security engineer at Immersive, called attention to CVE-2026-21536, a critical remote code execution bug in a component called the Microsoft Devices Pricing Program. Microsoft has already resolved the issue on their end, and fixing it requires no action on the part of Windows users. But McCarthy says it’s notable as one of the first vulnerabilities identified by an AI agent and officially recognized with a CVE attributed to the Windows operating system. It was discovered by XBOW, a fully autonomous AI penetration testing agent.

XBOW has consistently ranked at or near the top of the Hacker One bug bounty leaderboard for the past year. McCarthy said CVE-2026-21536 demonstrates how AI agents can identify critical 9.8-rated vulnerabilities without access to source code.

“Although Microsoft has already patched and mitigated the vulnerability, it highlights a shift toward AI-driven discovery of complex vulnerabilities at increasing speed,” McCarthy said. “This development suggests AI-assisted vulnerability research will play a growing role in the security landscape.”

Microsoft earlier provided patches to address nine browser vulnerabilities, which are not included in the Patch Tuesday count above. In addition, Microsoft issued a crucial out-of-band (emergency) update on March 2 for Windows Server 2022 to address a certificate renewal issue with passwordless authentication technology Windows Hello for Business.

Separately, Adobe shipped updates to fix 80 vulnerabilities — some of them critical in severity — in a variety of products, including Acrobat and Adobe Commerce. Mozilla Firefox v. 148.0.2 resolves three high severity CVEs.

For a complete breakdown of all the patches Microsoft released today, check out the SANS Internet Storm Center’s Patch Tuesday post. Windows enterprise admins who wish to stay abreast of any news about problematic updates, AskWoody.com is always worth a visit. Please feel free to drop a comment below if you experience any issues apply this month’s patches.

Categories: Software Security

Pages

Subscribe to articles.innovatingtomorrow.net aggregator