You are here

Rook Blog

Subscribe to Rook Blog feed Rook Blog
Cloud-Native Software Defined Storage that is open, flexible, scalable and integrates with your environment - Medium
Updated: 4 hours 37 min ago

Running Rook at Petabyte Scale Across Multiple Regions

Wed, 03/18/2026 - 10:50

This post describes how SAP’s cloud infrastructure team uses Rook to manage a multi-region Ceph fleet — from bare metal provisioning to rolling upgrades — as part of building a digitally sovereign storage backbone for Europe.

The 120 Petabyte Challenge

When you are responsible for a target of 120 Petabytes of storage across 30 Regions, manual operations don’t scale.

For years, SAP Cloud Infrastructure relied on a mix of proprietary appliances and legacy OpenStack Swift. But as we architected our next-generation cloud stack (internally part of the Apeiro project), we faced a non-negotiable constraint: Digital Sovereignty. Our stack had to be completely free of hyperscaler lock-in, running on our own hardware in our own data centers.

This created a concrete engineering challenge: build a storage layer that is API-first, fully open-source, and capable of self-management at a global scale. We chose Ceph for the storage engine — and Rook for the automation layer that makes it manageable.

Why Rook

Managing Ceph at this scale without an operator would mean building and maintaining custom tooling for OSD lifecycle, daemon placement, upgrade orchestration, and failure recovery across every region. Rook gives us all of this as a declarative Kubernetes-native interface, which means our existing GitOps and CI/CD workflows extend naturally to storage. Instead of writing region-specific runbooks, we write Helm values.

Architecture: The Separation of Metal and Software

Our platform, CobaltCore, is built on top of Gardener and Metal-API both part of the ApeiroRA reference architecture. In this stack, storage isn’t a static resource — it’s a programmable Kubernetes object. We run storage on dedicated nodes, separate from application workloads. At our density (16 NVMe drives per node), co-locating workloads would create unacceptable I/O interference, so storage nodes do one thing: serve data.

The Metal Layer

Metal-API and Gardener manage the physical lifecycle of bare-metal servers: inventory, provisioning, firmware, and OS deployment. This allows Rook to focus purely on the software layer without worrying about the underlying physical state.

The Declarative Storage Layer (Rook)

Once nodes are handed over, Rook takes control. We use a strict GitOps workflow to ensure consistency across the fleet:

  • Base Blueprint: A central Helm chart defines global best practices and standard Ceph configurations.
  • Region Overlay: Region-specific resources (CephBlockPools, RGW placement rules) are injected via localized values.yaml files.
  • Automation: Rook handles the rest: bootstrapping daemons, configuring CRUSH failure domains, and provisioning RGW endpoints.

Standard Storage Node Spec:

  • Server: Dell PowerEdge R7615
  • CPU: AMD EPYC 9554P (64 cores)
  • RAM: 384 GB
  • Storage: 16x 14 TB NVMe
  • Network: 100 GbE (redundant)

Validation: Establishing the Performance Envelope

Before committing to production capacity planning, we needed to establish the performance envelope of our RGW tier. We ran a breakpoint test on a typical Ceph Squid cluster (28 nodes, 362 OSDs) to find the stable operating range, saturation threshold, and hard ceiling.

Test Setup

  • Workload: 2M objects (4 KB each), 20 k6 clients, single “premium” NVMe bucket.
  • Method: Ramping load over 30 minutes until p90 latency exceeded 500 ms.

Results

Request rate ramp: successful requests (pink) peak at 171K ops/sec before the test exits. Failed requests (blue) spike briefly near saturation.
  • Saturation Point: The cluster entered saturation around 90K GET/s — latency percentiles begin diverging and request queues start building.
  • Breaking Point: Peak of 171K GET/s (measured on RGW) before the runners hit the latency exit condition.

Note: isolated 503s appeared as early as ~33K GET/s on a single RGW instance, likely caused by uneven load distribution rather than cluster-wide saturation.

Reading the charts

Client-side latency (k6): flat near zero through moderate load, stepping up as the cluster approaches saturation, and reaching 1.5s+ at the breaking point.

The client-side latency chart tells the story most directly. Average request duration stays flat near zero well into the ramp — then steps up sharply as the cluster enters saturation and eventually hits its ceiling.

RGW-side GET latency: all percentiles stay flat and sub-ms through moderate load. Around 90K GET/s, p99 begins climbing while median remains low — a classic saturation signal.

Comparing the two charts reveals where the system saturates. At peak load, RGW reports p99 latency of ~210ms — but clients observe 1.5 seconds. The gap is connection queueing: requests waiting to be picked up by RGW Beast frontend threads. RGW’s internal metrics only measure processing time after a request is accepted, not time spent in the queue.

The RGW latency chart also shows that RADOS operation latency climbs under load, which means RGW threads stay occupied longer, contributing to the queue buildup. At the breaking point, request queues filled and RGWs began returning 503s across all instances.

This is a read-focused baseline — our primary workload is read-heavy. The saturation point of 90K GET/s gives us a conservative operating ceiling for per-region capacity planning.

Operational Reality: Making Day 2 Uneventful

The true test of any storage system is what happens when things break or need upgrading. At our scale, the goal is to make operations boring.

Zero-Downtime Upgrades

Rook has reduced storage maintenance from a coordinated event to a background task. Since the first cluster went live in May 2024, we have maintained a continuous upgrade cadence with zero customer-facing downtime and zero data loss:

  • GardenLinux: Monthly rolling updates across all regions.
  • Kubernetes: Quarterly version upgrades.
  • Rook: Quarterly upgrades (v1.14 through v1.18), with additional upgrades when a needed feature ships in a new release.
  • Ceph: Major version migration from Reef v18 to Squid v19. A rolling upgrade of the largest cluster (~816 OSDs) completes in approximately 2 days.
Upgrade cadence since May 2024: GardenLinux monthly, Kubernetes quarterly, Rook quarterly (with extra upgrades for needed features), and one Ceph major version migration.

Drive Failures

With ~2,800 OSDs in the fleet, drive failures are a routine event. When a drive fails, Ceph (RADOS) automatically handles data recovery and rebalancing across the remaining OSDs — no operator action is needed to protect data. On the Kubernetes side, Rook detects the failed OSD pod and manages its lifecycle. The full drive replacement cycle (removing the failed OSD, clearing the device, provisioning a new OSD on the replacement drive) still involves operational steps on our side, but Ceph’s self-healing ensures data durability is never at risk while the replacement is carried out.

Current Status and What’s Next

As of early 2026, the fleet spans 10 live regions (with an 11th newly provisioned):

  • Storage Nodes: 251
  • Total OSDs: ~2,800
  • Raw Capacity: ~37 PiB

Region sizes range from 13-node / 96-OSD deployments to 59-node / 816-OSD clusters — the same Rook-based GitOps workflow handles both.

The next phase is bringing high-performance Block Storage (RBD) into this declarative model to fully retire our remaining proprietary SANs.

  • Target: 30 Regions.
  • Target Capacity: 120 PB.

We are active contributors to the Rook project and continue collaborating with the maintainers as we scale toward these targets. The Rook Slack community has been a valuable resource throughout this journey.

This work is part of ApeiroRA — an open initiative developing a reference blueprint for sovereign cloud-edge infrastructure. All components use enterprise-friendly open-source licenses under neutral community governance. ApeiroRA welcomes participants — whether you want to adopt the blueprints, contribute components, or shape the architecture. Get started at the documentation portal.

Authors: SAP Engineering Team, CLYSO Engineering Team.

Running Rook at Petabyte Scale Across Multiple Regions was originally published in Rook Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Categories: CNCF Projects

Rook v1.19 Storage Enhancements

Tue, 01/20/2026 - 14:51

The Rook v1.19 release is out! v1.19 is another feature-filled release to improve storage for Kubernetes. Thanks again to the community for all the great support in this journey to deploy storage in production.

The statistics continue to show Rook is widely used in the community, with over 13.3K Github stars, and Slack members and X followers constantly increasing.

If your organization deploys Rook in production, we would love to hear about it. Please see the Adopters page to add your submission. As an upstream project, we don’t track our users, but we appreciate the transparency of those who are deploying Rook!

We have a lot of new features for the Ceph storage provider that we hope you’ll be excited about with the v1.19 release!

NVMe-oF Gateway

NVMe over Fabrics allows RBD volumes to be exposed and accessed via the NVMe/TCP protocol. This enables both Kubernetes pods within the cluster and external clients outside the cluster to connect to Ceph block storage using standard NVMe-oF initiators, providing high-performance block storage access over the network.

NVMe-oF is supported by Ceph starting in the recent Ceph Tentacle release. The initial integration with Rook is now completed, and ready for testing in experimental mode, which means that it is not production ready and only intended for testing. As a large new feature, this will take some time before we declare it stable. Please test out the feature and let us know your feedback!

See the NVMe-oF Configuration Guide to get started.

Ceph CSI 3.16

The v3.16 release of Ceph CSI has a range of features and improvements for the RBD, CephFS, NFS drivers. Similar to v1.18, this release is again supported both by the Ceph CSI operator and Rook’s direct mode of configuration. The Ceph CSI operator is still configured automatically by Rook. We will target v1.20 to fully document the Ceph CSI operator configuration.

In this release, new Ceph CSI features include:

  • NVMe-oF CSI driver for provisioning and mounting volumes over the NVMe over Fabrics protocol
  • Improved fencing for RBD and CephFS volumes during node failure
  • Block volume usage statistics
  • Configurable block encryption cipher

Concurrent Cluster Reconciles

Previous to this release, when multiple Ceph clusters are configured in the same cluster, they each have been reconciled serially by Rook. If one cluster is having health issues, it would block all other subsequent clusters from being reconciled.

To improve the reconcile of multiple clusters, Rook now enables clusters to be reconciled concurrently. Concurrency is enabled by increasing the operator setting ROOK_RECONCILE_CONCURRENT_CLUSTERS (in operator.yaml or the helm setting reconcileConcurrentClusters) to a value greater than 1. If resource requests and limits are set on the operator, they may need to be increased to accommodate the concurrent reconciles.

While this is a relatively small change, to be conservative due to the difficulty of testing the concurrency, we have marked this feature experimental. Please let us know if the concurrency works smoothly for you or report any issues!

When clusters are reconciled concurrently, the rook operator log will contain the logging intermingled between all the clusters in progress. To improve the troubleshooting, we have updated many of the log entries with the namespace and/or cluster name.

Breaking changes

There are a few minor changes to be aware of during upgrades.

CephFS

  • The behavior of the activeStandby property in the CephFilesystem CRD has changed. When set to false, the standby MDS daemon deployment will be scaled down and removed, rather than only disabling the standby cache while the daemon remains running.

Helm

  • The rook-ceph-cluster chart has changed where the Ceph image is defined, to allow separate settings for the repository and tag. See the example values.yaml for the new repository and tag settings. If you were previously specifying the ceph image in the cephClusterSpec, remove it at the time of upgrade while specifying the new properties.

External Clusters

  • In external mode, if you specify a Ceph admin keyring (not the default recommendation), Rook will no longer create CSI Ceph clients automatically. The CSI client keyrings will only be created by the external Python script. This removes the duplication between the Python script and the operator from creating the same users.

Versions

Supported Ceph Versions

Rook v1.19 has removed support for Ceph Reef v18 since it has reached end of life. If you are still running Reef, upgrade at least to Ceph Squid v19 before upgrading to Rook v1.19.

Ceph Squid and Ceph Tentacle are the supported versions with Rook v1.19.

Kubernetes v1.30 — v1.35

Kubernetes v1.30 is now the minimum version supported by Rook through the latest K8s release v1.35. Rook CI runs tests against these versions to ensure there are no issues as Kubernetes is updated. If you still require running an older K8s version, we haven’t done anything to prevent running Rook, we simply just do not have test validation on older versions.

What’s Next?

As we continue the journey to develop reliable storage operators for Kubernetes, we look forward to your ongoing feedback. Only with the community is it possible to continue this fantastic momentum.

There are many different ways to get involved in the Rook project, whether as a user or developer. Please join us in helping the project continue to grow on its way beyond the v1.17 milestone!

Rook v1.19 Storage Enhancements was originally published in Rook Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Categories: CNCF Projects