Benchmarking the Ceph Object Gateway: A Deep Dive into Secure, Scalable Object Storage Performance. Part 1

Daniel Alexander Parkes, Anthony D'Atri

Introduction: Why Performance Matters for Secure Object Storage

Organizations today are challenged not only with managing massive volumes of data — often in the tens of petabytes — but also with the responsibility of securing that data across hybrid and multicloud environments. Object storage systems (especially Ceph) provide the scalability and flexibility required to meet these challenges, offering S3-compatible access, native redundancy, and a growing set of enterprise features.

As encryption in transit and at rest is configured and layered with Ceph Object Gateway (RGW) deployments, it becomes essential to understand their impact on latency, throughput, and resource utilization.

This blog series presents the results of a comprehensive performance benchmarking effort conducted by the IBM Storage Ceph Performance and Interoperability team. Special thanks to Jay Rajput for leading the execution of test cases and data collection. Our evaluation focuses on how real-world workloads interact with different configurations of encryption, data protection, and horizontal scaling, offering practical insights for architects, administrators, and developers alike.

Hardware and Software Setup

We tested on a production-grade Ceph cluster, deployed with collocated RGW, OSD, Monitor, Manager, and Ingress services.

Hardware Specifications

ComponentRole(s)QuantityCPURAMStorage
Dell R760Monitor, Manager, OSD, RGW, Ingress122× Intel Xeon Gold 6438N (64 threads)512 GB24 × 3.84 TB NVMe
Dell R660Benchmarking Clients, Monitoring132× Intel Xeon Gold 5418Y (48 threads)384 GB2 × 3.84 TB NVMe

Each test cluster configuration (4-node, 8-node, and 12-node) maintained consistent OSD density (24 per node) and 4 RGW daemons per node, with dedicated VIPs for Ingress-based load balancing.

Ceph Cluster SetupDetails
Cluster Size4 node, 8 node, 12 node
Total OSD Counts96, 192, 288
OSDs24 per node
RGWs4 per node
Ingress1 VIP per node
RGW data poolReplica 3, EC 2+2, 4+2, 8+3
PG Replica Count per OSD~400
Collocated Ceph DaemonsMonitor, Manager, OSD, RGW, Ingress
Object Count per bucket shardDefault: 100K

Software version matrix

ComponentVersion / Notes
Ceph19.2.0-52
Elbencho3.0-26 (benchmarking tool)
HashiCorp Vault1.19.1 (for SSE key management)
Prometheus + GrafanaMonitoring stack
RHEL9.5 with BIOS profile set to “performance”

Ceph Cluster Configuration

Ceph Cluster ConfigValue
Scrubs/Deep-scrubsDisabled
Ceph BalancerDisabled
Progress ModuleDisabled
PG AutoscalerDisabled
OSDMAP_FLAGSMuted
Dynamic Bucket Re-shardingDisabled

PG counts

Cluster SizeOSD CountTarget PG replicas per OSDPool TypePG Count (Index / Data pool)
4 Node96400EC 2+2 / Replicated512 / 8192
8 Node192400EC 2+21024 / 32768
8 Node192400EC 4+21024 / 16384
8 Node192400EC 8+31024 / 8192
12 Node288400EC 2+21024 / 32768
12 Node288400EC 4+21024 / 16384
12 Node288400EC 8+31024 / 8192

Network Architecture & Hardware Connectivity

To complement the compute/storage setup, our network underpins the cluster's high-throughput performance:

  • Leaf–Spine topology: We’re running a 100 GE leaf–spine network with one spine (QFX5120) and three leaf switches (QFX5120), enabling a scalable, low-latency design. This offers port density now and a future upgrade path (e.g., adding a true spine and repurposing the current one) without impacting performance.

  • Dual 100 Gbps uplinks per server via LACP: Each Ceph node utilizes two 100 GE ports on a single NIC, bonded using LACP, to connect to both leaf switches for redundancy and link aggregation.

  • Per-node limit: Each Ceph storage node is equipped with Intel NICs that support a maximum aggregate throughput of 100 Gbps, even though two ports are available and bonded via LACP. This means that per-node throughput is capped at ~12.5 GB/s in optimal conditions.

  • Cluster-wide switching capacity: Our leaf–spine topology, built with one QFX5120 spine and three QFX5120 leaf switches, provides full line-rate connectivity across all twelve storage nodes. Each leaf connects to four nodes and uplinks to the spine at 100 Gbps. This results in a total cluster theoretical switching capacity of ~150 GB/s. In our large-object benchmarks, the system achieved an aggregate throughput of ~111 GB/s, demonstrating that we were reaching the physical network ceiling, particularly for large object read-intensive workloads.

Test Methodology

We designed our performance evaluation to answer foundational questions regarding how to deploy the Ceph Object Gateway (RGW) for both performance and security:

  • What’s the impact of TLS (SSL) on RGW throughput and latency?

  • How much overhead does server-side encryption (SSE-S3/KMS) introduce?

  • Does securing internal daemon communication (msgr v2) affect CPU utilization?

  • How do EC profiles (2+2, 4+2, 8+3) compare to 3x replication?

  • What are the performance implications of using HAProxy-based ingress vs direct access?

  • How does performance scale with node count and concurrency?

Each test case was repeated across PUT and GET workloads with varying object sizes, ranging from 64 KiB to 1 GiB. Elbencho was used in client-server mode with thread counts of 128 (except for SSE testis that used 64 threads), running up to eight concurrent clients. Each Elbencho client uses an individual bucket. Buckets were created in advance, using the default sharding count of eleven shards per bucket. Multipart upload was used for objects larger than 1 GiB.

Payload SizeWorkload CategoryRepresentative of
≤ 64KBSmall objectThumbnails, telemetry, and small metadata files
1MBMedium objectDocuments, emails, attachments
≥ 32MBLarge objectBackups, HD media, ML datasets

Executive Summary

IBM Storage Ceph demonstrates exceptional performance and flexibility when deployed on cutting-edge, all-flash infrastructure with 100 GE networking, such as the IBM Storage Ceph Ready Nodes. As enterprises scale to billions of objects and multi-petabyte workloads, Ceph's ability to handle diverse data patterns, from high IOPS, low-latency, metadata-heavy workloads to high-throughput, bandwidth-intensive workloads, becomes critical.

Large Object Workloads (Throughput Focus)

For objects exceeding 32 MiB, the cluster achieved near-linear scaling up to twelve storage nodes, peaking at an aggregate PUT throughput of 65 GiB/s and an aggregate GET throughput of ~115 GiB/s. Beyond these points, 100 GE NIC saturation on individual nodes became the primary bottleneck. This suggests that future benchmark testing will benefit from higher-bandwidth NICs, as large object workloads still have room to achieve higher throughput results from the current node's available resources. Please note the distinction between the sum of a NIC's nominal port speeds and the bandwidth that it can handle when more than one port is in use, as the latter may be smaller than the former on multiport NICs.

For large objects, fully in-transit secured configurations (TLS + msgr v2) maintained high throughput with reasonable overhead, demonstrating that Ceph Object Gateway (RGW) is well-suited for secure data pipelines at scale. There is room for performance improvement when also enabling server-side encryption (SSE) to provide object encryption at rest.

Small Object Workloads (IOPS & Latency Focus)

Small object tests (6i4 KB) demonstrated Ceph Object Gateway (RGW)’s ability to scale IOPS with increasing concurrency and cluster size efficiently. With 64 KiB objects, the system achieved up to 391K GET IOPS and 86K PUT IOPS on a twelve-node cluster using erasure coding.

To unlock optimal performance for small-object workloads, especially under high concurrency, it's essential to deploy on infrastructure with robust CPU capacity and generous RGW threading, enabling the Ceph Object Gateway to leverage its parallel processing capabilities fully.

What’s Next

This post introduces the testbed, methodology, and highlights key results. In the upcoming posts of this series, we’ll delve into each performance axis, exploring the impact of TLS and SSE on RGW throughput, scaling behaviors with erasure coding versus replication, how concurrency and daemon density affect latency, and more. You’ll see detailed graphs and architectural guidance drawn directly from production-grade testing. Whether you’re building secure object storage for AI pipelines, backups, or multi-tenant cloud services, stay tuned, there’s much more to uncover.

Read Part 2 here.

The authors would like to thank IBM for supporting the community with our time to create these posts.