Ceph Blog

v20.2.1 Tentacle released

2026-04-06T00:00:00Z

This is the first minor release in the Tentacle series. We recommend that all users update to this release.

Release Date ¶

April 06, 2026

Notable Changes ¶

OSD / BlueStore ¶

EC Recovery: Fixed a length calculation bug in erase_after_ro_offset() that caused empty shards to retain data, leading to shard_size >= tobj_size assertion failures when recovering small objects in EC pools.
BlueFS Volume Selector: Updated the BlueFS volume selector to properly account for file size changes when recovering the WAL in envelope mode.
BlueFS: Fixed a bug where stat() missed the actual file size update after indexing WAL envelope files.

Monitor (mon) ¶

Fast EC Restrictions: Denied the ability to enable EC optimizations ("fast EC") for non-4K-aligned chunk sizes. Unaligned chunk sizes handled by fast EC perform poorly and suffer from bugs, so attempts to force this configuration are now rejected.
Peering: Ensured ceph pg repeer proposes a correctly sized pg temp, as optimized EC cannot cope with mismatched sizes.
NVMeoF Gateway: Added a new nvme-gw listeners command to display all existing listeners (including auto-listeners) inside a pool/group.
NVMeoF Failover: Overhauled the NVMeoF Gateway fast-failover logic. Beacon timeouts are now evaluated within prepare_beacon to support shorter intervals, and the mechanism for detecting monitor slowness was improved.

librbd & rbd-mirror ¶

RBD: Introduced a new RBD_LOCK_MODE_EXCLUSIVE_TRANSIENT policy for rbd_lock_acquire(). This is a low-level interface intended to allow a peer to grab exclusive lock manually for short periods of time with other peers pausing their activity and waiting for the lock to be released rather than instantly aborting I/O and returning an error. It's possible to switch from RBD_LOCK_MODE_EXCLUSIVE to RBD_LOCK_MODE_EXCLUSIVE_TRANSIENT policy and vice versa even if the lock is already held.

Ceph Object Gateway (RGW) ¶

Multi-Part Operations: Fixed conditional validation handling in MultiWrite, Delete, and MultiDelete workflows.

mgr/dashboard ¶

UI Navigation: Redesigned the main landing page; the "Dashboard" navigation item was renamed to "Overview" and uses a new carbonized productive card layout.
NVMeoF Management: Added the nvmeof get_subsystems CLI command, fixed JSON output indentation for NVMeoF CLI commands, and reverted the server_addr API parameter back to traddr for consistency.
Hosts View: Fixed a bug causing the IP addresses of hosts to be hidden on the Hosts page due to an issue with fact merging.
Forms & Modals: Standardized forms onto the Carbon Design System, including the pools form, service form, multi-site realm token export modal, delete zone modal, and password change forms.
Form Validation: Generalized form error handling and validations using a new cdValidate directive.

mgr/cephadm ¶

Monitoring Stack: Bumped the default container image versions for the monitoring stack: Prometheus to v3.6.0, Node-exporter to v1.9.1, Alertmanager to v0.28.1, and Grafana to v12.2.0.

Security Changes ¶

Monitoring Stack Images: Updated Prometheus, Alertmanager, and Grafana container image versions, picking up upstream security and stability fixes.

Configuration Changes ¶

bluefs_check_volume_selector_on_mount: The previous bluefs_check_volume_selector_on_umount debug setting was renamed and repurposed. It now checks for volume selector inconsistencies on both mount and unmount phases.
mon_nvmeofgw_beacon_grace: The default grace period before marking a gateway as failed has been reduced from 10 seconds to 7 seconds for faster failover.
nvmeof_mon_client_tick_period: The default beacon tick interval has been lowered from 2 seconds to 1 second.

Changelog ¶

[rgw][tentacle] backport of cloud-restore related PRs (pr#65830, Soumya Koduri)
Add normalization and casesensitive options to the subvolume group creation command (pr#65564, Venky Shankar, Xavi Hernandez)
auth: msgr2 can return incorrect allowed_modes through AuthBadMethodFrame (pr#65336, Miki Patel)
backports variants improvements and Dockerfile.build changes (pr#66010, John Mulligan, Zack Cerza)
Beacon diff (pr#66958, Leonid Chernin, Samuel Just)
blk/kernel: bring "bdev_async_discard" config parameter back (pr#65609, Igor Fedotov)
blk/kernel: improve DiscardThread life cycle (pr#65213, Igor Fedotov)
bluestore/BlueFS: fix bytes_written_slow counter with aio_write (pr#66355, chungfengz)
build-with-container: add argument groups to organize options (pr#65628, John Mulligan)
build-with-container: build image variants (pr#65946, John Mulligan)
ceph-mixin: Update monitoring mixin (pr#65692, Aashish Sharma, SuperQ, Ankush Behl)
ceph-volume: fix UdevData initialisation from empty /run/udev/data/* file (pr#65923, Matteo Paramatti)
ceph-volume: lvm.Lvm.setup_metadata_devices refactor (pr#65925, Guillaume Abrioux)
ceph-volume: support additional dmcrypt params (pr#65544, Guillaume Abrioux)
ceph-volume: use udev data instead of LVM subprocess in get_devices() (pr#65921, Guillaume Abrioux)
ceph_release, doc/dev: update tentacle as stable release (pr#65988, Laura Flores)
cephadm, debian/rules: Use system packages for cephadm bundled dependencies (pr#66256, Kefu Chai)
cephadm: fix building rpm-sourced cephadm zippapp on el10 (pr#65292, John Mulligan)
cephadm: set default image for tentacle release (pr#65719, Adam King)
cephadm: support custom distros by falling back to ID_LIKE (pr#65696, bachmanity1)
cephfs-journal-tool: Journal trimming issue (pr#65601, Kotresh HR)
client: fix async/sync I/O stalling due to buffer list exceeding INT_MAX (pr#65256, Dhairya Parmar)
client: fix dump_mds_requests to valid json format (issue#73639, pr#66156, haoyixing)
client: fix unmount hang after lookups (pr#65254, Dhairya Parmar)
client: use path supplied in statfs (pr#65132, Christopher Hoffman)
common/frag: properly convert frag_t to net/store endianness (pr#66540, Patrick Donnelly, Max Kellermann)
common: Allow PerfCounters to return a provided service ID (pr#65587, Adam C. Emerson)
debian/control: add iproute2 to build dependencies (pr#66737, Kefu Chai)
debian/control: Add libxsimd-dev build dependency for vendored Arrow (pr#66248, Kefu Chai)
debian/control: record python3-packaging dependency for ceph-volume (pr#66590, Thomas Lamprecht, Max R. Carrara)
doc/cephfs: fix docs for pause_purging and pause_cloning (pr#66452, Rishabh Dave)
doc/mgr/smb: document the 'provider' option for smb share (pr#65617, Sachin Prabhu)
doc/radosgw: change all intra-docs links to use ref (1 of 6) (pr#67043, Ville Ojamo)
doc/radosgw: change all intra-docs links to use ref (2 of 6) (pr#67084, Ville Ojamo)
doc/radosgw: Cosmetic improvements and ref links in account.rst (pr#67064, Ville Ojamo)
doc/rbd/rbd-config-ref: add clone settings section (pr#66175, Ilya Dryomov)
doc: add Tentacle to os recommendations (pr#66464, Casey Bodley, Joseph Mundackal)
doc: fetch releases from main branch (pr#67002, Patrick Donnelly)
doc: Pin pip to <25.3 for RTD as a workaround for pybind in admin/doc-read-the-docs.txt (pr#66106, Ville Ojamo)
doc: Remove sphinxcontrib-seqdiag Python package from RTD builds (pr#67296, Ville Ojamo)
doc: Update dashboard pending release notes (pr#65984, Afreen Misbah)
encode: Fix bad use of DENC_DUMP_PRE (pr#66565, Adam Kupczyk)
Fast failover (pr#67150, leonidc, Leonid Chernin)
Fix multifs auth caps check (pr#65358, Kotresh HR)
Form retains old data when switching from edit to create (pr#65654, pujashahu)
Generalize error handling for angular forms (pr#66904, Afreen Misbah)
github: pin GH Actions to SHA-1 commit (pr#65761, Ernesto Puerta)
install-deps.sh: install proper compiler version on Debian/Ubuntu (pr#66015, Dan Mick)
install-deps: Replace apt-mirror (pr#66672, David Galloway)
libcephfs: New feature - add ceph_setlk and ceph_getlk functions (pr#65258, Giorgos Kappes)
librbd: fix ExclusiveLock::accept_request() when !is_state_locked() (pr#66628, Ilya Dryomov)
librbd: introduce RBD_LOCK_MODE_EXCLUSIVE_TRANSIENT (pr#67279, Ilya Dryomov)
mds/FSMap: fix join_fscid being incorrectly reset for active MDS during filesystem removal (pr#65777, ethanwu)
mds/MDSDaemon: unlock mds\_lock while shutting down Beacon and others (pr#64885, Max Kellermann)
mds: dump export_ephemeral_random_pin as double (pr#65163, Enrico Bocchi)
mds: fix rank 0 marked damaged if stopping fails after Elid flush (pr#65778, ethanwu)
mds: Fix readdir when osd is full (pr#65346, Kotresh HR)
mds: fix snapdiff result fragmentation (pr#65362, Igor Fedotov, Md Mahamudur Rahaman Sajib)
mds: include auth credential in session dump (pr#65255, Patrick Donnelly)
mds: Return ceph.dir.subvolume vxattr (pr#65779, Edwin Rodriguez)
mds: skip charmap handler check for MDS requests (pr#64953, Patrick Donnelly)
mds: wrong snap check for directory with parent snaps (pr#65259, Patrick Donnelly)
mgr/alerts: enforce ssl context to SMTP_SSL (pr#66140, Nizamudeen A)
mgr/cephadm: Add some new fields to the cephadm NVMEoF spec file (pr#66987, Gil Bregman)
mgr/cephadm: bump monitoring stack versions (pr#65895, Nizamudeen A)
mgr/cephadm: Change the default of max hosts per namespace in NVMEoF to 16 (pr#66819, Gil Bregman)
mgr/cephadm: don't mark nvmeof daemons without pool and group in name as stray (pr#65594, Adam King)
mgr/cephadm: update grafana conf for disconnected environment (pr#66209, Nizamudeen A)
mgr/cephadm: Use a persistent volume to store Loki DB (pr#66023, Aashish Sharma)
mgr/DaemonServer: fixed mistype for mgr_osd_messages (pr#63345, Konstantin Shalygin)
mgr/DaemonState: Minimise time we hold the DaemonStateIndex lock (pr#65464, Brad Hubbard)
mgr/dasboard : Carbonize pools form (pr#66789, Abhishek Desai, Ankit Kumar)
mgr/dashboard : Fixed labels issue (pr#66603, Abhishek Desai)
mgr/dashboard : Carbonize -> Report an issue modal (pr#66048, Abhishek Desai)
mgr/dashboard : fix - about model tooltip issue (pr#66276, Devika Babrekar)
mgr/dashboard : fix - CephFS Authorize Modal Update issue (pr#66419, Devika Babrekar)
mgr/dashboard : fix css for carbon input fields (pr#65490, Abhishek Desai)
mgr/dashboard : Fix secure-monitoring-stack creds issue (pr#65943, Abhishek Desai)
mgr/dashboard : Fixed mirrored image usage info bar (pr#65491, Abhishek Desai)
mgr/dashboard : Fixed usage bar for secondary site in rbd mirroing (pr#65927, Abhishek Desai)
mgr/dashboard : Fixed warning icon colour issue with carbon colour (pr#66271, Abhishek Desai)
mgr/dashboard : Hide suppressed alert on landing page (pr#65737, Abhishek Desai)
mgr/dashboard : Remove subalerts details for multiple subalerts (pr#66295, Abhishek Desai)
mgr/dashboard : Skip calls until secure_monitoring_stack is enabled (pr#65673, Abhishek Desai)
mgr/dashboard: --no-group-append default value to False, aligned with old cli" (pr#65678, Tomer Haskalovitch)
mgr/dashboard: Add Archive zone configuration to the Dashboard (pr#67131, Aashish Sharma)
mgr/dashboard: add customizations to table-actions (pr#65956, Naman Munet)
mgr/dashboard: Add full page tearsheet component (pr#66892, Afreen Misbah)
mgr/dashboard: Add generic wizard component (pr#66893, Afreen Misbah)
mgr/dashboard: add get_subsystem nvme command (pr#66941, Tomer Haskalovitch)
mgr/dashboard: add indentation to the json output of nvmeof cli commands (pr#66940, Tomer Haskalovitch)
mgr/dashboard: add multiple ceph users deletion (pr#65658, Pedro Gonzalez Gomez)
mgr/dashboard: add nsid param to ns add command (pr#65677, Tomer Haskalovitch)
mgr/dashboard: add nsid param to ns list command (pr#65749, Tomer Haskalovitch)
mgr/dashboard: Add overview page and change 'Dashboard' to 'Overview' (pr#67118, Afreen Misbah)
mgr/dashboard: Add productive card component (pr#67147, Afreen Misbah)
mgr/dashboard: add text-label-list component (pr#66312, Pedro Gonzalez Gomez)
mgr/dashboard: Adding QAT Compression dropdown on RGW Service form (pr#66642, Devika Babrekar)
mgr/dashboard: allow deletion of non-default zone and zonegroup (pr#66211, Aashish Sharma)
mgr/dashboard: Allow FQDN in Connect Cluster form -> Cluster API URL (pr#65622, Aashish Sharma)
mgr/dashboard: Blank entry for Storage Capacity in dashboard under Cluster > Expand Cluster > Review (pr#65705, Naman Munet)
mgr/dashboard: bump validator package to address vulnerability (pr#66227, Naman Munet)
mgr/dashboard: Carbonize - Multisite Zone (pr#67117, Dnyaneshwari Talwekar)
mgr/dashboard: Carbonize Administration module > Create Realm/Zone group/zone (pr#66986, Dnyaneshwari Talwekar)
mgr/dashboard: Carbonize multisite sync policy forms (pr#66302, Naman Munet)
mgr/dashboard: carbonize service form (pr#66978, Pedro Gonzalez Gomez)
mgr/dashboard: Carbonize the Change Password Form (pr#66401, Afreen Misbah)
mgr/dashboard: carbonize-delete-zone-modal (pr#67100, Sagar Gopale)
mgr/dashboard: carbonize-delete-zonegroup-modal (pr#67014, Sagar Gopale)
mgr/dashboard: carbonized-multisite-export-realm-token-modal (pr#66649, Sagar Gopale)
mgr/dashboard: change the default max namespace from 4096 to None in subsystem add command (pr#65951, Tomer Haskalovitch)
mgr/dashboard: Edit user via UI throwing multiple server errors (pr#66081, Naman Munet)
mgr/dashboard: empty-data-message (pr#66902, Sagar Gopale)
mgr/dashboard: fetch all namespaces in a gateway group (pr#67140, Afreen Misbah)
mgr/dashboard: fix command alias help message (pr#65750, Tomer Haskalovitch)
mgr/dashboard: fix dashboard freeze on missing smb permissions (pr#65873, Pedro Gonzalez Gomez)
mgr/dashboard: fix data mismatch in Advance section in Tiering (pr#65672, Dnyaneshwari Talwekar)
mgr/dashboard: Fix display of IP address in host page (pr#67146, Afreen Misbah)
mgr/dashboard: fix icon alignment in navigation header (pr#66091, Naman Munet)
mgr/dashboard: fix misaligned text links on login page (pr#66052, prik73, Afreen Misbah)
mgr/dashboard: fix missing schedule interval in rbd API (pr#65560, Nizamudeen A)
mgr/dashboard: fix multi-cluster route reload logic (pr#66504, Aashish Sharma)
mgr/dashboard: fix multisite wizard realm configuration mode (pr#66017, Aashish Sharma)
mgr/dashboard: fix None force param handling in ns add_host so it won't raise exceptions (pr#65679, Tomer Haskalovitch)
mgr/dashboard: fix ns add and resize commands help (pr#66939, Tomer Haskalovitch)
mgr/dashboard: fix oauth2-service creation UI error (pr#66139, Nizamudeen A)
mgr/dashboard: fix prometheus API error when not configured (pr#65856, Nizamudeen A)
mgr/dashboard: fix rbd form mirroring toggle (pr#65874, Nizamudeen A)
mgr/dashboard: fix RBD mirror schedule inheritance in pool and image APIs (pr#67107, Imran Imtiaz)
mgr/dashboard: fix smb button and table column (pr#65657, Pedro Gonzalez Gomez)
mgr/dashboard: Fix table width expansion on manager module dropdown selection #74089 (pr#66647, Sagar Gopale)
mgr/dashboard: fix the separation between CLI and API only commands (pr#65781, Tomer Haskalovitch)
mgr/dashboard: Fix timestamps in APIs (pr#66029, Afreen Misbah)
mgr/dashboard: fix total capacity value in dashboard (pr#65647, Nizamudeen A)
mgr/dashboard: fix typo in error when gw does not exist (pr#66956, Tomer Haskalovitch)
mgr/dashboard: fix zone update API forcing STANDARD storage class (pr#65619, Aashish Sharma)
mgr/dashboard: fixes for quick-bootstrap script (pr#67040, Nizamudeen A)
mgr/dashboard: FS - Attach Command showing undefined for MountData (pr#65675, Dnyaneshwari Talwekar)
mgr/dashboard: Group similar alerts (pr#65493, Abhishek Desai)
mgr/dashboard: Handle pool creation in tiering local storage class creation (pr#65680, Dnyaneshwari, Naman Munet)
mgr/dashboard: Maintain sentence case consistency in side nav bar titles (pr#66050, Aashish Sharma)
mgr/dashboard: ns list now support not passing nqn param (pr#65897, Tomer Haskalovitch)
mgr/dashboard: raise exception if both size and rbd_image_size are being passed in ns add (pr#65816, Tomer Haskalovitch)
mgr/dashboard: rbd consistency group and snapshot APIs (pr#66935, Imran Imtiaz)
mgr/dashboard: Remove illegible texts from the dashboard (pr#66306, Afreen Misbah)
mgr/dashboard: remove not needed 'cli_version' field from gw info com… (pr#66942, Tomer Haskalovitch)
mgr/dashboard: Remove the time dropdown from grafana iframe (pr#65853, Abhishek Desai)
mgr/dashboard: removes nx folder (pr#67003, Afreen Misbah)
mgr/dashboard: rename 'Zone Group' labels to 'Zonegroup' (pr#66790, Sagar Gopale)
mgr/dashboard: Rename Alerts tab to All Alerts (pr#66532, Sagar Gopale)
mgr/dashboard: Rename side-nav panel items (pr#65846, Naman Munet)
mgr/dashboard: replace bootstrap badges with carbon tags (pr#66350, pujaoshahu)
mgr/dashboard: replace usage or progress bar with carbon meter chart (pr#66934, Naman Munet)
mgr/dashboard: rgw accounts form group mode disable option is not working (pr#66351, Naman Munet)
mgr/dashboard: server side table rendering improvements (pr#65828, Nizamudeen A)
mgr/dashboard: service creation fails if service name is same as sevice type (pr#66481, Naman Munet)
mgr/dashboard: Set max subsystem count to 512 rather than 4096 (pr#66284, Afreen Misbah)
mgr/dashboard: support gw get_stats and listener info (pr#65896, Tomer Haskalovitch)
mgr/dashboard: Tiering form - Placement Target in Advanced Section (pr#65653, Dnyaneshwari Talwekar)
mgr/dashboard: update teuth_ref hash in api test (pr#66706, Nizamudeen A)
mgr/dashboard:[NFS] add Subvolume Groups and Subvolumes in "Edit NFS Export form" (pr#65650, Dnyaneshwari Talwekar)
mgr/prometheus: Handle empty/invalid JSON from orch get-security-config (pr#65906, Sunnatillo)
mgr/telemetry: add 'ec_optimizations' flag to 'basic_pool_flags' collection (pr#65969, Laura Flores)
mgr/vol: handling the failed non-atomic operation (pr#65728, Neeraj Pratap Singh)
mgr/vol: keep and show clone source info (pr#64650, Rishabh Dave)
mgr/volumes: Keep mon caps if auth key has remaining mds/osd caps (pr#65262, Enrico Bocchi)
mgr/volumes: remove unnecessary log error lines from earmark handling (pr#66991, Avan Thakkar)
mgr: avoid explicit dropping of ref (pr#65005, Milind Changire)
mgr:python: avoid pyo3 errors by running certain cryptographic functions in a child process (pr#66794, Nizamudeen A, John Mulligan, Paulo E. Castro)
mon/FSCommands: avoid unreachable code triggering compiler warning (pr#65261, Patrick Donnelly)
mon/MgrMonitor: add a space before "is already disabled" (pr#64687, Zehua Qi)
mon/OSDMonitor.cc: optionally display availability status in json (pr#65794, Shraddha Agrawal)
mon: Add command "nvme-gw listeners" (pr#66584, Vallari Agrawal)
mon: ceph pg repeer should propose a correctly sized pg temp (pr#66324, Alex Ainscow)
mon: Deny EC optimizations (fast EC) for non-4k-aligned chunk-sizes (pr#67319, Alex Ainscow)
monc: synchronize tick() of MonClient with shutdown() (pr#66916, Radoslaw Zarzynski)
monitoring: fix "In" OSDs in Cluster-Advanced grafana panel. Also change units from decbytes to bytes wherever used in the panel (pr#65670, Aashish Sharma)
monitoring: fix "Total gateway" and "Ceph Health NVMeoF WARNING" grafana graphs (pr#66225, Vallari Agrawal)
monitoring: fix CephPgImbalance alert rule expression (pr#66828, Aashish Sharma)
monitoring: Fix Filesystem grafana dashboard units (pr#66018, Ankush Behl)
monitoring: fix MTU Mismatch alert rule and expr (pr#65708, Aashish Sharma)
monitoring: fix rgw_servers filtering in rgw sync overview grafana (pr#66989, Aashish Sharma)
monitoring: Fixes for smb overview (pr#66019, Ankush Behl)
monitoring: make cluster matcher backward compatible for pre-reef metrics (pr#66984, Aashish Sharma)
monitoring: update NVMeoFTooManyNamespaces to 4096 ns (pr#67039, Vallari Agrawal)
monitoring: upgrade grafana version to 12.3.1 (pr#66963, Aashish Sharma)
nvmeof: refactor beacon timer for exact frequency timing with drift correction (pr#66536, Alexander Indenbaum)
Objecter: respect higher epoch subscription in tick (pr#66972, Nitzan Mordechai)
os/bluestore: cumulative patch to fix extent map resharding and around (pr#65964, Igor Fedotov, Adam Kupczyk, Jaya Prakash)
os/bluestore: fix vselector update after enveloped WAL recovery (pr#67333, Igor Fedotov, Adam Kupczyk)
os/bluestore: introduce device type specific allocation policy (pr#66839, Igor Fedotov)
osd/ECUtil: Fix erase_after_ro_offset length calculation and add tests (pr#66825, Alex Ainscow)
osd/PeeringState: re-evaluate full OSDs while waiting for recovery re… (pr#65701, Nitzan Mordechai)
osd/scrub: do not reduce min chunk on preemption (pr#66214, Ronen Friedman)
osd/scrub: fix blocked scrub accounting (pr#66220, Ronen Friedman)
osd/scrub: new/modified perf counters for scrub preemption (pr#66234, Ronen Friedman)
osd: Do not remove objects with divergent logs if only partial writes (pr#66725, Alex Ainscow)
osd: Fix fast EC truncate to whole stripe (pr#66543, Alex Ainscow)
osd: Fix for num_bytes mismatch occurring from snapshot workloads with partial writes in fast_ec (pr#67137, Jon Bailey)
osd: Fix memory leak of ECDummyOp (pr#66977, Alex Ainscow)
osd: Fix stats mismatch cluster error seen during scrubbing occasionally (pr#65793, Jon Bailey)
osd: Relax missing entry assert for partial writes (pr#65860, Alex Ainscow)
osd: stop scrub_purged_snaps() from ignoring osd_beacon_report_interval (pr#65478, Radoslaw Zarzynski)
pickup object corpus 20.2.0 380 gdbcbbd3f281 (pr#66592, Nitzan Mordechai)
prometheus: Add Cephadm orch ps output metric to prometheus (pr#66760, Ankush Behl)
pybind/mgr/dashboard: dashboard/requirements-lint.txt: re-pin rsscheck (pr#66877, Ronen Friedman)
pybind/mgr/pg_autoscaler: Introduce dynamic threshold to improve scal… (pr#66871, Prashant D)
pybind/mgr: pin cheroot version in requirements-required.txt (pr#65635, Adam King)
pybind/rados: Add list_lockers() and break_lock() to Rados Python interface (pr#65098, Gil Bregman)
qa/multisite: switch to boto3 (pr#67318, Shilpa Jagannath, Adam C. Emerson)
qa/rgw: bucket notifications use pynose (pr#67449, Casey Bodley, Adam C. Emerson)
qa/standalone/availability.sh: retry after feature is turned on (pr#67226, Shraddha Agrawal)
qa/suites/nvmeof: add upgrade sub-suite (pr#65583, Vallari Agrawal)
qa/suites/rados/thrash-old-clients: Add OSD warnings to ignore list (pr#65369, Naveen Naidu)
qa/suites/rbd/valgrind: don't hardcode os_type in memcheck.yaml (pr#66196, Ilya Dryomov)
qa/suites/upgrade: add "Replacing daemon mds" to ignorelist (issue#71615, issue#50279, pr#64888, Venky Shankar)
qa/suites: wait longer before stopping OSDs with valgrind (pr#63716, Nitzan Mordechai)
qa/tasks/ceph_manager: population must be a sequence (pr#64746, Kyr Shatskyy)
qa/tasks/qemu: rocky 10 enablement (pr#67283, Ilya Dryomov)
qa/tasks/rbd_mirror_thrash: don't use random.randrange() on floats (pr#67163, Ilya Dryomov)
qa/tasks/workunit: fix no module named 'pipes' (pr#66250, Kyr Shatskyy)
qa/tests: added inital draft for tentacle-p2p (pr#67765, Patrick Donnelly, Yuri Weinstein)
qa/tests: added messages to the whitelist (pr#65645, Laura Flores, Yuri Weinstein)
qa/tests: wait for module to be available for connection (pr#67196, Nizamudeen A)
qa/valgrind.supp: make gcm_cipher_internal suppression more resilient (pr#67281, Ilya Dryomov)
qa/workunits/nvmeof/basic_tests: use nvme-cli 2.13 (pr#67285, Vallari Agrawal)
qa/workunits/rados: remove cache tier test (pr#65540, Nitzan Mordechai)
qa/workunits/rbd: adapt rbd_mirror.sh for trial nodes (pr#67152, Ilya Dryomov)
qa/workunits/rbd: reduce randomized sleeps in live import tests (pr#67154, Ilya Dryomov)
qa/workunits/rbd: use the same qemu-iotests version throughout (pr#67282, Ilya Dryomov)
qa/workunits/rgw: drop netstat usage (pr#67184, Kyr Shatskyy)
qa/workunits: add Rocky Linux support to librados tests (pr#67091, Nitzan Mordechai)
qa: Disable OSD benchmark from running for tests (pr#67068, Sridhar Seshasayee)
qa: don't assume that /dev/sda or /dev/vda is present in unmap.t (pr#67077, Ilya Dryomov)
qa: Fix test_with_health_warn_with_2_active_MDSs (pr#65260, Kotresh HR)
qa: ignore cluster warning (evicting unresponsive ...) with tasks/mgr-osd-full (issue#73278, pr#66125, Venky Shankar)
qa: Improve scalability test (pr#66224, Vallari Agrawal)
qa: krbd_blkroset.t: eliminate a race in the open_count test (pr#67075, Ilya Dryomov)
qa: Run RADOS suites with ec optimizations on and off (pr#65471, Jamie Pryde)
qa: suppress OpenSSL valgrind leaks (pr#65660, Laura Flores)
rbd-mirror: add cluster fsid to remote meta cache key (pr#66297, Mykola Golub)
rbd-mirror: allow incomplete demote snapshot to sync after rbd-mirror daemon restart (pr#66164, VinayBhaskar-V)
Relax scrub of shard sizes for upgraded EC pools (pr#66021, Alex Ainscow)
Revert "Merge pull request #66958 from Hezko/wip-74413-tentacle" (pr#67750, Patrick Donnelly)
Revert "PrimeryLogPG: don't accept ops with mixed balance_reads and rwordered flags" (pr#66611, Radoslaw Zarzynski)
RGW | fix conditional Delete, MultiDelete and Put (pr#65949, Ali Masarwa)
RGW | fix conditional MultiWrite (pr#67425, Ali Masarwa)
rgw/account: bucket acls are not completely migrated once the user is migrated to an account (pr#65666, kchheda3)
rgw/admin: Add max-entries and marker to bucket list (pr#65485, Tobias Urdin)
rgw/lc: LCOpAction_CurrentExpiration checks mtime for delete markers (pr#65965, Casey Bodley)
rgw/tentacle: clean up .rgw_op.cc.swn file (pr#66161, Soumya Koduri)
rgw: add metric when send message with kafka and ampq (pr#65904, Hoai-Thu Vuong)
rgw: fix 'bucket rm --bypass-gc' for copied objects (pr#66004, Casey Bodley)
rgw: fix radosgw-admin object unlink ... (pr#66151, J. Eric Ivancich)
RGW: multi object delete op; skip olh update for all deletes but the last one (pr#65488, Oguzhan Ozmen)
rgw: update keystone repo stable branch to 2024.2 (pr#66241, Kyr Shatskyy)
rpm: default to gcc-toolset-13, not just for crimson (pr#65752, John Mulligan, Casey Bodley)
scripts/build/ceph.spec.in: fix rhel version checks (pr#66865, Ronen Friedman)
src/ceph_osd, osd: Implement running benchmark during OSD creation - Phase 1 (pr#65522, Sridhar Seshasayee)
src: Move the decision to build the ISA plugin to the top level make file (pr#67894, Alex Ainscow)
sync build-with-container patches from main (pr#65843, John Mulligan, Dan Mick)
systemd services: fix installing ceph-volume@ (pr#66861, Thomas Lamprecht)
tasks/cbt_performance: Tolerate exceptions during performance data up… (pr#66102, Nitzan Mordechai)
test/ceph_assert.cc: Disable core files (pr#66334, Bob Ham)
test/neorados: Catch timeouts in Poll test (pr#65605, Adam C. Emerson)
test: disable known flaky tests in run-rbd-unit-tests (pr#67559, Ilya Dryomov)
tools: handle get-attr as read-only ops in ceph_objectstore_tool (pr#66537, Jaya Prakash)

Ceph Q1 2026 Newsletter

2026-03-31T00:00:00Z

During this quarter, the Ceph Foundation focused on strengthening its structure by establishing new governance charters, event strategies, and financial plans. To enhance transparency as our community evolves, this newsletter offers a look behind the scenes at these foundational details. Our aim is to update Ceph Community members on the latest developments within the Foundation and to clarify how they can contribute to our ongoing growth. There are several ways to get involved with the foundation. If you have a concept for a Ceph-related project, we encourage you to take the first step toward bringing your idea to the next level and submit a funding request whenever you are ready. Feedback and suggestions will be offered along your journey with Ceph.

In this issue ¶

CSC and Ceph Foundation Board Meeting
Ceph Foundation Charters Approved
Ceph Governing Board Hiring a Technical Writer
OVHcloud Spending Update
Ceph Tech Talks Are Back (Monthly Schedule)
Upcoming Ceph Days Events
Ceph Community Slack Upgraded to Pro

CSC and Ceph Foundation Board Meeting ¶

The Ceph Foundation Board recently hosted the Ceph Steering Committee (CSC) for a collaborative discussion on the current state of the Ceph project and how both sides can work together to support the Ceph community.

These quarterly meetings are designed to foster communication between the two committees working to build Ceph for the benefit of its users and contributors. The Ceph Board's goal is to help provide greater context to the CSC as they make decisions and to support their missions, thereby bridging the communication gap. The meeting's agenda is available here.

Key discussion areas ¶

Operational complexity
Friction in contributing and getting reviews
Fragmented communication
Unclear strategy in some areas
Unclear ownership across parts of the ecosystem

Major takeaways ¶

The Board will continue to work on closing the gap between developer experience and real-world operator needs, with help from the CSC around failure handling and upgrades.
Both committees agreed that contributor experience remains a key challenge, with CI complexity and limited reviewer bandwidth identified as the biggest sources of friction, not the contribution process itself.
The Board and the CSC recognize that perception matters. After hearing feedback that Ceph feels “unwelcoming,” this concern is being addressed and will guide improvements in onboarding and engagement.
The Ceph Foundation will keep focusing on areas like performance and efficiency. There's an opportunity to improve how project-wide priorities are communicated, helping to align contributors and ecosystem partners.

Action Items ¶

Ceph Community Manager: Document best practices for large contributions requiring early community engagement
Ceph Community Manager: Improve discoverability of contributor guidelines and ambassador resources
CSC: Prepare Q2 response on project priorities, pain points, and foundation delegation opportunities
Ceph Community Manager: Share contributor feedback data with CSC for deeper analysis

Road to Fully Onboarding into the Linux Foundation: Charters Approved ¶

Ceph now has a clearly defined governance model that separates technical decision-making from funding and community growth, making it easier to understand how decisions are made and how to get involved. This task began in September 2025, when the Ceph Foundation initiated its transition into The Linux Foundation, marking a significant step toward strengthening the long-term sustainability and neutrality of the Ceph project.

The outcome of this work was the formal approval of two foundational governance frameworks. The Ceph Foundation Charter was established to define how the Foundation raises, allocates, and manages resources in support of the project, while also creating a clear and transparent structure for decision-making, community outreach, and ecosystem development. At its core, the Charter exists to ensure that Ceph operates as a vendor-neutral, community-driven project with sustainable funding and broad industry participation.

In parallel, the Ceph Technical Charter was approved by the Ceph Steering Committee (CSC), reinforcing the independence of the project’s technical governance and clarifying how technical decisions are made in alignment with the needs of the community.

Together, these milestones establish a balanced governance model: the Foundation focuses on funding, outreach, and ecosystem growth, while the technical community retains authority over the project’s technical direction and innovation.

The third and final step of this process, which is currently in the works, involves transferring the Ceph trademark to the Linux Foundation after completing steps one and two.

Review the Foundation Charter: https://cdn.platform.linuxfoundation.org/agreements/cephfoundation.pdf

Review the Ceph Technical Charter:
https://github.com/ceph/ceph/blob/main/doc/technical-charter.rst

Opportunity: Technical Writer Role Open ¶

Ceph is investing directly in better documentation, making it easier for new users to adopt and experienced operators to scale, and helping those interested in contributing to Ceph get onboarded for development. The Ceph Technical Committee is actively interviewing a Technical Writer to help improve the quality and accessibility of Ceph documentation.

About the role ¶

Focus on clarity, consistency, and usability of technical content
Collaborate with developers and contributors across the community
Help lower the barrier to entry for new users and operators

How to get involved ¶

Review docs and flag gaps
Suggest onboarding pain points
Participate in doc sprints

OVHcloud Spending Update ¶

The Ceph Governing Board, with Mark Nelson (Clyso) and Patrick Donnelly (IBM), is working to reduce infrastructure costs. Reports from Mark and Joachim Kraftmayer (Clyso) on OVHcloud show potential for lower monthly hosting expenses. Negotiations with OVHcloud could yield savings, which would be reinvested to boost the Foundation's support for community programs, events, and infrastructure.

Current overview ¶

Monthly costs over the past year have ranged between **$5K–$10K USD **
Spending peaked around August and has since been reduced to approximately $7K/month
Costs remain higher than earlier in the year, prompting further review

Key cost drivers ¶

A significant portion of expenses comes from storage
Four 10TB volumes (supporting Chacra nodes), along with associated snapshots, accounted for over half of total monthly costs (~$4K USD)

Recent actions ¶

Unused snapshots and resources have been removed by David Galloway (IBM)
Ongoing collaboration with OVHcloud to identify additional optimization opportunities
Further cost savings are expected as usage is refined

What’s next ¶

Continued monitoring and cost optimization efforts
Improved visibility into infrastructure usage and spending
Updates will be shared in the next newsletter and monthly reports as new efficiencies are realized

Ceph Tech Talks Are Back ¶

Ceph Tech Talks have officially returned, giving the community a consistent way to learn directly from contributors and share real-world experience. Together, we have discussed the Running Teuthology Outside of the Sepia lab and How to Get Involved with the Ceph Ambassador Program.

Attend the Tech Talk on April 22, 2026, at 12 pm EST/9 am PDT. Our topic will be,** MAAS as a Backend: Provisioning Infrastructure for Teuthology Suites.**

Follow the Ceph Community Calendar for more information.

What to expect ¶

Deep dives into real-world Ceph use cases and features
Presentations led by community members and contributors
A wide variety of topics for users and developers
Interactive sessions with opportunities for Q&A

Get involved ¶

Attend upcoming sessions to stay current
Submit a proposal to present your work here

Upcoming Ceph Events and the State of Cephalocon ¶

The Ceph Foundation is evolving its approach to events in order to better serve the global community.

Instead of focusing all of our resources on a single large event like Cephalocon, the Foundation is now emphasizing local and regional engagement to expand Ceph’s reach into nearby communities and projects. This shift was prompted by new guidelines regarding spending within the foundation. These current fiduciary guidelines still require broader community support and sponsorship participation.

What’s changing ¶

Increased focus on Ceph Days, meetups, data-driven conferences, and community-led events
Strategic investment in opportunities that introduce Ceph to new audiences
A shift toward distributed, community-driven engagement

Important note on future large events ¶

The Foundation remains open to hosting large-scale events like Cephalocon
This ensures alignment with the Foundation’s financial responsibilities while enabling sustainable growth

How to get involved ¶

As part of the Foundation’s financial stewardship under the Linux Foundation, large-scale events require broader community sponsorship and participation
Organize or support a Ceph event in your region
Partner with related open source or data infrastructure communities

Need support? ¶

Community members can submit funding requests to attend or organize events, as well as to support Ceph-related activities
Requests are reviewed by the Board (approval is not guaranteed)

Upcoming Ceph Days – March 2026 ¶

Ceph Days continues to grow as the primary way the community connects locally. There were two community-driven events that happened this March:

Ceph Days India

Sponsors: IBM and Clyso
Attendees: 192

Ceph Days Raleigh

Sponsor: IBM
Attendees: 78

Why attend ¶

Connect with other Ceph users and contributors
Learn from real-world deployments and technical sessions
Grow your local Ceph network

Get involved ¶

Share your ideas for a Ceph Days: https://pad.ceph.com/p/ceph-days-2026
Help organize or promote events in your region

Ceph Community Slack Upgraded to Pro ¶

The community can now access Slack messages from the last 90 days! The Ceph Community Slack workspace has been upgraded to Slack Pro, courtesy of the Linux Foundation, as part of negotiations requested by the Ceph Board.

What this means ¶

No more losing conversations!
There is no additional cost to the Ceph community
A more robust platform for collaboration and communication

Why it matters ¶

Supports better knowledge sharing across contributors and users
Improves accessibility of past discussions and technical insights
Strengthens real-time collaboration within the community

Get involved ¶

Become a moderator for the Ceph Slack workspace
Email amiddleton@linuxfoundation.org for more information

v18.2.8 Reef released

2026-03-20T00:00:00Z

This is the eighth, and expected to be last, backport release in the Reef series. We recommend that all users update to this release.

Release Date ¶

March 20, 2026

Known Issues ¶

During QA for v18.2.8, it was found that there was a bug for upgrades from Pacific to Reef. Pacific OSDs (and other Ceph daemons) were still using a deprecated connection feature bit that was adopted to indicate a Reef OSD. This can cause a OSD_UPGRADE_FINISHED warning before all OSDs are actually upgraded to Reef. There are no known issues associated with Pacific and Reef OSDs interoperating where Pacific OSDs are "advertising" Reef compatibility; however, out of an abundance of caution, we no longer recommend upgrading from Pacific to Reef directly.

Security Fixes ¶

CephFS Client: A fix was merged to prohibit unprivileged users from modifying the sgid or suid bits on a file. Previously, unprivileged users were inadvertently permitted to set these bits if they were the sole bits being modified.
Mgr Alerts: The SMTP SSL context was enforced in the mgr/alerts module to resolve a security vulnerability (GHSA-xj9f-7g59-m4jx).

Notable Changes ¶

RGW (RADOS Gateway):
- Fixed an issue where bucket rm --bypass-gc was mistakenly removing head objects instead of tail objects, potentially causing data inconsistencies.
- Fixed rgw-restore-bucket-index to handle objects with leading hyphens and to process versioned buckets correctly.
- Addressed an issue in the msg/async protocol that caused memory locks and hangs during connection shutdown.
- RGW STS: Made JWKS URL verification configurable for AWS compliance via the rgw_enable_jwks_url_verification configuration.
CephFS / MDS:
- Prevented the MDS from stalling (up to 5 seconds) during rename/stat workloads by forcing the log to nudge for unstable locks after early replies.
- Fixed cephfs-journal-tool so it no longer incorrectly resets the journal trim position during disaster recovery, which was causing stale journal objects to linger forever in the metadata pool.
- Fixed a bug where ll_walk incorrectly processed absolute paths as relative paths.
- Prevented the ceph fs volume create command from accidentally deleting user-created pools if the command aborted during cleanup.
- MDS Batched Operations: Added a new mds_allow_batched_ops configuration option (default: true) to control whether the MDS can batch lookup or getattr RPCs.
- CephFS Subvolumes: Added the ceph fs subvolume snapshot getpath command to allow users to retrieve the absolute path of a snapshot of a subvolume.
BlueStore:
- Fixed a bug where the bytes_written_slow performance counter incorrectly reported 0 when using aio_write.

Changelog ¶

.github: Fix RTD build retrigger (pr#63616, David Galloway)
Ensure the ETag format is consistent with AWS S3 API (pr#62608, Casey Bodley, liubingrun)
[reef] os/bluestore: fix _extend_log seq advance (pr#61653, Pere Diaz Bou)
[reef] RGW backports (pr#63031, Soumya Koduri)
[reef] rgw/dbstore: Update bucket attrs as part of put_info() (pr#64488, Soumya Koduri)
auth: msgr2 can return incorrect allowed_modes through AuthBadMethodFrame (pr#65334, Miki Patel)
backport build-with-container patches from main (pr#65188, John Mulligan, Dan Mick, Zack Cerza)
Backport the hybrid_btree2 allocator and prereqs (pr#62539, Igor Fedotov, Jrchyang Yu)
backports variants improvements and Dockerfile.build changes (pr#66012, John Mulligan, Zack Cerza)
blk/kernel: improve DiscardThread life cycle (pr#65216, Igor Fedotov)
blk/KernelDevice: Introduce a cap on the number of pending discards (pr#62220, Joshua Baergen)
blk/kerneldevice: notify_all only required when discard_drain wait for condition (pr#62152, Yite Gu)
blk/kerneldevice: some fix for device discard (pr#62481, Igor Fedotov, Yite Gu)
bluestore/BlueFS: fix bytes_written_slow counter with aio_write (pr#66353, chungfengz)
build backports (pr#65066, John Mulligan, Zack Cerza)
build-with-container: add argument groups to organize options (pr#65630, John Mulligan)
build-with-container: build image variants (pr#65944, John Mulligan)
build-with-container: two small fixes (pr#62339, John Mulligan)
ceph-fuse: Improve fuse mount usage message (pr#61275, Kotresh HR)
ceph-volume: allow zapping partitions on multipath devices (pr#62178, Guillaume Abrioux)
ceph-volume: do not convert LVs's symlink to real path (pr#59989, Guillaume Abrioux)
ceph-volume: fix regex usage in set\_dmcrypt\_no\_workqueue (pr#62791, Guillaume Abrioux)
ceph.spec.in: add man/rgw-gap-list (pr#63999, Matan Breizman)
ceph.spec.in: Remove rgw-restore-bucket-index.8* from packaging (pr#64130, Kefu Chai)
cephfs,mon: fs rename must require FS to be offline and refuse_client_session to be set (issue#66088, pr#61410, Rishabh Dave, Venky Shankar)
cephfs-journal-tool: fix segfault during 'journal import' from invalid dump file (pr#62114, Jos Collin)
cephfs-journal-tool: Journal trimming issue (pr#65603, Kotresh HR)
cephfs-shell: add option to remove xattr (pr#62409, Neeraj Pratap Singh)
cephfs-top, qa: Remove unnecessary global statements in tests (pr#62606, Kefu Chai)
cephfs-top: exception when terminal size greater than PAD_WIDTH (pr#61773, Jos Collin)
cephfs: session tracker accounts for killing sessions (pr#65253, Abhishek Lekshmanan)
client: fix d_reclen for readdir (pr#61519, Xavi Hernandez)
client: fixed a bug that read operation hung (pr#60695, Tod Chen)
client: Handle empty pathnames for ceph\_chownat() and ceph\_statxat() (pr#61165, Anoop C S)
client: ll_walk will process absolute paths as relative (pr#62500, Patrick Donnelly)
client: prohibit unprivileged users from setting sgid/suid bits (pr#66040, Kefu Chai)
client: return EOPNOTSUPP for fallocate with mode 0 (pr#60657, Milind Changire)
cls/rbd: write image mirror status if state is CREATING (pr#63236, N Balachandran)
cls/rgw: non-versioned listings skip past version suffix (pr#62591, Casey Bodley)
common/options: fix the description of osd_max_scrubs (pr#62378, Satoru Takeuchi)
common/options: fix typo in description (pr#64218, Lorenz Bausch)
common/pick_address: Add IPv6 support to is_addr_in_subnet (pr#62814, Nitzan Mordechai)
container: small container image improvements (pr#62345, John Mulligan)
crush: use std::vector instead of variable length arrays (pr#62014, Kefu Chai)
debian/control: add iproute2 to build dependencies (pr#66738, Kefu Chai)
debian: package mgr/rgw in ceph-mgr-modules-core (pr#57874, Kefu Chai)
doc/architecture: remove sentence (pr#61615, Zac Dover)
doc/cephadm/services: Add mention of --zap for OSD removal (pr#62444, Anthony D'Atri)
doc/cephadm/services: Correct indentation in osd.rst (pr#62428, Anthony D'Atri)
doc/cephadm/services: Fix formatting in osd.rst (pr#62811, Anthony D'Atri)
doc/cephadm/services: improve rgw.rst and snmp-gateway.rst (pr#62695, Anthony D'Atri)
doc/cephadm: Add admonition re restarting an OSD service (pr#62797, Anthony D'Atri)
doc/cephadm: Add PG autoscaler advice to upgrade.rst (pr#62380, Anthony D'Atri)
doc/cephadm: clarify "Monitoring OSD State" (pr#61665, Zac Dover)
doc/cephadm: Correct formatting in upgrade.rst (pr#63148, Anthony D'Atri)
doc/cephadm: correct markup in rgw.rst (pr#63074, Zac Dover)
doc/cephadm: improve "Maintenance Mode" (pr#63496, Zac Dover)
doc/cephadm: s/confg/config/ (pr#62645, Zac Dover)
doc/cephfs: add a note about estimated replay completion time (issue#71629, pr#65058, Venky Shankar, Zac Dover)
doc/cephfs: correct ill-formatted command (pr#63502, Zac Dover)
doc/cephfs: correct reference structure in fs-volumes.rst (pr#63545, Zac Dover)
doc/cephfs: Cosmetic changes and small fixes in cephfs-mirroring.rst (pr#63468, Ville Ojamo)
doc/cephfs: document first-damage.py (pr#63978, Zac Dover)
doc/cephfs: edit ceph-dokan.rst (1 of x) (pr#64736, Zac Dover)
doc/cephfs: edit ceph-dokan.rst (2 of x) (pr#64760, Zac Dover)
doc/cephfs: edit ceph-dokan.rst (3 of x) (pr#64786, Zac Dover)
doc/cephfs: edit disaster-recovery.rst (pr#64645, Zac Dover)
doc/cephfs: edit disaster-recovery.rst (pr#64609, Zac Dover)
doc/cephfs: edit troubleshooting.rst (pr#65380, Zac Dover)
doc/cephfs: edit troubleshooting.rst (pr#65094, Zac Dover)
doc/cephfs: edit troubleshooting.rst (pr#65091, Zac Dover)
doc/cephfs: edit troubleshooting.rst (pr#65126, Zac Dover)
doc/cephfs: edit troubleshooting.rst (pr#65123, Zac Dover)
doc/cephfs: edit troubleshooting.rst (pr#65097, Zac Dover)
doc/cephfs: edit troubleshooting.rst (pr#65078, Zac Dover)
doc/cephfs: edit troubleshooting.rst (pr#65088, Zac Dover)
doc/cephfs: edit troubleshooting.rst (pr#65047, Zac Dover)
doc/cephfs: edit troubleshooting.rst (pr#65044, Zac Dover)
doc/cephfs: edit troubleshooting.rst (pr#65041, Zac Dover)
doc/cephfs: edit troubleshooting.rst (pr#65037, Zac Dover)
doc/cephfs: edit troubleshooting.rst (pr#65026, Zac Dover, Venky Shankar)
doc/cephfs: edit troubleshooting.rst (pr#64904, Zac Dover)
doc/cephfs: edit troubleshooting.rst (pr#64901, Zac Dover)
doc/cephfs: edit troubleshooting.rst (pr#64879, Zac Dover)
doc/cephfs: edit troubleshooting.rst (pr#64872, Zac Dover)
doc/cephfs: edit troubleshooting.rst (pr#64853, Zac Dover)
doc/cephfs: edit troubleshooting.rst (Slow MDS) (pr#65201, Zac Dover)
doc/cephfs: Improve mount-using-fuse.rst (pr#64473, Anthony D'Atri)
doc/cephfs: link section for pausing async threads in section for (pr#62875, Rishabh Dave)
doc/cephfs: Update deprecation notice in experimental-features.rst (pr#63949, Ville Ojamo)
doc/cephfs: Update quota.rst (pr#65083, Jannis Speer, Zac Dover)
doc/dev/cephfs-mirroring: edit file 1 of x (pr#63299, Zac Dover)
doc/dev/cephfs-mirroring: edit file 2 of x (pr#63274, Zac Dover)
doc/dev/cephfs-mirroring: edit file 3 of x (pr#63548, Zac Dover)
doc/dev/cephfs-mirroring: edit file 4 of x (pr#63661, Zac Dover)
doc/dev/config: Document how to use :confval: directive for config op… (pr#64167, Kefu Chai)
doc/dev/release-process.rst: document new Jenkins job for containers (pr#62613, Dan Mick)
doc/dev/release-process.rst: release builds cannot build containers (pr#61818, Dan Mick, Zac Dover)
doc/dev: Debuggging with gdb (pr#63994, Matan Breizman)
doc/dev: update link to backporter manual (pr#63991, Zac Dover)
doc/dev:update blkin.rst doc for lttng trace (pr#65212, lizhipeng)
doc/glossary: s/OMAP/omap/ (pr#63738, Zac Dover)
doc/man/8: Improve mount.ceph.rst (pr#65184, Anthony D'Atri)
doc/mgr/ceph_api: edit index.rst (pr#63198, Zac Dover)
doc/mgr/crash.rst: remove outdated module enabling instructions (pr#64285, Kefu Chai)
doc/mgr/dashboard_plugins: edit feature_toggles.inc.rst (pr#63705, Zac Dover)
doc/mgr: edit administrator.rst (pr#63208, Zac Dover)
doc/mgr: edit alerts.rst (pr#63201, Zac Dover)
doc/mgr: edit cli_api (pr#63744, Zac Dover)
doc/mgr: edit cli_api.rst (pr#63690, Zac Dover)
doc/mgr: edit crash.rst (pr#63539, Zac Dover)
doc/mgr: edit dashboard.rst (pr#63316, Zac Dover)
doc/mgr: edit debug.inc.rst (pr#63394, Zac Dover)
doc/mgr: edit diskpredictor.rst (pr#63424, Zac Dover)
doc/mgr: edit feature_toggles.inc.rst (pr#63397, Zac Dover)
doc/mgr: edit hello.rst (pr#63508, Zac Dover)
doc/mgr: edit influx.rst (pr#63455, Zac Dover)
doc/mgr: edit insights.rst (pr#63511, Zac Dover)
doc/mgr: edit iostat.rst (pr#63681, Zac Dover)
doc/mgr: edit iostat.rst (pr#63514, Zac Dover)
doc/mgr: edit localpool.rst (pr#63670, Zac Dover)
doc/mgr: edit localpool.rst (pr#63551, Zac Dover)
doc/mgr: edit mds_autoscaler.rst (pr#63493, Zac Dover)
doc/mgr: edit modules.rst (pr#63667, Zac Dover)
doc/mgr: edit modules.rst (pr#63578, Zac Dover)
doc/mgr: edit motd.inc.rst (pr#63403, Zac Dover)
doc/mgr: edit nfs.rst (pr#63664, Zac Dover)
doc/mgr: edit nfs.rst (pr#63581, Zac Dover)
doc/mgr: edit orchestrator.rst (pr#63584, Zac Dover)
doc/mgr: edit progress.rst (pr#63658, Zac Dover)
doc/mgr: edit progress.rst (pr#63587, Zac Dover)
doc/mgr: edit prometheus.rst (pr#63590, Zac Dover)
doc/mgr: edit rgw.rst (pr#63593, Zac Dover)
doc/mgr: edit telegraf.rst (pr#63612, Zac Dover)
doc/mgr: edit telemetry (1 of x) (pr#63769, Zac Dover)
doc/mgr: edit telemetry (2 of x) (pr#63772, Zac Dover)
doc/mgr: edit telemetry (3 of x) (pr#63775, Zac Dover)
doc/mgr: edit telemetry (4 of x) (pr#63778, Zac Dover)
doc/mgr: edit telemetry.rst (pr#64344, Zac Dover)
doc/mgr: edit telemetry.rst (pr#63810, Zac Dover)
doc/mgr: edit telemetry.rst (pr#63906, Zac Dover)
doc/mgr: edit telemetry.rst (pr#63865, Zac Dover)
doc/mgr: edit telemetry.rst (pr#63693, Zac Dover)
doc/mgr: edit telemetry.rst (lines 300-400) (pr#63868, Zac Dover)
doc/mgr: Improve prometheus.rst (pr#62931, Zac Dover, Anthony D'Atri)
doc/mgr: Small improvements in rgw.rst (pr#63626, Ville Ojamo)
doc/monitoring: correct list formatting (pr#63542, Zac Dover)
doc/rados/configuration/bluestore-config-ref: Fix lowcase typo (pr#62261, Adam Kupczyk)
doc/rados/configuration/bluestore-config-ref: Fix lowercase typos (pr#62291, Dan van der Ster)
doc/rados/configuration: Correct admonition in ceph-conf.rst (pr#62621, Anthony D'Atri)
doc/rados/configuration: Improve ceph-conf.rst (pr#63943, Anthony D'Atri)
doc/rados/configuration: Mention show-with-defaults and ceph-conf (pr#65207, Niklas Hambüchen)
doc/rados/configuration: Small improvements in ceph-conf.rst (pr#64288, Ville Ojamo)
doc/rados/operations/stretch-mode: Improve doc (pr#61654, Kamoltat Sirivadhna)
doc/rados/operations: Actually mention upmap\_max\_deviation setting … (pr#64119, Niklas Hambüchen)
doc/rados/operations: add kernel client procedure to read balancer documentation (pr#65440, Laura Flores)
doc/rados/operations: Add settings advice to balancer.rst (pr#63536, Anthony D'Atri)
doc/rados/operations: Additional improvements to placement-groups.rst (pr#63650, Anthony D'Atri)
doc/rados/operations: Address suggestions for stretch-mode.rst (pr#63850, Zac Dover)
doc/rados/operations: edit cache-tiering.rst (pr#63696, Zac Dover)
doc/rados/operations: Improve erasure-code.rst (pr#62574, Anthony D'Atri)
doc/rados/operations: Improve health-checks.rst (pr#65239, Anthony D'Atri)
doc/rados/operations: Improve placement-groups.rst (pr#63647, Anthony D'Atri)
doc/rados/operations: Improve stretch-mode.rst (pr#63816, Anthony D'Atri)
doc/rados/ops: add caps restore command (pr#64322, Zac Dover)
doc/rados/ops: edit cache-tiering.rst (pr#64497, Zac Dover)
doc/rados/ops: edit cache-tiering.rst (pr#63831, Zac Dover)
doc/rados: document section absent in release < T (pr#64868, Zac Dover)
doc/rados: edit balancer.rst (pr#63684, Zac Dover)
doc/rados: edit ops/user-management.rst (pr#63893, Zac Dover)
doc/rados: enhance "pools.rst" (pr#63862, Zac Dover)
doc/rados: improve markup in cache-tiering.rst (pr#63505, Zac Dover)
doc/rados: remove clonedata command (pr#64394, Zac Dover)
doc/rados: repair short underline (pr#65138, Zac Dover)
doc/rados: s/enpty/empty/ in pgcalc doc (pr#63499, Zac Dover)
doc/rados: Update mClock doc on steps to override OSD IOPS capacity config (pr#63072, Sridhar Seshasayee)
doc/radosgw /notifications: fix topic details (pr#62405, Laimis Juzeliunas)
doc/radosgw/admin.rst: explain bucket and uid flags for bucket quota (pr#64022, Hyun Jin Kim)
doc/radosgw/cloud-transition: fix details (pr#62835, Laimis Juzeliunas)
doc/radosgw/s3: Document delete-if-unmodified-since (pr#64316, Anthony D'Atri)
doc/radosgw: add "persistent_topic_size" (pr#64140, Zac Dover)
doc/radosgw: add rgw_enable_lc_threads & rgw_enable_gc_threads (pr#64339, Zac Dover)
doc/radosgw: Cosmetic and formatting improvements in vault.rst (pr#63230, Ville Ojamo)
doc/radosgw: Cosmetic improvements in cloud-transition.rst (pr#63449, Ville Ojamo)
doc/radosgw: Cosmetic improvements in dynamicresharding.rst (pr#64059, Ville Ojamo)
doc/radosgw: edit "Lifecycle Settings" (pr#64548, Zac Dover)
doc/radosgw: edit cloud-transition (1 of x) (pr#64025, Zac Dover)
doc/radosgw: edit config-ref.rst (pr#64648, Zac Dover)
doc/radosgw: edit metrics.rst (pr#63813, Zac Dover)
doc/radosgw: edit sentence in metrics.rst (pr#63701, Zac Dover)
doc/radosgw: Fix RST syntax rendeded as text in oidc.rst (pr#62990, Ville Ojamo)
doc/radosgw: improve "pubsub_push_pending" info (pr#64114, Zac Dover)
doc/radosgw: Improve and more consistent formatting (pr#62910, Ville Ojamo)
doc/radosgw: Improve cloud-restore and cloud-transition (pr#62667, Anthony D'Atri)
doc/radosgw: Improve formatting in layout.rst (pr#63000, Anthony D'Atri)
doc/radosgw: Improve layout.rst (pr#62450, Anthony D'Atri)
doc/radosgw: Improve rgw-cache.rst (pr#64476, Ville Ojamo)
doc/radosgw: Promptify CLI commands and fix formatting in layout.rst (pr#63916, Ville Ojamo)
doc/radosgw: Promptify CLI, cosmetic fixes (pr#62857, Ville Ojamo)
doc/radosgw: remove "pubsub_event_lost" (pr#64127, Zac Dover)
doc/radosgw: remove "pubsub_event_triggered" (pr#64156, Zac Dover)
doc/radosgw: remove cloud-restore from reef (pr#65638, Zac Dover)
doc/radosgw: update aws specification link (pr#64096, Zac Dover)
doc/radosgw: Use ref for hyperlinking to multisite (pr#63312, Ville Ojamo)
doc/rbd/rbd-config-ref: add clone settings section (pr#66173, Ilya Dryomov)
doc/rbd: add mirroring troubleshooting info (pr#63847, Zac Dover)
doc/rgw: add man documentation for the rgw-gap-list tool (pr#63997, J. Eric Ivancich)
doc/rgw: clarify path-style vs virtual-hosted-style access (pr#61987, Casey Bodley)
doc/rgw: document Admin and System Users (pr#62882, Casey Bodley)
doc/rgw: remove metrics.rst which did not apply to reef (pr#66320, Casey Bodley)
doc/rgw: use 'confval' directive to render sts config options (pr#63442, Casey Bodley)
doc/src/common/options: mgr.yaml.in edit (pr#63765, Zac Dover)
doc/src: edit osd.yaml.in (osd_deep_scrub_interval_cv) (pr#63956, Zac Dover)
doc/start: edit documenting-ceph.rst (pr#63653, Zac Dover)
doc/start: edit documenting-ceph.rst (pr#63708, Zac Dover)
doc: add note admonitions in two files (pr#64493, Zac Dover)
doc: Clarify the status of MS Windows client support (pr#64482, Anthony D'Atri)
doc: do not depend on typed-ast (pr#64400, Kefu Chai)
doc: Document ceph-mgr module configuration options (pr#64397, Kefu Chai)
doc: fix formatting in cephfs_mirror dev doc (pr#63251, Jos Collin)
doc: Fix links to mClock config reference (pr#64798, Pierre Riteau)
doc: Fix missing blank line Sphinx warnings (pr#63338, Ville Ojamo)
doc: Fix unterminated inline literal in ceph-conf.rst (pr#64171, Kefu Chai)
doc: Fixed a spelling error (pr#64148, Instelligence.io)
doc: Fixes a typo in balancer operations (pr#65740, Tyler Brekke)
doc: mgr/dashboard: add OAuth2 SSO documentation (pr#64034, Pedro Gonzalez Gomez, Zac Dover)
doc: Pin pip to <25.3 for RTD as a workaround for pybind in admin/doc-read-the-docs.txt (pr#66118, Ville Ojamo)
doc: Remove sphinxcontrib-seqdiag Python package from RTD builds (pr#67528, Ville Ojamo)
doc: Revert "doc/radosgw: add "persistent_topic_size"" (pr#64179, Zac Dover)
doc: Revert "doc: mgr/dashboard: add OAuth2 SSO documentation" (pr#66796, Ville Ojamo)
doc: Revert doc/cephadm: correct markup in rgw.rst (pr#66971, Ville Ojamo)
doc: src/pybind/mgr/dashboard: edit HACKING.rst (pr#63697, Zac Dover)
doc: update cephfs-journal-tool docs (pr#63109, Jos Collin)
doc: update mgr modules notify_types (pr#64531, Nitzan Mordechai)
fix: the RGW crash caused by special characters (pr#64052, mertsunacoglu, Emin)
github: pin GH Actions to SHA-1 commit (pr#65759, Ernesto Puerta)
Handle failures in metric parsing (pr#65595, Anmol Babu)
install-deps.sh: install proper compiler version on Debian/Ubuntu (pr#66014, Dan Mick)
install-deps: Replace apt-mirror (pr#66669, David Galloway)
librbd/cache/pwl: fix memory leak in SyncPoint persist context cleanup (pr#64093, Kefu Chai)
librbd/migration/QCOWFormat: don't complete read_clusters() inline (pr#64195, Ilya Dryomov)
librbd: disallow "rbd trash mv" if image is in a group (pr#62967, Ilya Dryomov)
librbd: images aren't closed in group_snap\_*_by_record() on error (pr#64620, Miki Patel)
librbd: respect rbd_default_snapshot_quiesce_mode in group_snap_create() (pr#62962, Ilya Dryomov)
LogMonitor: set no_reply for forward MLog commands (pr#62212, Nitzan Mordechai)
mds/Beacon: wake up the thread in shutdown() (pr#61513, Max Kellermann)
mds: add an asok command to dump export states (pr#61512, Zhansong Gao)
mds: add more debug logs and log events (pr#61518, Xiubo Li)
mds: do not process client metrics message with fast dispatch (issue#68865, pr#61339, Venky Shankar)
mds: drop client metrics during recovery (pr#61299, Patrick Donnelly)
mds: dump next_snap when checking dentry corruption (pr#61978, Milind Changire)
mds: Fix invalid access of mdr->dn[0].back() (pr#61516, Anoop C S)
mds: Fix invalid access of mdr->dn[0].back() (pr#61450, Anoop C S)
mds: Fix readdir when osd is full (pr#65348, Kotresh HR)
mds: fix snapdiff result fragmentation (pr#65364, Igor Fedotov, Md Mahamudur Rahaman Sajib)
mds: nudge log for unstable locks after early reply (pr#64540, Patrick Donnelly)
mds: prevent duplicate wrlock acquisition for a single request (pr#61839, Xiubo Li, Sunnatillo)
mds: session in the importing state cannot be cleared if an export subtree task is interrupted while the state of importer is acking (pr#61514, Zhansong Gao)
mds: use SimpleLock::WAIT_ALL for wait mask (pr#67495, Patrick Donnelly)
memory lock issues causing hangs during connection shutdown (pr#65786, Nitzan Mordechai)
mgr/alerts: enforce ssl context to SMTP_SSL (pr#66142, Nizamudeen A)
mgr/cephadm: Fix unfound progress events (pr#58450, Prashant D)
mgr/DaemonState: Minimise time we hold the DaemonStateIndex lock (pr#65463, Brad Hubbard)
mgr/dashboard: adapt service creation form to support nvmeof creation (pr#63304, Afreen Misbah)
mgr/dashboard: add .nvmrc so ci can pick the node version (pr#64666, Nizamudeen A)
mgr/dashboard: Add ceph_daemon filter to rgw overview grafana panel queries (pr#62268, Aashish Sharma)
mgr/dashboard: add prometheus read permission to cluster_mgr role (pr#62651, Nizamudeen A)
mgr/dashboard: Dashboard not showing Object/Overview correctly (pr#62664, Aashish Sharma)
mgr/dashboard: fix access control permissions for roles (pr#62455, Nizamudeen A)
mgr/dashboard: Fix empty ceph version in GET api/hosts (pr#62730, Afreen Misbah)
mgr/dashboard: Fix inline markup warning in API documentation (pr#64270, Kefu Chai)
mgr/dashboard: fix make check tests (pr#63186, Afreen Misbah)
mgr/dashboard: fix zone update API forcing STANDARD storage class (pr#65621, Aashish Sharma)
mgr/dashboard: show non default realm sync status in rgw overview page (pr#65002, Aashish Sharma)
mgr/dashboard: use system packages when running tox (pr#64612, Nizamudeen A)
mgr/nfs: validate path when modifying cephfs export (pr#62278, Dhairya Parmar)
mgr/rbd_support: always parse interval and start_time in Schedules::remove() (pr#62964, Ilya Dryomov)
mgr/snap_schedule: fix typo in error message during retention add (pr#65295, Milind Changire)
mgr/snap_schedule: handle volume delete (pr#61187, Milind Changire)
mgr/vol: add command to get snapshot path (pr#62917, Rishabh Dave)
mgr/vol: don't delete user-created pool in "volume create" command (pr#63069, Rishabh Dave)
mgr/vol: print proper message when subvolume metadata filename is too long (pr#62050, Rishabh Dave)
mgr/volumes: allow disabling async job threads (pr#62436, Rishabh Dave)
mgr/volumes: fix dangling symlink in clone index (pr#62109, Neeraj Pratap Singh)
mgr/volumes: Keep mon caps if auth key has remaining mds/osd caps (pr#65297, Enrico Bocchi)
mgr/volumes: periodically check for async work (issue#61867, pr#61230, Venky Shankar)
mgr: add status command (pr#62505, Patrick Donnelly)
mgr: allow disabling always-on modules (pr#60563, Rishabh Dave)
mgr: process map before notifying clients (pr#57065, Patrick Donnelly)
mon [stretch mode]: support disable_stretch_mode & qa/workunits/mon: ensure election strategy is "connectivity" for stretch mode (pr#60630, Laura Flores, Kamoltat Sirivadhna)
mon/AuthMonitor: provide command to rotate the key for a user credential (pr#58236, Patrick Donnelly)
mon/test_mon_osdmap_prune: Use first_pinned instead of first_committed (pr#63343, Aishwarya Mathuria)
mon: Track and process pending pings after election (pr#62925, Kamoltat Sirivadhna)
monitor: Enhance historic ops command output and error handling (pr#64843, Nitzan Mordechai)
monitoring: add user-agent headers to the urllib (pr#65473, Nizamudeen A)
monitoring: fix MTU Mismatch alert rule and expr (pr#65710, Aashish Sharma)
objclass: deprecate cls_cxx_gather (pr#60195, Nitzan Mordechai)
os/bluestore: Disable invoking unittest_deferred (pr#66359, Adam Kupczyk)
os/bluestore: do cache locally compressor engines ever used (pr#62145, Igor Fedotov, Adam Kupczyk)
os/bluestore: fix bdev expansion and more (pr#62216, Igor Fedotov)
os/bluestore: Fix ExtentDecoderPartial::_consume_new_blob (pr#62054, Adam Kupczyk)
os/bluestore: Fix race in BlueFS truncate / remove (pr#62840, Adam Kupczyk)
os/bluestore: In BlueFS::truncate accept wierd alloc_unit (pr#66056, Adam Kupczyk)
os/bluestore: make BlueFS an exclusive selector for volume reserved (pr#62721, Igor Fedotov)
osd/scheduler/OpSchedulerItem: Fix calculation of recovery latency counters (pr#62801, Sridhar Seshasayee)
osd/scrub: allow longer waits for replicas to respond (pr#63940, Ronen Friedman)
osd/scrub: discard repair_oinfo_oid() (pr#62569, Ronen Friedman)
osd: add clear_shards_repaired command (pr#60566, Daniel Radjenovic)
osd: don't send stale hb msgr's addresses in MOSDBoot (pr#56520, Radosław Zarzyński)
osd: fix osd mclock queue item leak (pr#62364, Samuel Just)
OSD: Split osd_recovery_sleep into settings applied to degraded or clean PGs (pr#62399, Md Mahamudur Rahaman Sajib)
osd_types: Restore new_object marking for delete missing entries (pr#63152, Nitzan Mordechai)
OSDMonitor: exclude destroyed OSDs from "ceph node ls" output (pr#62326, Nitzan Mordechai)
OSDMonitor: Make sure pcm is initialised (pr#63805, Brad Hubbard)
PendingReleaseNotes; doc/rados/operations: document "rm-pg-upmap-primary-{all}" commands (pr#62468, Laura Flores)
PGMap: remove pool max_avail scale factor (pr#61320, Michael J. Kidd)
pybind/mgr/dashboard: Use teuthology's actual requirements (pr#65418, David Galloway)
pybind/mgr: attempt to fix mypy importing from python-common (pr#63313, John Mulligan)
pybind/mgr: Fix missing empty lines in mgr_module.py (pr#64267, Ville Ojamo)
pybind/mgr: pin cheroot version in requirements-required.txt (pr#65637, Nizamudeen A, Adam King)
qa/cephfs: ignore warning that pg is stuck peering for upgrade jobs (pr#65448, Rishabh Dave)
qa/cephfs: randomize configs in fs:thrash:workloads (pr#61341, Venky Shankar)
qa/cephfs: switch to ubuntu 22.04 for stock kernel testing (pr#62492, Venky Shankar)
qa/cephfs: update ignorelist (pr#61383, Rishabh Dave)
qa/multisite: add extra checkpoints in datalog_autotrim testcase (pr#61508, Shilpa Jagannath)
qa/rbd/iscsi: ignore MON_DOWN warning in logs (pr#64596, Adam King)
qa/rgw: bump maven version in hadoop task to resolve 404 Not Found (pr#63927, Casey Bodley)
qa/rgw: fix perl tests missing Amazon::S3 module (pr#64281, Mark Kogan)
qa/rgw: remove hadoop-s3a subsuite (pr#64669, Casey Bodley)
qa/rgw: run verify tests with garbage collection disabled (pr#62953, Casey Bodley)
qa/suites/krbd: use a standard fixed-1 cluster in unmap subsuite (pr#64918, Ilya Dryomov)
qa/suites/orch/cephadm: add PG_DEGRADED to ignorelist (pr#63055, Shraddha Agrawal)
qa/suites: wait longer before stopping OSDs with valgrind (pr#63717, Nitzan Mordechai)
qa/tasks/ceph_manager: population must be a sequence (pr#64748, Kyr Shatskyy)
qa/tasks/cephfs/mount: use 'ip route' instead 'route' (pr#63129, Kyr Shatskyy)
qa/tasks/workunit: fix no module named 'pipes' (pr#66252, Kyr Shatskyy)
qa/tests: added initial test for client-upgrade-reef-tentacle (pr#64761, Yuri Weinstein)
qa/workunits/fs/misc: remove data pool cleanup (pr#63017, Patrick Donnelly)
qa: add missing .qa links (pr#67529, Patrick Donnelly)
qa: Disable OSD benchmark from running for tests (pr#67067, Sridhar Seshasayee)
qa: enable debug mds/client for fs/nfs suite (issue#63482, pr#65251, Venky Shankar)
qa: fix multi-fs tests in test_mds_metrics.py (pr#64340, Jos Collin)
qa: fix test_cephfs_mirror_stats failure (pr#62116, Jos Collin)
qa: ignore pg availability/degraded warnings (pr#61297, Patrick Donnelly)
qa: ignore variant of down fs (pr#62092, Patrick Donnelly)
qa: increase the http.maxRequestBuffer to 100MB and enable the git debug logs (pr#61279, Xiubo Li)
qa: suppress OpenSSL valgrind leaks (pr#65663, Laura Flores)
qa: use a larger timeout for kernel_untar_build workunit (issue#68855, pr#61340, Venky Shankar)
rados/test_crash.sh: add PG_DEGRADED to ignorelist (pr#62396, Shraddha Agrawal)
rbd-mirror: add cluster fsid to remote meta cache key (pr#66272, Mykola Golub)
rbd-mirror: allow incomplete demote snapshot to sync after rbd-mirror daemon restart (pr#66163, VinayBhaskar-V)
rbd-mirror: prevent image deletion if remote image is not primary (pr#64738, VinayBhaskar-V)
rbd-mirror: release lock before calling m_async_op_tracker.finish_op() (pr#64091, VinayBhaskar-V)
rbd: display mirror state creating (pr#62939, N Balachandran)
Recent pipeline backports (pr#65250, Dan Mick)
resolve pacific/quincy upgrade failures (pr#67657, Patrick Donnelly)
rgw/iam: add policy evaluation for Arn-based Conditions (pr#62434, Casey Bodley)
rgw/rados: enable object deletion at rados pool quota (pr#62094, Casey Bodley, Samuel Just)
rgw/sts: Implementation of validating JWT using modulus and exponent (pr#63053, Pritha Srivastava)
rgw: Try to handle unwatch errors sensibly (pr#62403, Adam C. Emerson)
rgw: add force option to radosgw-admin object rm ... (pr#64311, J. Eric Ivancich)
rgw: add missing last_modified field to swift API (pr#61553, Andrei Ivashchenko)
rgw: allow bucket notification send message to kafka with multiple br… (pr#61825, Hoai-Thu Vuong)
rgw: bring rgw-restore-bucket-index up to current version (pr#64514, J. Eric Ivancich, Michael J. Kidd)
rgw: Changed discard buffer size (pr#63711, Artem Vasilev)
rgw: check all JWKS for STS (pr#64937, Alex Wojno)
rgw: correctly set worker thread names (pr#63095, Milind Changire)
rgw: don't use merge_and_store_attrs() when recreating a bucket (pr#64411, Casey Bodley)
rgw: fix 'bucket rm --bypass-gc' for copied objects (pr#66002, Casey Bodley)
rgw: fix bug with rgw-gap-list (pr#62723, J. Eric Ivancich, Michael J. Kidd)
rgw: fix empty storage class on display of multipart uploads (pr#64312, J. Eric Ivancich)
rgw: fix to correctly store updated attrs in backend store after erasing an attr/attrs for delete ops on a bucket (pr#61996, Pritha Srivastava, Wei Wang)
rgw: Head/GetObject support partNumber (pr#62544, Casey Bodley)
rgw: keep the tails when copying object to itself (pr#62656, Jane Zhu)
rgw: make incomplete multipart upload part of bucket check efficient (pr#64464, J. Eric Ivancich)
rgw: make keystone work without admin token(service ac requirement) (pr#64200, Deepika Upadhyay)
rgw: make rgw-restore-bucket-index more robust (pr#64622, J. Eric Ivancich)
rgw: optimize bucket listing to skip past regions of namespaced entries (pr#62234, J. Eric Ivancich)
rgw: prevent crash in radosgw-admin bucket object shard ... (pr#62885, J. Eric Ivancich)
rgw: PutObjectLockConfiguration can enable object lock on existing buckets (pr#62063, Casey Bodley)
rgw: radoslist improvements primarily to better support gap list tool (pr#62418, J. Eric Ivancich)
rgw: trigger resharding of versioned buckets sooner (pr#63598, J. Eric Ivancich)
rgw: update keystone repo stable branch to 2024.2 (pr#66243, Kyr Shatskyy)
Rocky 9/10 support backports (pr#64658, Zack Cerza, John Mulligan, David Galloway, Alexander Indenbaum)
run-make-check.sh backports (pr#65837, John Mulligan, luo rixin)
run-make.sh: Typo in argument addition (pr#66690, David Galloway)
scrub: use a generic interface for scheduling timer based events (pr#63558, Samuel Just, Ronen Friedman)
src/common/options: Clarify scope of scrub intervals in osd.yaml.in (pr#63490, Anthony D'Atri)
src/common: add guidance for deep-scrubbing ratio warning (pr#62503, Zac Dover)
src/common: add guidance for mon_warn_pg_not_scrubbed (pr#62552, Zac Dover)
src/mon/OSDMonitor.cc: [Stretch Mode] WRN non-existent CRUSH location assigned to MON (pr#62040, Kamoltat Sirivadhna)
src: modernize sample.ceph.conf (pr#61642, Anthony D'Atri)
suites/rados: cache tier deprecated, no need to keep the tests for it (pr#62210, Nitzan Mordechai)
sync build-with-container patches from main (pr#65845, John Mulligan, Dan Mick)
tasks/cephfs/mount: use 192.168.144.0.0/20 for brxnet (pr#63134, Kyr Shatskyy)
test/common: unittest_fault_injector omits unit-main target (pr#63979, Casey Bodley)
test/librbd/test_notify.py: conditionally ignore some errors (pr#62688, Ilya Dryomov)
test/librbd/test_notify.py: force line-buffered output (pr#62751, Ilya Dryomov)
test/rbd: remove unit tests about cache tiering (pr#64588, Laura Flores)
TEST_backfill_grow fails after finding "num_bytes mismatch" in osd log (pr#60901, Mohit Agrawal)
tools/ceph-objectstore-tool: tricks to tolerate disk errors for "pg export" command (pr#62122, Igor Fedotov)
Wip trackers 50371 67352 67489 69639 reef (pr#62473, Brad Hubbard, Patrick Donnelly)

Assessing the performance of the CLAY Erasure Code Plugin

2026-02-11T00:00:00Z

CBT Performance Benchmarking - Part 4. What can we say about CLAY?

Outline of the Blog Series ¶

Part 1 - How to start a Ceph cluster for a performance benchmark with CBT
Part 2 - Defining YAML contents
Part 3 - How to start a CBT performance benchmark
Part 4 - Assessing the performance of the CLAY erasure code plugin

Contents:

Client IO results for CLAY
Client IO with an OSD down
What is CLAY good at?
Problems with using CLAY
How does CLAY read data from the drive?
CLAY is broken in tentacle
Summary

Client IO results for CLAY ¶

As a refresher lets quickly look back on the client IO results of CLAY compared to JErasure:

If we look back to Step 3 in Part 3 of the blog (Generating a comparison report), we saw that reads had practically identical curves between CLAY & JErasure for both 4K random reads and 1024K sequential reads.

However, when we compared writes we saw that the performance hit to CLAY was substantially larger, particularly for higher bandwidths. The 1024k Sequential Writes diagram represents this:

Click to see Part 3 diagrams

So why was this?

This is because of CLAY's encoding process, it is significantly more complex. While JErasure performs a single encoding pass, CLAY uses three phases:

50% of data is encoded using PRT (Product Recovery Transform), 50% of the data is copied to form an intermediate set of buffers
All the intermediate data is encoded using RS (Reed-Solomon) to form a second set of intermediate buffers
50% of the result is encoded using PFT (Parity Fractional Transform), 50% of the data is copied to form the output buffers

Essentially, CLAY performs 2x the encoding plus an additional memcpy (memory copy) compared to JErasure's 1x encoding. This overhead therefore directly translates to lower write throughput for CLAY, as shown by the diagrams above. The performance impact increases for larger IO sizes because more data is being encoded.

Referenced the following: 'Clay Codes: Moulding MDS Codes to Yield an MSR Code' above for information on CLAY's encoding process.

Client IO with an OSD down ¶

Click to see Part 3 diagram

We then moved onto Step 4 in Part 3 of the blog (Running a test with an OSD down), and we saw that performance had got even worse for CLAY here. The curves are no longer near identical for the reads (as shown by the above diagram). CLAY is clearly performing worse in this scenario, which we did not initially expect.

This latency increase is due to the specific implementation of CLAY within Ceph. For degraded read IOs (when a client requests data from a missing shard), the system is configured to read and decode all the data to reconstruct the missing information. Just as the encode process (for write IOs) has a higher overheads when using CLAY, the decode process (for degraded read IOs) has similarly higher overheads. This is an implementation choice - when recovering objects (see next section) CLAY uses a more efficient method for recovering the data. This method could also have been used for degraded reads.

What is CLAY good at? ¶

Now you may be thinking, if CLAY is slower for writes and degraded reads, why use it? The answer is for Network Bandwidth Optimisations during recovery processes like backfill and recovery that use the erasure code to reconstruct and repair the missing parts of objects.

While JErasure requires k (data shards) to reconstruct one missing shard, CLAY uses coupled layers to reconstruct data using a significantly smaller amount of data from the remaining shards. In a standard 4+2 setup, JErasure would need to pull 100% of the data from the other 4 shards to rebuild the 5th.

This is what it would look like if we were to use JErasure and simulate a recovery of data when shard 0 is missing:

Now we will compare this to how CLAY would recover data if shard 0 was missing. CLAY reduces this traffic by approximately 50% as you can see in the below example:

It's important to note that the above configuration has a chosen non-default stripe unit of 32K which, with a 4+2 CLAY code, results in a sub chunk size of 4K and matches both the NVMe block size and the Bluestore allocation unit. See the Ceph documentation here for how to calculate the sub chunk size for your configuration.

We can see that with the CLAY example above more data shards are read, however overall less data is read. We can therefore see that for our configuration, CLAY will be more efficient in recovering data when a shard is missing.

With this erasure code profile CLAY will always read 50% of each other shard to recover a missing shard, however the subchunks that are read will vary depending on which shard is missing. The next diagram shows which subchunks will be read for each missing shard:

Note: While the diagram shows a 50% saving in network traffic, this comes at the cost of IOPS. We can see how shards 4 and 5 must perform four individual reads per stripe to gather those specific sub-chunks, so we can see here how it can be dependant on which shard is missing.

In summary CLAY reads much less data than JErasure during recovery/backfill saving approximately 50% network bandwidth which in systems that are limited by the network should improve the performance of recovery.

Problems with using CLAY ¶

Choosing your stripe unit is critical:

If stripe unit is 4K: Sub-chunks become tiny (512 bytes) and reads of less than 4K are rounded up to 4K.

This leads to extra data reads because the NVMe block size is 4K. This means that recovery reads 1x to 4x the amount of data from drives but transmits 50% less data across the network, there is still many more IOPs and CPU usage in this scenario.

Let's break this down a step further using examples of shards 0, 1 and 2 missing. Here we can see in blue the desired amount of data that we want to read, however the orange is the actual amount of data that is read due to these allignment issues.

If stripe unit is 32K: This fixes the fragmentation issue that we see above (sub-chunks align better with 4K drive blocks), but introduces some classic and fast EC problems:

In a classic EC pool, any overwrite requires reading the entire stripe, even if you only changed one byte. At 32K, small writes become incredibly expensive because of the Read-Modify-Write overhead. In classic EC objects are padded to a multiple of the stripe width, so a larger stripe unit increases wasted capacity. In fast EC objects are not padded but a larger stripe unit still results in more coding parity data and less storage efficiency. So there are still negatives to bare in mind if you are to pick a stripe unit of 32K.

How does CLAY read data from the drive? ¶

Fragmented Reads ¶

As shown above, CLAY issues fragmented reads. If the stripe unit gets smaller, for example 4K, the sub-chunk size drops to 512 bytes. This is because NVMe and HDD drives have a minimum block size of 4K, therefore any 512 byte read is rounded up to this minimum of 4K. This can result in CLAY reading the same 4K block multiple times to extract different 512 byte sub-chunks, and discarding the rest of the data. This therefore wastes CPU and drive IOPs, so if either of these are your performance bottlenecks this is not a good scenario.

Squid recovery also always tries to read 2MB from each stripe and expects the read to be truncated if the object is smaller than 2MB * number of stripes. With CLAY this results in a lot of small reads being issued beyond the end of the object. While these as quickly fail and do not stop CLAY recovering the data, this does waste additional CPU resources.

Referring to the same paper as before: Results have been shown that encoding data can take up to 70% longer in terms of CPU usage, if your cluster isn't CPU limited then you won't notice this. These results also showed dramatic savings in backfill and recovery time - but they were done on a system that was network limited and used much wider erasure codes (26 node cluster) than most people would typically use.

There is scope to improve the implementation of CLAY - currently the reads are issued serially, which will add a lot of latency to the recovery. A more efficient approach would be to issue a single read in parallel using readv or to read the entire stripe into memory once, then transmit the required data for the network. The latter would be the better method. This would trade drive bandwidth for a considerable saving in CPU utilisation and drive IOPs.

More in depth: ¶

We went over the 3 phases of how CLAY encodes data earlier. Decoding is also done in 3 phases, but on half the quantity of data:

25% of the data is decoded using PRT, 25% of the data is copied to form an intermediate set of buffers
All (50%) of the intermediate data is decoded using RS to form a 2nd set of intermediate buffers
25% of the data is decoded using PFT, 25% of the data is copied to form the output data

Therefore, CLAY has an additional 0.5x memcpy of the data and the same decoding costs, as JErasure. Hence there is slightly more overhead for CLAY (memcpy's + slight inefficiencies from performing several smaller decodes rather than one large decode). CLAY requires less data to perform the recovery so we can save on network bandwidth (and if implemented correctly, drive bandwidth)

To round off:

CLAY has higher encoding costs and the same decoding cost
CLAY has some memcpy's that JErasure does not have
CLAY has multiple encode/decode steps and there will be some small overheads/inefficiencies - for example, encoding 12K of data in 3 batches of 4K (CLAY) versus encoding 12K of data in 1 batch (JErasure)

CLAY is broken in Tentacle ¶

When performing benchmarking on the Tentacle release, a significant issue was discovered: The recovery benefit was non-existent for Tentacle.

In the tests, recovery in Tentacle transmitted the full amount of data, behaving like standard JErasure but with a higher CPU overhead of CLAY. This isn't the case for Squid however, which is what was used for the updated performance benchmarking used throughout this blog.

Summary ¶

CLAY is a fascinating project and definitely has potential, but for the average user, remains niche.

I'd recommend CLAY if: Your cluster is strictly Network Bottlenecked and you use wide erasure codes (eg 20+ nodes) where the 50% saving is a very considerable amount.

I'd recommend you avoid CLAY if: You are CPU or IOPs limited, or if you primarily use HDDs, as the fragmented serial reads will cripple recovery performance.

For most production environments, the simplicity and predictable performance of JErasure remains the better choice I believe.

Please note that there is a plan to end support for CLAY from the V release. Please see here for more details.

Link to connect with Ceph on slack

Link to previous parts of the blog series

RGW Bucket Resharding Without Pausing

2026-02-01T00:00:00Z

Introduction: The Foundation of Scalable Object Storage ¶

In the modern data landscape, object storage has evolved from a simple file repository into the foundational layer for AI/ML pipelines, data lakehouses, real-time analytics, and massive-scale archival systems. At the heart of this evolution is a deceptively simple question: How do you efficiently locate and access billions of objects stored in a single bucket?

The answer lies in one of Ceph's most critical performance mechanisms: bucket index sharding. This architectural pattern divides a bucket's index into multiple parallel structures, enabling concurrent operations across thousands of objects while maintaining the consistency and reliability that enterprise workloads demand.

But there's always been a catch. As workloads grow and evolve, buckets need to be resharded. Historically, when the buckets to be resharded had a vast number of objects, this operation came with a painful trade-off: blocking client writes from seconds to minutes, with a chance of causing application disruptions, 504 Gateway errors, and operational headaches.

With Ceph Tentacle, we're eliminating this trade-off. The new near-zero impact bucket resharding architecture transforms what was once a maintenance window event into a seamless background operation that your applications will never notice.

Note: As of 2026/02/05, the functionality described in this article is expected in an upcoming Tentacle update.

Executive Summary ¶

The Challenge: In Ceph Squid, resharding a 20-million-object bucket blocked writes for 4+ minutes, returning 504 errors. Even larger buckets (500M objects) required 94 minutes of complete write unavailability.

The Solution: Ceph Tentacle's two-phase architecture moves the heavy lifting to a non-blocking background phase, eliminating the impact on clients IO.

The Results:

(note: in this graphic 8.1 refers to Squid and 9.0 to Tentacle)

In this deep dive, we'll explore:

Why bucket sharding is essential for modern workloads
The challenges of resharding in Ceph Squid and earlier versions
The enhanced two-phase architecture in Ceph Tentacle
Before/after performance comparison from production testing
The future of bucket indexing with in-order sharding

The Scalability Enabler: Understanding Bucket Index Sharding ¶

The Bucket Index and omaps ¶

In Ceph's Object Gateway(RGW), the ability to list bucket contents is fundamental to object storage operations. The Object Gateway(RGW) implements this using a dedicated structure called the bucket index, which maintains an inventory of all objects in a bucket. This index is stored using a special RADOS feature called the Object Map (omap) - essentially a key-value store associated with a RADOS object, physically residing in the RocksDB database on each OSD's DB partition.

Without sharding, a bucket's entire index is stored in a single RADOS object. While elegant in its simplicity, this creates a fundamental performance problem:

The Single-Index Bottleneck: Since only one operation can modify this index at a time, you're looking at complete serialization. Write operations must queue and wait their turn to update the index. As your bucket grows to millions of objects with thousands of concurrent write operations, this serialization becomes a severe bottleneck.

Think of it like a busy airport with only one runway. No matter how many planes are waiting to land, only one can touch down at a time.

Sharding: Parallelism Through Distribution ¶

Bucket index sharding solves this bottleneck by dividing the index into multiple parts (shards), with each shard stored as a separate RADOS object within the .rgw.buckets.index pool. When an object is written, the Ceph Object Gateway (RGW) calculates a hash of the object's name to determine which shard should receive the index update. This enables multiple operations to run concurrently across multiple Placement Groups (PGs), distributing requests among the the OSDs that host the index pool.

Returning to our airport analogy: you now have multiple runways, each handling different aircraft simultaneously. The more runways (shards) you have, the more parallel operations you can support.

The sharding mechanism uses the rgw_max_objs_per_shard tunable (default: 100,000 objects per shard) to determine optimal distribution.

We recommend maintaining no more than 102,400 objects per shard for optimal performance.

Why Single-Bucket Scale is Mission-Critical in Modern Object Workloads ¶

Here's where bucket sharding becomes even more critical: modern analytics architectures are converging on single-bucket designs.

The Data Lakehouse Pattern

Apache Iceberg, Apache Hudi, and Delta Lake (the table formats revolutionizing data architecture) organize petabytes of data within a single bucket using hierarchical prefixes:

s3://analytics-lakehouse/
├── warehouse/
│   ├── sales_db/
│   │   └── transactions/
│   │       ├── data/
│   │       │   ├── year=2025/month=11/
│   │       │   │   ├── 00045-23-a1b2c3d4.parquet
│   │       │   │   └── 00046-24-e5f6g7h8.parquet
│   │       │   └── year=2025/month=10/
│   │       └── metadata/
│   │           ├── v1.metadata.json
│   │           ├── v2.metadata.json
│   │           └── snap-1234567890.avro
│   └── customer_db/
│       └── profiles/
├── staging/
└── archive/

The implication? Modern data platforms need buckets that can:

Scale to billions of objects distributed across thousands of prefixes
Handle mixed workloads: batch ETL, interactive queries, real-time streaming
Adapt dynamically to growth and contraction without downtime
Maintain sub-second listing performance across massive object counts

This is precisely where seamless resharding becomes absolutely critical.

The Challenge: Resharding in Ceph Squid and Earlier ¶

Understanding the improvements in Ceph Tentacle requires understanding the challenges of the previous approach.

The Blocking Resharding Process ¶

In Ceph Squid and earlier versions, bucket resharding followed this process:

Resharding operation initiates (manually or via dynamic resharding)
All client write operations are blocked to the bucket
Index entries are copied from source shards to destination shards
Applications receive 504 Gateway Timeout errors
Operations teams monitor progress

In buckets with small object counts, resharding is almost unnoticeable, but as object counts grow, the write pause can last from minutes to hours, depending on bucket size. Read operations continued, but write unavailability required careful planning for production workloads.

The Operational Impact ¶

This blocking behavior created several operational constraints:

Maintenance Windows: Resharding typically requires scheduling during off-peak hours with advance notification to application teams.

Capacity Planning Tradeoffs: Teams set high initial shard counts based on pre-sharding usage estimates for the bucket, but these are hard to calculate up front.

Dynamic Resharding Concerns: Automatic reshards could trigger during peak business hours, potentially causing disruptions. Some organizations disabled dynamic resharding entirely and managed sharding manually.

Ceph Tentacle addresses these challenges with a fundamentally different approach.

The Solution: Non-Pausing Resharding in Ceph Tentacle ¶

Now let's explore what changes in Ceph Tentacle - and why it's transformational.

Reshard Two-Phase Architecture ¶

The Ceph Object Gateway(RGW) engineering team fundamentally redesigned the resharding process from the ground up. Instead of blocking all writes while copying index entries, Ceph Tentacle introduces an intelligent two-phase incremental approach that keeps your applications running:

Phase 1: Log Record Phase (Non-Blocking)

During this phase, which comprises the bulk of the resharding operation:

Client writes continue normally - no blocking whatsoever
Index operations are logged to source shards alongside regular write operations
Background migration begins - existing index entries start copying to destination shards
Change tracking - a sophisticated logging mechanism captures all modifications

Phase 2: Progress Phase (Minimal Pause, zero client impact)

Only after the bulk of entries have been migrated does Phase 2 begin:

Brief write pause - With zero client impact (milliseconds to low seconds)
Log synchronization - recent changes recorded during Phase 1 are applied to destination shards
Conflict resolution - entries modified during migration are reconciled
Bucket stats recalculation - metadata is updated to reflect the new shard layout
Cutover - bucket switches to the new index layout
Normal operations resume

The Key Innovation: By recording changes as lightweight logs during Phase 1, the system only needs to synchronize recent modifications during the brief Phase 2 pause. The bulk of the work - migrating millions of existing entries - happens entirely in the background while your applications continue writing uninterrupted.

Backward Compatibility: Ceph Tentacle's resharding maintains compatibility as a superset of the previous implementation. If some OSD nodes haven't yet upgraded, resharding safely fails rather than risking data loss, and the system checks version compatibility before proceeding.

What This Means For Your Operations ¶

The practical implications extend far beyond eliminating 504 errors:

1. Eliminate Maintenance Windows. No more scheduling resharding operations for 2 AM on Sunday. Trigger reshards during peak business hours if needed - your applications won't notice.

2. Enable True Dynamic Scaling
Dynamic bucket resharding can now be fully trusted. The automation you've wanted - automatic scaling up and down with minimal client interruption.

3. Production Confidence. Deploy resharding changes without coordination, without warning application teams, without anxiety. It just works.

4. Faster Response to Demand. Workload explodes? Trigger an immediate upshard. No more waiting for a maintenance window.

5. Simplified Operations. One less thing requiring complex runbooks, escalation procedures, and off-hours coordination. Focus on value-add activities instead.

Performance Comparison: Before and After ¶

To validate the architectural improvements, we conducted extensive testing comparing Ceph Squid and Tentacle under identical conditions. The results demonstrate the transformational impact of near-zero impact resharding resharding.

Test Scenario: Small-Scale Bucket with 20 Million Objects ¶

Configuration:

Environment: Single-site deployment using s3cmd
Bucket size: ~20 million objects
Resharding operation: Manual upshard (401 → 10,001 shards for 8.1, 307 → 10,001 for 9.0)
Test action: Upload a 300MB object during active reshard

Results:

The Impact: Uploads that previously required 4+ minutes due to complete blocking now complete in 17 seconds for 300MB objects, with zero errors. That's a 93% reduction in client-perceived latency - or more accurately, the elimination of the problem entirely.

From an application perspective, resharding is now completely transparent. Your applications continue serving requests without any indication that a major infrastructure operation is happening beneath them.

Test Scenario: Medium-Scale Bucket with 500 Million Objects ¶

For larger buckets, the improvements are even more dramatic.

Test Methodology Note: This test was deliberately conducted as a stress scenario to evaluate behavior under extreme conditions. The cluster was pushed to near-saturation with concurrent large-object uploads during resharding operations. This aggressive test configuration amplifies resharding times significantly beyond typical production scenarios, allowing us to validate the improvements under worst-case conditions.

Configuration:

Environment: Single-site deployment using s3cmd
Test: Upload 300MB and 1GB objects during downshard operation
Resharding operation: Downshard from 10,001 → 1,999 shards
Load: Concurrent large uploads pushing cluster toward capacity limits

The Results:

The Impact: While typical production resharding in Ceph Squid would complete faster than the 94 minutes shown here, this stress test reveals critical behavior differences. Under load, Ceph Squid's blocking architecture creates cascading issues - the longer the reshard takes, the longer applications are blocked, potentially triggering timeouts and retry storms. Ceph Tentacle's non-blocking architecture eliminates this entire failure mode. Whether resharding takes 10 minutes or 90 minutes, applications continue operating normally.

At a Glance: The Transformation ¶

Aspect	Ceph Squid	Ceph Tentacle
Client Impact	Complete write blocking	Zero write blocking
Error Rate	504 Gateway errors	No errors
20M Object Upshard	4m23s blocked	17s upload (no pause)
500M Object Downshard	94 minutes blocked	5-17s uploads (no pause)
Maintenance Window	Required	Not required
Dynamic Resharding	Often disabled	Enabled

Looking Forward: The Future of Bucket Indexing ¶

The near-zero-impact bucket resharding feature in Ceph Tentacle is transformational, but it's part of a broader evolution in how Ceph handles bucket indexing at scale.

In-Order Sharding: The Next Frontier ¶

Currently, RGW's hashed sharding optimizes for write distribution but presents challenges for alphabetical listing operations. To fulfill a paginated list request, RGW must perform a "scatter-gather" operation: querying every shard and sorting the combined results. For buckets with thousands of shards, this becomes a bottleneck.

In-order sharding (ordered bucket listing) is in active development and will revolutionize listing performance:

The Change: Instead of using a hash function, objects will be placed into shards based on lexicographical name ordering.

The Impact:

List requests can target specific shard ranges instead of querying all shards.
Paginated listing becomes dramatically faster (query 1-2 shards instead of thousands).
Prefix-based queries (critical for data lakehouses) become highly efficient.
Iterating through object keys becomes significantly more performant.

Why This Matters for Data Lakehouses:

Apache Iceberg, Hudi, and Delta Lake all rely heavily on prefix-based object discovery:

s3://lakehouse/warehouse/sales_db/transactions/data/year=2025/month=11/

With in-order sharding, a query for this prefix would hit only the specific shards containing objects in that lexicographical range - not all 10,000 shards in the bucket.

Combined with non-pausing resharding, Ceph is building toward virtually unlimited, performant scalability within a single bucket - exactly what modern data platforms demand.

For a detailed slide deck on the topic, check out Eric Ivancich's excellent Cephalocon talk:

Video

Slides

Conclusion: A New Era of Operational Excellence ¶

Ceph Tentacle's near-zero impact bucket resharding represents a fundamental shift in production object storage operations, eliminating one of the most significant pain points in large-scale deployments.

As Ceph continues evolving with features like in-order sharding, the vision becomes clear: single-bucket architectures that scale infinitely without operational complexity.

For data lakehouse architects building on Apache Iceberg, for AI/ML engineers managing billions of training artifacts, and for enterprise architects demanding the highest availability without operational friction, Ceph Tentacle delivers the operational maturity that production workloads require.

*All test configurations were performed on HDD production-equivalent hardware. Results may vary based on hardware specifications, network topology, and workload characteristics. Consult the official documentation for detailed configuration guidance and best practices.

We would like to thank IBM for the time to author these articles.

Mastering IAM in Ceph: Multi-Tenancy, Access Control, and Why ACLs Must Die

2026-01-24T00:00:00Z

Introduction ¶

Introduction: When Security Theater Becomes a Real Disaster ¶

In March 2017, a misconfigured S3 bucket at Verizon exposed the personal information of 14 million customers. The root cause wasn't a sophisticated attack; it was a simple oversight in access permissions. The bucket was set to be publicly accessible due to S3 permission misconfiguration, and no one noticed because ACLs were managed separately from the company's centralized IAM policies. The security team had implemented careful, identity-based access controls, but a resource-level ACL silently bypassed them by granting access to "All Users."

This scenario repeats constantly across the industry: ACLs creating invisible access paths that security teams don't know exist, buckets accidentally exposed to the public internet, and contractors uploading data that the bucket owner cannot reliably read or administer, while still consuming capacity.

Between 2017 and 2019, major companies exposed hundreds of millions of records via misconfigured S3 permissions (ACLs and/or bucket policies):

Verizon (2017): 14 million customers - An AWS S3 bucket configured for public access exposed names, addresses, account PINs
Facebook (2019): 540 million records - Third-party apps stored user data in publicly accessible S3 buckets
Instagram (2019): 49 million records - Marketing firm left influencer database unprotected in AWS S3

The AWS response was clear: since April 2023, all new S3 buckets default to "ACLs disabled" (BucketOwnerEnforced) and Block Public Access enabled. AWS strongly recommends disabling ACLs on existing buckets and migrating to a pure policy-based model with IAM Accounts architecture.

If you're running the Ceph Object Gateway (RGW), you have access to the same IAM Accounts model introduced in Ceph Squid 19.2.0. This post explains why ACLs must be disabled immediately and how to implement modern, secure access control with IAM policies.

Do This First (Quick Security Wins)

Before reading further, take these two actions on all production buckets:

Enable Block Public Access - Prevents public exposure via ACLs or bucket policies

Deny ACL operations - Add explicit deny for s3:PutObjectAcl and s3:PutBucketAcl as defense-in-depth

These changes prevent the attack patterns described in this post. Continue reading to understand why and how.

Why ACLs Failed? ¶

Access Control Lists (ACLs) were S3's original permission system. They failed for several critical reasons that made them fundamentally unsafe for production use.

Public Access Disasters ¶

The most dangerous ACL failure was a silent public exposure. A single misconfigured ACL could grant the entire internet access to your data, and your security team would never know because ACLs weren't visible in centralized IAM policies.

How it happened:

$ export RGW_ENDPOINT="https://rgw.example.com"

# Developer accidentally makes object public during testing
$ aws --profile developer --endpoint-url "$RGW_ENDPOINT" s3api put-object-acl \
  --bucket bucketacl \
  --key hosts \
  --grant-read uri=http://acs.amazonaws.com/groups/global/AllUsers
$ aws --profile developer --endpoint-url "$RGW_ENDPOINT" s3api get-object-acl \
  --bucket bucketacl \ 
  --key hosts
{
    "Owner": {
        "DisplayName": "developer",
        "ID": "developer"
    },
    "Grants": [
        {
            "Grantee": {
                "Type": "Group",
                "URI": "http://acs.amazonaws.com/groups/global/AllUsers"
            },
            "Permission": "READ"
        }
    ]
}

# Security team checks IAM policies - looks fine (against the same RGW endpoint)

$ aws --profile account-root --endpoint-url "$RGW_ENDPOINT" iam get-user-policy \
  --user-name developer \
  --policy-name S3Access

# ✓ Least privilege, no issues detected

# Meanwhile, the object is public to anyone who can reach the RGW endpoint:

$ curl "$RGW_ENDPOINT/bucketacl/hosts" 
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
10.2XX.0.X   ceph01 

# Full access, no authentication required
# The same risk exists at bucket scope; a public bucket ACL enables unauthenticated listing
# which can leak keys and metadata

$ aws --profile developer --endpoint-url "$RGW_ENDPOINT" s3api put-bucket-acl \
--bucket bucketacl --acl public-read

# Unauthenticated Access to list bucket contents

$ curl -s "$RGW_ENDPOINT/bucketacl" | xmllint --format -
<?xml version="1.0" encoding="UTF-8"?>
<ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
  <Name>bucketacl</Name>
 ...
  <Contents>
    <Key>hosts</Key>
    <LastModified>2025-12-31T08:58:21.346Z</LastModified>
    <ETag>"71ae31ad9b6e7fda9cb5a8628b2e152a"</ETag>
    <Size>415</Size>
    <StorageClass>STANDARD</StorageClass>
    <Owner>
      <ID>developer</ID>
      <DisplayName>developer</DisplayName>
 ...
</ListBucketResult>

Why was it catastrophic?

Decentralized control: ACLs could be set per-bucket and per-object, creating millions of potential exposure points
No visibility: ACLs didn't appear in the IAM console - security teams had no way to audit them centrally
Silent bypasses: Even perfect IAM policies couldn't prevent an ACL from granting public access
Object-level chaos: With millions of objects, each having its own ACL, comprehensive auditing was impossible

Real-world impact: The three breaches in our introduction (Verizon, Facebook, Instagram) all involved publicly accessible S3 data caused by permission misconfiguration (ACLs, bucket policies, or both), combined with weak central visibility and auditing; exactly the problems that policy-based access control solves.

The Object Ownership Problem ¶

Beyond public access, ACLs created an ownership nightmare. When external accounts uploaded objects to your bucket, they owned those objects, not you.

# Contractor uploads data to your bucket
$ aws --endpoint-url "$RGW_ENDPOINT" s3 cp sensitive.pdf s3://company-bucket/contractor-data/ --profile contractor
upload: ./sensitive.pdf to s3://company-bucket/contractor-data/sensitive.pdf

# Who owns this object?
$ aws --endpoint-url "$RGW_ENDPOINT" s3api get-object-acl \
  --bucket company-bucket \
  --key contractor-data/sensitive.pdf \
  --profile contractor
{
    "Owner": {
        "DisplayName": "Contractor Account",
        "ID": "contractor"  ← Contractor owns it, not you!
    }
}

# You (bucket owner) can't READ the object
$ aws --endpoint-url "$RGW_ENDPOINT" s3 cp \
  s3://company-bucket/contractor-data/sensitive.pdf \
  ./test.pdf --profile company-admin
fatal error: An error occurred (403) when calling the HeadObject operation: Forbidden

# You can't even GET the ACL to see permissions
$ aws --endpoint-url "$RGW_ENDPOINT" s3api get-object-acl \
  --bucket company-bucket --key contractor-data/sensitive.pdf \
  --profile company-admin
fatal error: An error occurred (AccessDenied) when calling the GetObjectAcl operation: Access Denied


# You can't MODIFY the ACL
$ aws --endpoint-url "$RGW_ENDPOINT" s3api put-object-acl \
  --bucket company-bucket --key contractor-data/sensitive.pdf \
  --acl private --profile company-admin
fatal error: An error occurred (AccessDenied) when calling the PutObjectAcl operation: Access Denied

For on-premises Ceph deployments, while there's no per-GB billing surprise, the operational and compliance problems are identical: you can't read, audit, or manage data in your own infrastructure.

In Ceph RGW, bucket owners CAN delete objects they don't own. However, they still can't read, view ACLs, or manage those objects, creating operational blind spots and compliance risks.

ACLs include grants that appear safer than "public" but remain dangerously broad. In S3, authenticated-read grants read access to the AuthenticatedUsers group; in Ceph RGW terms, that can translate to "any identity that can authenticate to this RGW endpoint/cluster," not "only my team." On a shared on-premises platform (multiple accounts, tenants, service accounts, CI users, integrations), this can lead to accidental cross-team or cross-tenant data exposure.

# Finance team uploads "internal" data with authenticated-read
# (thinking it's safer than public)
$ aws --endpoint-url "$RGW_ENDPOINT" s3 cp finance-report.pdf \
  s3://company-bucket/finance-report.pdf \
  --acl authenticated-read --profile finance-team
 upload: ./finance-report.pdf to s3://company-bucket/finance-report.pdf

# Check the ACL - looks reasonable?
$ aws --endpoint-url "$RGW_ENDPOINT" s3api get-object-acl \
  --bucket company-bucket \
  --key finance-report.pdf --profile finance-team
{
    "Owner": {
        "DisplayName": "Finance Team",
        "ID": "finance-team"
    },
    "Grants": [
        {
            "Grantee": {
                "Type": "Group",
                "URI": "http://acs.amazonaws.com/groups/global/AuthenticatedUsers"
            },
            "Permission": "READ"  ← ANY authenticated user on the cluster!
        }
    ]
}

# DevOps team (completely different department) can read it!
$ aws --profile devops --endpoint-url "$RGW_ENDPOINT" s3 cp \
  s3://company-bucket/finance-report.pdf ./leaked.pdf
download: s3://company-bucket/finance-report.pdf to ./leaked.pdf

# Contractor user (or any other authenticated user) can also access it
$ aws --profile contractor --endpoint-url "$RGW_ENDPOINT" s3 cp \
  s3://company-bucket/finance-report.pdf ./contractor-copy.pdf
download: s3://company-bucket/finance-report.pdf to ./contractor-copy.pdf

# Anonymous users are still blocked
$ aws s3 cp s3://company-bucket/finance-report.pdf ./anon.pdf \
  --endpoint-url "$RGW_ENDPOINT" --no-sign-request
fatal error: An error occurred (403) when calling the HeadObject operation: Forbidden

Public write is an integrity disaster, not just a leak ¶

ACL errors are not solely about "read" exposure. With bucket ACLs, public-read-write (or broad write grants) can enable untrusted PUT requests to a bucket. That turns into an integrity incident: poisoned datasets, overwritten "golden" artifacts, malware hosting, or backup tampering. Even on-prem "internal-only" does not save you; it just changes the attacker's vector, but the threat still exists.

WRITE_ACP is the "permission to rewrite permissions." ¶

ACLs don’t just control data-plane actions; they can delegate control-plane authority over the ACL itself. In Ceph RGW S3 semantics, WRITE_ACP the permission that allows changing a bucket's ACL (required WRITE_ACP for PUT Bucket ACL). If the wrong principal has it, they can escalate later by granting broader access (including public exposure), and this delegation is distributed across buckets and objects. This is a governance anti-pattern because the system contains a hidden "permission to change permissions."

# Step 1: Bucket owner grants contractor WRITE + WRITE_ACP
$ aws --endpoint-url "$RGW_ENDPOINT" s3api put-bucket-acl \
  --bucket company-bucket \
  --grant-write id=contractor \
  --grant-write-acp id=contractor \
  --profile developer

# Verify the ACL
$ aws --endpoint-url "$RGW_ENDPOINT" s3api get-bucket-acl \
  --bucket company-bucket --profile developer
{
    "Owner": {
        "DisplayName": "developer",
        "ID": "developer"
    },
    "Grants": [
        {
            "Grantee": {
                "DisplayName": "Contractor Account",
                "ID": "contractor",
                "Type": "CanonicalUser"
            },
            "Permission": "WRITE"
        },
        {
            "Grantee": {
                "DisplayName": "Contractor Account",
                "ID": "contractor",
                "Type": "CanonicalUser"
            },
            "Permission": "WRITE_ACP"  ← Contractor can modify ACLs!
        },
        {
            "Grantee": {
                "DisplayName": "developer",
                "ID": "developer",
                "Type": "CanonicalUser"
            },
            "Permission": "FULL_CONTROL"
        }
    ]
}

# Step 2: Contractor abuses WRITE_ACP to make bucket PUBLIC
$ aws --endpoint-url "$RGW_ENDPOINT" s3api put-bucket-acl \
  --bucket company-bucket \
  --acl public-read --profile contractor
# Success! Contractor just made the bucket public

# Step 3: Verify the escalation
$ aws --endpoint-url "$RGW_ENDPOINT" s3api get-bucket-acl \
  --bucket company-bucket --profile developer
{
    "Owner": {
        "DisplayName": "developer",
        "ID": "developer"
    },
    "Grants": [
        {
            "Grantee": {
                "Type": "Group",
                "URI": "http://acs.amazonaws.com/groups/global/AllUsers"
            },
            "Permission": "READ"  ← NOW PUBLIC! Anyone can list contents
        },
        {
            "Grantee": {
                "DisplayName": "developer",
                "ID": "developer",
                "Type": "CanonicalUser"
            },
            "Permission": "FULL_CONTROL"
        }
    ]
}

# Step 4: Anonymous users can now list the bucket
$ aws s3 ls s3://company-bucket/ \
  --endpoint-url "$RGW_ENDPOINT" --no-sign-request
2025-12-31 05:00:00         27 finance-report.pdf
# Public exposure complete

The Solution: Stop using ACLs immediately ¶

AWS and the Ceph Object Gateway (RGW) provide controls to disable ACLs entirely. This should be your first action on any production bucket.

Step 1: Block Public Access ¶

Enforce public access blocks to prevent bucket ACLs from granting public access.

Ceph AWS CLI Configuration Note

All aws CLI commands in this guide assume your AWS CLI profile is configured: See the Ceph documentation on AWS CLI configuration and AWS CLI endpoint configuration for details.

Bucket-level (Granularity per individual bucket):

# Anon access is enabled on bucket from previous example

$ aws s3 ls s3://company-bucket/ \
  --endpoint-url "$RGW_ENDPOINT" --no-sign-request
                           PRE contractor-data/
2025-12-31 07:13:55         26 finance-report.pdf

# We use public-access-block on our bucket

$ aws s3api put-public-access-block \
  --bucket company-bucket \
  --public-access-block-configuration \
    BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true \
  --endpoint-url "$RGW_ENDPOINT" \
  --profile developer

# Public access has been removed from the bucket,
# a non-authorized request fails after the put-public-access-block

$ aws s3 ls s3://company-bucket/ --endpoint-url "$RGW_ENDPOINT" --no-sign-request
fatal error: An error occurred (AccessDenied) when\
  calling the ListObjectsV2 operation: Access Denied
# Some AWS CLI versions surface certain error responses
# poorly; if you see a Python exception, re-run with
# --debug to confirm the underlying HTTP 403/AccessDenied.

What each setting does:

BlockPublicAcls: Prevents new public ACLs from being applied (redundant if BucketOwnerEnforced, but adds defense in depth)
IgnorePublicAcls: Ignores existing public ACLs (treats them as private)
BlockPublicPolicy: Prevents bucket policies that grant public access
RestrictPublicBuckets: Blocks public access to buckets even if policies exist

Step 2: Deny ACL Operations via IAM Policy ¶

As the root account administrator, you should establish a security baseline that prevents ACL usage by default for all users and groups. This way, even if a developer tries to use ACLs in the future, they'll get an immediate Access Denied error, preventing accidents before they happen.

The governance pattern creates a standard "DenyACLs" policy that you attach to every new user or group you create. This establishes ACL blocking as your organization's security baseline.

Create the standard policy:

$ cat > deny-acl-operations.json <<'EOF'
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DenyACLOperations",
      "Effect": "Deny",
      "Action": [
        "s3:PutObjectAcl",
        "s3:PutObjectVersionAcl",
        "s3:PutBucketAcl"
      ],
      "Resource": [
        "arn:aws:s3:::*",
        "arn:aws:s3:::*/*"
      ]
    }
  ]
}
EOF

Here is an example of how to apply the policy to new users as you create them:

# Create a new developer
$ aws iam create-user --user-name alice
{
    "User": {
        "Path": "/",
        "UserName": "alice",
        "UserId": "4abb3a59-7991-4644-8863-347b02adc48f",
        "Arn": "arn:aws:iam::RGW89761398048153XXX:user/alice",
        "CreateDate": "2025-01-03T15:44:06.920034Z"
    }
}
$ aws iam create-access-key --user-name alice

# Immediately apply the ACL deny policy (before giving any other permissions)
$ aws iam put-user-policy \
  --user-name alice \
  --policy-name DenyACLs \
  --policy-document file://deny-acl-operations.json

# Now grant the user their actual S3 permissions
$ aws iam attach-user-policy --user-name alice --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess

If Alice later tries to configure ACLs on any bucket, she will get Access Denied:

# Create a bucket as Alice, upload an Object and try to apply a public ACL on the Object
$ aws --profile alice --endpoint-url="$RGW_ENDPOINT" \
  s3 mb s3://alicebucket
$ aws --profile alice --endpoint-url "$RGW_ENDPOINT" \
  s3 cp finance-report.pdf s3://alicebucket
$ aws --profile alice --endpoint-url "$RGW_ENDPOINT" \
  s3api put-object-acl --bucket alicebucket --key \
  finance-report.pdf --acl public-read
#  Error: Access Denied
fatal error: An error occurred (AccessDenied) when calling the PutObjectAcl operation: Access Denied
# Some AWS CLI versions surface certain error responses poorly; if you see a Python exception, re-run with --debug to confirm the underlying HTTP 403/AccessDenied.

With ACLs disabled, you might be wondering: How do I grant cross-account access to share my datasets?

Previously, you might have used ACLs to grant a contractor account read access to specific objects or allowed a partner account to upload files. With ACLs gone, how do you securely share data between accounts?

Two modern approaches exist:

Approach	How It Works	Access Pattern	Best For
Bucket policies	Resource owner adds bucket policy; requesting account adds identity policy	Direct, always-on access	Static, permanent sharing
IAM Role assumption	Resource owner creates an assumable role; requesting account assumes it	Temporary session (1-12h)	Dynamic, auditable access

We'll focus on IAM role assumption because it provides:

Temporary credentials that auto-expire (vs. permanent keys)
Detailed audit trails showing who assumed what role and when (vs. static access logs)
Instant revocation by deleting the role (vs. updating multiple policies)
Least privilege with time-bound access (vs. always-on permissions)

This is also AWS's recommended pattern and follows zero-trust principles. Let's see how.

IAM Accounts: The Modern Solution ¶

The Ceph Object Gateway (RGW) implements AWS-compatible IAM Accounts, introduced in Squid/19.2.0. This provides proper multi-tenancy with policy-based access control instead of ACLs.

What is an IAM Account? ¶

An IAM Account provides isolation for identities and access control:

Account: finance-team (ID: RGW12345678901234567)
├── Users & Groups (isolated per account)
├── Roles (isolated per account)  
├── Policies (fine-grained permissions)
└── S3 Buckets (owned by account)

S3 bucket names are globally unique across ALL accounts in a flat namespace (just like AWS S3). If Finance creates a bucket called financial-reports no other account can use that name. However, bucket ownership and access control are account-specific, only Finance can manage their financial-reports bucket.

Ceph accounts can optionally belong to a tenant for namespace isolation. Within a tenant, bucket names are unique to that tenant; they are not globally unique across all tenants.

Key distinction:

Account Root User: Emergency admin access only, created with --account-root flag
IAM Users: Day-to-day access, follows the least privilege principle

For this post, we'll assume you have two accounts already set up:

Finance Account (ID: RGW00893359550361292)
DevOps Account (ID: RGW89761398048153888)

For a complete guide on creating IAM Accounts, users, and basic configuration, see our previous post: Enhancing Ceph Multitenancy with IAM Accounts.

Scenario: Finance needs to give DevOps read-only access to backup data for disaster recovery testing. Previously, this might have been done with ACLs. Now, we use cross-account role assumption.

Requirements:

DevOps can read backups, but cannot modify or delete them
Access uses temporary credentials (not long-term keys)
Finance can revoke access instantly
Fully auditable (who accessed what, when)

How It Works ¶

The key insight: Create a role in the Finance account (same account as the bucket). When DevOps assumes this role, they temporarily "become" a Finance account principal with Finance credentials.

This is the same STS pattern we covered in our previous post on temporary credentials, but now applied to cross-account access.

Implementation ¶

1. Finance Creates a Cross-Account Role for the Devops Team ¶

Finance creates a role called devops-backup-reader in their account with two policies:

The Trust Policy (who can assume this role):

$ cat > trust-policy.json <<'EOF'
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "AWS": "arn:aws:iam::RGW89761398048153888:user/dave-backup-ops"
    },
    "Action": "sts:AssumeRole"
  }]
}
EOF

This says: "DevOps account user ‘dave’ can assume this role."

You can use in the trust policy the RGWXXXX:root formatting for the Principal. This gives access to all users in the devops account to assume the role. Then we could configure in the devops account to allow users from a specific IAM group to be able to assume the finance devops-backup-reader role.

And the Permission Policy (what the role can do):

$ cat > role-permissions.json <<'EOF'
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": ["s3:GetObject", "s3:ListBucket"],
    "Resource": [
      "arn:aws:s3:::finance-backups",
      "arn:aws:s3:::finance-backups/*"
    ]
  }]
}
EOF

This says: "This role can list & read the finance-backups bucket."

Once we have the policy files created, we can go ahead and create the IAM role devops-backup-reader :

$ aws --profile finance-admin s3 mb s3://finance-backups 
$ aws iam create-role \
  --profile finance-admin \
  --role-name devops-backup-reader \
  --assume-role-policy-document file://trust-policy.json

$ aws iam put-role-policy \
  --profile finance-admin \
  --role-name devops-backup-reader \
  --policy-name ReadBackups \
  --policy-document file://role-permissions.json

2. DevOps User Accesses the Finance Account Dataset ¶

Dave from the DevOps team assumes the role and gets temporary Finance credentials:

# Assume Finance role
$ aws --endpoint-url "$RGW_ENDPOINT" sts assume-role \
  --profile dave-backup-ops \
  --role-arn "arn:aws:iam::RGW00893359550361292:role/devops-backup-reader" \
  --role-session-name david-devops-backup-finance \
  --region default

{
    "Credentials": {
        "AccessKeyId": "ASIA****************",
        "SecretAccessKey": "REDACTED",
        "SessionToken": "REDACTED",
        "Expiration": "2025-0X-15TXX:00:00Z"
    }
}

# Use temporary credentials
$ export AWS_ACCESS_KEY_ID=ASIA****************
$ export AWS_SECRET_ACCESS_KEY=REDACTED
$ export AWS_SESSION_TOKEN=REDACTED

# Access Finance backups (using Finance account credentials!)
$ aws --endpoint-url "$RGW_ENDPOINT" s3 ls s3://finance-backups/
2025-01-14 02:00:00  daily-backup-2025-01-14.tar.gz

$ aws --endpoint-url "$RGW_ENDPOINT" s3 cp s3://finance-backups/daily-backup-2025-01-14.tar.gz .
download: s3://finance-backups/daily-backup-2025-01-14.tar.gz to ./daily-backup-2025-01-14.tar.gz

Why This Works (And Why No Bucket Policy Is Needed) ¶

The role devops-backup-reader is in the Finance account (same account as the bucket). When Dave assumes this role, he receives temporary Finance account credentials. From the bucket's perspective, this is same-account access: only the role's policy is required; no bucket policy is needed.

The cross-account part: Only the AssumeRole action crosses accounts. The actual bucket access is the same account (role and bucket), both in Finance.

Security Benefits of This Approach ¶

Temporary credentials: Expire after 1 hour (configurable up to 12 hours)
No shared secrets: DevOps never sees Finance's long-term keys
Instant revocation: Finance deletes the role → all access stops immediately
Audit trail: Logs show role name, session name, and requesting account
Least privilege: Role has only read permissions, nothing more
Better than ACLs: Centralized control, no object-level chaos

What the Audit Logs Show ¶

The Ceph Object Gateway (RGW) audit logs capture the complete cross-account access pattern. Here's what you will see:

Note: Ensure RGW audit logging is enabled. See the Ceph documentation on bucket and object audit logging (OPS logs) for configuration details.

Example audit log extract when DevOps assumes the Finance role:

$ tail -f /var/log/ceph/ops-log-ceph-client.rgw.default.ceph02.fvqogr.log | jq .
{
...
  "time": "2025-01-04T17:34:07.711570Z",
  "time_local": "2025-01-04T17:34:07.711570+0000",
  "remote_addr": "10.251.0.21",
  "user": "98b5e284-bd74-4a54-922e-cf1ee1d460c2",
  "operation": "assume_role",
  "uri": "POST / HTTP/1.1",
  "http_status": "200",
  "bytes_sent": 999,
  "user_agent": "aws-cli/1.38.34 md/Botocore#1.37.34 ua/2.1 os/linux#5.14.0-496.el9.x86_64 md/arch#x86_64 lang/python#3.9.19 md/pyimpl#CPython m/N cfg/retry-mode#legacy botocore/1.37.34",
  "referrer": "",
  "trans_id": "tx000001bb92497c13eba06-00695aa48f-494246-default",
  "access_key_id": "MPUWRVKZFH9XXXXXXX",
  "temp_url": false
}

# We can then get any specific details on this user
$ radosgw-admin user info --access-key=MPUWRVKZFH9XXXXXXX
{
    "user_id": "98b5e284-bd74-4a54-922e-cf1ee1d460c2",
    "display_name": "dave-backup-ops",
    "email": "",
    "suspended": 0,
    "max_buckets": 1000,
    ...
}

Example audit log extract when Dave from the DevOps Account accesses the Finance bucket:

{
  "bucket": "finance-backups",
  "object": "daily-backup-2025-01-14.tar.gz",
  "time": "2026-01-04T17:42:35.956711Z",
  "time_local": "2026-01-04T17:42:35.956711+0000",
  "remote_addr": "10.251.0.21",
  "object_owner": "RGW00893359550361292",
  "user": "98b5e284-bd74-4a54-922e-cf1ee1d460c2",
  "operation": "get_obj",
  "uri": "GET /finance-backups/daily-backup-2025-01-14.tar.gz HTTP/1.1",
  "http_status": "200",
  "bytes_sent": 26,
  "bytes_received": 0,
  "object_size": 26,
  "total_time": 3,
  "user_agent": "aws-cli/1.38.34 md/Botocore#1.37.34 ua/2.1 os/linux#5.14.0-496.el9.x86_64 md/arch#x86_64 lang/python#3.9.19 md/pyimpl#CPython m/N cfg/retry-mode#legacy botocore/1.37.34",
  "trans_id": "tx00000a13eeac4ce551ce2-00695aa68b-494246-default",
  "authentication_type": "STS",
  "sts_info": {
    "role_name": "$devops-backup-reader",
    "role_session": "david-devops-backup-finance"
  },
  "temp_url": false
}

What this tells you:

Who: Dave from DevOps (identified by role session name and the user uid)
When: 2026-01-04T17:42:35.956711Z (exact UTC timestamp)
What: Downloaded daily-backup-2025-01-14.tar.gz from finance-backups bucket
How: Via STS temporary credentials (authentication_type: "STS")
- Assumed role: devops-backup-reader
- Session: david-devops-backup-finance
From where: IP address 10.251.0.21
Bucket owner: Finance account RGW00893359550361292
Status: Success (http_status: 200, 26 bytes transferred)

Key security insights from this log:

Authentication type is explicitly marked as "STS" - You can easily filter all temporary credential access
User who assumed the role is identified - (98b5e284-bd74-4a54-922e-cf1ee1d460c2)
Role name is captured - You know which role was used (devops-backup-reader)
Session name is captured - You can trace back to who initiated the session (Dave via david-devops-backup-finance)
Object owner is logged - Confirms the bucket belongs to the Finance account, not the accessor
Full HTTP details - User agent shows it was AWS CLI, complete with version

Compared to ACLs: With ACLs, you had no audit trail showing who from which account accessed what. The logs only showed "someone accessed the object" with no attribution to the originating account or session context.

Comparison of IAM Roles Versus ACLs:

ACLs: Decentralized, object-level, permanent, no audit trail of cross-account access
IAM Roles: Centralized, temporary, revocable, full audit trail with account attribution

Understanding Policy Evaluation ¶

To use IAM effectively, you need to understand how permissions are evaluated.

The Basic Rule ¶

When a user requests access to an S3 resource, it follows the following workflow, taking into account that any DENY always wins over ALLOW

Explicit DENY always wins, even if there are multiple ALLOW statements.

Same-Account vs Cross-Account ¶

Same-Account Access (user and bucket in the same account):

Permission needed in either the bucket policy or the identity policy
One ALLOW is sufficient

Cross-Account Access (using role assumption):

Permission needed for AssumeRole (on both sides - trust policy + identity policy)
Role's identity policy grants bucket access (same-account from bucket's perspective)
No bucket policy needed

The Security Roadmap: Enterprise S3 Security Coming to Ceph ¶

The Ceph community is making a significant investment in enterprise S3 security. Several critical features are under active development to bring Ceph RGW to full feature parity with AWS S3's modern security model. Here's what's coming and why it matters.

BucketOwnerEnforced: Disabling ACLs (Coming in a Tentacle update) ¶

Status: Merged into Ceph v20.3.0 (Tentacle) (Issue #63323)

What it does: The PutBucketOwnershipControls API with BucketOwnerEnforced setting disables ACLs entirely and forces all objects to be owned by the bucket owner regardless of who uploaded them.

The problem it solves:

Before (with ACLs):

Contractor uploads → contractor owns object → you, as the owner of the bucket, can't delete it
Developer sets ACL to public → bucket exposed to the internet
Objects disappear from inventory (owned by other accounts)

After (BucketOwnerEnforced):

Anyone uploads → you own the object → you control it completely
ACLs are ignored → impossible to make the bucket public accidentally via ACLs
All objects visible in your inventory reports

How it will work:

# Enable BucketOwnerEnforced on a bucket
$ aws s3api put-bucket-ownership-controls \
  --bucket company-data \
  --ownership-controls 'Rules=[{ObjectOwnership=BucketOwnerEnforced}]'

Once enabled, any requests that include ACL headers (e.g., --acl public-read) will fail. Applications must be audited before enabling this feature on their buckets because if the application is using ACLs in their workflow the application requests using the ACL headers will start failing.

S3Control API Block Public Access (Coming Soon) ¶

Status: Active development, PR #64293 under review

You've disabled ACLs in your Finance account. You've enabled Block Public Access. Your security team is confident the Finance buckets are locked down. Then someone in the Marketing account creates a new IAM user, spins up a bucket, and accidentally makes it public during a website deployment test. Your Finance settings didn't apply to Marketing's account because each account manages its own configuration independently.

This is where account-level controls become critical. While individual buckets can have their own Block Public Access settings, managing hundreds or thousands of buckets individually is error-prone. The S3Control API allows you to set account-level defaults that apply automatically to all buckets in that account, both existing and any new bucket created in the future.

Account-level enforcement prevents all public access:

# Block all public access for entire account
$ aws s3control put-public-access-block \
  --account-id RGW11111111111111111 \
  --public-access-block-configuration \
    BlockPublicAcls=true,\
    IgnorePublicAcls=true,\
    BlockPublicPolicy=true,\
    RestrictPublicBuckets=true

Once the account administrator sets this policy using S3Control, regular account users cannot override it. If a user later tries to disable Block Public Access on a specific bucket, make a bucket public via ACL, or add a public bucket policy, all those attempts will fail with "Access Denied." The account-level setting takes precedence and cannot be bypassed by bucket-level operations. This creates a secure-by-default environment in which enabling public access using ACLs at the bucket-level is impossible.

What each setting will do:

BlockPublicAcls: Prevents new public ACLs from being applied to buckets/objects
IgnorePublicAcls: Ignores existing public ACLs (treats them as private)
BlockPublicPolicy: Prevents bucket policies that grant public access
RestrictPublicBuckets: Blocks public access even if policies exist

Account-level Block Public Access is enforced by the account administrator on regular users within that account, but the account administrator themselves can still modify or disable it. For enforcement from a higher authority, you need organization-level controls. See the next section on Organizational Units and SCPs, which enable Ceph/RGW cluster administrators to enforce immutable policies across all accounts.

Organizational Units and Service Control Policies (Future) ¶

Status: Roadmap item for future Ceph releases

What it will do: Enable cluster administrators to enforce immutable security policies across multiple accounts—policies that even account administrators cannot disable or modify.

The problem it solves: Account-level controls rely on administrator discipline. A determined (or compromised) account administrator can disable Block Public Access or re-enable ACLs. Organization-level controls provide actual enforcement from a higher authority that cannot be bypassed.

Example use cases (when available):

Immutable Block Public Access: Cluster admin sets organization-wide "no public buckets" policy: account admins cannot disable it
Required encryption: Force all objects to use encryption → accounts cannot opt out
Cross-account access policies: Restrict which accounts can share data with external accounts
Audit requirements: Enforce logging and monitoring so that accounts cannot be disabled

This will provide enterprise multi-tenant governance that scales to thousands of accounts with immutable top-down policy enforcement.

Conclusion: Ceph's Enterprise Security Transformation ¶

The migration from ACLs to IAM represents a fundamental shift in S3 security philosophy: from decentralized, object-level chaos to centralized, policy-based control.

Available today in Ceph Squid and later:

IAM Accounts: Multi-tenant isolation with proper account boundaries
Cross-account role assumption: Secure data sharing with temporary credentials
Comprehensive audit logging: Full visibility into who accessed what, when, and how

Coming soon (active development):

BucketOwnerEnforced (Upcoming Tentacle update): Disable ACLs, fix ownership chaos
S3Control Block Public Access (Tentacle/Umbrella): Account-level public access prevention
Organizational Units & SCPs (future): Immutable cluster-wide security policies

The Ceph community is making a substantial investment to bring Ceph Object Gateway (RGW) to full feature parity with AWS S3's modern security model. The roadmap is clear, and the commitment is real.

The modern S3 security model is simpler, safer, and more auditable than ACLs ever were. ACLs created invisible access paths that security teams couldn't see. IAM policies are explicit, centralized, and visible in one place.

Disable ACLs today. Your future self will thank you.

Daniel would like to thank IBM for supporting the community with his time to create these posts.

Breaking the Static Key Habit: Modernizing Ceph RGW S3 Security with STS

2025-12-18T00:00:00Z

Introduction: The USD 148 Million Lesson ¶

In late 2016, Uber learned that intruders had accessed a trove of personal data stored in an Amazon S3 bucket. The entry point was painfully mundane: attackers accessed Uber's source code on GitHub using stolen credentials, found an AWS credential, and used it to access Uber’s data. That single, long-lived credential exposed data on roughly 57 million users and 600,000 drivers.

The breach was bad; the duration risk was worse. Static access keys do not expire. Once leaked, they remain active until someone notices, locates every instance in use, and rotates them. That makes credential theft uniquely dangerous in cloud and S3-style storage, because an attacker can repeatedly return, automate access, and quietly expand their footprint.

Uber ultimately agreed to a $148 million multistate settlement related to how the incident was handled and disclosed. The exact dollar figure is not the main lesson, though. The lesson is this: a single static key can turn a small mistake into a durable breach.

If you are running the Ceph Object Gateway (RGW), you face the same dynamic: S3 credentials in an application configuration file config.yaml, embedded in scripts, or stored in CI/CD variables. Each one is a long-lived credential that, once copied, can be used from anywhere the S3 endpoint is reachable.

This post shows you how to eliminate static credentials using Security Token Service (STS) with temporary credentials that expire automatically. By the end, you'll understand how to implement the same security model that prevented these breaches from being even worse, and how to adapt it for Ceph RGW.

The Static Credential Problem ¶

Let's take a look at some examples of how most applications access S3 storage today:

# app-config.yaml (application config file)
s3:
  endpoint: https://s3.example.com
  access_key: AKIA1234567890ABCDEF
  secret_key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
  bucket: production-data

Or with the credentials embedded directly in code:

# backup.py
import boto3

s3 = boto3.client(
    's3',
    endpoint_url='https://s3.example.com',
    aws_access_key_id='AKIA1234567890ABCDEF',
    aws_secret_access_key='wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'
)

Or in environment variables (slightly better, but not by much):

export AWS_ACCESS_KEY_ID=AKIA1234567890ABCDEF
export AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

Why This Is Dangerous? ¶

The Permanence Problem ¶

The fundamental issue with static credentials is that they never expire. Once created, these keys authenticate requests indefinitely, working the same on day one as they do five years later. This creates a dangerous organizational memory gap. Keys made in 2020 still work in 2025, but no one remembers which application uses them, what permissions they have, or whether they're even still needed. When rotation finally becomes necessary, it requires coordinated updates across all applications simultaneously, often in the middle of an incident when coordination is most difficult.

Key Proliferation ¶

Static credentials spread like a virus through an organization's infrastructure. They start in a configuration file for a single application, then get copied into container images where they're baked into immutable layers. They're added to CI/CD pipelines where they're shared across multiple projects. Developers copy them to their laptops for testing, where they sync to cloud backup services. They end up in documentation and internal wikis, pasted as "helpful examples" for other teams. Each copy represents another attack vector, another place where the credentials might leak.

The Revocation Nightmare ¶

When credentials are eventually stolen, and with this level of exposure, it's when" not if. The response options are replete with shortcomings. The credentials work from anywhere where the S3 endpoint is accessible, so there's no easy way to distinguish legitimate requests from attacker activity. Revoking them immediately breaks every application that depends on those keys, forcing an emergency deployment across potentially dozens of services. The alternative is to leave them active while attackers maintain access, then race to update applications before further damage occurs. Organizations need to coordinate emergency updates during an active security incident, precisely when coordination is hardest.

The Permission Accumulation Problem ¶

Static keys tend to accumulate permissions over time. They start with minimal access, but as requirements evolve, it's easier to grant permissions than to audit what's truly necessary carefully. This key needs to read and write, just to be safe. Let's give it access to all buckets; we might expand to new ones later. No one wants to risk disrupting production by restricting access, mainly when credentials are spread across so many systems that tracking down every usage point seems impossible.

The Real Cost ¶

The Uber incident shows the real cost of a leaked static key. A single exposed AWS access key pai9r exposed sensitive data to roughly 57 million users and 600,000 drivers, and Uber later agreed to a USD 148 million multistate settlement related to the incident and its handling.

The uncomfortable truth is that static keys turn small mistakes into persistent breaches because credentials do not naturally "die”. Without expiration, containment depends entirely on detection and coordinated rotation across every place that the key has spread.

The Solution: Temporary Credentials via STS ¶

Security Token Service (STS) fundamentally reimagines how applications authenticate with S3. Instead of using permanent credentials that live forever, applications request temporary credentials that expire automatically after a defined window, typically between fifteen minutes and twelve hours. This simple shift transforms the entire security model.

The mechanics work like this: Applications maintain a minimal service account that is authorized to assume a role. When the application needs to access S3, it calls the STS service using those service account credentials to request temporary credentials for a specific role. STS validates that the service account is authorized to assume that role, then issues time-limited credentials. The application uses these temporary credentials for actual S3 operations. When they expire, the application requests fresh credentials. The entire process is transparent to the application's business logic.

![](images/sequence.png align="center")

The Security Transformation ¶

With static keys, credentials remain valid indefinitely. Once stolen, they persist indefinitely. STS eliminates these problems through automatic expiration. When an application calls AssumeRole, it specifies a DurationSeconds parameter that defaults to 3600 seconds (one hour). The temporary credentials returned include an expiration timestamp that cannot be modified or extended. If an attacker steals temporary credentials from a compromised server or intercepts them in transit, those credentials become worthless the moment they expire.

The audit trail improves dramatically as well. Instead of seeing generic access key IDs that could be used by any application anywhere, the RGW logs now show which specific role was assumed (role_name) and the session name provided when the role was assumed (role_session_name). When applications use descriptive session names that include the application name and a timestamp, security teams can immediately identify which application and which specific execution generated each request. This attribution becomes critical during incident response, when distinguishing legitimate traffic from attacker activity can mean the difference between containing a breach and suffering a complete data exfiltration.

Consider the compromise scenario: An attacker gains access to a production server and dumps memory, capturing the application's current S3 credentials. With static keys, this can represent full, ongoing access to your data, potentially for months before detection. With STS, the attacker has at most one hour before those credentials expire and become useless. STS is not a silver bullet: it will not stop an attacker already on the host. It does put every stolen credential on a timer, which sharply limits persistence and reduces the “evergreen access key” problem. The application continues to operate normally, automatically refreshing its credentials; incident response can focus on evicting the attacker and preventing further refreshes rather than racing to replace long-lived keys everywhere.

"Wait, aren't we still using a static key to assume the role?" ¶

Yes, but with a critical difference. The service account (e.g., backup-service) possesses static Access and Secret Keys, but this user has zero permissions to access S3 data. It cannot list buckets, read objects, or delete data.

Its only capability is to call the STS API to assume a specific Role. If these credentials are leaked, an attacker cannot directly steal data. They would have to know which Role to assume and how to use it, which would add significant friction. Furthermore, you have traces in the audit logs, and you can rotate these service keys without disrupting the application's active S3 sessions.

Quick Primer: Understanding Roles (Just What You Need for STS) ¶

Roles are part of the IAM (Identity and Access Management) API, which the Ceph Object Gateway (RGW) implements to provide AWS-compatible identity management. In this post, we focus on how roles enable STS-based authentication. We'll dive deeper into the full IAM capabilities, including users, groups, policies, and account-level governance, in a specific IAM security post coming soon.

The Role Structure ¶

Every role has two policies:

Trust Policy - Defines who can assume the role
Permission Policy - Defines what the role can do once assumed

Here's the complete flow: Your application holds a minimal service account that is authorized to assume a role (via the role trust policy, an identity policy, or both). When it needs to work (e.g., access S3 resources), it calls STS to assume a role (e.g., backup-reader). STS checks the role's trust policy, validates the request, and issues temporary credentials (access key, secret key, session token) that inherit the role's permissions. Those credentials expire after one hour. The application uses them for S3 operations and automatically requests new credentials as needed.

Here is an example Trust Policy (who can assume the role) allowing the user backup-service to assume the role:

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {"AWS": "arn:aws:iam::123456:user/backup-service"},
    "Action": "sts:AssumeRole"
  }]
}

Here is an example Permission Policy (what the role can do), allowing read-only access to the bucket backups:

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": ["s3:GetObject", "s3:ListBucket"],
    "Resource": [
      "arn:aws:s3:::backups",
      "arn:aws:s3:::backups/*"
    ]
  }]
}

In this post, we'll use inline policies (policies embedded directly in the role). There are other canned policy types available in the IAM API, which we'll cover in a future IAM post.

Beyond Service Accounts: Single Sign-on Authentication with OIDC Integration ¶

The pattern we'll implement uses a service account with static credentials to assume roles. However, RGW also supports AssumeRoleWithWebIdentity, which allows applications to assume roles using tokens from an enterprise identity provider (such as RHSSO (Keycloak), IBM Security Verify, etc.) via OpenID Connect (OIDC). This eliminates the need for static credentials: applications authenticate via your existing SSO system to obtain a JWT, which they then use to request a temporary credential directly from the STS API. This is the most secure option for organizations with mature identity infrastructure, though it requires additional OIDC provider configuration in RGW. We'll cover this advanced pattern in a future post on identity federation.

Implementing STS in Ceph RGW: Step by Step ¶

This implementation builds on the IAM foundation covered in Enhancing Ceph Multitenancy with IAM Accounts. If you're new to Ceph IAM accounts, that post covers account creation, user management, and policy basics. Here, we focus specifically on enabling STS and using roles for temporary credentials.

Let's build on an example use case. We'll create a role for a backup service that needs read-only access to a specific bucket.

To follow this guide, you will need:

Admin access to the Ceph cluster: SSH access to a node where you can run ceph and radosgw-admin commands.

AWS CLI: Installed on your workstation to interact with the RGW S3 endpoint.

Python 3 and Boto3: For running the automation scripts (pip install boto3).

Ceph Squid or later: While basic STS works on older versions, the IAM Accounts feature used in this guide requires Ceph Squid (19.2.0) or newer.

Step 1: Enable STS in RGW Configuration ¶

STS must be explicitly enabled in your RGW configuration. The configuration uses the Ceph config database and requires two settings.

Generate a secure STS key (must be exactly 16 characters):

# Generate a 16-character random key
$ openssl rand -hex 8
# Example output: 0a1b2c3d4e5f6789

Configure RGW to use STS:

Most deployments use client.rgw.default as the RGW client identifier. If your deployment uses a custom service name, replace default with your service name.

# Set the STS encryption key (MUST be exactly 16 characters)
$ ceph config set client.rgw.default rgw_sts_key 0a1b2c3d4e5f6789

# Enable STS authentication
$ ceph config set client.rgw.default rgw_s3_auth_use_sts true

Ceph-Specific Configuration Note

Unlike AWS, where STS is a global service enabled by default, Ceph requires you to explicitly configure the encryption key used to sign the session tokens.

Critical Requirement: The rgw_sts_key must be exactly 16 characters long. If it is 15 or 17 characters, the STS handshake will fail silently or with opaque 500 errors.

Restart all RGW instances to apply changes:

# For default service
$ ceph orch restart client.rgw

Verify the configuration:

$ ceph config get client.rgw.default rgw_s3_auth_use_sts
$ ceph config get client.rgw.default rgw_sts_key

Step 2: Create IAM Account, Root User, and Service User ¶

IAM accounts provide multi-tenancy and resource organization. We'll create an account, a root user for administrative tasks, and a restricted service user for applications.

Create the IAM account: ¶

$ radosgw-admin account create  --account-name=backup-team 
{
    "id": "RGW89761398048153888",
    "tenant": "",
    "name": "backup-team",
    "email": "",
    "quota": {
        "enabled": false,
        "check_on_raw": false,
        "max_size": -1,
        "max_size_kb": 0,
        "max_objects": -1
    },
    "bucket_quota": {
        "enabled": false,
        "check_on_raw": false,
        "max_size": -1,
        "max_size_kb": 0,
        "max_objects": -1
    },
    "max_users": 1000,
    "max_roles": 1000,
    "max_groups": 1000,
    "max_buckets": 1000,
    "max_access_keys": 4
}

Create the Account Root User (for administrative tasks) ¶

The account root user has full permissions on all resources within the account by default, including the ability to use the IAM API to create roles and manage policies. This is built into the account system; no additional capabilities are needed.

Create the root user for the account: ¶

$ radosgw-admin user create   --account-id=RGW89761398048153888 \
  --uid=backup-admin   --display-name="Backup-Team-Admin" \
  --account-root   --gen-access-key   --gen-secret

The --account-root flag is critical: it designates this user as the account's root user, granting full administrative permissions within the account's scope.

The Ceph documentation stats that: Account owners are encouraged to use this account root user for management only, and create users and roles with fine-grained permissions for specific applications.

For this tutorial, we'll use the root user for setup tasks to keep things simple. In production, you would typically use the root user to set up IAM users with specific permissions, then remove or restrict the root user's credentials.

Create the Backup Service User (for applications) ¶

This user will have minimal permissions, only the ability to assume roles. No direct access to S3 resources.

$ radosgw-admin user create \
  --account-id=RGW89761398048153888 \
  --uid=backup-service \
  --display-name="backup-service" \
  --gen-access-key \
  --gen-secret

The service account has no S3 permissions and no IAM capabilities. It can only assume roles that explicitly trust it.

Configure AWS CLI Profiles ¶

Configure two AWS CLI profiles, one for each user. Each profile contains the user's credentials and the RGW/STS endpoint URL, so we don’t need to specify the endpoint on each AWS CLI command. See the AWS CLI configuration documentation for details.

AWS Profile summary for this setup: ¶

backup-admin profile: Uses root user credentials, S3/IAM/STS endpoint https://s3.cephlabs.com
backup-service profile: Uses service account credentials, S3/IAM/STS endpoint https://s3.cephlabs.com

Here is an example .aws/config file:

[profile backup-admin]
region = default
output = json
services = ceph-rgw

[profile backup-service]
region = default
output = json
services = ceph-rgw

[services ceph-rgw]
s3 =
  endpoint_url = https://s3.cephlabs.com
s3api =
  endpoint_url = https://s3.cephlabs.com
iam =
  endpoint_url = https://s3.cephlabs.com
sts =
  endpoint_url = https://s3.cephlabs.com

Verify both profiles:

# Test root user (should work - has full permissions)
$ aws s3 ls --profile backup-admin

# Test service user (should fail - has no S3 permissions yet)
$ aws s3 ls --profile backup-service
# Expect: AccessDenied in RGW logs
argument of type 'NoneType' is not iterable

Identity Summary ¶

At this point, you have two users in the IAM account:

||User||Type||Permissions||Used For|| |backup-admin|Account root user (--account-root)|Full permissions on all account resources + IAM API access Creating buckets, creating/managing roles via AWS CLI| |backup-service|Regular user|None (can only assume roles)|Running backup applications with temporary credentials|

Step 3: Create the Backup Bucket ¶

Run this as the backup admin user (who has S3 permissions):

$ aws s3 mb s3://backups --profile backup-admin

Why the admin user? The service account (backup-service) has no S3 permissions yet; it can only assume roles. The admin user creates the infrastructure (buckets), then creates roles that grant specific permissions to those buckets.

Verify the bucket exists:

$ aws s3 ls --profile backup-admin
2025-12-12 17:09:25 backups

Step 4: Create the Role ¶

Run these commands as the account root user (backup-admin), who has full IAM API permissions.

Create a role trust policy (who can assume this role):

cat > trust-policy.json <<'EOF'
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {"AWS": "arn:aws:iam::RGW89761398048153888:user/backup-service"},
    "Action": "sts:AssumeRole"
  }]
}
EOF

_ARNs in IAM Accounts (Ceph Object Gateway): In the IAM Accounts model, the user ARN is built from the account ID plus the user name; in Ceph this “name” corresponds to the user’s display-name (not the --uid). If your --uid and --display-name differ, ensure that your trust policy Principal ARN uses the display-name value, or the AssumeRole request will not match.

Authorization to assume a role can be granted in two ways. In this tutorial we grant it via the role trust policy by naming the service user as the Principal. In same-account setups, this is sufficient; no user policy is required. If you instead trust the whole account or you are doing cross-account access, attach an identity policy to the user or group allowing sts:AssumeRole on the specific role ARN._

Create the role:

$ aws iam create-role \
  --profile backup-admin \
  --role-name backup-reader \
  --assume-role-policy-document file://trust-policy.json

Create permission policy (what the role can do):

cat > permissions-policy.json <<'EOF'
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": [
      "s3:GetObject",
      "s3:ListBucket"
    ],
    "Resource": [
      "arn:aws:s3:::backups",
      "arn:aws:s3:::backups/*"
    ]
  }]
}
EOF

Attach permissions to role:

$ aws iam put-role-policy \
  --profile backup-admin \
  --role-name backup-reader \
  --policy-name backup-read-policy \
  --policy-document file://permissions-policy.json

Verify that the role was created:

$ aws iam get-role \
  --profile backup-admin \
  --role-name backup-reader
{
    "Role": {
        "Path": "/",
        "RoleName": "backup-reader",
        "RoleId": "8c8eec8c-c647-42bb-8a53-36c6d2fc747a",
        "Arn": "arn:aws:iam::RGW89761398048153888:role/backup-reader",
        "CreateDate": "2025-12-12T22:10:18.644Z",
        "AssumeRolePolicyDocument": {
            "Version": "2012-10-17",
            "Statement": [
                {
                    "Effect": "Allow",
                    "Principal": {
                        "AWS": "arn:aws:iam::RGW89761398048153888:user/backup-service"
                    },
                    "Action": "sts:AssumeRole"
                }
            ]
        },
        "Description": "",
        "MaxSessionDuration": 3600
    }
}

$ aws --profile backup-service sts assume-role --role-arn "arn:aws:iam::RGW89761398048153888:role/backup-reader" --output json --role-session-name testbr
{
    "Credentials": {
        "AccessKeyId": "reUwxxxxxxn",
        "SecretAccessKey": "CQGxxxxxxx",
        "SessionToken": "nADwRdQ5xxxx90qMZlDPl4ozBjcQKF1tceytgNVGD5D4h2FpoMvjybl31cXI9uh/nUrQePW+Ob3TmpMa4QXdXfml/gQYSYeQLJEzNncQPUQB9+QUl5TShDy4RYYziRulTMWrkYokL6kI0uN0LksQ56/qOyd59A1qbWtsBNYBdvxUUi7r3lhrifn4MNWQbErJKCVNdVOBSzN1L34JDMvjEqN2QyKWLQI16D+XhCq8V05OnQFMHsf128BealrX+KkWS6+74G960WzoHzWDwHF1uO08VlFYCdHO0A==",
        "Expiration": "2025-12-12T23:34:00.247844317Z"
    },
    "AssumedRoleUser": {
        "Arn": "arn:aws:sts::RGW89761398048153888:assumed-role/backup-reader/testbr"
    },
    "PackedPolicySize": 0
}

Step 5: Write Application Code (Python Example) ¶

This code runs with the service account credentials (backup-service), which have no direct S3 access. The application calls STS to assume the backup-reader role and receives temporary credentials for S3 operations.

Identity flow:

The application starts with backup-service credentials (long-term, minimal permissions)
Calls AssumeRole using those credentials to request the backup-reader role
Receives temporary credentials (access key + secret + session token)
Uses temporary credentials for all S3 operations
Temporary credentials expire after 1 hour (or configured duration)
Application manually checks expiration before each operation and refreshes if needed

Upload test file (as admin user who has write permissions):

$ echo "test backup data" > test-backup.txt
$ aws s3 cp test-backup.txt s3://backups/ --profile backup-admin

Running the script: download the script from GitHub Gist, and export the variables of the backup-service user:

# Download script
$ wget -O backup_service.py https://gist.githubusercontent.com/likid0/f7c40c4851bf32c595c7a5e63cf21f35/raw/137bfeea46c20d46d37fa026e29f1b5193c3e281/gistfile1.txt
# Make it executable
$ chmod +x backup_service.py
# Export Vars
$ export AWS_ACCESS_KEY_ID='AKIAIOSFODNN7EXAMPLE'
$ export AWS_SECRET_ACCESS_KEY='wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'
$ export S3_ENDPOINT_URL='https://s3.example.com'

Run the example backup_service.py cript:

$ python backup_service.py

================================================================================
 Backup Service - STS Temporary Credentials Demo
================================================================================

Configuration:
  Endpoint: https://s3.cephlabs.com
  User:     backup-service
  Key:      VQQPNR4XOW...

Note: These are the service account's PERMANENT credentials
      They have NO direct S3 permissions (can only assume roles)

================================================================================
Calling AssumeRole API to get temporary credentials...
================================================================================

AssumeRole Parameters:
  RoleArn:         arn:aws:iam::RGW89761398048153888:role/backup-reader
  RoleSessionName: backup-job-1765579042
  DurationSeconds: 3600 (1 hour)

Authentication:  Using service account credentials (backup-service)
                 AccessKey: VQQPNR4XOW...

Calling STS endpoint: https://s3.cephlabs.com

SUCCESS! Received temporary credentials:
  AccessKeyId:     YAhacPIIT4BcUWiyPC0M...
  SecretAccessKey: S4VSWFTM2U... (redacted)
  SessionToken:    Yn3A4Mt4VGQoIvloer2ByH3aecQAeP... (redacted)
  Expiration:      2025-12-12 23:37:22.878813+00:00
================================================================================


 Listing backups in bucket 'backups'...
   Found 1 object(s):

    test-backup.txt
      Size: 0.00 MB (17 bytes)
      Modified: 2025-12-12 22:11:23.474000+00:00


================================================================================
Demo completed successfully!
================================================================================

This example script uses manual credential checking: the _check_credentials() method checks expiration time before each operation and calls _refresh_credentials() when needed. This is simple and works well for most use cases.

For long-running jobs (hours or days), see the "Handling Long-Running Jobs: Credential Refresh Strategies" section later in this post, which covers automatic credential refresh using Boto3's RefreshableCredentials. With automatic refresh, Boto3 handles the timing and renewal for you so you never have to think about expiration.

Handling Long-Running Jobs: Credential Refresh Strategies ¶

A critical consideration for production deployments is handling jobs that run longer than the credential lifetime.

The Challenge ¶

The DurationSeconds Parameter controls how long the temporary credentials remain valid:

Minimum: 900 seconds (15 minutes) configurable via rgw_sts_min_session_duration Default: 3600 seconds (1 hour) Maximum: Limited by the role's max_session_duration attribute (defaults to 3600)

When a role is created, it has a max_session_duration of 3600 seconds by default. This means even if you request DurationSeconds=7200 (2 hours), the request will be limited to the role's maximum. To allow longer sessions, you would need to modify the role's max_session_duration when creating it (though for security, shorter durations are recommended).

Here, we share three example strategies for handling this.

Strategy 1: Increase Token Duration (Up to 12 Hours) ¶

The most straightforward approach is to request longer-lived credentials and configure the role to allow them.

Configure maximum session duration on the role: ¶

When creating the role, you can set a maximum session duration:

$ aws iam create-role \
  --profile backup-admin \
  --role-name backup-reader \
  --assume-role-policy-document file://trust-policy.json \
  --max-session-duration 43200  # 12 hours

Or modify an existing role:

$ aws iam update-role \
  --profile backup-admin \
  --role-name backup-reader \
  --max-session-duration 43200  # 12 hours

Verify the setting:

$ aws iam get-role \
  --profile backup-admin \
  --role-name backup-reader \
  --query 'Role.MaxSessionDuration'

RGW Configuration: the following config option controls the global maximum:

ceph config set client.rgw.default rgw_sts_max_session_duration 43200

Limitations:

Maximum duration in Ceph RGW: 12 hours (43,200 seconds)
Not an ideal solution, as it extends the duration of the tokens to twelve hours
Suitable for jobs that can be completed within 12 hours

Strategy 2: Automatic Credential Refresh with RefreshableCredentials ¶

For jobs longer than 12 hours, or to avoid managing token duration, implement automatic refresh using botocore's RefreshableCredentials. This pattern continuously calls AssumeRole to get fresh credentials before expiration.

An enhanced BackupService example script with STS token Auto-Refresh is available here.

How it works:

RefreshableCredentials wraps your credential fetching logic
Before each AWS API call, Boto3 checks if credentials are expired or expiring soon
If needed, boto3 automatically calls _refresh_credentials() to get fresh credentials
Your application never sees authentication errors due to expiration
Each refresh calls AssumeRole using the original service account credentials

Key Advantages:

Works for jobs of any length (days, weeks)
No manual credential management needed
Boto3 handles refresh timing automatically
Original service account credentials remain secure (never exposed to S3 operations)

Important Notes:

The service account's long-term credentials must remain valid for the entire job
Each refresh makes a new AssumeRole call to STS
Credentials are cached in memory only (not written to disk)

Strategy 3: Use Third-Party Libraries ¶

If you prefer not to work with botocore internals, use a well-maintained library:

Install the library aws-assume-role-lib:

$ pip install aws-assume-role-lib

Reference the library in code:

import boto3
import aws_assume_role_lib

# Create session with automatic refresh
parent_session = boto3.Session(
    aws_access_key_id='BACKUP_SERVICE_KEY',
    aws_secret_access_key='your-secret-key'
)

# This session automatically refreshes expired credentials
assumed_role_session = aws_assume_role_lib.assume_role(
    parent_session, 
    'arn:aws:iam::RGW12345678901234567:role/backup-reader'
)

# Use it like any boto3 session
s3 = assumed_role_session.client('s3', endpoint_url='https://s3.example.com')
s3.list_buckets()  # Credentials auto-refresh as needed

Static Key Rotation: Completing the Security Model ¶

You've now implemented STS for temporary credentials, but there's one final layer to complete the security architecture: rotating the service account's static keys.

Background: The create_date Field ¶

Starting with Tentacle, Ceph RGW now includes a create_date timestamp for each access key in the user metadata. This addition enables programmatic key age tracking and automated rotation: a critical capability for eliminating static credential risk.

Example output from radosgw-admin user info:

{
    "user_id": "backup-service",
    "keys": [{
        "user": "backup-service",
        "access_key": "XXXXXXXX",
        "secret_key": "XtDhTWsb6vkNOsAnWBXSIhDhqdRBYXXXXXXX",
        "active": true,
        "create_date": "2025-12-12T22:02:16.628205Z"  ← Key creation timestamp
    }]
}

Recommended Approach: Use a Secrets Manager ¶

The best way to implement key rotation is with a secrets manager such as HashiCorp Vault, IBM GKLM, AWS Secrets Manager, Google Secret Manager, or Azure Key Vault. This approach enables zero-downtime rotation without code changes.

How it works:

Application queries secrets manager (no hardcoded credentials):
- Application starts up and queries Vault/secrets manager for credentials
- Gets current access_key and secret_key
- Uses these to call AssumeRole and get temporary STS credentials
When keys rotate (automated monthly rotation using the create_date field time stamp):
- Generate a new Ceph access key with radosgw-admin key create
- Update the secret in Vault/secrets manager with new credentials
- Keep both old and new keys active in Ceph for a 7-day transition period
- After 7 days, remove the old key from Ceph
Application automatically gets new keys:
- The next time the application restarts or refreshes credentials, it queries the secrets manager
- Gets the new credentials automatically
- No code changes required: the application doesn't know rotation happened
- No downtime: the old key still works during the transition

Migration Strategy: From Static to Temporary ¶

You can't flip a switch and convert all applications overnight. The transition requires methodical planning, careful testing, and phased rollout. Organizations that rush this process end up with broken applications, emergency rollbacks, and frustrated teams. The ones that succeed treat it as a deliberate migration project with clear phases and success criteria.

The challenge isn't technical; the STS implementation is straightforward once one understands roles. The challenge is organizational: identifying where static credentials exist, understanding what each application actually needs, and coordinating updates across teams that may not even realize they're using S3. This is why the first phase isn't about changing anything; it's about understanding what you have.

Coming in the Next Post ¶

You now have STS working in your Ceph environment. Your applications use temporary credentials that expire automatically, dramatically reducing the blast radius of credential theft. The permanent credentials your applications hold can't access S3 directly; they can only assume specific roles with limited permissions. Each role follows least privilege. Every access is logged with full attribution.

We kept the IAM explanation minimal in this post, just enough to implement STS. In the next post, we'll dive into IAM architecture and access control patterns. We'll cover the new IAM Accounts model introduced in Ceph Squid, how it creates proper multi-tenancy, and why the distinction between root account and IAM users matters for security. We'll explore advanced least privilege patterns, trust policy design for cross-account access, and how to test policies before deployment. We'll also examine organizational mandates, such as blocking ACLs entirely and using the new S3Control API for account-level governance.

The authors would like to thank IBM for supporting the community with our time to create these posts.

RocksDB Compression in Ceph: Space Savings with No Performance Cost

2025-12-17T00:00:00Z

Introduction ¶

In the world of data storage, engineers and architects constantly face a fundamental dilemma: the trade-off between performance and efficiency. It’s a balancing act. When you want to save space, you typically enable features like compression, but the common assumption is that this will cost you performance, a CPU cycle tax that slows throughput.

But what if you could significantly reduce your metadata storage footprint without slowing things down?

This search for an answer to this question started with research work from Mark Nelson, who published a blog post on ceph.io that covers RocksDB tuning in depth, exploring RocksDB compression with positive results. These promising results sparked a conversation on the upstream GitHub about enabling compression by default; a link to the PR is available here.

To build on the previous investigation, the Ceph performance team ran tests on a robust hardware configuration running IBM Storage Ceph 7.1 (Reef). The cluster used the BlueStore OSDs for an erasure-coded (EC 4+2) pool, with a hybrid OSD storage setup: HDDs for object data and fast NVMe drives for the BlueStore WAL+DB.

To understand the test, it's helpful to know what the WAL+DB is. In modern Ceph, the BlueStore storage engine manages all data on the OSDs (physical devices). To do this, it must maintain a vast catalog of internal metadata: think of it as a high-speed index that quickly locates every piece of data.

RocksDB, a high-performance key-value database, manages this critical index. In our hybrid cluster, the RocksDB database runs on the fast NVMe deviceses, while the actual object data resides on the slower HDDs.

Because this metadata can grow very large, RocksDB's efficiency, how much space it consumes on those expensive NVMe drives, is a critical factor in the cluster's overall cost and performance. Our test, therefore, focuses on a simple, high-stakes question:

Can we compress this metadata to save space without paying a performance penalty?

Executive Overview ¶

The results were not just positive; they were counterintuitive, revealing a powerful opportunity for optimization that comes with virtually no downside.

The results confirm that using RocksDB compression has no detrimental effect on either throughput or resource consumption in Ceph, while providing significant savings in DB space (compression ratio), especially for smaller objects. As a result of the tests, RocksDB encryption is now enabled by default beginning with the Squid release.

Test Environment and Details ¶

All tests were run against the Ceph Gateway (RGW) to simulate a typical Object Storage workload.

Two different sets of object sizes were used in testing. Each workload leveraged five clients and a range of fixed sizes (one object size per bucket, repeated across the total bucket count), as listed below.

Testing Configuration ¶

Smaller
- 1 KiB, 4 KiB, 8 KiB, 64 KiB, 256 KiB
- 100K objects, 300 buckets, five clients (150M total objectss)
- Fill Workload (~8%) - 3hr
- Hybrid workload (45% reads, 35% writes, 15% stats, 5% deletes)
Larger
- 1 MiB, 4 MiB, 8 MiB, 64 MiB, 256 MiB
- 300 objects, 100 buckets, five clients (150K total objects)
- Fill Workload (~7%) - 40m
- Hybrid workload (45% reads, 35% writes, 15% stats, 5% deletes)

Hardware Used ¶

Two identical clusters, each with

3x Monitor / Manager nodes
- Dell R630
- 2x E5-2683 v3 (28 total cores, 56 threads)
- 128 GB RAM
8x OSD / RGW nodes
- Supermicro 6048R
- 2x Intel E5-2660 v4 (28 total cores, 56 threads)
- 256 GB RAM
192x OSDs (BluesStore): 24 2TB HDD and 2x 800GB NVMe SSD for WAL/DB per node
Pool: site{1,2}.rgw.buckets.data EC 4+2, pg_num=4096

The "Free Lunch" is Real: Significant Space Savings at Zero Performance Cost ¶

The primary and most surprising finding from our tests is that enabling RocksDB compression had no negative impact on performance. The specific algorithm used was LZ4, a lightweight solution known for its high speed. Our analysis suggests that modern CPUs are so efficient at processing algorithms like LZ4 that the overhead is negligible, particularly when compression operations occur on the high-speed NVMe devices where the RocksDB database resides.

Across a variety of hybrid workloads (45% reads, 35% writes, 15% stats, and 5% deletes), we observed no detrimental effect on throughput or CPU resource consumption compared to running the same workloads without compression. This effectively eliminates the traditional trade-off.

Graph 1. CPU Consumption for Small Objects

Graph 2. CPU Consumption for Large Objects

Graph 3. Throughput for Small Object Writes

Graph 4. Throughput for Small Object Reads

The results confirm that using RocksDB compression has no detrimental effect on either throughput or resource consumption in Ceph, while providing significant savings in DB space (compression ratio), especially for smaller objects. This allows a smaller WAL+DB offload partition for each OSD, or conversely helps avoid spillover of RocksDB level data onto the BlueStore slow device.

Small Objects, Massive Gains: A Game-Changer for Object Storage Workloads. ¶

While compression proved beneficial across the board, its impact was most dramatic on small-object workloads. Our tests, which used object sizes ranging from 1 KiB to 256 KiB, showed a remarkable reduction in the storage required for RocksDB metadata. In a BlueStore configuration, Ceph's internal metadata is managed by a RocksDB database running on top of the BlueFS file system on a fast storage device, in our case, an NVMe SSD.

The single most impactful data point we recorded was this: with compression enabled, the bluefs db_used_bytes workload for small objects required 2.68 times less storage during the cluster fill. This is a massive efficiency gain. For any organization whose workload involves storing millions or even billions of tiny objects, the metadata overhead can become a significant storage burden. This feature directly and powerfully addresses that specific pain point by compressing the metadata database on the fast offload device, not object data on HDDs.

This is particularly critical for Object Storage (RGW) workloads. When using the Ceph Object Gateway (RGW), all rich metadata associated with an object, such as its name, size, ACLs, and custom user tags, is stored in RocksDB instances spread across the OSDs that comprise the index pool. Furthermore, the bucket index, which lists all objects within a bucket, is maintained as omap entries in this same database.

For clusters with millions or billions of small objects, this metadata and index data can swell to consume terabytes of space, often becoming the primary capacity bottleneck on the expensive, high-speed NVMe drives. Compressing RocksDB directly compresses this RGW metadata, providing massive and immediate relief on that fast tier.

It's Not Just for Small Objects: Large Objects Also See a Clear Benefit ¶

The positive effects were not limited to small objects. Our tests on large-object workloads, ranging from 1 MiB to 256 MiB, also showed clear benefits. While the source report highlights the most dramatic space savings for small objects, it explicitly notes that the positive effect across both sets of object sizes is evident, making compression a clear win for large-object workloads as well.

Furthermore, our test plan included stressful OSD failure scenarios to measure behavior under duress. The overall conclusion of "no detrimental effect" on performance or resource consumption held even during these fault and recovery operations. This implies that RocksDB compression is not just efficient but also a stable and robust feature under pressure.

Graph 5. Throughput for Small Object Reads During Failure

Graph 6. Throughput for Small Object Writes During Failure

Conclusion: A Feature That Should Be Enabled By Default ¶

Based on this comprehensive testing, RocksDB compression in a Ceph environment is a low-risk, high-reward feature. It breaks the old rule that says efficiency must come at the expense of performance. The evidence points to a clear win: substantial storage savings on the metadata layer, with no measurable trade-off in throughput or CPU usage.

This led to a simple conclusion: given the potential for substantial space savings with no performance downside, the decision was to enable RocksDB LZ4 compression by default in the Squid release.

# ceph version
ceph version 19.2.1-222.el9cp (f2cd71cc2f7b46709c2351134ac89ea3e9f609b6) squid (stable)

# ceph config get osd bluestore_rocksdb_options
compression=kLZ4Compression,max_write_buffer_number=64,min_write_buffer_number_to_merge=6,compaction_style=kCompactionStyleLevel,write_buffer_size=16777216,max_background_jobs=4,level0_file_num_compaction_trigger=8,max_bytes_for_level_base=1073741824,max_bytes_for_level_multiplier=8,compaction_readahead_size=2MB,max_total_wal_size=1073741824,writable_file_max_buffer_size=0

The authors would like to thank IBM for supporting the community with our time to create these posts.

Ceph RGW Rate Limiting

2025-12-16T00:00:00Z

Introduction ¶

The Tentacle release introduces significant enhancements to Object Gateway (RGW) rate limiting, addressing a critical gap that has long challenged administrators managing multi-tenant object storage environments. With the addition of rate limiting for LIST and DELETE operations, along with improved STS integration, administrators now have more granular control over resource consumption across their storage infrastructure.

Understanding Rate Limiting in the Ceph Object Gatway (RGW) ¶

Rate limiting in the Ceph Object Gateway has been a powerful tool for controlling resource consumption and preventing individual users or applications from monopolizing cluster resources. Before this enhancement, RGW supported rate limiting for:

Read operations (max-read-ops): Controlling GET request frequency
Write operations (max-write-ops): Limiting PUT request rates
Read bandwidth (max-read-bytes): Throttling data egress
Write bandwidth (max-write-bytes): Controlling data ingress

These limits operate within configurable time windows (controlled by the rgw_ratelimit_interval option), traditionally defaulting to 60 seconds. The system uses a token bucket algorithm to track resource consumption, and when limits are exceeded, RGW returns HTTP 503 responses to throttle clients.

The Scope of Rate Limiting ¶

Rate limits can be applied at multiple scopes:

User scope: Limits apply to a specific user across all buckets
Bucket scope: Limits apply to operations in a particular bucket
Global scope: Limits apply cluster-wide across all users and buckets
Anonymous scope: Limits for unauthenticated requests

Important Architectural Considerations ¶

The Ceph Object Gateway's rate-limiting feature is not a complete QoS system. Key points:

Per-RGW enforcement: Limits are enforced per RGW instance, not cluster-wide. With 2 RGWs and a desired 10 ops/minute limit, configure each RGW for 5 ops/minute. If the client request load isn't evenly distributed across the endpoints, the required limits may be lower than expected.
Limit intersection: Both user-level AND bucket-level limits must be satisfied. Requests are rejected if either limit is exceeded.
No traffic shaping: Throttled requests are immediately rejected (503) rather than queued.
No mid-request throttling: Bandwidth is counted after a request completes, not during. Users who exceed limits go into "debt" (max: 2x the limit) and are blocked from new requests until the next interval(s) repay the debt.

The Problem: Missing Control for List and Delete Operations ¶

While read and write operation limits provided good coverage for data transfer operations, two critical operation types remained uncontrolled:

List Operations ¶

Bucket listing operations, particularly against buckets with millions of objects, can place a significant load on the cluster. These operations:

Scan bucket indexes extensively
Consume RADOS read IOPS on index pools
Can impact overall cluster performance when executed at high frequency
Are costly when using prefixes and delimiters that require filtering

Previous limitation: LIST operations (which use GET/HEAD HTTP methods) were counted as read operations under the max-read-ops limit, making it impossible to control listing separately from regular GET operations. This meant administrators couldn't prevent list-heavy workloads from consuming the entire read operation budget while still allowing standard data retrieval.

Consider a workload performing checkpoint validation by repeatedly listing with prefixes like:

$ aws s3api list-objects-v2 --bucket data --prefix checkpoint-flag --max-items 1

Even though this returns minimal data, each request triggers index scanning operations that consume cluster resources.

As an example, Apache Iceberg tables in data lakehouse environments have been particularly challenging to maintain metadata for Iceberg's deleteOrphanFiles procedure, which cleans up unreferenced data files, requiring complete table listings that can overwhelm object storage systems.

![](images/f06a38ed-1eb5-4457-a13e-45a0eba48684.png align="center")

Delete Operations ¶

Single-object and multi-object delete operations were also uncontrolled, creating challenges for:

Preventing abuse during bulk deletion scenarios
Managing garbage collection workload
Controlling the rate at which storage capacity is reclaimed
Protecting against accidental or malicious mass deletion events

Previous limitation: DELETE operations were classified as write operations (non-GET/HEAD HTTP methods) and counted against max-write-ops, making it impossible to limit deletion rates from PUT operations separately. Workloads that combined uploads and deletions had to balance their write-ops budget across both operation types.

Without dedicated controls for these operations, administrators had limited options for managing workloads that mixed listing, reading, writing, and deleting operations in different proportions.

The Solution: Enhanced Rate Limiting in Tentacle ¶

![](images/a0ea6524-e006-496b-b2bf-855834372d56.jpeg align="center")

Tentacle introduces two new rate-limiting parameters that address these gaps.

New Configuration Options ¶

Max-list-ops: Specifies the maximum number of bucket listing requests per accumulation interval. A value of 0 (default) disables this limit, maintaining backward compatibility.
Max-delete-ops: Specifies the maximum number of delete operations per accumulation interval. A value of 0 (default) disables this limit.

Critical: Backward Compatibility Behavior ¶

Important: The new limits work in conjunction with existing read/write operation limits:

LIST operations: Count against both max-read-ops AND max-list-ops
DELETE operations: Count against both max-write-ops AND max-delete-ops

Both limits must be satisfied for a request to proceed. Administrators upgrading from earlier versions will see no behavior change unless they explicitly configure the new parameters.

![](images/36e1a637-6316-40de-90d6-df76a8fcb97f.png align="center")

Configurable Time Windows ¶

The rgw_ratelimit_interval configuration option allows administrators to adjust the interval for rate limit accumulation. This is particularly important for workloads that exhibit bursty behavior:

$ ceph config set client.rgw.rgw.1 rgw_ratelimit_interval 10

The default 60-second interval may not be optimal for all workloads. Bursty workloads, such as Apache Iceberg's metadata maintenance operations (snapshot expiration, orphan file cleanup), can exhaust their LIST operation budget in the first few seconds of a time window. Since Iceberg's deleteOrphanFiles procedure performs complete table listings across potentially thousands of partitions in rapid succession, the accumulated operations can quickly exceed the rate limit, resulting in extended throttling periods during which subsequent maintenance tasks are blocked. Shorter intervals (1-10 seconds) can provide more consistent behavior by allowing the operation budget to replenish more frequently, preventing long stalls in critical table maintenance workflows.

STS Integration ¶

A new enhancement to the STS/IAM feature ensures that rate limits now apply correctly when users authenticate with temporary credentials obtained via the Security Token Service (STS):

User rate limits configured on an account continue to be enforced when that user assumes an IAM role and operates with temporary credentials.
Bucket rate limits are enforced adequately for operations performed using STS credentials, regardless of how the user authenticated.
Global rate limits now work seamlessly with federated authentication flows, such as AssumeRoleWithWebIdentity.

This closes a previous gap where rate limiting enforcement may not have worked correctly with STS sessions, ensuring consistent rate limit policies across all authentication methods.

Rate Limiting Configuration Examples ¶

Example 1: Configuring LIST Operation Rate Limits ¶

Set up a user with list operation limits to control the frequency of bucket listings.

Create a test user:

$ radosgw-admin user create --uid=testuser --display-name="Test User"
{
    "user_id": "testuser",
    "display_name": "Test User",
    "email": "",
    "suspended": 0,
    "max_buckets": 1000,
    "subusers": [],
    "keys": [
        {
            "user": "testuser",
            "access_key": "TESTUSER_ACCESS_KEY",
            "secret_key": "TESTUSER_SECRET_KEY"
        }
    ],
    "caps": [],
    "op_mask": "read, write, delete",
    "type": "rgw"
}

Set rate limits for list operations. We have two RGW services deployed in our cluster, so if we want to limit the operations to 10, we need to divide the ops limit by the number of RGWs running in the cluster:

$ radosgw-admin ratelimit set --ratelimit-scope=user --uid=testuser \
    --max-list-ops=5 \
    --max-read-ops=100

Enable rate limiting for this user:

$ radosgw-admin ratelimit enable --ratelimit-scope=user --uid=testuser

Verify the configuration:

$ radosgw-admin ratelimit get --ratelimit-scope=user --uid=testuser
{
    "user_ratelimit": {
        "max_read_ops": 100,
        "max_write_ops": 0,
        "max_list_ops": 5,
        "max_delete_ops": 0,
        "max_read_bytes": 0,
        "max_write_bytes": 0,
        "enabled": true
    }
}

Example 2: Configuring DELETE Operation Rate Limits ¶

Set up delete operation limits to control the rate of deletions. Set rate limits for delete operations on the same user:

$ radosgw-admin ratelimit set --ratelimit-scope=user --uid=testuser \
    --max-delete-ops=10 \
    --max-write-ops=100

Verify the updated configuration:

$ radosgw-admin ratelimit get --ratelimit-scope=user --uid=testuser
{
    "user_ratelimit": {
        "max_read_ops": 100,
        "max_write_ops": 100,
        "max_list_ops": 5,
        "max_delete_ops": 10,
        "max_read_bytes": 0,
        "max_write_bytes": 0,
        "enabled": true
    }
}

Observing Rate Limiting in Action ¶

Let's see what happens when a user exceeds their configured limits.

Test Scenario: Exceeding List Operation Limits ¶

With the configuration from Example 2 (5 list ops per RGW, 10 list ops total per minute), configure AWS CLI with test credentials:

$ aws configure set aws_access_key_id TESTUSER_ACCESS_KEY
$ aws configure set aws_secret_access_key TESTUSER_SECRET_KEY

Create a test bucket:

$ aws --endpoint-url http://rgw.example.com s3 mb s3://test-bucket
make_bucket: test-bucket

Populate the bucket with test objects:

$ for i in {1..100}; do
    echo "Test object $i" | aws --endpoint-url http://rgw.example.com s3 cp - s3://test-bucket/object-$i
done

Rapidly execute list operations to exceed the limit. I will use a script that uses curl to list the contents of the bucket test-bucket repeatedly:

bash script.sh
Testing Rate Limit with list-objects-v2...
------------------------------------------------
Attempt 1: ✅ SUCCESS (200)
Attempt 2: ✅ SUCCESS (200)
Attempt 3: ✅ SUCCESS (200)
...
Attempt 10: ✅ SUCCESS (200)
Attempt 11: 🛑 BLOCKED (503) - Limit Reached
Attempt 12: 🛑 BLOCKED (503) - Limit Reached
Attempt 13: 🛑 BLOCKED (503) - Limit Reached

Testing Delete Rate Limits ¶

Attempt to delete objects beyond the configured limit (20 deletes per minute) with the AWS CLI client:

$ for i in {1..25}; do
    echo "Delete attempt $i"
    aws --endpoint-url http://rgw.example.com s3 rm s3://test-bucket/object-$i 2>&1 | grep -E "delete:|error"
done
Delete attempt 1
delete: s3://test-bucket/object-1
Delete attempt 2
delete: s3://test-bucket/object-2
Delete attempt 3
delete: s3://test-bucket/object-3
...
Delete attempt 19
delete: s3://test-bucket/object-19
Delete attempt 20
delete: s3://test-bucket/object-20
Delete attempt 21
delete failed: s3://limits-bucket/object-21 argument of type 'NoneType' is not iterable
Delete attempt 22
delete failed: s3://limits-bucket/object-22 argument of type 'NoneType' is not iterable

Known Limitations and Future Enhancements ¶

Current Limitations ¶

Backward Compatibility Constraint*: LIST operations still count against max-read-ops, and DELETE operations count against max-write-ops. The new max-list-ops and max-delete-ops limits provide additional constraints but do not replace the legacy limits. Both limits must be satisfied for a request to proceed. This design choice maintains backward compatibility but means you cannot completely isolate LIST/DELETE operations from general read/write operation budgets.
Multi-Object Delete: The S3 DeleteObjects API (bulk delete) is not currently rate-limited but is tracked for future enhancement: RFE
IAM Account Limitation: Rate limits on IAM accounts (as opposed to users) do not currently work. This is tracked as an RFE for a future release. RFE
Multipart Upload Accounting: During multipart uploads with limited write ops, the CreateMultipartUpload, UploadPart, and CompleteMultipartUpload operations each count against the write-ops limit. For large files split into many parts, this can quickly consume the operation budget. RFE
Improved Logging Output. Currently, when hitting a rate limit, we see only the following opaque errors in the RGW log, which don’t specify which rate limit we have reached. RFE

2025-11-20T16:39:40.030+0000 7f9e6423a640  2 req 15365199512736087891 0.001000024s s3:delete_obj check rate limiting
2025-11-20T16:39:40.030+0000 7f9e6423a640 20 req 15365199512736087891 0.001000024s op->ERRORHANDLER: err_no=-2218 new_err_no=-2218
2025-11-20T16:39:40.030+0000 7f9e6423a640  2 req 15365199512736087891 0.001000024s s3:delete_obj http status=503

Conclusion ¶

The addition of LIST and DELETE operation rate limiting in Tentacle represents a significant maturity improvement for the Object Gateway. Combined with the new STS integration and configurable time intervals, administrators now have comprehensive tools for managing multi-tenant object storage workloads.

These enhancements are particularly valuable for:

Enterprises implementing department-level chargebacks and resource governance
Cloud-native applications using federated identity with OIDC
Data analytics platforms with mixed read-heavy and metadata-intensive operations

While some limitations remain (particularly around multi-object delete and IAM accounts), the current implementation provides production-ready capabilities that have been extensively tested with workloads ranging from small-object writes to multi-million object listings.

Get Involved ¶

We encourage you to test these new capabilities in your environment and share your experiences with the community.

The authors would like to thank IBM for supporting the community with our time to create these posts.

Special thanks to the Ceph community and the IBM Storage Ceph QE team for their extensive testing and validation of these features, covering functional, scale, and regression scenarios with millions of objects and hundreds of gigabytes of test data.

Ending Support for some Erasure Code Plugins

2025-12-16T00:00:00Z

A plan to end support for some erasure code plugins and techniques in the Ceph V release.

The Erasure Code Plugin Interface ¶

Ceph uses a plugin interface for erasure coded pools. These plugins are external code libraries that are used to do the encoding and decoding of data. Ceph pass chunks of data to the plugin. The plugin uses an encoding algorithm to produce additional chunks called parity (or coding) chunks.

When an erasure coded pool is created, an erasure code profile must be selected. Among other things, the profile includes the plugin and the technique that will be used for the pool. The technique defines the algorithm that the plugin will use for encoding and decoding, and some plugins support multiple different techniques. Because the parity chunks generated are different for each combination of plugin and technique, there is no way to change the plugin and technique after the pool has been created.

Ceph currently supports five erasure code plugins, some of which support multiple techniques:

Jerasure
- reed_sol_van
- reed_sol_r6_op
- cauchy_orig
- cauchy_good
- liberation
- blaum_roth
- liber8tion
ISA-L (Intel Intelligent Storage Acceleration Library)
- reed_sol_van
- cauchy
SHEC (Shingled Erasure Code)
CLAY (Coupled Layer)
LRC (Locally Repairable Erasure Code)

Why are there so many options?

In the distant past, before CPUs supported SIMD instructions (Single Instruction, Multiple Data) and could encode and decode lots of data in parallel, Jerasure's XOR-optimized techniques (cauchy, liberation, liber8tion and blaum_roth) offered a performance improvement when encoding and decoding data. Now, with SSE, AVX (and other) instructions, the need for techniques that rely on XOR operations has been greatly reduced, and reed_sol_van is very close or in some cases better, than the XOR-optimized techniques. See the comparison chart later in this post for the data!

SHEC and CLAY both focus on trying to improve the recovery efficiency (by optimizing network and disk usage when decoding data) when an OSD or server fails and data must be rebuilt. Both of these plugins build on top of Jerasure, with additional logic that aims to speed up recovery.

LRC also builds on top of Jerasure and intends to improve recovery efficiency by using locally available data (e.g. data in the same data centre or same rack) to minimise transfers between racks or sites.

Ending Support For Plugins ¶

In the Ceph Tentacle release we introduced a new version of erasure coding that has become known as Fast EC. Fast EC offers significant performance improvements and some capacity savings when using erasure coding, particularly for block and file workloads. There are even more improvements to Fast EC coming in future releases. See Lee Sanders' blog post https://ceph.io/en/news/blog/2025/tentacle-fastec-performance-updates/ for more details about Fast EC.

Fast EC changes the interface between Ceph and the erasure code plugins. In Tentacle, only ISA-L (using reed_sol_van or cauchy) and Jerasure (using reed_sol_van) support Fast EC. The old EC code has been kept as a separate code path in Ceph, and the other plugins (and other Jerasure techniques) continue to use old EC.

Our proposal is that we should end support for the least used (and least useful) plugins and techniques in the V release. Ceph clusters using these plugins and techniques will not be able to upgrade to the V release unless data is first migrated to a pool that uses a supported plugin and technique.

Why not continue to support all of these plugins using the old EC code path?

The Fast EC work exposed the amount of development effort required to continue to support such a big list of plugins and techniques. Even though only the most important plugins and techniques support Fast EC, code changes were still required to ensure that the other plugins continue working correctly. We now have two separate erasure code paths to maintain. Along with extra development work, supporting a big list of plugins also means lots more testing needs to be done to ensure nothing gets broken.

We don't think this effort is justified given the small number of users using some plugins, and the lack of benefits that these plugins and techniques provide according to performance benchmarks. Developer focus would be better spent on improving other parts of Ceph.

The proposed list of plugins and techniques that we will support in the V release are:

Jerasure
- reed_sol_van
ISA-L
- reed_sol_van
- cauchy
LRC (Although LRC doesn't currently support Fast EC and we wouldn't recommend using it yet, we think we will be able to use LRC in future to improve support for erasure coded pools in stretched clusters.)

Telemetry Data ¶

So let's look at some data. How many people are actually using each plugin? Not every Ceph cluster has opted in to upload usage data to Telemetry, but enough have to give us a good idea about the plugins and techniques that people are using. Here is a recent snapshot of the clusters using erasure coded pools:

Performance Data ¶

My talk at Cephalocon 2024 (https://www.youtube.com/watch?v=aM8sJgDD-x4) discussed why we've made ISA-L the default plugin for new EC pools. The talk included performance data captured using Ceph's EC benchmarking tool, and I've included that here. These charts demonstrate how advancements in SIMD instructions have brought the performance of the reed_sol_van technique to a point where reed_sol_van is almost as good as or better than other techniques. Note that the ISA-L vandermonde and cauchy lines are overlapping in the encode graph:

As mentioned earlier, the goal of both SHEC and of CLAY is to improve recovery efficiency when an OSD is down. A recent blog post written by Jake Squelch uses the Ceph Benchmarking Tool (CBT) to compare performance of the Jerasure and CLAY plugins. His results show that there is a trade-off when using CLAY. Although CLAY can reduce network bandwidth usage during recovery by around 50%, there is a performance penalty for client I/O during normal operation, and when an OSD is down and the cluster needs to recover data. Data is being read in a very inefficient way, particularly when using the default stripe_unit value. See https://ceph.io/en/news/blog/2025/cbt-performance-benchmarking-part3/ for more detail.

Pool Migration ¶

What if you're the owner of the single cluster using Jerasure's blaum_roth technique in the telemetry data? As we end support for the above list of plugins and techniques, we will need a way to move data from those pools into new pools that use supported plugins. In the Umbrella release we plan to add such a pool migration feature. This new feature will provide a way to non-disruptively move data from one pool to another. The migration will run as a background task, similar to backfill and recovery, with no downtime where data is inaccessible. This will allow you to migrate all the objects from a pool that uses an unsupported plugin to a new pool that uses a supported plugin and technique, and then upgrade the cluster to the V release. See https://github.com/ceph/ceph/pull/65703 for the pool migration design document.

Conclusion ¶

As we've developed Fast EC, it's become clear that continuing to support such a big list of plugins and techniques is too much effort for the value that some of the plugins and techniques provide.

In the Umbrella release we will deprecate the plugins and techniques not included in the supported list mentioned above. In the V release we will end support for those plugins and techniques. You will not be able to upgrade to the V release if your cluster has any pools that use those plugins and techniques. You will be able to use the new pool migration feature in Umbrella to migrate data from a pool to a new pool that uses one of the supported plugins and techniques.

Reducing our list of supported plugins and techniques will allow us to focus our development efforts and continue to improve Fast EC, without the risk of breaking lesser-used plugins.

Ceph Object Storage Deep Dive Series Part 3: Version and Object Lock

2025-12-11T00:00:00Z

Introduction ¶

In the first and second parts of this deep dive series, we dissected the core foundations of Ceph RGW: stateless frontends, specialized RADOS pools, bucket index mechanics, and the head/tail data layout. We explored how the Ceph Object Gateway(RGW) achieves massive scalability through dynamic bucket sharding and how background processes, including Garbage Collection and Lifecycle Management, automate data governance.

We now turn to two critical features for enterprise data protection: S3 Object Versioning and S3 Object Lock. These features transform the Ceph Object Gateway (RGW) from a simple object store into a robust data preservation platform capable of meeting regulatory compliance requirements, protecting against accidental deletions, and supporting immutable storage patterns.

In this third deep dive, we will first explore the concepts behind versioning and object lock from the S3 API perspective. Then, we'll peel back the layers to reveal how RGW implements these features internally, focusing on a crucial architectural component: the Object Logical Head (OLH). Understanding this mechanism is key to understanding how RGW efficiently maintains version history while preserving the performance characteristics we expect.

S3 Object Versioning: Concepts and Rationale ¶

Object versioning is a mechanism that allows users to preserve, retrieve, and restore every version of every object stored in a bucket. When versioning is enabled, each object modification (PUT) or deletion creates a new, immutable record rather than overwriting or removing existing data.

Why Versioning Matters ¶

Without versioning, object storage follows a "last write wins" model. Uploading an object with the same key as an existing object silently replaces it. A DELETE operation permanently removes the object. While simple, this model offers no protection against:

Accidental overwrites: A user uploads a corrupted file over a critical dataset
Accidental deletions: A script with a bug issues DELETE commands against production data
Malicious actions: A compromised credential is used to destroy data
Audit requirements: Regulations requiring historical record preservation

Versioning addresses all of these concerns by maintaining a complete history of every object. When combined with an RGW Archive Zone, versioned objects enable all of the above while enabling production buckets to be lean and mean.

Versioning States ¶

Each bucket has one of three versioning states:

State	Behavior
Unversioned (Default)	Objects have a `null` version ID. Overwrites replace data; deletes remove data permanently.
Enabled	Every PUT creates a new version with a unique version ID.
Suspended	New writes get `null` version ID, but existing versions are preserved.

Once versioning is enabled on a bucket, it can never be fully disabled, only suspended. This is a deliberate design choice to prevent accidental or malicious destruction of version history.

Version IDs and the Current Version ¶

When versioning is enabled, every write to an object generates a unique, system-assigned Version ID. This ID is an opaque string that uniquely identifies the object's version. When a client issues a GET request without specifying a version ID, RGW returns the current version: the most recently written version of that object.

# Create bucket
$ aws s3api create-bucket --bucket my-bucket
# Enable versioning on the bucket
$ aws s3api put-bucket-versioning --bucket my-bucket --versioning-configuration Status=Enabled
# Upload creates a new version
$ aws s3api put-object --bucket my-bucket --key report.pdf --body report.pdf
{
    "ETag": "\"959f45520adcbe51b3d7b24e1379d3c0\"",
    "ChecksumCRC64NVME": "viq2x5cBzls=",
    "ChecksumType": "FULL_OBJECT",
    "VersionId": "5ch0kwnw2Nv1l5JctIrUFDY1zd55.va"
}

# List all versions, currently there is only one
$ aws s3api list-object-versions --bucket my-bucket --prefix report.pdf | jq .Versions
[
  {
    "ETag": "\"959f45520adcbe51b3d7b24e1379d3c0\"",
    "Size": 1012,
    "StorageClass": "STANDARD",
    "Key": "report.pdf",
    "VersionId": "5ch0kwnw2Nv1l5JctIrUFDY1zd55.va",
    "IsLatest": true,
    "LastModified": "2025-12-05T11:37:52.802000+00:00",
    "Owner": {
      "DisplayName": "user",
      "ID": "RGW42603947660038067"
    }
  }
]
# We do another PUT to the same Object/key
$ aws s3api put-object --bucket my-bucket --key report.pdf --body report.pdf
# We now have 2 versions of the same Object/Key
$ aws --profile kyle s3api list-object-versions --bucket my-bucket --prefix report.pdf | jq .Versions
[
  {
    "ETag": "\"959f45520adcbe51b3d7b24e1379d3c0\"",
    "Size": 1012,
    "Key": "report.pdf",
    "VersionId": "QhSnbf7bYMGHMshc0S-fyF3.SPMjIju",
    "IsLatest": true,
    "LastModified": "2025-12-05T11:39:56.974000+00:00",
  },
  {
    "ETag": "\"959f45520adcbe51b3d7b24e1379d3c0\"",
    "Size": 1012,
    "Key": "report.pdf",
    "VersionId": "5ch0kwnw2Nv1l5JctIrUFDY1zd55.va",
    "IsLatest": false,
    "LastModified": "2025-12-05T11:37:52.802000+00:00",
  }
]

Delete Markers: Soft Deletes ¶

When you delete an object in a versioned bucket (without specifying a version ID), RGW does not remove any data. Instead, it creates a special zero-byte object called a Delete Marker. This marker becomes the current version, causing subsequent GET requests to return a 404 Not Found error; even though all previous versions remain intact.

# Delete creates a marker, not actual deletion
$ aws s3api delete-object --bucket my-bucket --key report.pdf
{
    "DeleteMarker": true,
    "VersionId": "77d9Np158AOrYrDod98ev7EhONah2G."
}

# GET now returns 404, Because DeleteMarker's IsLatest is set to true
$ aws s3api get-object --bucket my-bucket --key report.pdf output.pdf
An error occurred (NoSuchKey) when calling the GetObject operation: Unknown

# But all versions still exist.
$ aws s3api list-object-versions --bucket my-bucket --prefix report.pdf
{
    "DeleteMarkers": [
        {
            "Key": "report.pdf",
            "VersionId": "77d9Np158AOrYrDod98ev7EhONah2G.",
            "IsLatest": true
        }
    ],
    "Versions": [
        {
            "Key": "report.pdf",
            "VersionId": "5ch0kwnw2Nv1l5JctIrUFDY1zd55.va",
            "IsLatest": false,
            "Size": 1012
        },
        ...
    ]
}

Recovering Deleted Objects ¶

Recovery is straightforward: either delete the Delete Marker or copy a specific version back to the current position:

# Method 1: Remove the Delete Marker
$ aws s3api delete-object --bucket my-bucket --key report.pdf \
    --version-id "77d9Np158AOrYrDod98ev7EhONah2G."

# Method 2: Copy a specific version to restore it as current
$ aws s3api copy-object \
    --copy-source "my-bucket/report.pdf?versionId=5ch0kwnw2Nv1l5JctIrUFDY1zd55.va" \
    --bucket my-bucket --key report.pdf

Permanent Deletion ¶

To permanently remove data from a versioned bucket, you must explicitly delete each version by specifying its version ID:

# Permanent deletion of the object requires the version ID of each version to get deleted
$ aws s3api delete-object --bucket my-bucket --key report.pdf \
    --version-id "5ch0kwnw2Nv1l5JctIrUFDY1zd55.va"

Common Misconception: "Delete Markers" ¶

Question: "If I delete all versions of an object, will the delete markers be automatically removed by the garbage collection process?"

No! Delete markers are permanent metadata that preserve deletion history. They persist indefinitely unless explicitly removed:

Lifecycle policy:

{
  "Rules": [
    {
      "ID": "remove-expired-delete-markers",
      "Status": "Enabled",
      "Filter": {},
      "Expiration": {
        "ExpiredObjectDeleteMarker": true
      }
    }
  ]
}

Manual deletion: $ aws s3api delete-object --version-id <delete-marker-id>

Why this matters: With high-churn workloads (frequent PUT/DELETE cycles), delete markers accumulate silently, causing:

Bucket index bloat (millions of entries with no data)
Severe ListObjects performance degradation

The fix: Configure lifecycle policies for versioned buckets to periodically clean up expired delete markers.

Critical Consideration: Every Version is a Full Copy ¶

A crucial detail that catches many users off guard: each version is a complete, independent copy of the object. Unlike filesystem snapshots or incremental backups, S3 versioning does not store deltas or differences between versions. When you upload a 1 GB file and then modify a single byte, you now have two 1 GB objects stored in your cluster. Tiering however can be employed to shift older revisions to more cost-effective storage.

This design has significant implications for specific workloads:

Workload Pattern	Impact with Versioning
Large files with frequent minor updates	Storage multiplies rapidly
Log files with append operations	Each append creates a complete copy
Database dumps are overwritten daily	N days = N complete copies
Configuration files are updated often	Manageable (small files)

Example: The Log Append Anti-Pattern

Consider an application that appends log entries to an S3 object throughout the day:

# Hour 1: Create 10 MB log file
$ aws s3 cp app.log s3://versioned-bucket/logs/app.log  # 10 MB stored

# Hour 2: Append 1 MB, re-upload 
$ aws s3 cp app.log s3://versioned-bucket/logs/app.log  # Now 11 MB + 10 MB = 21 MB total

# Hour 3: Append 1 MB, re-upload
$ aws s3 cp app.log s3://versioned-bucket/logs/app.log  # Now 12 MB + 11 MB + 10 MB = 33 MB total

# After 24 hourly appends...
# Actual log data: ~34 MB
# Storage consumed: ~528 MB (sum of all versions)

For workloads that involve frequent modifications to large objects, consider these alternatives:

Use unique keys: Write app-2025-01-15-10.log, app-2025-01-15-11.log instead of overwriting
Disable versioning selectively: Use separate buckets for append-heavy vs. versioning-critical data
Aggressive Lifecycle policies: Use NoncurrentVersionExpiration with short retention periods

Best Practice: Before enabling versioning on a bucket, analyze your workload patterns. Versioning is ideal for objects that change infrequently but need protection (documents, images, backups). It can be costly for objects that change constantly (logs, metrics, temporary files).

Operational Consideration: Bucket Index Sharding and Many Versions ¶

Another consideration for versioned buckets concerns how RGW manages the bucket index. As discussed in Part 2 of this series, RGW distributes bucket index entries across multiple shards to maintain performance. However, versioning introduces a constraint: entries for all versions of a single object must reside on the same bucket index shard.

This design has several implications:

Uneven Shard Distribution: Even with hashed sharding, a single object with thousands of versions can create "hot spots" where one shard holds significantly more entries than others. This undermines the even distribution that sharding is designed to provide.

Large omap Warnings: Each version of an object requires multiple index entries: approximately 2 + 2N entries for an object with N versions. Since all these entries must reside on the same shard, a single heavily-versioned object can push a shard past the RADOS “Large Object Warning“ threshold:

Versions per Object	Approximate Index Entries	Thresholds Level
1,000	~2,002	Low
50,000	~100,002	Moderate
100,000	~200,002	Threshold exceeded

When a shard exceeds 200,000 entries (the default osd_deep_scrub_large_omap_object_key_threshold), Ceph raiases a LARGE_OMAP_OBJECTS health warning.

Future Improvements: The RGW development team is actively working on enhancements to ordered bucket indexes that will allow version entries for a single object to span multiple index shards. This architectural change will effectively eliminate the current practical limit on the number of versions per object (currently constrained by omap size limits to roughly 100,000 in the worst case). This work is part of the broader ordered bucket index initiative discussed in Part 2 of our blog series.

S3 Object Lock: Immutable Storage ¶

While versioning protects against accidental changes, it doesn't prevent a privileged user from deliberately deleting all versions. S3 Object Lock provides an additional layer of protection by implementing Write-Once-Read-Many (WORM) semantics. Once an object is locked, it cannot be deleted or overwritten through the S3 endpoint, not even by an RGW admin account, until the lock expires.

Object Lock Prerequisites ¶

Object Lock has a critical prerequisite: versioning must be enabled. This tight coupling exists because Object Lock protects specific object versions rather than just object keys.

Historical Limitation (Pre-Tentacle) ¶

Before Ceph Tentacle, Object Lock could only be enabled at bucket creation time. This was a significant operational constraint: if you created a bucket without Object Lock and later needed WORM protection, your only option was to create a new bucket and migrate all data.

New in Ceph Tentacle ¶

Starting with Ceph Tentacle, you can now enable Object Lock on existing versioned buckets (ceph/ceph#62063). This removes a major operational pain point, allowing you to add compliance protection to production buckets without data migration.

Retention Modes: Governance vs. Compliance ¶

Object Lock supports two retention modes, each with different enforcement characteristics:

Mode	Behavior
Governance	Regular users cannot delete protected objects. However, users with the `s3:BypassGovernanceRetention` permission can override the lock. Useful for internal policies that may require exceptions.
Compliance	Absolutely immutable. No user, including an RGW administrator, can delete the object or shorten the retention period through the S3 endpoint. Even the bucket owner cannot override. Required for regulatory compliance (SEC 17a-4, FINRA, etc.).

Retention Periods ¶

A retention period specifies how long the lock remains in effect. Once set to Compliance mode, this period cannot be shortened; it can only be extended.

# Create a bucket with Object Lock Enabled
$ aws s3api create-bucket --bucket worm-bucket --object-lock-enabled-for-bucket

# Set retention on upload
$ aws s3api put-object --bucket worm-bucket --key financial-record.pdf \
    --body financial-record.pdf \
    --object-lock-mode COMPLIANCE \
    --object-lock-retain-until-date "2032-12-31T23:59:59Z"

# Or set default retention for all objects in the bucket
$ aws s3api put-object-lock-configuration --bucket worm-bucket \
    --object-lock-configuration '{
        "ObjectLockEnabled": "Enabled",
        "Rule": {
            "DefaultRetention": {
                "Mode": "COMPLIANCE",
                "Years": 7
            }
        }
    }'

Legal Hold: Indefinite Protection ¶

In addition to time-based retention, Object Lock supports Legal Hold, a flag that prevents deletion regardless of retention settings. Legal Hold acts as a binary switch (On/Off) and does not require a retention period; it remains in effect until explicitly removed. This is designed, for example, for litigation scenarios where data must be preserved indefinitely until legal proceedings conclude.

# Apply Legal Hold, Example version-id
$ aws s3api put-object-legal-hold --bucket worm-bucket \
    --key evidence.pdf --version-id "abc123" \
    --legal-hold Status=ON

# Remove Legal Hold (requires s3:PutObjectLegalHold permission)
$ aws s3api put-object-legal-hold --bucket worm-bucket \
    --key evidence.pdf --version-id "abc123" \
    --legal-hold Status=OFF

Important: An object can have both a retention period AND a Legal Hold. The object remains protected until BOTH conditions are cleared.

Regulatory Compliance: Third-Party Validation ¶

For organizations in regulated industries, Ceph's Object Lock implementation has been independently assessed by Cohasset Associates, a consulting firm specializing in records management and information governance. Their October 2023 compliance assessment confirms that Ceph with Object Lock meets the electronic recordkeeping requirements of:

SEC Rules 17a-4(f) and 18a-6(e): Non-rewriteable, non-erasable record format (WORM) requirements for broker-dealers and security-based swap entities
FINRA Rule 4511(c): Which defers to SEC Rule 17a-4 for format and media requirements
CFTC Rule 1.31(c)-(d): Principles-based requirements for commodity futures trading firms

Understanding Object Lock Protection Boundaries ¶

It's essential to understand what Object Lock protects against and what it does not. Object Lock enforcement occurs at the S3 API layer. When a DELETE request arrives at the Object Gateway (RGW) endpoint, the gateway checks the lock status and denies the operation if the object is protected. This means:

What Object Lock Protects Against:

Accidental deletion via S3 clients (aws cli, SDKs, applications)
Malicious deletion by compromised S3 credentials
Deletion by any user, including the bucket owner and RGW admin account (in Compliance mode)
Programmatic bulk deletions from rogue scripts or ransomware targeting S3 APIs

What Object Lock Does NOT Protect Against:

Direct RADOS-level operations (rados rm, radosgw-admin bucket rm --purge-objects )
Physical destruction of storage media
Cluster-level administrative actions by users with Ceph admin credentials

# This is blocked by Object Lock
$ aws s3api delete-object --bucket compliance-bucket --key locked-file.pdf --version-id abc123
An error occurred (AccessDenied) when calling the DeleteObject operation: forbidden by object lock

# But someone with RADOS admin access could still do this (DON'T DO THIS!)
$ rados -p default.rgw.buckets.data rm <bucket_marker>_locked-file.pdf
# This bypasses Object Lock entirely - the data is gone

This is not a limitation unique to Ceph; it's inherent to any software-enforced protection. Object Lock protects your data at the application layer (S3 API), but someone with root access to the underlying storage infrastructure operates at a different trust boundary entirely.

Object Lock provides strong protection against S3-layer threats and satisfies regulatory requirements (SEC 17a-4, etc.) when combined with appropriate access controls at the infrastructure layer. The RADOS bypass scenario requires privileged cluster access that should be tightly controlled and audited through separate mechanisms.

RGW Internals: The Object Logical Head (OLH) ¶

Now that we understand the API semantics, let's examine how RGW implements versioning under the hood. This is where the Object Logical Head (OLH) becomes essential.

The Problem: Resolving Ambiguity ¶

Consider a simple GET request: GET /bucket/photo.jpg. In an unversioned bucket, this is unambiguous: there's exactly one object with that key. But with versioning enabled, "photo.jpg" could have dozens of versions. Which one should RGW return?

The naive solution is to scan all versions and select the one with the most recent timestamp. But this approach has profound performance implications:

Every GET would require a range scan of the bucket index
The cost would grow linearly with the number of versions
Concurrent writes could create race conditions

RGW solves this with a layer of indirection: the Object Logical Head.

What is the OLH? ¶

The OLH is a mechanism that tracks which version instance is the "current" version of an object. When you access photo.jpg without a version ID, RGW uses the OLH to determine which version instance to return.

The Ceph source code defines distinct entry types in the bucket index (cls_rgw_types.h):

enum class BIIndexType : uint8_t {
  Invalid        = 0,
  Plain          = 1,   // Non-versioned object entries
  Instance       = 2,   // Individual version instances
  OLH            = 3,   // Object Logical Head
};

When versioning is enabled:

Each object version is stored as an Instance entry with a unique version ID
The OLH entry tracks which instance is current
Non-versioned objects use Plain entries

OLH Epochs: Ordering Versions ¶

RGW uses an olh_epoch counter to establish version ordering. As described in the Ceph GitHub repo:

Note: "The existing algorithm uses an OLH epoch, incremented with each new version of a name, that is used to sort its versions from newest to oldest."* — ceph/ceph PR #31325

When a new version is written:

The olh_epoch is incremented
A new Instance entry is created in the bucket index
The OLH is updated to reflect the new current version

This epoch-based approach ensures consistent ordering even in concurrent write scenarios and is critical for multi-site replication where versions may arrive out of order.

The OLH Log ¶

The OLH mechanism includes an olh_log that records modifications to the version history. Rather than updating the OLH pointer directly, changes are logged and then applied. This log-based approach:

Enables safe concurrent modifications
Supports multi-site synchronization (each zone maintains its own olh_log)
Allows recovery from partial failures

The olh_log is processed by functions like apply_olh_log() in the RGW codebase, which evaluates pending changes and updates the current version pointer accordingly.

Delete Markers and the OLH ¶

When deleting an object in a versioned bucket, RGW creates a Delete Marker using a dedicated operation (CLS_RGW_OP_LINK_OLH_DM). This operation:

Creates a special zero-byte Instance entry marked as a delete marker
Updates the OLH to point to this delete marker as the current version

Subsequent GET requests (without a version ID) will resolve to the delete marker and return 404, while direct version access still works for all previous versions.

Examining the Bucket Index ¶

You can examine the bucket index entries using radosgw-admin:

$ radosgw-admin bi list --bucket my-bucket

The OLH Entry tracks the current version:

{
    "type": "olh",
    "idx": "�1001_report.pdf",
    "entry": {
        "key": {
            "name": "report.pdf",
            "instance": "sTsGobhZm2cGravZvOmc9IbpXgIEM8R"
        },
        "delete_marker": false,
        "epoch": 6,
        "pending_log": [],
        "exists": true
    }
}

The OLH tells us: the current version of report.pdf is instance sTsGobhZm2cGravZvOmc9IbpXgIEM8R. The current epoch is 6, and it's not a delete marker.

Instance Type Entries exist for each version of the object:

{
    "type": "instance",
    "idx": "�1000_report.pdf\u0000isTsGobhZm2cGravZvOmc9IbpXgIEM8R",
    "entry": {
        "name": "report.pdf",
        "instance": "sTsGobhZm2cGravZvOmc9IbpXgIEM8R",
        "exists": true,
        "meta": {
            "size": 1012,
            "mtime": "2025-12-05T11:47:54.163133Z",
            "etag": "959f45520adcbe51b3d7b24e1379d3c0"
        },
        "versioned_epoch": 6
    }
}

Notice how versioned_epoch establishes ordering. Our three versions have epochs 2, 3, and 6; the OLH points to epoch 6, confirming it's the current version.

When we delete report.pdf without specifying a version ID, a Delete Marker is created:

$ aws s3api delete-object --bucket my-bucket --key report.pdf
{
    "DeleteMarker": true,
    "VersionId": "NtxFanesdl99IjNYXyJ-QGSGNETrlko"
}

Now the OLH has changed:

{
    "type": "olh",
    "idx": "�1001_report.pdf",
    "entry": {
        "key": {
            "name": "report.pdf",
            "instance": "NtxFanesdl99IjNYXyJ-QGSGNETrlko"
        },
        "delete_marker": true,
        "epoch": 7,
        "pending_log": [],
        "exists": true
    }
}

The OLH now points to a new instance with "delete_marker": true and epoch 7. The delete marker's instance entry confirms it's a zero-byte marker:

{
    "type": "instance",
    "idx": "�1000_report.pdf\u0000iNtxFanesdl99IjNYXyJ-QGSGNETrlko",
    "entry": {
        "name": "report.pdf",
        "instance": "NtxFanesdl99IjNYXyJ-QGSGNETrlko",
        "exists": false,
        "meta": {
            "size": 0,
            "mtime": "2025-12-05T13:35:37.561880Z"
        },
        "tag": "delete-marker",
        "versioned_epoch": 7
    }
}

The radosgw-admin object stat command can be usefull providing a higher-level view down to a specific object-version:

$ radosgw-admin object stat --bucket my-bucket --object report.pdf
$ radosgw-admin object stat --bucket my-bucket --object report.pdf --object-version sTsGobhZm2cGravZvOmc9IbpXgIEM8R

This shows the object's metadata, manifest, and version information from RGW's perspective.

Key Takeaways ¶

The OLH mechanism provides several essential properties:

Efficient lookups: GET requests without a version ID can quickly resolve to the current version without scanning all versions
Consistent ordering: The epoch-based system ensures deterministic version ordering
Multi-site compatibility: The olh_log design supports replication scenarios where versions may be created concurrently in different zones
Safe concurrent access*: The log-and-apply model handles race conditions between concurrent writers

Lifecycle Management with Versioning ¶

As we discussed in Part 2, Lifecycle management automates data governance through policy-based rules. With versioning enabled, lifecycle policies gain additional capabilities for managing version history.

Expiration Actions for Versioned Buckets ¶

Action	Effect
`Expiration` (Days/Date)	Adds a Delete Marker to current versions (does not delete data)
`NoncurrentVersionExpiration`	Permanently deletes noncurrent versions after specified days
`ExpiredObjectDeleteMarker`	Removes Delete Markers when they're the only remaining version
`NewerNoncurrentVersions`	Limits how many noncurrent versions to retain

Example: Version Retention Policy ¶

This policy keeps the current version indefinitely, retains the last three noncurrent versions, and permanently deletes older noncurrent versions after 90 days:


{
  "Rules": [
    {
      "ID": "Version Retention Policy",
      "Status": "Enabled",
      "Filter": {
        "Prefix": ""
      },
      "NoncurrentVersionExpiration": {
        "NoncurrentDays": 90,
        "NewerNoncurrentVersions": 3
      },
      "Expiration": {
        "ExpiredObjectDeleteMarker": true
      }
    }
  ]
}

Apply the policy using the AWS CLI:

$ aws --endpoint=http://rgw:80 s3api put-bucket-lifecycle-configuration --bucket versioned-bucket --lifecycle-configuration file://lifecycle-policy.json

Understanding NoncurrentVersionExpiration Parameters ¶

The NoncurrentVersionExpiration rule takes two parameters that work together to control version retention:

"NoncurrentVersionExpiration": {
  "NoncurrentDays": 90,
  "NewerNoncurrentVersions": 3
}

How it works: Both conditions must be true for a version to be deleted:

Version must be noncurrent for at least NoncurrentDays (90 days), AND
There must be more than NewerNoncurrentVersions (3) newer noncurrent versions

Let's say you have an object report.pdf with this version history:

Current version (latest):
└─ v10 - 2025-12-11 (current version, not affected by this rule)

Noncurrent versions (older versions):
├─ v9  - 2025-12-10 (1 day noncurrent)   ← Newer noncurrent #1
├─ v8  - 2025-12-08 (3 days noncurrent)  ← Newer noncurrent #2
├─ v7  - 2025-12-05 (6 days noncurrent)  ← Newer noncurrent #3
├─ v6  - 2025-09-01 (102 days noncurrent) ✅ DELETE (>90 days AND >3 newer versions)
├─ v5  - 2025-08-15 (118 days noncurrent) ✅ DELETE
├─ v4  - 2025-08-01 (132 days noncurrent) ✅ DELETE
├─ v3  - 2025-07-15 (149 days noncurrent) ✅ DELETE
├─ v2  - 2025-07-01 (163 days noncurrent) ✅ DELETE
└─ v1  - 2025-06-15 (179 days noncurrent) ✅ DELETE

Lifecycle and Object Lock Interaction ¶

When both Lifecycle policies and Object Lock are active, Object Lock takes precedence. If a Lifecycle rule attempts to delete a locked object version, the deletion is blocked:

![](images/ocol.png align="center")

This ensures that compliance requirements always take precedence over automated cleanup policies.

Cloud Transition and Object Lock ¶

RGW's policy-based cloud transition feature allows you to tier data to external S3-compatible endpoints (public cloud, tape gateways, etc.) using Lifecycle policies. When Object Lock is active, locked objects are automatically skipped during cloud transitions to preserve the WORM contract.

This behavior is intentional: cloud transition is a destructive operation from Ceph's perspective: after transition, the local copy is typically removed and replaced with a stub. Allowing this for locked objects would violate the immutability guarantee.

From the RGW lifecycle code (rgw_lc.cc):

if (!oc.o.is_current() &&
    !pass_object_lock_check(oc.driver, oc.obj.get(), oc.dpp)) {
  /* Skip objects which has object lock enabled. */
  ldpp_dout(oc.dpp, 10) << "Object(key:" << oc.o.key 
                        << ") is locked. Skipping transition to cloud-s3 tier"
                        << dendl;
  return 0;
}

This ensures that compliance data remains on your Ceph cluster until the retention period expires, regardless of any cloud tiering policies that might otherwise apply.

Operational Considerations ¶

Storage Capacity Planning ¶

With versioning enabled, storage consumption can grow rapidly:

Every modification creates a new complete object version
Delete operations don't free space (they add Delete Markers)
Space is only reclaimed when versions are permanently deleted

Monitoring recommendation: Track both logical (S3-reported) and physical (RADOS-reported) usage:

# S3-level bucket statistics
$ radosgw-admin bucket stats --bucket versioned-bucket | jq '.usage'

# RADOS pool usage
$ ceph df detail

You may find it useful to add this or other rgw-exporter to your Prometheus staack.

Index Shard Sizing for Versioned Buckets ¶

Each object version creates additional entries in the bucket index. A bucket with one million objects and an average of ten versions per object has 10 million index entries. Plan shard counts accordingly:

# Check current shard count
$ radosgw-admin bucket stats --bucket versioned-bucket | jq '.num_shards'

# Consider pre-sharding for expected growth
$ radosgw-admin bucket reshard --bucket versioned-bucket --num-shards XXX

Note that as of December 2025 there is work underway to enhance dynamic resharding to account for versioned objects. Clusters running earlier releases should factor versioning into their manual shard count or dynamic resharding threshold.

MFA Delete for Additional Security ¶

RGW supports MFA Delete, which requires multi-factor authentication to permanently delete object versions or change a bucket's versioning state.

# Generate current TOTP code
$ oathtool -d6 --totp b4902c641a1363541b32abc2a26817
293651

# Enable MFA Delete (note: serial + space + code)
$ aws --endpoint=http://rgw:80 s3api put-bucket-versioning \
    --bucket secure-bucket \
    --versioning-configuration MFADelete=Enabled,Status=Enabled \
    --mfa "my-mfa-device 293651"

Once enabled, any attempt to permanently delete a version without providing a valid MFA code will fail with AccessDenied.

For complete MFA setup instructions, including creating TOTP tokens with radosgw-admin mfa create, see the Ceph Object Gateway Multi-Factor Authentication documentation.

Conclusion: The Complete Data Protection Stack ¶

![](images/lock2.png align="center")

Across this third deep dive, we've explored how Ceph RGW implements two cornerstone features for enterprise data protection. Versioning provides a complete history of every object, enabling recovery from accidental modifications and deletions. Object Lock adds WORM semantics for regulatory compliance and ransomware protection.

At the heart of these features is the Object Logical Head (OLH), an elegant architectural solution that maintains version history efficiently through a layer of indirection.

Combined with the Lifecycle Management capabilities from Part 2, you now have a complete picture of RGW's data governance stack:

Versioning + OLH: Preserves history and enables point-in-time recovery
Object Lock: Enforces immutability for compliance
Lifecycle Management: Automates version cleanup within policy constraints
Garbage Collection: Reclaims space from permanently deleted versions

In upcoming articles, we'll continue our exploration of topics that include multi-site replication, and STS/IAM integration. Stay tuned!

The authors would like to thank IBM for supporting the community with our time to create these posts.

KV Caching with vLLM, LMCache, and Ceph

2025-12-10T00:00:00Z

Inference accounts for 90% of machine learning costs for deployed AI systems, and it is no surprise that inference optimization is a burgeoning topic in the research community. IDC estimates that global enterprises will invest $307 billion USD on AI solutions in 2025, and that number is expected to grow aggressively year-over-year.

Understanding the workload ¶

Unlike training, inference for autoregressive language models only involves the forward pass, which itself is broken up into two distinct phases: prefill and decode. Each phase has a unique workload profile – prefill tends to be computation bound, consuming every ounce of floating-point arithmetic capability the system can garner, followed by decode, which is principally limited by memory bandwidth.

The computational complexity of both prefill and decode phases grows quadratically with each additional token. Prefill is easily parallelized across GPUs - all prompt tokens are known up front when a request arrives at the model API. The decode phase brings in the transformer multi-headed attention mechanism and must compute the attention states across all previous tokens - including any prompt(s) and generated responses. This complicates the deployment of inference services where context lengths are growing rapidly to accommodate larger code bases, longer documents, and retrieval augmented generation. KV caching is where the computed key and value weights that correspond with token sequences in a prompt are saved for later, and then retrieved when they are used in a subsequent prompt to avoid the cost of computation (GPU hours) and to reduce the time between when the prompt was submitted as a request and the first response token (time-to-first-token, or TTFT).

Cache blocks in vLLM and LMCache ¶

vLLM takes a hierarchical approach to KV caching. First it checks for the existence of cache blocks in GPU memory, if there is a cache miss it will progress to CPU memory, and if there is again a cache miss it will try to retrieve cache blocks over any configured KV connectors. LMCache works with vLLM over this KV connector interface - vLLM sends or requests cache blocks and LMCache works to diligently store or stream cache blocks it locates. vLLM also introduced the technique of Paged Attention, which breaks up prompts into fixed sized token sequences referred to as a block, 16 tokens by default. LMCache uses a larger 256 token block by default, presumably to reduce the overhead of managing reference to many blocks and to better amortize the per-block transfer overhead. Storage folks, being unfamiliar with a token as a unit of measurement for space and IO, might naturally wonder what this translates to in terms of block sizes expressed in bytes. The bytes-per-token is model dependent, because it’s a product of the model’s hidden size, number of key-value heads, number of hidden layers, head dimension, and data type size. For a model like Qwen3-32B this works out to be approximately 62.5 MiB. There is a convenient KV Cache calculator available on the documentation page for LMCache if you want to see how much KV space would be required for any given model or number of tokens.

Content addressable KV storage ¶

vLLM and LMCache both calculate a hash of the token sequence that represents a block and use that as a cache block identifier. This means that vLLM will pass over the kv-connector interface the hashes of cache blocks that it is interested in, and LMCache will return a bitmask indicating which cache blocks it can provide. Under the covers the LMCache S3 connector will make GetObjectAttributes calls with each block identifier (hash of the token sequence) and for each block that exists it will flip the corresponding bit in the mask. The elegance of this approach is that there is no cache block map that needs to be persisted, and no coordination necessary when there are multiple instances of vLLM+LMCache running across different hosts. In fact, there is no requirement that the LMCache controller be configured at all. This design also permits flexible eviction: a storage system could implement time-based expiration via Lifecycle configurations, and any deleted block simply registers as a miss. In the end you get fully elastic content addressable storage for KV cache blocks with flexible eviction. Anyone familiar with Ceph will truly appreciate the notion of computing the location of data over performing a lookup.

Retrieving cache blocks ¶

We began exploring LMCache by testing its native S3 connector with Ceph, as it provides an accessible entry point for most existing environments. The other appeal of the native S3 connector in LMCache is that it leverages an AWS common runtime library (CRT), which means that the connections in the client’s connection pool will be multiplexed across endpoints that are returned in the DNS response for the object store’s FQDN. The downside is that the bindings in the AWS common runtime library for Python only support recv_filepath and send_filepath, which limits the ability of LMCache to stream the response body of a GetObject call directly to page-locked memory buffers allocated by the LocalCPUBackend. To work around this limitation the connector pre-allocates and mmaps files on a tmpfs mounted at /dev/shm (one per concurrent request), in this way the CRT client can pass the file descriptors of memory mapped files and then memcpy from their corresponding buffers to page-locked LocalCPUBackend buffers that are used for DMA transfers to the GPU. This is a clever way of working around most of the limitations of aws-crt-python, but to get true zero-copy it will require changes to the bindings.

After some preliminary testing with the native S3 connector LMCache PR#1939 caught our eye because leveraged the NVIDIA Inference Xfer Library (NIXL). This PR introduces the ability to directly read S3 data into page-locked NIXL buffers, bypassing files on /dev/shm and the associated memory copy. It also introduced a presence cache to eliminate redundant GetObjectInfo requests that are used to determine if a cache block exists for a given sequence. We had experimented with the NIXL obj plugin already and ran some rudimentary nixlbench tests. What we found was that the NIXL obj plugin alone wanted a pre-allocated pool of object keys, and that it required either the LMCache coordinator or Dynamo KVBM to maintain device ID, offset, and length information for each cache block. Unlike other NIXL plugins, the obj plugin could only write a single cache block to each device ID (1:1 mapping with object key), because object APIs like S3 do not support writes to arbitrary offsets. This is all addressed by PR1939, because instead of using a pool of object keys and tracking cache block metadata, it preserves the content addressable approach of LMCache’s native S3 connector. The only remaining downside with NIXL is that it used S3Client instead of S3CrtClient, the latter of which supports multipathing across S3 endpoints.

Hyperscale AI deployments ¶

Drawing from over a decade of experience selecting hardware for Ceph storage systems we had an idea of what sort of system we would want to build to maximize throughput, while also drawing inspiration from choices made by major AI practitioners like Meta and OpenAI. Enter Meta’s contribution to the Open Compute project – the Yosemite V3.5 Sierra Point server platform. The YV3.5 cubby occupies 3 OU and can be populated with 6x Sierra Point blades. Unlike conventional enterprise blade systems the YV3.5 platform does not have an integrated ethernet switch, instead each Sierra Point blade has OCP 3.0 slot for direct to host network connectivity. We wanted a system that was a spiritual successor to YV3.5 and Sierra Point, that reaped the advantages of cutting-edge processor designs and lithography. While surveying the server landscape across a whole host of OEMs there was one system that caught our attention, the Supermicro X14 2U 4-node GrandTwin Rear IO.

Supermicro X14 2U 4-node GrandTwin Rear IO

Each node:

1x Intel Xeon 6 6740E 96C/96T, 205W
16x16GB DDR5-6400
1x Broadcom 57608 2x200GbE
6x 2.5” Kioxia CM6-R, 7.68TB Gen4 NVMe SSD
RAID1 2x 480TB NVMe (boot)

This system is utilized to provide high-bandwidth all-flash object storage for the AI solution using IBM Storage Ceph 8.1.

Supermicro Gaudi 3 AI Server SYS-822GA-NGR3

2x Intel Xeon 6 6960P 72C/144T
24x 64GB DDR5-6400
8x Gaudi 3 HL-325L accelerators
Up to 8x 2.5" Gen5 NVMe SSD
Scale-up networking: 21x 200GbE Gaudi NICs
2x Broadcom 57608 1x400GbE

This system is utilized to run inference workloads with the combination of vLLM and LMCache, leveraging Gaudi 3 accelerators from Intel.

Supermicro GPU A+ Server AS -8125GS-TNMR2

1x AMD EPYC 9654 96C/192T
24x 96GB DDR5-4800
8x AMD MI300X accelerators
Up to 8x 2.5" Gen5 NVMe SSD
Scale-up networking: 4x400GbE
Storage and GPU scale-out networking: 4x NVIDIA MT28908 ConnectX-6 200GbE

This system is utilized to run inference workloads with the combination of vLLM and LMCache, leveraging MI300X accelerators from AMD.

SSE-T7132S - 400Gb Ethernet Switch

32x QSFP-DD 400GbE, or 64x QSFP56 / 128x QSFP28 with breakout cables
25.6Tb/s switching capacity
SONiC OS
RoCEv2/RDMA support with PFC

For simplicity we used a single fixed-port 400Gb switch for both GPU-to-GPU and the storage fabric.

Host configuration ¶

Performance profile set in BIOS
Set the tuned profile to network-latency

tuned-adm profile network-latency

All hosts were configured with mode 802.3AD with xmit_hash_policy=Layer3+4

Ceph configuration ¶

OSD service ¶

---
service_type: osd
service_id: nvme
placement:
  hosts:
    - ceph-osd01
    - ceph-osd02
    - ceph-osd03
data_devices:
  paths:
    - /dev/disk/by-path/pci-0000:63:00.5-pci-10001:81:00.0-nvme-1
    - /dev/disk/by-path/pci-0000:63:00.5-pci-10001:82:00.0-nvme-1
    - /dev/disk/by-path/pci-0000:89:00.5-pci-10002:01:00.0-nvme-1
    - /dev/disk/by-path/pci-0000:89:00.5-pci-10002:02:00.0-nvme-1
    - /dev/disk/by-path/pci-0000:89:00.5-pci-10002:03:00.0-nvme-1
    - /dev/disk/by-path/pci-0000:89:00.5-pci-10002:04:00.0-nvme-1

Pool configuration ¶

We decided to pre-create metadata and data pools for RGW before initializing the RGW service.

ceph osd pool set noautoscale
ceph osd pool create default.rgw.buckets.data 2048 2048 replicated
ceph osd pool create default.rgw.buckets.index 64 64 replicated
ceph osd pool create default.rgw.buckets.non-ec 64 64 replicated
ceph osd pool set default.rgw.buckets.data size 2
ceph osd pool set default.rgw.buckets.data min_size 1
ceph osd pool application enable default.rgw.buckets.data
ceph osd pool application enable default.rgw.buckets.index
ceph osd pool application enable default.rgw.buckets.non-ec

RGW service ¶

This RGW service configuration will create 4x RGW instances on each of the 4 hosts, with a concentrator bound to the host IP address at port 80.

---
service_type: rgw
service_id: standard
service_name: rgw.standard
placement:
  count_per_host: 4
  label: rgw
networks:
  - 10.67.67.0/24
spec:
  rgw_exit_timeout_secs: 120
  rgw_frontend_port: 8080
  concentrator: haproxy
  concentrator_frontend_port: 80
  concentrator_monitor_port: 1967
  concentrator_monitor_user: admin

Traffic management ¶

Like many applications, LMCache expects a single S3 endpoint. For us to maximize bandwidth to storage cluster we decided to leverage Hashicorp Consul and CoreDNS to return multiple DNS records in response to queries for our chosen object FQDN. As stated earlier, this works perfectly with AWS CRT libraries like those utilized by LMCache’s native S3 connector.

Consul ¶

/etc/consul.d/consul.hcl

datacenter = "smci"
data_dir = "/opt/consul"
bind_addr = "172.19.65.41"
client_addr = "0.0.0.0"
retry_join = [
  "172.19.65.41",
  "172.19.65.42",
  "172.19.65.43",
  "172.19.65.44"
]
server = true
bootstrap_expect = 3

services = [
  {
    name = "s3"
    port = 8080
    check = {
      id       = "tcp-check"
      name     = "S3 TCP"
      tcp      = "localhost:8080"
      interval = "10s"
      timeout  = "2s"
    }
  },
  {
    name = "s3"
    port = 8081
    check = {
      id       = "tcp-check"
      name     = "S3 TCP"
      tcp      = "localhost:8081"
      interval = "10s"
      timeout  = "2s"
    }
  },
  {
    name = "s3"
    port = 8082
    check = {
      id       = "tcp-check"
      name     = "S3 TCP"
      tcp      = "localhost:8082"
      interval = "10s"
      timeout  = "2s"
    }
  },
  {
    name = "s3"
    port = 8083
    check = {
      id       = "tcp-check"
      name     = "S3 TCP"
      tcp      = "localhost:8083"
      interval = "10s"
      timeout  = "2s"
    }
  }
]

CoreDNS ¶

/etc/coredns/Corefile

.:53 {
    log
    errors
    forward . 8.8.8.8
}

cephlab.com {
    file /etc/coredns/cephlab.com
    prometheus
    errors
    log
    debug
}

consul {
  forward . 172.19.65.41:8600 172.19.65.42:8600 172.19.65.43:8600 172.19.65.44:8600
  log
  errors
}

s3.cephlab.com {
    rewrite stop {
        name exact s3.cephlab.com s3.service.consul.
        answer name s3.service.consul. s3.cephlab.com.
    }
    rewrite stop {
        name regex (.*)\.s3\.cephlab\.com s3.service.consul.
        answer auto
    }
    forward . 172.19.65.41:8600 172.19.65.42:8600 172.19.65.43:8600 172.19.65.44:8600
    log
    errors
    debug
}

example.hosts s3.ecmp.cephlab.com {
    hosts {
        10.67.67.67 s3.ecmp.cephlab.com
        10.67.67.67 nixl.s3.ecmp.cephlab.com
        fallthrough
    }
    whoami
}

Testing DNS balancing ¶

To validate that the Hashicorp Consul and CoreDNS based approach is functioning properly, we can test DNS resolution of our object FQDN of our object endpoint. Note that we’re seeing 4 records returned, which is exactly what we want.

[cephuser@ceph-osd01 ~]$ dig s3.cephlab.com

; <<>> DiG 9.16.23-RH <<>> s3.cephlab.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 12051
;; flags: qr aa rd; QUERY: 1, ANSWER: 4, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;s3.cephlab.com.                        IN      A

;; ANSWER SECTION:
s3.cephlab.com.         0       IN      A       172.19.65.41
s3.cephlab.com.         0       IN      A       172.19.65.42
s3.cephlab.com.         0       IN      A       172.19.65.43
s3.cephlab.com.         0       IN      A       172.19.65.44

;; Query time: 1 msec
;; SERVER: 172.19.65.41#53(172.19.65.41)
;; WHEN: Tue Nov 04 12:33:03 PST 2025
;; MSG SIZE  rcvd: 163

Baseline performance ¶

To establish the baseline performance of the storage cluster before we introduce vLLM and LMCache we assessed the performance using elbencho to generate load from the Gaudi3 GPU host and direct it towards the Ceph S3 endpoints. We used a 62MB block size to match the expected size of KV cache blocks being persisted by LMCache. This shows that we’re able to multiplex connections across the concentrator endpoints on each host and drive a considerable amount of S3 traffic from even a single host, topping out at nearly 60 GB/s.

vLLM ¶

At the time of our testing the vllm production stack did not support our end-to-end workflows, so we created customized vLLM container images that incorporated a LMCache development release, including one that incorporated the latest vllm-gaudi development for our testing.

AMD Container

vLLM:
LMCache:
NIXL:

Gaudi Container

vLLM:
LMCache:
NIXL:

Below you will find the configuration files and command line arguments we used to run vLLM and LMCache together.

.aws/credentials

[lmcache]
region = default
endpoint_url = http://s3.cephlab.com:80
aws_access_key_id = xxx
aws_secret_access_key = yyy
response_checksum_validation = when_required
preferred_transfer_client = crt

lmcache-ceph.yaml

chunk_size: 256
local_cpu: False
max_local_cpu_size: 100
remote_url: "s3://lmcache.s3.cephlab.com"
save_unfull_chunk: False
enable_async_loading: True
remote_serde: "naive"
blocking_timeout_secs: 100
extra_config:
  s3_max_io_concurrency: 1024
  s3_max_inflight_reqs: 1024
  s3_prefer_http2: False
  s3_region: "default"
  s3_enable_s3express: False
  save_chunk_meta: False
  s3_file_prefix: "test"

lmcache-nixl-ceph.yaml

chunk_size: 512
local_cpu: false
max_local_cpu_size: 50
remote_serde: "naive"
nixl_buffer_size: 1073741824
nixl_buffer_device: cpu
extra_config:
  enable_nixl_storage: true
  nixl_backend: OBJ
  nixl_pool_size: 512
  nixl_backend_params:
    endpoint_override: http://s3.cephlab.com
    access_key: CR98FOT054QZJ60NR7E3
    secret_key: 15CTFkiAdwPkkiSh4gOlQ5zF14KZ0uCnZloYVo3w
    scheme: http
    region: default
    req_checksum: required
    bucket: lmcache

lmcache-dram.yaml

chunk_size: 256
local_cpu: True
max_local_cpu_size: 50
save_unfull_chunk: False
enable_async_loading: True
remote_serde: "naive"
blocking_timeout_secs: 100

Starting vLLM

LMCACHE_CONFIG_FILE="/root/lmcache-nixl-s3.yaml"
LMCACHE_USE_EXPERIMENTAL=True
PYTHONHASHSEED=67
AWS_PROFILE='lmcache'
vllm serve Qwen/Qwen3-32B  \
       --gpu-memory-utilization 0.55 \
       --rope-scaling '{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}' \
       --max-model-len 131072 \
       --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_both","kv_parallel_size":"16"}' \
       --tensor-parallel-size 2

For the Gaudi3 accelerator testing we set the following additional environmental variables:

PT_HPU_GPU_MIGRATION=1
VLLM_USE_V1=1
VLLM_SKIP_WARMUP=True
VLLM_EXPONENTIAL_BUCKETING=False

Benchmark ¶

We wanted to characterize the reduction in time-to-first-token for a 100% cache hit rate from remote storage with Ceph across various context lengths, and chart it relative to computational prefill. For this we selected the LMCache long_doc_qa.py. We developed the following methodology for TTFT data collection:

Start vLLM
Run long_doc_qa.py and record TTFT for the warm-up round (computational prefill result)
Restart vLLM
Run long_doc_qa.py and record TTFT for the warm-up round (KV cache hit from remote storage result)
Stop vLLM
Remove cache blocks from remote storage

By restarting vLLM in step 3 we ensure that the results are not skewed by KV caching in GPU HBM or CPU memory, and by stopping vLLM and removing cache blocks from remote storage we ensure that each subsequent context length is not benefitting from remote storage KV caching from the previous context length. With this methodology all KV caches are cold at the beginning of each test, except for remote storage KV caching which we want to measure the benefit of in step 4.

long_doc_qa.py example command line

python3 ~/LMCache/benchmarks/long_doc_qa/long_doc_qa.py \
      --model Qwen/Qwen3-32B \
      --port 8000 \
      --num-documents 1 \
      --document-length ${len} \
      --output-len 100 \
      --repeat-count 1 \
      --repeat-mode interleave \
      --max-inflight-requests 1 \
      --output results/ttft_${L}.out

Results ¶

Intel Gaudi 3 Results ¶

AMD MI300X Results ¶

Considerable reduction in TTFT with both Intel Guadi3 and AMD MI300X accelerators, with the largest measured speed-up of 23x reduction. This testing also illustrates how KV caching can reduce TTFT more than using tensor parallelism to spread prefill across multiple GPUs in a system and that combing these techniques can deliver the lowest TTFT. It’s also worth pointing out that in addition to reducing TTFT, prefix caching derives additional value by conserving GPU cycles for decode – potentially reducing time-per-output-token (TPOT).

What's next? ¶

We shared our results with the llm-d team at Red Hat and have sarted to work with them to commodify KV caching by establishing KV caching with Ceph as a well-lit path. We believe that our approach is perhaps the most accessible because it uses standard object protocols like S3, standard TCP/IP networking, works with a variety of accelerators from different vendors, and because Ceph object is ubiquitously deployed in OpenShift clusters through OpenShift Data Foundation and IBM Fusion. Our next phase of testing will utilize llm-d, with the GPU hosts serving as worker nodes, and exploring more sophisticated scenarios like PD disaggregation and cache blending.

Finally, we'd like to thank Supermicro for providing the environment for these testing efforts. If you have any questions about Data or AI workloads for Ceph, please reach out.

Benchmarking Performance with CBT: Running and Analysing a Performance Test. Part Three

2025-12-08T00:00:00Z

CBT Performance Benchmarking - Part 3. How do we run and analyse a performance test?

Outline of the Blog Series ¶

Part 1 - How to start a Ceph cluster for a performance benchmark with CBT
Part 2 - Defining YAML contents
Part 3 - How to start a CBT performance benchmark

Contents:

Introduction: Running a performance benchmark
How to read response time curves
What values to read from a response curve?
Summary
Conclusion

Introduction ¶

Now that we have created our erasure coded (EC) cluster (from Part 1) and defined our YAML file and workloads (from Part 2), we can now start a CBT performance benchmark test.

This part will cover:

Running a performance test
Generating a performance report
How to read response time curves
Comparing performance benchmarks
Running a performance test with an OSD down

Step 1: Run the performance test

First, clone the CBT GitHub repository into a directory of your choice on the machine you are using and cd into it.

This is an example of the command to run a CBT performance test:

  python /cbt/cbt.py -a /tmp/cbt -c /example/ceph.conf /example/<yaml_file> 2>&1 | tee /tmp/cbt.out

You will specify the location of your cbt file (cbt.py). Provide an archive folder where your results will be generated (/tmp/cbt). Provide a config folder (/example/ceph.conf) to allow CBT to connect with the cluster. Finally, we specify our (yaml_file) which will outline what tests/workloads will be running.

Step 2: Generate a performance report

Once you have ran the performance test by following Pat 1 your result files will be outputed at the location you specified them to go in Step 1 after the archive argument (-a). For me, the previous command referenced /tmp/cbt, so my results are there.

You can now copy these result files to a new directory if you wish. I would like them to be within /perftests/my_test in this case, I do so because I like to keep a directory of all my CBT test results, and I delete /tmp/cbt before each performance test, so that is not a suitable place to keep them stored. So I would do this for example:

cp -r /tmp/cbt/* /perftests/my_test

Next, it is a case of generating the performance report, which can be done by the following command for myself in this example:

PYTHONPATH=/cbt/ /cbt/tools/generate_performance_report.py --archive /perftests/my_test --output_directory /perftests/my_test_results --create_pdf

Above you reference the location of cbt.py again at the start, you then reference the script that will generate the performance report (generate_performance_report.py). I state the directory, /perftests/my_test in this case, that has the results from the performance run, and you should also state a desired output_directory, this is where the files for the performance report will be.

Side note: you do not need to have already created the specified output_directory you see in the command above, this will be automatically created for you if need be. After these steps, you should now have the result files inside your new output_directory, in my case, my_test_results folder. You have now successfully generated your performance report! I normally upload these result files to GitHub to create a main repository to store and view the reports.

The next section will go over the performance report generated, and how to understand your own one.

How to read response time curves ¶

Now you have generated your performance report for your test you may be looking at the pdf or md file and be slightly confused by the graphs shown. This section will cover how we read the response time curves and reach conclusions based on the data points.

So lets go back to our example CBT test run and the question we started with: "Does using the CLAY erasure code plugin give better performance than using the default Jerasure plugin?"

I generated a performance report for a Jerasure plugin EC pool, the results can be found here.

I then did the same for the CLAY plugin, here.

Within the generated reports above you will see hockey stick curves plotted to show the performance of each configuration.

So how do we read the curves generated? ¶

Here is an example of a curve generated within a performance report: Below is an example of the total_iodepth value. As stated above we can find out each specified total iodepth point for this test by checking the yaml file we previously used in this test, and it is also stated within the performance report under the “Configuration yaml” section. For the above example it is:

total_iodepth: [ 2, 4, 8, 12, 16, 24, 32, 64, 96, 128, 192, 288, 384 ]

The vertical red lines (error bars) shows the amount of standard deviation/variance in the performance for that specific point in the curve. If the standard deviations are small it shows that performance is stable with that workload. As the response curve starts to curve upwards performance bceomes more variable and the standard deviation increases.

For an FIO workload, CBT will start 1 instance of FIO per volume.
It's also to note that the graph produced by reports do not include the results during the "ramp" period.

The post processing tools will sum the IOPs to generate a total IOPs for the response curve and calculate an average latency over all the volumes. The IOPS vs latency is then plotted on the response curve for that point of the curve for that specific iodepth.

What values to read from a response curve? ¶

If you know how much I/O your application is generating then you can use the response curve to work out what latency you should expect
If you want to see the maximum amount of I/O that the storage controller can process look for the right most point on the curve and find the value on the X axis.
If you have a latency requirement such as all I/O must complete in under 2ms then you can find out the maximum I/Os the storage controller can do by finding the point on the curve at this latency.
Most of the time you don't know exactly how much I/O an application is going to generate, and want to ensure that if there are any peaks or bursts in the amount of I/O that this doesn't cause a big change in latency. Where the response curve is flat there will be little change in latency if the amount of I/O varies, where the response curve is bending upwards a fairly small variation in amount of I/O can have a big impact on latency. Choosing a point on the response curve just before it starts increasing too rapidly gives a good indication of the maximum amount of I/O you can do with stable performance.
Most users do not want to operate above around 70% of maximum throughput, as this provides some headroom for expansion and allows for sudden bursts in a workload so that high latency can be tolerated.

As mentioned in Part 1 of the blog, the perfect response curve would be a flat horizontal line showing constant latency as the quantity of I/O increases until we reach the saturation point where the system can handle no more I/O. This is because it highlights that performance is consistent with less variance.

Step 3: Generating a comparison report

With CBT, as well as performance reports, we can also generate comparison reports quickly. Now that I have ran the tests for CLAY and Jerasure, we can generate a performance report for them. I will use the following command to do so:

PYTHONPATH=/cbt/ /cbt/tools/generate_comparison_performance_report.py --baseline /perftests/jerasure_test/ --archives /perftests/clay_test/ --output_directory /perftests/clay_vs_jerasure_comparison --create_pdf

In the above command we will have to specify what our baseline is, we will use the Jerasure test folder as the baseline curve as shown above. Our archive curve will be our CLAY performance report test folder. It is important here that in the above command you are inputting the test folders for Jerasure and CLAY NOT the results folder that was generated from the previous steps. The above command will generate a comparison report in our specified output_directory.

You have now successfully generated your comparison report! Mine can be found here.

Basic analysis of the comparison report: ¶

Let's first give a bit of a background on our two erasure coding profiles: Jerasure is a generic reed-solomon erasure coding library, it is matrix-based, not CPU-optimised. It is fairly balanced between read and write. CLAY is designed for faster recovery at the cost of more complicated write paths. So we are expecting to see better performance from CLAY potentially when it comes to smaller IO sizes, but as the writes get larger we may see a decline in performance from CLAY leading to better Jerasure results. Furthermore, in terms of reads we expect fairly similar results across the board as they are implemented very similar, the main difference is when it comes to the writes.

So lets now take a look at our comparison report, first comparing smaller workloads so let's start with a 4K Random Reads, this is the corresponding graph: As shown by the diagram, the orange curve is our CLAY EC pool, and the blue curve is our Jerasure EC pool. We can see for 4k random reads there is very little change in performance, as we expected. Both the curves have almost identical latencies and IOps.

We can also take a look at the 4K Random Writes: The performance is similar until we get to the saturation point around 14,000 IOps, where we can see latency sky rocket for both Jerasure and CLAY. The IOps for Jerasure are marginally better than CLAY at this point but nothing substantial.

So overall, we can see at small workloads there is very similar performance between Jerasure and CLAY.

Lets now move onto larger workloads, starting with 1024K Sequential Read: Once again the two curves barely differ and they follow very similar paths, and that was expected. This is because for a normal read, ceph only needs to fetch data chunks (not parity chunks). Both Jerasure and CLAY are practically just returning the stored object, there is no real difference unless a failure occurs.

Now lets look at the 1024K Sequential Write: Now when we take a look at the writes we see that CLAY has 20-60% higher latency, with throughput dropping compared to Jerasure. This is likely due to extra CPU and network demands in CLAY. Larger writes mean bigger encoding matrices/layers, and CLAY has more complexity per write than Jerasure, likely leading to the higher latency shown.

Our sequential write benchmarks shows that Jerasure delivers more consistent write performance across all the block sizes, while CLAY is more volatile, performing better at some smaller sizes but much worse at large sequential writes. This shows CLAY’s design priorities: it is optimised for reduced recovery bandwidth rather than raw write performance.

This means that if your I/O workload is mainly large sequential reads, for example a data lake for AI training, then switching to CLAY isn't going to affect performance. However, if your I/O workload is mainly heavy sequential writes, for example storage archives or backups, then switching to CLAY will have a substantial negative performance impact, as shown by the diagrams.

Step 4: Running a test with OSD down

So, before we had a CLAY and Jerasure EC pool compared with one another. The results solidified our hypothesis that Jerasure would likely perform better because of the more complex computations used to recover data. Now we will do an additional run and deliberately kill an OSD prior to running the CBT test, to simulate real world failures that could occur, to see how the performance between the two differs when it comes to OSD recovery.

The following comparison report shows CLAY and Jerasure curves where both of the plugins have 1 OSD down. The report can be found here.

We will now take a look at 1024K Sequential Read from the above comparison report: Now we expect CLAY to have better performance here due to it's supposedly more efficient data recovery. However this is not the case as shown by the diagram above.

Summary ¶

Within this part we have used CBT to successfully compare Jerasure and CLAY for a variety of different workloads. We have generated results that are repeatable and show that for both good path I/O and I/O when there is an OSD down (hence data needs to be recronstucted using erasure coding), there is no benefit to using CLAY. In fact, there are extra overheads which mean that performance may be worse when using CLAY.

Conclusion ¶

In conlusion this blog has demonstrated the seamless experience of how you can generate a CBT performance benchmark run from start to finish, generating performance reports along the way and enabling analysis/comparison of performance. We used CLAY and Jerasure as an example of how to easily do a performance benchmark but sometimes the results can be unexpected and lead to more questions arising than answers being received. This can lead to further experiments to deep-dive into why certain results occured, and this is what I'll be doing in Part 4 of the blog that will be coming in the near future. Part 4 will provide more detailed analysis and IO breakdown for CLAY and Jerasure to provide more clarity on why CLAY performance was worse!

Links to previous parts of the blog series

Benchmarking Performance with CBT: Defining YAML Contents. Part Two

2025-12-04T00:00:00Z

CBT Performance Benchmarking - Part 2. What is a YAML file and how do we use them within CBT?

Outline of the Blog Series ¶

Part 1 - How to start a Ceph cluster for a performance benchmark with CBT
Part 2 - Defining YAML contents
Part 3 - How to start a CBT performance benchmark

Contents:

Introduction: What goes into the YAML file?
Key sections of the YAML file
Expressing queue depth
Why do we have lots of different IO values in the yaml?

Introduction: What goes into the YAML file? ¶

Once you have finished Part 1 you should have an erasure coded Ceph cluster setup now, and you're nearly ready to run a CBT test on it! However, before we can do that, we need to understand what YAML contents we want.

The YAML file defines what tests we will run on the cluster.

We could briefly describe the YAML file as having 3 main sections to it:

cluster section: Where the YAML describes how CBT communicated with the cluster. Eg user ID, clients, OSDs, ceph binary paths etc.
monitoring_profiles section: Where the YAML describes the monitoring tools used (collectl in our case) to collect statistics.
benchmarks section: Where the benchmarking technique is specified (librbdfio) in our case, and also where the workloads are placed.

Key sections of the YAML file: ¶

Cluster

Here you will be describing your ceph cluster configuration.

Now the reason user, head, clients, osds, mons etc fields are required is because CBT uses a parallel distributed shell (pdsh) with SSH to login to the various entities of the cluster that have been defined in the cluster section. This enables "ceph" commands and also the ability to start up the benchmark tool (such as FIO) on the client endpoints (which are defined in the "clients" field).

A typical use case of Ceph is that there is a separately attached host server dedicated for reading and writing data to the storage. Therefore it is possible to run CBT on a completely separate server from the cluster itself, and the performance data can be collected on the attached server. So the separately attached server is orchestrating the starting and stopping of the benchmark tools on the Ceph cluster.

Important side note: A requirement of CBT is that passwordless SSH has to be enabled from the server running CBT to the Ceph nodes defined in the head, clients and osds fields.

Example:

cluster:
  user: 'exampleUser' # the SSH user ID that is going to be used for accessing the ceph cluster
  head: "exampleHostAddress" # node where general ceph commands are run
  clients: ["exampleHostAddress"] # nodes that will run benchmarks or other client tools
  osds: ["exampleHostAddress"] # nodes where OSDs will live
  mons: # nodes where mons will live
    exampleHostAddress:
      a: "exampleIPAddress"
  mgrs:
    exampleHostAddress:
      a: ~
  osds_per_node: 8
  conf_file: '/etc/ceph/ceph.conf'
  clusterid: "ceph"
  tmp_dir: "/tmp/cbt"
  ceph-osd_cmd: "/usr/bin/ceph-osd"
  ceph-mon_cmd: "/usr/bin/ceph-mon"
  ceph-run_cmd: "/usr/bin/ceph-run"
  rados_cmd: "/usr/bin/rados"
  ceph_cmd: "/usr/bin/ceph"
  rbd_cmd: "/usr/bin/rbd"
  ceph-mgr_cmd: "/usr/bin/ceph-mgr"

Monitoring Profiles

In our example, we will be using collectl, to collect statistics.

In more detail, the benchmark IO exercisor (FIO) starts up. When the ramp period expires, the monitoring tool (collectl) is started to begin statistics collection, so that no data is collected during the warmup/ramp period. Once the time period of the IO exerciser has expired, CBT stops the monitor tool.

Example:

monitoring_profiles:
  collectl:
     args: '-c 18 -sCD -i 10 -P -oz -F0 --rawtoo --sep ";" -f {collectl_dir}'

Benchmark module

In our example, we will be using librbdfio.

Example:

benchmarks:
  librbdfio:
    rbdname: "test-image"
    poolname: "rbd_replicated"
    cmd_path: '/usr/local/bin/fio'
    <insert details here>

Now within the librbdfio section you will have to specify some details, including your volume name and pool name you created in Part 1 in Step 5. CBT will append 'hostname -f' followed by a volume ID '-X' onto the end of your rbdname stated above, where X is a volume starting from 0 to X as specified in your volumes_per_client field (see Number of volumes section).

For example: rbdname="test-image" will use: --rbdname=test-image-mycephhost1.com-1, if: hostname -f returned: mycephhost1.com

It's important to have the rbdname reflect your volume name and the poolname to reflect your pool name that you used to create the volume. So the example YAML above, follows on from what we did in Part 1, here:

rbd create –pool rbd_replicated –data-pool rbd_erasure –size 10G test-image

Also, the cmd_path attribute shown above is important, this has to be the path where FIO is located on the client driving the IO.

Other important sections of the YAML file: ¶

Length of the benchmark

We configure a ramp and a time for each test:

Ramp → warmup period where no data is collected.
Time → duration for which each test will run and collect results.

The ramp time ensures that the I/O test gets into a steady state before the I/O measurement starts, it is quite common that write caches give unrealistically high performance at the start of the test while the cache fills up and that read caches give slightly lower performance at the start of the test while they are filled. Caches may be implemented in the drives or in the software.

A very short duration test will get performance measurements quicker but might not reflect the performance you will see in real use. Reasons for this include background processes that periodically perform work to clean up and issues such as fragmentation that typically become worse the longer the test is run for. If doing a performance run multiple times gives different results then it is possible that the test duration is too short.

It's important to note that the specified amount of time and ramp within librbdfio will apply to all workloads elsewhere specified in the YAML.
However, these can be overridden by specifying a time or ramp within a specific workload. You will see an example of this within the precondition section, where time is overridden to 600 (10 minutes).

Example:

  librbdfio:
    time: 90 #in seconds
    ramp: 30 #in seconds

Volume size

Storage systems may give different performance depending how full they are, where there are fixed sized caches the cache hit ratio will be higher when testing a smaller quantity of storage, dealing with fragmentation and garbage colleciton takes more time when there is less free capacity. Ideally configure the performance test to use over 50% of the physical storage to get measurements representative of real world use. We went over how to calculate the RBD volume size in Part 1, so it's important that your calculation there, matches with the vol_size attribute within your yaml file.

Ideally, this should match the volume size created in Part 1 when setting up the EC profile.
If this value is lower than the RBD image size, then only that amount of data specified will be written.
If the value is grater, then only the amount of data equivalent to the RBD image size will be written.

Example:

  librbdfio:
    vol_size: 52500 #in megabytes

Number of volumes

This is the same number of volumes you defined in Part 1.

Example:

  librbdfio:
    volumes_per_client: [8]

Prefill & Precondition

These are discussed more in depth in part 1 so please refer to that section if you need a recap.

Prefill → filling all volumes with sequential writes.
Precondition → adding random writes to simulate real-world workloads.

Example:

  librbdfio:
    prefill:
      blocksize: '64k'
      numjobs: 1

    workloads:
      precondition:
        jobname: 'precond1rw'
        mode: 'randwrite'
        time: 600
        op_size: 65536
        numjobs: [ 1 ]
        total_iodepth: [ 16 ]
        monitor: False

So the above is issuing random 64K writes at a total_iodepth of 16 (across all volumes), so with an 8 volume configuration, each volume will be using a queue depth of 2 per volume.

Note: The time here is overriding the time specified in the librbdfio (global) section of the YAML. Not specifying a time will use the default value spceified in the outer (librbdfio) section.

Workloads

Example:

librbdfio:
  workloads:
    Seq32kwrite:
      jobname: 'seqwrite'
      mode: 'write'
      op_size: 32768
      numjobs: [ 1 ]
      total_iodepth: [ 2, 4, 8, 16, 32, 64, 128, 256, 512, 768 ]

The above is an example of a 32k sequential write, we configure different levels of total_iodepth. So the way this test would work is that it would start with a total_iodepth of 2 with a ramp of 30 seconds and 90 seconds of IO with stats collected, then the same would occur for total_iodepth 4, and so on for the increasing total_iodepth values. Each of these total_iodepth points are one of the points that are represented on the curve diagram.

An example of workloads from a YAML file:

Expressing queue depth ¶

Firstly, what is queue depth?

Queue depth can be defined as the number of concurrent commands that are outstanding.

There are two ways of expressing the queue depth per volume in CBT:

Using the iodepth attribute
Using the total_iodepth attribute

iodepth n will use the same queue depth of n for each volume. For example, if the number of configured volumes is 8 then a setting of iodepth 2 will generate a total_iodepth of 16 with each volume having a queue of 2 I/Os. As the queue depth is increased the total amount of queued I/O will increase in multiples of the number of volumes.

total_iodepth n will try and spread n I/O requests across the set of volumes. For example, if total_iodepth is 16 and the number of configured volumes is 8, then the queue depth per volume will be 2 (16/8). Total_iodepth does not need to be exactly divisible by the number of volumes, in these cases CBT some volumes will have a queue depth 1 higher than other volumes.

The main drawback of iodepth over total_iodepth: ¶

Example: If you have a large number of volumes eg. 32. If you specified:

  iodepth: [1, 2, 4, 8]

All 32 volumes will be exercised, and therefore this is equivalent to writing a YAML that does:

total_iodepth: [32, 64, 128, 256]

As you can see, your control over the queue depth scales according to the number of volumes you have configured in the YAML.

Now with total_iodepth, you can go finer grain than this, like so:

total_iodepth: [1, 2, 4, 8, 16, 32]

CBT will only use a subset of the volumes if the total_iodepth configured is less than the total_iodepth in the YAML and where the number of volumes configured does not divide into total_iodepth evenly. This means some volumes will have a different queue depth than others, but CBT will try to start FIO with an iodepth that is as even as possible over the volumes.

A good way to look at the relationship between these terms if you're struggling, is:

total_iodepth = volumes x queue depth

Why do we have lots of different IO values in the yaml? ¶

We have lots of different levels of IOs for our writes and reads within the yaml because we want to get test results for all the different scenarios that happen in the real world. Also to test the different bottlenecks that could be holding back the ceph cluster.

In terms of bottlenecks:
- Short IOs will usually have a CPU bottleneck (this is why the x axis is IOPs for small IOs)
- Larger IOs are more likely to suffer from network and device storage bottlenecks (this is why the x axis turns to Bandwidth for the larger IO sizes)
In terms of real world scenarios:
- A database, or more generally OLTP (Online Transaction Processing) running on block or file storage generally issues small random read and write I/Os. Often there is a higher percentage of read I/Os to write I/Os so this might be represented by a 70% read, 30% overwrite 4K I/O workload.
- An application creating a backup is likely to make larger read and write I/Os and these are likely to be fairly sequential. If the backup is being written to other storage then the I/O workload will be 100% sequential reads, if the backup is being read from elsewhere and written to the storage the I/O workload will be 100% sequential writes.
- A traditional S3 object store contains large objects that are read and written sequentially. S3 objects are not overwritten so the I/O workload would be a mixture of large sequential reads and writes. While the S3 object may be GB in size, RGW will typically split the S3 object into 4MB chunks.
- S3 object stores can be used to store small objects as well, and some applications store indexes and tables within objects and make short random accesses to data within the object. These applications may generate I/O workloads where the reads are more similar to OLTP workloads.
- A storage cluster is likely to be used by more than one application, each with its own I/O workload. The I/O workload to the cluster can consequently become quite complicated. Measuring the performance for I/O workloads with just one type of I/O is a good way of characterising the performance. This data can then be used to predict the performance of more complex I/O workloads with a mixture of I/O types in different ratios by calculating a harmonic mean.

Here is an example of a full YAML file, containing the components mentioned above:

Example YAML file

Here is an example of a YAML file, you can have a lot more workloads than this of course, I just have a few for simplicity purposes.

cluster:

  user: #specify user here 
  head: #specify head here
  clients: #specify clients here
  osds: #specify OSDs here
  mons:
    #specify mons here
  mgrs:
    #specify mgrs here
  osds_per_node: 8
  fs: 'xfs'
  mkfs_opts: '-f -i size=2048'
  mount_opts: '-o inode64,noatime,logbsize=256k'
  conf_file: '/cbt/ceph.conf.4x1x1.fs'
  iterations: 1
  use_existing: True
  clusterid: "ceph"
  tmp_dir: "/tmp/cbt"
  ceph-osd_cmd: "/usr/bin/ceph-osd"
  ceph-mon_cmd: "/usr/bin/ceph-mon"
  ceph-run_cmd: "/usr/bin/ceph-run"
  rados_cmd: "/usr/bin/rados"
  ceph_cmd: "/usr/bin/ceph"
  rbd_cmd: "/usr/bin/rbd"
  ceph-mgr_cmd: "/usr/bin/ceph-mgr"
  pdsh_ssh_args: "-a -x -l%u %h"

monitoring_profiles:
  collectl:
     args: '-c 18 -sCD -i 10 -P -oz -F0 --rawtoo --sep ";" -f {collectl_dir}'

benchmarks:
  librbdfio:
    time: 90
    ramp: 30
    time_based: True
    norandommap: True
    vol_size: 52500
    use_existing_volumes: True
    procs_per_volume: [1]
    volumes_per_client: [16]
    osd_ra: [4096]
    cmd_path: '/usr/local/bin/fio'
    create_report: True
    wait_pgautoscaler_timeout: 20
    log_iops: True
    log_bw:  True
    log_lat: True
    fio_out_format: 'json'
    log_avg_msec: 100
    rbdname: "test-image"
    poolname: "rbd_replicated"
    prefill:
      blocksize: '64k'
      numjobs: 1

    workloads:
      precondition:
        jobname: 'precond1rw'
        mode: 'randwrite'
        time: 600
        op_size: 65536
        numjobs: [ 1 ]
        total_iodepth: [ 16 ]
        monitor: False 

      seq32kwrite:
        jobname: 'seqwrite'
        mode: 'write'
        op_size: 32768
        numjobs: [ 1 ]
        total_iodepth: [ 2, 4, 8, 16, 32, 64, 128, 256, 512, 768 ]
      4krandomread:
        jobname: 'randread'
        mode: 'randread'
        op_size: 4096
        numjobs: [ 1 ]
        total_iodepth: [ 4, 8, 12, 16, 32, 48, 64, 128, 256, 384, 588, 768 ]

Summary ¶

In part 2 you have learnt about YAML files, workloads, and how they are incorporated within CBT performance benchmarking. We will now move onto Part 3 of the blog, which will discuss factors to consider and how to start your first CBT performance benchmark!

Benchmarking Performance with CBT: A guide to setup a Ceph cluster. Part One

2025-12-03T00:00:00Z

CBT Performance Benchmarking - Part 1. What is CBT and how can we use it?

Outline of the Blog Series ¶

Part 1 - How to start a Ceph cluster for a performance benchmark with CBT
Part 2 - Defining YAML contents
Part 3 - How to start a CBT performance benchmark

Contents:

Introduction: What is CBT (Ceph Benchmarking Tool)?
What do you have to consider when you are benchmarking storage systems?
What are you looking to achieve from the performance benchmark?
Starting up a ceph cluster for a performance benchmark

Introduction: What is CBT (Ceph Benchmarking Tool)? ¶

CBT can be used to standardise the performance evaluation process by:

Simplifying the cluster creation process and having CBT do it
Running a deterministic suite of tests with response curves (throughput vs latency) with a wide variety of workloads
Tooling to automatically post process data from a performance benchmark and generate performance reports and comparison reports, ability to compare two or more (up to 6) response curve runs and identify differences in performance within the response curves

Here is an example of what a CBT comparison report would look like: (this will all be explained in more detail later, in part 3)

Now I understand that the above example curves could be a totally new concept for a lot of people so will go over the fundamentals of them:

The perfect response curve would be a flat horizontal line showing constant latency as the quantity of I/O increases until we reach the saturation point. This is where we reach a bottleneck in the system, such as in CPU, network, drive utilisation or some other resource limitation which could also be in the software. At this point we would expect the curve to become a vertical line showing that attempting to do more I/O than the system can handle, just results in I/Os being queued and hence the latency increasing.
In practice response curves are never perfect, a good response curve will have a fairly horizontal line with the latency increasing gradually as the I/O load increases, curving upwards towards a vertical line where we reach the saturation point.
Our comparison curves will be explained in more detail in part 3 of the blog, so a basic understanding is more than fine for now.

The objective of this blog is to demonstrate how CBT (Ceph Benchmarking Tool) can be used to run tests for Ceph in a deterministic manner. It's also to show how to set up a Ceph cluster for use with CBT to make your life simpler by automating a lot of the manual effort that is required to set up a performance test.

For a real life example, this blog will try and answer the quesiton "Does using the CLAY erasure code plugin give better performance than using the default JErasure plugin?" showing how CBT can be used to conduct a set of experiments and produce reports to answer this question.

I hope you find this tutorial simple to understand and you will get to learn the benefits of using CBT to make your performance benchmarking a whole lot easier.

What do you have to consider when you are benchmarking storage systems? ¶

There are several aspects to consider when evaluating performance, the main aspect to consider is what is the goal of measuring performance, this may be:

Regression testing a fix to see if performance has degraded or improved
Regressing testing a build to see if other contributors have degraded performance
Comparing a feature
Comparing the effect of scale-up (adding more OSDs to a node) or scale-out (adding more nodes)
Comparing the performance of one pool type over another
The effect of additional network bandwidth
The effect of upgrading CPU in a Ceph Node

Therefore you need to consider:

The results generated must be compared against a like-for-like system with the test repeated in the same way as the original results.
- This includes the same cpu, number of OSDs, drive type, number of RBD volumes, Ceph nodes, ethernet port/type.
- Even client attach is important.
- Two seamlessly like-for-like systems could produce varying performance results because one drive could have a different generation of Flash memory within it.
- So Ideally, to get like for like comparisons, tests need to be run on the same system.
The system must be prefilled (if applicable, perhaps not so important for Object/RGW evaluation) and preconditioned in the same way.
- Pre-filling involves filling the volume or pool with sequential writes prior to any performance benchmarking. Filling to 100% of the physical capacity is not needed, most production systems will have sufficient capacity available to allow for expansion. For benchmarking therefore, filling to around 50% of the physical capacity is sufficient for real world storage.
- Pre-conditioning is adding random overwrites after prefilling the system to simulate a real world application, to add some garbage collection/fragmentation since most production systems will have been running for many months/years and therefore will have generated many overwrites and updates to data written.
- A storage system that is almost empty will perform very differently from one that has a lot of data on it due to metadata access, garbage collection, fragmentation etc.
Same workload amount, e.g. 1M, 4k, 8k, 64k etc. And this has to be with the same sequential/random method.
There is always going to be some element of variance in the results, even if everything is done like for like.
This could be down to something as minimal as workload ordering, this can have an effect on performance of later workloads. For example, if you sequentially write then read, that will have significantly better performance than if you were to randomly write then sequentially read.
- So if the test results in a pass, fail, you need to allow for variance, typically 10% is probably acceptable if you are just looking at the average performance during the duration of the time.
- The shorter the run time, the greater the degree of variance.
- Also to help minimise variance it’s important to pick an appropriate run time for each test, 5 minutes is usually a good amount.
Turning off the balancer, scrub, deep scrub, autoscaler will help generate more repeatable results as the performance benchmark will just be measuring client performance and not measuring any of the background processes in Ceph that can affect performance such as backfill, pg splitting/merging, and scrubbing. Leaving these features enabled will generate real world results, but likely will generate more variance and a few % difference in performance.

What are you looking to achieve from the performance benchmark? ¶

If the same performance test is repeated on the same system you want to be able to measure the same results (or with as little variance between runs as possible). This predictability is important if you are going to try and compare different configurations to see which is better.

Ideally you also want to be able to come back and run the same test 6 months later, on the same system, and get the same results. This is harder because things can change over time. Ideally, if someone configures an equivalent system to the one the performance test was run on you would like to get the same results. If done correctly, the amount of manual effort needed to regression test performance will be significantly reduced.

Starting up a ceph cluster for a performance benchmark ¶

For these blogs we will be focusing on using Cephadm to start our ceph clusters, though vstart or by hand are also feasible options. It's also important to note that I am using RBD volumes as the storage type with FIO as the IO exerciser interface. The same rules for capacity filling etc apply equally to other storage types, except the maths for calculating the pool size will differ.

This section will describe the basic steps to get a ceph cluster up and running, ready to start a performance benchmark.

Step 1: Setup

You will want to ssh into our machine that we will be using.

My system has the following setup:

6 Sata Drive SSD’s 210GB
ceph version 20.3.0-2198-gb0ae68b0 (b0ae68b0ccceed5a913d81c5a8cb0b4e9c5a5f6b) tentacle (dev)
OS: Red Hat Enterprise Linux 9.6 (Plow)
Note: This is a single node system and you are running the IO client on the same system as Ceph. However, there is nothing stopping you from running CBT on a multi-node server. The YAML format allows it to SSH into Ceph nodes.

Step 2: Clean up

When you create a cluster using cephadm and run a CBT test, log files will be created in specified locations.

So if you have done a test before and know there will be old log files at a specific location, begin by deleting them, if you have never done a CBT run before, you can move onto Step 3: Building a container.

Now I will remove a previous cluster that I had running, so that I am starting from a clean slate.

There are 2 areas you will have to delete to complete this step:

Wherever the tmp_dir line within your yaml file points to:
```
tmp_dir: "/tmp/cbt"
```
This directory contains the temporarily log files from the IO exercisor, eg the FIO json files.
The -a argument when you run a performance benchmark:
```
-a /tmp/cbt
```
This argument directory contains the performance results of the performance benchmark. As you can see, both my YAML and argument point to the same directory, so before my CBT run I will always make sure to:
```
rm -rf /tmp/cbt/*
```
We delete these files as if you don't, CBT assumes there is already a run ongoing and CBT will attempt to protect the previous data and skip tests throughout the YAML.

Step 3: Building a container

Next you will have to get a build container that you are going to use to construct our ceph cluster. You can obtain this container id from Builds ceph. Click on your desired build and then copy the sha1, this is also known as the container id. The build I’m using can be seen within the Step 1: Setup section previously.

You will now pull down the desired build container using podman

Click to see details for upstream

podman pull quay.ceph.io/ceph-ci/ceph:<sha1>

Make sure to paste your specific sha1 into the above command!

Note: The above is using the development build containers. You can also pull released build containers (for Squid/Reef etc), from here.

Click to see details for downstream

podman pull quay.ceph.io/ceph/ceph:v19.2.3

The above is an example for the latest squid build

podman pull quay.ceph.io/ceph/ceph:v20.1

The above is an example for the tentacle RC candidate

Step 4: Creating a cluster

Firstly, run command lsblk to see if there are any ceph partitions on the block devices you are going to use for ceph. If so, you will need to run the removevgs script below, to remove the volume groups:

Click here to see removevgs script

for i in /dev/ceph*
do
lvremove -y $i
done

Next, use cephadm with your container id you previously pulled down, to create your ceph cluster, like so:

cephadm --image quay.ceph.io/ceph-ci/ceph:<sha1> bootstrap --single-host-defaults --log-to-file --mon-ip <ip_of_node> --allow-mismatched-release

Of course replace sha1 and ip_of_node with your corresponding values. You are specifying the container image, using bootstrap to initialise a new Ceph cluster. --single-host-defaults is optimising the bootstrap for a single node, note that if you are creating a multi-node Ceph cluster, this option is not needed. --log-to-file makes Ceph daemons log to files on disk. --mon-ip tells what IP address to bind the first monitor to. --allow-mismatched-release lets you bootstrap with an image that does not match the cephadm version of the host.

It is also common in performance benchmarking to reset the system into a known state prior to starting any benchmarks because factors such as fragmentation of stored data can affect results. Therefore it is advisable to delete and recreate the cluster between every run.

Step 5: Configure cluster

Now you have a basic cluster setup, you can view your cluster to make sure it is up and running:

ceph orch device ls to check all the OSDs you need are available
If not available, you have to use ceph orch zap device <osd> to make them available. A script like this will solve the OSD unavailability problem:

Click to see zap OSD script

#! /bin/bash
file=/tmp/$$.out
out=/tmp/$$b.out
cephadm shell ceph orch device ls 2>&1 | grep ssd >$file

cat $file | while read -a line_array; do

host=${line_array[0]}
device=${line_array[1]}

echo ceph orch device zap ${host} ${device} --force >>$out
done

echo exit >>$out

cephadm shell <$out

rm -f $file
rm -f $out

Next, you will create our Erasure Coding (EC) setup. This script can be customised however you’d like your EC setup to be, I will provide a simple example version of mine here:

Click to see details

ceph osd erasure-code-profile set reedsol plugin=isa k=4 m=2 technique=reed_sol_van stripe_unit=4K crush-failure-domain=osd
ceph osd pool create rbd_erasure 64 64 erasure reedsol
ceph osd pool create rbd_replicated 64 64 replicated
ceph osd pool set rbd_erasure allow_ec_overwrites true
ceph osd pool set rbd_erasure allow_ec_optimizations true
rbd pool init rbd_erasure
rbd pool init rbd_replicated
rbd create –pool rbd_replicated –data-pool rbd_erasure –size 10G test-image

It's very important to take a note of the volume name and pool name you create, in my example above this is test-image and rbd_replicated respectively. As we are creating an erasure coded profile set up, we use the replicated pool name. (In part 2 within the Benchmark Module section you will need to refer to these names)

So the above is an example of a similar script to what I run. It defines a 4 + 2 EC profile named reedsol. An EC profile is essentially a template that defines how Ceph should encode and store data using EC. You create two pools (rbd_erasure & rbd_replicated), enable EC overwrites and EC optimisations, then initialise pools and create an RBD image backed by the EC pool.

Within creating the EC setup you will be:

Defining the amount of data OSDs (k) and parity OSDs (m)
Defining the size of your drives
Defining the percentage of prefill
Defining the number of volumes
Defining the volume size
Defining the EC profile, specifying the plugin, technique, stripe width etc
Creating your EC pool

My EC (Erasure Coding) setup is as follows:

4 + 2 setup (k=4, m=2)
210gb drive size
50% prefill
8 volumes
52.5gb volume size
Single EC pool
Chunk size = 4K

Now you have set up and configured an erasure coded ceph cluster!

I will go a bit more in depth here regarding prefilling, as mentioned above you are aiming to prefill 50% of the physical capacity. You need to choose a working set, (the amount of logical capacity to utilise over during the course of the benchmark) is very important, this is so that all the IO doesn't just go straight into cache in systems with large amounts of memory. Therefore, it is important to have the total working set to be significantly larger than the RAM in the system.

In this example you are using RBD volumes with erasure coding. This is the calculation you would do to find out how much you need to write to fill the physical capacity to 50% (this is represented by 0.5), this is known as the RBD Volume size. (Physical drive size * K * 0.5 / No. of volumes Therefore for our example above, you would get: (210000 * 4 * 0.5) / 8 Therefore the RBD Volume size = 52500 (52.5GB)

You can then calculate the total working set, by doing: RBD Volume size * No. of volumes Which would result in, for our example: 52500 x 8 Therefore the working set = 420000 (420GB)

You can see here for our example that the working set is 420GB and the RAM is 210GB therefore this is satisfactory.

If you are not using RBD volumes with EC and you are using Replica pools instead, the maths would look like this, to get the RBD Volume size: (Physical drive size * Number of OSDS / Number of copies * 0.5) / Number of Volumes

Now move onto Part 2 of the blog if you so wish, where you can take a look at defining a YAML file that will outline the workloads (tests) that you will be running on your ceph cluster!

Fast Erasure Coding for Tentacle Performance Updates

2025-11-20T00:00:00Z

A deep-dive into the benefits of the FastEC improvements in Tentacle. This blog discusses in detail how we have improved Erasure Coding to be a viable alternative to replica and reduce TCO of your Ceph clusters.

Contents:

Introduction
Erasure Coding Basics
- Choosing an Erasure Code Profile
Read Optimizations (Partial Reads)
Space efficiency improvements - small objects padding
Increasing stripe_unit size to 16k
Write optimizations
- Partial Writes
- Parity Delta Writes
Performance Results
Summary

Introduction ¶

Users of Ceph within the community have been getting very excited about the Fast EC feature within the Tentacle release of Ceph. This blog discusses the performance benefits of enabling Fast EC in Tentacle compared to Squid.

The optimizations are primarily intended to benefit Block and File workloads; there may be benefits for S3 object workloads with small objects or random-access reads.

Enabling Fast EC in Tentacle is on a per-pool basis with:

ceph osd pool <mypool> set allow_ec_optimizations on

It is important to note that once allow_ec_optimizations is enabled, it cannot be disabled.

The Fast Erasure coding improvements are summarised as follows:

Read optimizations - partial reads
Space efficiency improvements - small objects padding
Write optimizations – partial writes, parity delta writes
Recommending users increase the stripe_unit size to 16k for pools with allow_ec_optimizations enabled.

Erasure Coding Basics ¶

Before we jump into discussing the optimizations, let us briefly talk about the basics of Erasure Coding and RAID.

Ceph erasure coding works by splitting an object into K data chunks and M parity coding chunks, which are then stored across different Object Storage Daemons (OSDs). If one or more OSDs fail, the missing data can be reconstructed by using the remaining data and parity coding chunks. This method is more storage-efficient than traditional replication because it doesn't store full copies of data.

How it works:

Data splitting: An object is divided into K data chunks.
Parity generation: An erasure code algorithm, such as Reed-Solomon, computes M parity coding chunks based on the data chunks. The number of parity chunks M determines how many OSDs can fail without data loss. The user can configure the erasure code algorithm with different plug-in’s available. The choice of plug-in is outside the scope of this blog.
Chunk distribution: The K data chunks and M parity chunks are distributed and stored on separate OSDs according to a CRUSH rule.
The user decides what the size of a chunk is, this is called a stripe_unit and this can be specified when the erasure-code-profile is created. There is a section later that discusses the choice of stripe_unit.
The stripe_unit size is the amount of data that is written to a data chunk before the next part of an object is written to the next chunk on the next OSD. The stripe is the collection of strips in the stripe which also make up for the coding parities that protect the data in the event of an OSD loss.

Within the community, stripe_unit is commonly referred to as a chunk. For the purpose of this blog, stripe_unit is synonymous to chunk size.

Choosing an Erasure Code Profile ¶

Users have been mainly using replica-3 pools for block and file workloads. A replica-3 pool stores 3 copies of the data on different OSDs so can survive two OSD failures without loss of data. The most common double failure is a drive failure plus a medium error on another drive. Replica-3 pools have a 300% storage overhead - for every 3GB of raw capacity you can store 1GB of application data.

With erasure coding pools you create an erasure code profile choosing values for K+M. The minimum number of OSDs required for an erasure code pool is K+M, and just like replica-3 pools it is recommended that these OSDs are in different servers for fault tolerance. The choice of M defines how much redundancy you have, M=2 means you can survive two OSD failures - the same as a replica-3 pool. The storage overhead for an erasure coded pool is (K+M / K), so a 4+2 pool has a 150% storage overhead.

This blog focuses on Erasure code performance with M=2 as this gives the same level of protection as a replica-3 pool.

Read Optimizations (Partial Reads) ¶

In Squid, reads to an individual strip in a stripe, read the whole stripe, extract the required data that is needed by the client request from the stripe data and then discard the rest of the data. For small reads, the greater the K value (data strips) in the erasure code profile, the greater the amount of wasted IOs to the OSDs.

In Tentacle, Partial Reads is an improvement to only read the minimal data to honour the client request. There are two benefits of this improvement, firstly performance reads are unaffected by the increase of K and your drive media will get better utilization through less wasted IOs, secondly with a larger stripe_unit , client reads will only need to read part of a strip and there will be less wasted bandwidth from the other OSDs.

This means that in Tentacle, with fast EC, you can now choose to use a higher value of K so that you get better capacity utilization without the performance penalties that we see in Squid.

Space efficiency improvements - small objects padding ¶

In Squid, small objects are padded to a whole stripe, which resulted in wasted space as well as a write performance loss due to writing to multiple OSDs needlessly. Fast EC does not pad small objects to a whole stripe, instead it writes the object to just the strips that it needs to, resulting in a performance improvement as well as a capacity saving.

Increasing stripe_unit size to 16k ¶

Having a small stripe_unit increases the probability that client I/Os get split up into multiple requests for different OSDs. For large I/Os (e.g. 1MB reads) there is a performance advantage in splitting the I/O into smaller requests to separate OSDs that can be processed in parallel. For smaller I/Os splitting the I/O just increases the work for the drives, CPU and network and reduces performance.

Increasing the stripe_unit reduces the overheads for processing small I/Os whilst still splitting and getting a performance advantage for large I/Os.

In squid and earlier, there are two reasons why the stripe_unit was small:

Lack of partial read support essentially was a blocker to allowing the increase of the stripe_unit size, as greater values of K with a larger stripe_unit meant reads of 4-16k would have resulted in even greater IO wastage to the OSDs.
EC used to pad all objects to be a multiple of the stripe size. A bigger stripe_unit means more padding which wasted storage capacity.

There is still a compromise between performance and capacity usage. Increasing the stripe_unit above 16K, perhaps as high as 256K would improve performance more but for small files or objects will still waste storage capacity. The choice of 16K for the stripe_unit is a good compromise – it gives very similar capacity utilization to the old EC but better performance.

The default stripe_unit is still 4K in Tentacle, but we recommend that you specify a 16K stripe_unit when you create a new fast EC pool for a bigger performance gain.

For existing pools, it is not possible to change the stripe_unit, fast EC can still be enabled for these pools but there will be a slightly less performance improvement.

Write optimizations ¶

Partial Writes ¶

In squid, all sub-stripe writes are handled by reading the whole stripe, merging in the new data from the client, encode the new parities and write the stripe back, data with the coding parities.

This meant that EC was more optimised for large block and large object workloads, but it is not optimal for small object or small write workloads such as CephFS or transactional workloads, since greater values of K with small writes meant that IO operations are amplified.

Partial Writes only reads the data strips that are not being written, encode the new parities and only write back the modified data and parity strips.

This optimisation means for small writes and large values of K, Fast EC saves on drive operations for reading and writing unchanged data within the stripe.

Parity Delta Writes ¶

Parity delta writes (PDW) builds on the partial write improvement within Fast EC.

A common technique used by block storage controllers, implementing RAID-5 and RAID-6 is to implement parity delta writes. When a small part of the stripe is being overwritten it is possible to perform the update by reading the old data, XORing this with the new data to create a delta and then read each coding parity, apply the delta and write the new parity. The advantage of this technique is that it can involve a lot less I/O, especially for K+M encodings with larger values of K. The technique is not specific to M=1 and M=2, it can be applied with any number of coding parities. For M=2, this technique involves doing 3 reads and 3 writes per strip within the client request and then updates to the parity are coalesced via a cache to minimize the number of parity updates within the stripe. (For M=1, 2 reads and 2 writes are needed for each write).

In some scenarios depending on the value of K and the size of the write operation, it may be more beneficial to not use PDW.

The implementation of PDW within Fast EC dynamically adjusts the write technique for each IO for optimal write performance.

Here is an example table of profile vs the write size just for illustration:

Erasure Code	stripe_unit	Write size	PDW Write	PDW Off (Partial Write)
2+2	16k	4 to 16k	3 reads+3 writes	`1 read+3 writes`
4+2	16k	4 to 16k	`3 reads+3 writes`	3 reads+3 writes
6+2	16k	4 to 16k	`3 reads+3 writes`	5 reads+3 writes
6+2	16k	32k	`4 reads+4 writes`	4 reads+4 writes
8+2	16k	4 to 16k	`3 reads+3 writes`	7 reads+3 writes
8+2	16k	32k	`4 reads+4 writes`	6 reads+4 writes

Figure 1: Table to explain write overhead using PDW and Partial Write techniques

The highlighted text indicates the more efficient method for the scenario.

In scenarios where the total numbers of I/O operations is the same between PDW on and off (ie. Using Partial Write methodology), FastEC will favour using PDW because reading and writing the same OSD is more efficient than reading and writing to different OSDs because bluestore caches metadata.

Performance Results ¶

For the purpose of this blog, we ran the performance tests with a single node. Running with a single node means that there are no network bottlenecks and we can focus on CPU and drive bottlenecks. The absolute performance measurements won’t be great, but we can still compare relative performance as the optimizations will be demonstrating we have extracted more performance in workloads that are limited by CPU or drives.

The configuration of the system is as follows:

Single Node – 8 OSDs - NVME Flash
2 x Intel(R) Xeon(R) Platinum 8276M CPU @ 2.20GHz – 28 cores per socket
LibRBD FIO client – 16 volumes – 1 client per RBD volume
ISAL plugin

How to read a response curve ¶

For the next sections, I need to explain a response curve (also known as a hockey stick curve).

Figure 2: How to read a response curve.

A response curve plots I/Os per second (IOPs) against latency. Starting from the bottom left of the chart the expectation is that as the queue depth to the storage system increases, the IOPs increases also. At a certain point in the curve, otherwise known as the knee, this is the saturation point where throughput no longer increases and adding extra work onto the storage systems queue will just increase latency. This is what generates the “hockey stick” shape to the curve.

For each point of the curve, the I/O workload is run at a specified queue depth for several minutes and then an average IOPS and latency is calculated. Typically 3 to 5 minutes (with a warm up period).

The saturation point is system specific and there may be many reasons why this limit is hit, depending on the workload, for example such as the CPU, a CPU core, drive, network interface, or some software resource limit in the software are just a few possible reasons and there maybe others.

Typically, response curves are used during client system sizing estimates to understand the limits of the system being sold and to evaluate how much headroom remains on the system. Typically, clients don’t go beyond around 70% of the maximum throughput to allow for sufficient head room for expansion. A flat line at the beginning of the curve through to the knee is an indication that latency is consistent with low variance in the throughput and latency.

The topic of how a response curve is created or factors that can affect the response curve is subject to performance best practices. This is outside the scope of this blog and will be discussed in a series of blogs on CBT (Ceph Benchmarking Tool) which will be available soon on ceph.io

When comparing response curves, it is inevitable that there is some variance typically around 5 to 10%

For now, let us get onto discussing the performance Fast EC improvements in Tentacle for the purpose of explaining the legend of the charts in the next section and all other charts in this blog, for example:

squid-ec-6+2-4K means we are running the squid build, using erasure coding with a 6+2 profile and a 4K stripe unit. Therefore, these graphs are comparing a Squid build with a 4K stripe_unit 6+2 erasure code to a Tentacle build with FastEC enabled with a 16K stripe_unit and a 6+2 profile. There are other charts that use a different erasure code profile.

Write Results - Small Writes ¶

Figure 3: Small Writes - Squid 4k 6+2 EC vs Tentacle 16k Fast EC 6+2 Small writes are common for Ceph FS, RBD and small object workloads.

In Figure 3, we start by comparing a Squid 4k stripe_unit 6+2 erasure code to Tentacle with FastEC enabled 16k stripe_unit 6+2 configuration. This is a small system single node with 8 OSDs system. By all means, 20K IOPS isn’t a particularly great throughput of a storage system, however it isn’t the absolute numbers we are interested in here. We are interested in the relative performance of the two pieces of software, this is highlighting that we can at least double the throughput of the drives with Fast EC at the same latency achieved, or in some cases improve the latency. If you want more performance, you can add more drives and nodes to the configuration.

The improvement in performance in the 6+2 configuration between Squid 4K and Tentacle 16k is largely due to the Parity Delta Writes feature of FastEC, as explained in the Figure 1 comparing the number of read/write operations depending on the value of K and the size of the write IO request.

Your choice of K can affect the performance you get from the system. Here are a set of charts that perform the same 4/8/16k random writes test comparing 2+2,4+2 and 6+2 EC configurations:

Figure 4: Small Writes - Squid – EC 2+2, 4+2 and 6+2 – 4k compared to tentacle – FastEC in 2+2,4+2, 6+2 profiles

Previously in Squid, write performance reduced as K increased, the reason is the whole stripe is always being read and written, this means that for wider erasure codes (eg 4+2 and 6+2) the overheads get higher and performance reduces. Increasing K above 6 would lead to further drops in performance.

For Tentacle with Fast EC, the parity delta write optimization means that wider erasure codes performance improves as K increases. Performance is not expected to improve beyond the 6+2.

We’ll discuss later in this blog how we are recommending choosing greater values of K as this improves storage efficiency with less capacity overhead.

Write Results - Large Writes ¶

Figure 5: Large Writes - Comparing Squid 4k, Tentacle 16k to 3-way replica

For large writes and large S3 objects there is a small increase in throughput and lower latency compared to Squid. You can expect to see the same performance with FastEC enabled for larger 1Mbyte objects, performance is near 3-way replica.

Read Results - Small Reads ¶

Figure 6: Small reads comparing 4k stripe_unit Squid 6+2 EC to 16k Tentacle 6+2 EC

Small reads yield a 2-3x improvement due to the Partial Read feature added in Fast EC. This is good for RBD, Ceph FS and Small object workloads.

Figure 7: Small Reads - Comparing 2+2,4+2, 6+2 Squid to Tentacle to 3-way replica

Comparing the different erasure code profiles between Squid and Tentacle.

These results highlight the following observations:

For Squid, as K increases from 2+2 -> 4+2 -> 6+2. Maximum throughput degrades, for reasons as explained earlier in this blog, Squid does not have partial reads. As K increases, more data is thrown away for small read operations therefore increasing OSD and CPU utilization.
For Tentacle, as K increases, maximum throughput scales to the point where we can achieve nearly the same read performance as 3-way replica.
The latency gap between Tentacle and 3-way reads to the non-primary OSD are being redirected to primary OSD.

Direct Reads, a feature coming in a future release of Ceph will remove the hop to the primary OSD which will improve latency to be equivalent to 3-way replica performance.

Currently targeted at Umbrella timeframe, there will be a blog at a future date on this feature.

Read Results - Large Reads ¶

Figure 8: Large Reads - Comparing 6+2 Squid 4K to Tentacle 6+2 16k to 3-way replica

For large reads, backup, large S3 object and streaming workloads, offer slightly lower latency and around a 1.2x increase in throughput using Fast EC over Squid.

Direct Reads is expected to significantly improve EC throughput further to be much closer to 3-way replica whilst also reducing latency due to dividing up of the large requests into chunks and issuing the IOs in parallel to all the OSDs in the stripe.

Write Append Results ¶

Write appends are where new data is being appended to the end of an existing object. This is typically common in sequential write, backup, AI or RGW PUT workloads.

Figure 9: Write Appends – Squid 4k to Tentacle 16k stripe_unit

These results highlight the following benefits:

A significant latency reduction for all writes upto 512k and modest improvement at 1Mbyte.
For small block writes upto 16k, there is a significant increase in IOPs throughput available.
For writes 16k to 64k there is a modest increase in throughput available also.
No degradation in performance for 512k and 1Mbyte writes whilst improving latency significantly.

It is interesting to note that increasing K, (eg. going from 2+2 to 4+2/6+2) increases the latency. The reason for this is that in a 2+2 configuration, 50% of your I/O is writing to the primary OSD of the PG, where as in a 4+2 configuration, 25% of your I/O is writing to the primary OSD of the PG. Writing to the non-primary OSD results in needing to forward the request to the primary OSD resulting in an extra messenger hop operation.

Mixed Read/Write Workloads ¶

Figure 10: Mixed 16k 70/30 - Squid 4k to Tentacle 16k to 3-way replica

Transactional and File workloads contain a mixture of reads and writes, typically small block with 70% reads and 30% writes of around 16k in size. This chart contains a typical 70/30 16k mix workload.

Compared with Squid, there is at least a doubling in throughput with FastEC. Three-way replica is still faster, however compared to a 6+2 16k stripe_unit erasure pool with Fast EC it is around 50% of the performance, however you need to consider that a 6+2 erasure code has only 33% overheads, compared to needing 3x physical capacity for a Three-way configuration. Three-way replica is a significantly more expensive option compared to using EC. Therefore, EC in 6+2 form has a much better cost vs performance ratio over 3-way replica.

On the same storage system, write dominated workloads with EC (due to the 3 reads/3 writes) are never going to perform as well as Replica purely because of the laws of physics and the algorithms for EC need to do more IOs than Replica. However, you can offset this cost with less physical capacity and restructure your storage accordingly.

It is important to note, traditional storage controllers often offer a choice between RAID-1 (mirroring) and RAID-6 (erasure coding K+2) and they also have a similar cost performance trade off.

Using a wider erasure code such as 6+2 requires 9 nodes and therefore you may need to add more nodes to your Ceph cluster. However, the cost of a storage solution, is typically dominated by the cost of the drives you install to store the data, especially if you are using Flash. With Erasure Code you get half the performance at less than half the cost, giving you the opportunity to scale out to build the same level of performance as replica.

Summary ¶

The objective of the EC performance enhancements is to make performance good enough to make it viable to use EC for block and file storage, especially when you consider the cost performance ratio benefits of using EC over 3-way replica.

For the most part users should not be considering performance when choosing the value of K. Users should use higher values of K (such as 6+2) for better storage efficiency whilst maintaining the same redundancy as replica.

Using Fast EC in a 6+2 configuration, you could use this saving to increase the number of nodes, redistribute your drives across the nodes and achieve the same performance as Three-way replica and still save money.

The Fast EC feature in Tentacle reduces the total cost of ownership of your Ceph cluster by allowing you to use Erasure Coding as an alternative and more space efficient method with a significantly better cost vs performance ratio of storing your data compared to Replica pools.

I hope this blog has helped you appreciate the performance benefits of Fast EC. The team are working on many more improvements:

Direct Reads – This feature will significantly improve reads to offer the same performance as Replica pools.
Object packing – This feature brings substantial benefits to users wanting to increase the stripe_unit increase beyond 16k without degrading space utilization which will bring other performance improvements for reads and writes beyond 16k. This will be a useful improvement for larger (4MB) objects.

Direct Reads is targeted for the Umbrella release. Object packing will be in a future release. More performance data on these features will be available nearer the time.

v20.2.0 Tentacle released

2025-11-18T00:00:00Z

Tentacle is the 20th stable release of Ceph.

This is the first stable release of Ceph Tentacle.

Contents:

Major Changes from Squid
Upgrading from Reef or Squid
Upgrading from pre-Reef releases (like Quincy)
Thank You to Our Contributors

Major Changes from Squid ¶

Highlights ¶

See the sections below for more details on these items.

CephFS

Directories may now be configured with case-insensitive or normalized directory entry names.
Modifying the FS setting variable max_mds when a cluster is unhealthy now requires users to pass the confirmation flag (--yes-i-really-mean-it).
EOPNOTSUPP (Operation not supported) is now returned by the CephFS FUSE client for fallocate for the default case (i.e. mode == 0).

Crimson

SeaStore Tech Preview: SeaStore object store is now deployable alongside Crimson-OSD, mainly for early testing and experimentation. Community feedback is encouraged to help with future improvements.

Dashboard

Support has been added for NVMe/TCP gateway groups and multiple namespaces, multi-cluster management, OAuth 2.0 integration, and enhanced RGW/SMB features including multi-site automation, tiering, policies, lifecycles, notifications, and granular replication.

Integrated SMB support

Ceph clusters now offer an SMB Manager module that works like the existing NFS subsystem. The new SMB support allows the Ceph cluster to automatically create Samba-backed SMB file shares connected to CephFS. The smb module can configure both basic Active Directory domain or standalone user authentication. The Ceph cluster can host one or more virtual SMB clusters which can be truly clustered using Samba's CTDB technology. The smb module requires a cephadm-enabled Ceph cluster and deploys container images provided by the samba-container project. The Ceph dashboard can be used to configure SMB clusters and shares. A new cephfs-proxy daemon is automatically deployed to improve scalability and memory usage when connecting Samba to CephFS.

MGR

Users now have the ability to force-disable always-on modules.
The restful and zabbix modules (deprecated since 2020) have been officially removed.

RADOS

FastEC: Long-anticipated performance and space amplification optimizations are added for erasure-coded pools.
BlueStore: Improved compression and a new, faster WAL (write-ahead-log).
Data Availability Score: Users can now track a data availability score for each pool in their cluster.
OMAP: All components have been switched to the faster OMAP iteration interface, which improves RGW bucket listing and scrub operations.

RBD

New live migration features: RBD images can now be instantly imported from another Ceph cluster (native format) or from a wide variety of external sources/formats.
There is now support for RBD namespace remapping while mirroring between Ceph clusters.
Several commands related to group and group snap info were added or improved, and rbd device map command now defaults to msgr2.

RGW

Added support for S3 GetObjectAttributes.
For compatibility with AWS S3, LastModified timestamps are now truncated to the second. Note that during upgrade, users may observe these timestamps moving backwards as a result.
Bucket resharding now does most of its processing before it starts to block write operations. This should significantly reduce the client-visible impact of resharding on large buckets.
The User Account feature introduced in Squid provides first-class support for IAM APIs and policy. Our preliminary STS support was based on tenants, and exposed some IAM APIs to admins only. This tenant-level IAM functionality is now deprecated in favor of accounts. While we'll continue to support the tenant feature itself for namespace isolation, the following features will be removed no sooner than the V release:
- Tenant-level IAM APIs including CreateRole, PutRolePolicy and PutUserPolicy,
- Use of tenant names instead of accounts in IAM policy documents,
- Interpretation of IAM policy without cross-account policy evaluation,
- S3 API support for cross-tenant names such as Bucket='tenant:bucketname'
- STS Lite and sts:GetSessionToken.

Cephadm ¶

A new cephadm-managed mgmt-gateway service provides a single, TLS-terminated entry point for Ceph management endpoints such as the Dashboard and the monitoring stack. The gateway is implemented as an nginx-based reverse proxy that fronts Prometheus, Grafana, and Alertmanager, so users no longer need to connect to those daemons directly or know which hosts they run on. When combined with the new oauth2-proxy service, which integrates with external identity providers using the OpenID Connect (OIDC) / OAuth 2.0 protocols, the gateway can enforce centralized authentication and single sign-on (SSO) for both the Ceph Dashboard and the rest of the monitoring stack.
High availability for the Ceph Dashboard and the Prometheus-based monitoring stack is now provided via the cephadm-managed mgmt-gateway. nginx high-availability mechanisms allow the mgmt-gateway to detect healthy instances of the Dashboard, Prometheus, Grafana, and Alertmanager, route traffic accordingly, and handle manager failover transparently. When deployed with a virtual IP and multiple mgmt-gateway instances, this architecture keeps management access available even during daemon or host failures.
A new certmgr cephadm subsystem centralizes certificate lifecycle management for cephadm-managed services. certmgr acts as a cluster-internal root CA for cephadm-signed certificates, it can also consume user-provided certificates, and tracks how each certificate was provisioned. It standardizes HTTPS configuration for services such as RGW and the mgmt-gateway, automates renewal and rotation of cephadm-signed certificates, and raises health warnings when certificates are invalid, expiring or misconfigured. With certmgr, cephadm-signed certificates are available across all cephadm-managed services, providing secure defaults out of the box.

CephFS ¶

Directories may now be configured with case-insensitive or normalized directory entry names. This is an inheritable configuration, making it apply to an entire directory tree.

For more information, see https://docs.ceph.com/en/tentacle/cephfs/charmap/.
It is now possible to pause the threads that asynchronously purge deleted subvolumes by using the config option mgr/volumes/pause_purging.
It is now possible to pause the threads that asynchronously clone subvolume snapshots by using the config option mgr/volumes/pause_cloning.
Modifying the setting max_mds when a cluster is unhealthy now requires users to pass the confirmation flag (--yes-i-really-mean-it). This has been added as a precaution to inform users that modifying max_mds may not help with troubleshooting or recovery efforts. Instead, it might further destabilize the cluster.
EOPNOTSUPP (Operation not supported) is now returned by the CephFS FUSE client for fallocate in the default case (i.e., mode == 0) since CephFS does not support disk space reservation. The only flags supported are FALLOC_FL_KEEP_SIZE and FALLOC_FL_PUNCH_HOLE.
The ceph fs subvolume snapshot getpath command now allows users to get the path of a snapshot of a subvolume. If the snapshot is not present, ENOENT is returned.
The ceph fs volume create command now allows users to pass metadata and data pool names to be used for creating the volume. If either is not passed, or if either is a non-empty pool, the command will abort.
The format of the pool namespace name for CephFS volumes has been changed from fsvolumens__<subvol-name> to fsvolumens__<subvol-grp-name>_<subvol-name> to avoid namespace collisions when two subvolumes located in different subvolume groups have the same name. Even with namespace collisions, there were no security issues, since the MDS auth cap is restricted to the subvolume path. Now, with this change, the namespaces are completely isolated.
If the subvolume name passed to the command ceph fs subvolume info is a clone, the output will now also contain a "source" field that tells the user the name of the source snapshot along with the name of the volume, subvolume group, and subvolume in which the source snapshot is located. For clones created with Tentacle or an earlier release, the value of this field will be N/A. Regular subvolumes do not have a source subvolume and therefore the output for them will not contain a "source" field regardless of the release.

Crimson / SeaStore ¶

The Crimson project continues to progress, with the Squid release marking the first technical preview available for Crimson. The Tentacle release introduces a host of improvements and new functionalities that enhance the robustness, performance, and usability of both Crimson-OSD and the SeaStore object store. In this release, SeaStore can now be deployed alongside the Crimson-OSD! Early testing and experimentation are highly encouraged and we’d greatly appreciate any initial feedback rounds from the community to help guide future improvements. Check out the Crimson project updates blog post for Tentacle where we highlight some of the work included in the latest release, moving us closer to fully replacing the existing Classical OSD in the future: https://ceph.io/en/news/blog/2025/crimson-T-release/

If you're new to the Crimson project, please visit the project page for more information and resources: https://ceph.io/en/news/crimson

Dashboard ¶

There is now added support for NVMe/TCP gateway groups and multiple namespaces, multi-cluster management, OAuth 2.0 integration, and enhanced RGW/SMB features including multi-site automation, tiering, policies, lifecycles, notifications, and granular replication.

MGR ¶

The Ceph Manager's always-on modulues/plugins can now be force-disabled. This can be necessary in cases where we wish to prevent the manager from being flooded by module commands when Ceph services are down or degraded.
mgr/restful, mgr/zabbix: both modules, already deprecated since 2020, have been finally removed. They have not been actively maintained in the last years, and started suffering from vulnerabilities in their dependency chain (e.g.: CVE-2023-46136). An alternative for the restful module is the dashboard module, which provides a richer and better maintained RESTful API. Regarding the zabbix module, there are alternative monitoring solutions, like prometheus, which is the most widely adopted among the Ceph user community.

RADOS ¶

Long-anticipated performance and space amplification optimizations (FastEC) are added for erasure-coded pools, including partial reads and partial writes.
A new implementation of the Erasure Coding I/O code provides substantial performance improvements and some capacity improvements. The new code is designed to optimize performance when using Erasure Coding with block storage (RBD) and file storage (CephFS) but will have benefits for object storage (RGW), in particular when using smaller sized objects. A new flag allow_ec_optimizations must be set on each pool to switch to using the new code. Existing pools can be upgraded once the OSD and Monitor daemons have been updated. There is no need to update the clients.
The default plugin for erasure coded pools has been changed from Jerasure to ISA-L. Clusters created on Tentacle or later releases will use ISA-L as the default plugin when creating a new pool. Clusters that upgrade to the T release will continue to use their existing default values. The default values can be overridden by creating a new erasure code profile and selecting it when creating a new pool. ISA-L is recommended for new pools because the Jerasure library is no longer maintained.
BlueStore now has better compression and a new, faster WAL (write-ahead-log).
All components have been switched to the faster OMAP iteration interface, which improves RGW bucket listing and scrub operations.
It is now possible to bypass ceph_assert() in extreme cases to help with disaster recovery.
Testing improvements for dencoding verification were added.
A new command, ceph osd pool availability-status, has been added that allows users to view the availability score for each pool in a cluster. A pool is considered unavailable if any PG in the pool is not active or if there are unfound objects. Otherwise the pool is considered available. The score is updated every one second by default. This interval can be changed using the new config option pool_availability_update_interval. The feature is off by default. A new config option enable_availability_tracking can be used to turn on the feature if required. Another command is added to clear the availability status for a specific pool:
```
$ ceph osd pool clear-availability-status <pool-name>
```
This feature is in tech preview.

Related links:
- Feature ticket: https://tracker.ceph.com/issues/67777
- Documentation: https://docs.ceph.com/en/tentacle/rados/operations/monitoring/
Leader monitor and stretch mode status are now included in the ceph status output.

Related tracker: https://tracker.ceph.com/issues/70406
The ceph df command reports incorrect MAX AVAIL for stretch mode pools when CRUSH rules use multiple take steps for datacenters. PGMap::get_rule_avail incorrectly calculates available space from only one datacenter. As a workaround, define CRUSH rules with take default and choose firstn 0 type datacenter. See https://tracker.ceph.com/issues/56650#note-6 for details.

Upgrading a cluster configured with a CRUSH rule with multiple take steps can lead to data shuffling, as the new CRUSH changes may necessitate data redistribution. In contrast, a stretch rule with a single-take configuration will not cause any data movement during the upgrade process.
Added convenience function librados::AioCompletion::cancel() with the same behavior as librados::IoCtx::aio_cancel().
The configuration parameter osd_repair_during_recovery has been removed. That configuration flag used to control whether an operator-initiated "repair scrub" would be allowed to start on an OSD that is performing a recovery. In this Ceph version, operator-initiated scrubs and repair scrubs are never blocked by a repair being performed.
Fixed issue of recovery/backfill hang due to improper handling of items in the dmclock's background clean-up thread.

Related tracker: https://tracker.ceph.com/issues/61594
The OSD's IOPS capacity used by the mClock scheduler is now also checked to determine if it's below a configured threshold value defined by:
- osd_mclock_iops_capacity_low_threshold_hdd – set to 50 IOPS
- osd_mclock_iops_capacity_low_threshold_ssd – set to 1000 IOPS
The check is intended to handle cases where the measured IOPS is unrealistically low. If such a case is detected, the IOPS capacity is either set to the last valid value or the configured default to avoid affecting cluster performance (slow or stalled ops).
Documentation has been updated with steps to override OSD IOPS capacity configuration.

Related links:
- Tracker ticket: https://tracker.ceph.com/issues/70774
- Documentation: https://docs.ceph.com/en/tentacle/rados/configuration/mclock-config-ref/
pybind/rados: Fixes WriteOp.zero() in the original reversed order of arguments offset and length. When pybind calls WriteOp.zero(), the argument passed does not match rados_write_op_zero, and offset and length are swapped, which results in an unexpected response.

RBD ¶

RBD images can now be instantly imported from another Ceph cluster. The migration source spec for native format has grown cluster_name and client_name optional fields for connecting to the source cluster after parsing the respective ceph.conf-like configuration file.
With the help of the new NBD stream ("type": "nbd"), RBD images can now be instantly imported from a wide variety of external sources/formats. The exact set of supported formats and their features depends on the capabilities of the NBD server.
While mirroring between Ceph clusters, the local and remote RBD namespaces don't need to be the same anymore (but the pool names still do). Using the new --remote-namespace option of rbd mirror pool enable command, it's now possible to pair a local namespace with an arbitrary remote namespace in the respective pool, including mapping a default namespace to a non-default namespace and vice versa, at the time mirroring is configured.
All Python APIs that produce timestamps now return "aware" datetime objects instead of "naive" ones (i.e., those including time zone information instead of those not including it). All timestamps remain in UTC, but including timezone.utc makes it explicit and avoids the potential of the returned timestamp getting misinterpreted. In Python 3, many datetime methods treat "naive" datetime objects as local times.
rbd group info and rbd group snap info commands are introduced to show information about a group and a group snapshot respectively.
rbd group snap ls output now includes the group snapshot IDs. The header of the column showing the state of a group snapshot in the unformatted CLI output is changed from STATUS to STATE. The state of a group snapshot that was shown as ok is now shown as complete, which is more descriptive.
In rbd mirror image status and rbd mirror pool status --verbose outputs, mirror_uuids field has been renamed to mirror_uuid to highlight that the value is always a single UUID and never a list of any kind.
Moving an image that is a member of a group to trash is no longer allowed. The rbd trash mv command now behaves the same way as rbd rm in this scenario.
rbd device map command now defaults to msgr2 for all device types. -o ms_mode=legacy can be passed to continue using msgr1 with krbd.
The family of diff-iterate APIs has been extended to allow diffing from or between non-user type snapshots which can only be referred to by their IDs.
Fetching the mirroring mode of an image is invalid if the image is disabled for mirroring. The public APIs -- C++ mirror_image_get_mode(), C rbd_mirror_image_get_mode(), and Python Image.mirror_image_get_mode() -- will return EINVAL when mirroring is disabled.
Promoting an image is invalid if the image is not enabled for mirroring. The public APIs -- C++ mirror_image_promote(), C rbd_mirror_image_promote(), and Python Image.mirror_image_promote() -- will return EINVAL instead of ENOENT when mirroring is not enabled.
Requesting a resync on an image is invalid if the image is not enabled for mirroring. The public APIs -- C++ mirror_image_resync(), C rbd_mirror_image_resync(), and Python Image.mirror_image_resync() -- will return EINVAL instead of ENOENT when mirroring is not enabled.

RGW ¶

Multiple fixes: Lua scripts will no longer run uselessly against health checks, properly quoted ETag values returned by S3 CopyPart, PostObject, and CompleteMultipartUpload responses.
IAM policy evaluation now supports conditions ArnEquals and ArnLike, along with their Not and IfExists variants.
Added BEAST frontend option so_reuseport which facilitates running multiple RGW instances on the same host by sharing a single TCP port.
Replication policies now validate permissions using s3:ReplicateObject, s3:ReplicateDelete, and s3:ReplicateTags for destination buckets. For source buckets, both s3:GetObjectVersionForReplication and s3:GetObject(Version) are supported. Actions like s3:GetObjectAcl, s3:GetObjectLegalHold, and s3:GetObjectRetention are also considered when fetching the source object. Replication of tags is controlled by the s3:GetObject(Version)Tagging permission.
Adding missing quotes to the ETag values returned by S3 CopyPart, PostObject, and CompleteMultipartUpload responses.
PutObjectLockConfiguration can now be used to enable S3 Object Lock on an existing versioning-enabled bucket that was not created with Object Lock enabled.
The x-amz-confirm-remove-self-bucket-access header is now supported by PutBucketPolicy. Additionally, the root user will always have access to modify the bucket policy, even if the current policy explicitly denies access.
Added support for the RestrictPublicBuckets property of the S3 PublicAccessBlock configuration.
The HeadBucket API now reports the X-RGW-Bytes-Used and X-RGW-Object-Count headers only when the read-stats querystring is explicitly included in the API request.

Telemetry ¶

The basic channel in telemetry now captures the ec_optimizations flag, which will allow us to gauge feature adoption for the new FastEC improvements. To opt into telemetry, run ceph telemetry on.

Upgrading from Reef or Squid ¶

Before starting, ensure that your cluster is stable and healthy with no down, recovering, incomplete, undersized or backfilling PGs. You can temporarily disable the PG autoscaler for all pools during the upgrade by running ceph osd pool set noautoscale before beginning, and if the autoscaler is desired after completion, running ceph osd pool unset noautoscale after upgrade success is confirmed.

Note:

You can monitor the progress of your upgrade at each stage with the ceph versions command, which will tell you what Ceph version(s) are running for each type of daemon.

Upgrading Cephadm Clusters ¶

If your cluster is deployed with cephadm (first introduced in Octopus), then the upgrade process is entirely automated. To initiate the upgrade,

$ ceph orch upgrade start --image quay.io/ceph/ceph:v20.2.0

The same process is used to upgrade to future minor releases.

Upgrade progress can be monitored with

$ ceph orch upgrade status

Upgrade progress can also be monitored with ceph -s (which provides a simple progress bar) or more verbosely with

$ ceph -W cephadm

The upgrade can be paused or resumed with

$ ceph orch upgrade pause  # to pause
$ ceph orch upgrade resume # to resume

or canceled with

$ ceph orch upgrade stop

Note that canceling the upgrade simply stops the process. There is no ability to downgrade back to Reef or Squid.

Upgrading Non-cephadm Clusters ¶

Note:
If your cluster is running Reef (18.2.x) or later, you might choose to first convert it to use cephadm so that the upgrade to Tentacle is automated (see above). For more information, see https://docs.ceph.com/en/tentacle/cephadm/adoption/.
If your cluster is running Reef (18.2.x) or later, systemd unit file names have changed to include the cluster fsid. To find the correct systemd unit file name for your cluster, run the following command:
$ systemctl -l | grep <daemon type>
Example:
$ systemctl -l | grep mon | grep active

ceph-6ce0347c-314a-11ee-9b52-000af7995d6c@mon.f28-h21-000-r630.service loaded active running Ceph mon.f28-h21-000-r630 for 6ce0347c-314a-11ee-9b52-000af7995d6c

Set the noout flag for the duration of the upgrade. (Optional, but recommended.)
```
$ ceph osd set noout
```
Upgrade Monitors by installing the new packages and restarting the Monitor daemons. For example, on each Monitor host:
```
$ systemctl restart ceph-mon.target
```
Once all Monitors are up, verify that the Monitor upgrade is complete by looking for the tentacle string in the mon map. The command:
```
$ ceph mon dump | grep min_mon_release
```
should report:
```
min_mon_release 20 (tentacle)
```
If it does not, that implies that one or more Monitors haven't been upgraded and restarted and/or the quorum does not include all Monitors.
Upgrade ceph-mgr daemons by installing the new packages and restarting all Manager daemons. For example, on each Manager host:
```
$ systemctl restart ceph-mgr.target
```
Verify the ceph-mgr daemons are running by checking ceph -s:
```
$ ceph -s

...
  services:
   mon: 3 daemons, quorum foo,bar,baz
   mgr: foo(active), standbys: bar, baz
...
```
Upgrade all OSDs by installing the new packages and restarting the ceph-osd daemons on all OSD hosts:
```
$ systemctl restart ceph-osd.target
```
Upgrade all CephFS MDS daemons. For each CephFS file system:

5.1. Disable standby_replay:

$ ceph fs set <fs_name> allow_standby_replay false

5.2. Reduce the number of ranks to 1. (Make note of the original number of MDS daemons first if you plan to restore it later.)

$ ceph status # ceph fs set <fs_name> max_mds 1

5.3. Wait for the cluster to deactivate any non-zero ranks by periodically checking the status:

$ ceph status

5.4. Take all standby MDS daemons offline on the appropriate hosts with:

$ systemctl stop ceph-mds@<daemon_name>

5.5. Confirm that only one MDS is online and is rank 0 for your FS:

$ ceph status

5.6. Upgrade the last remaining MDS daemon by installing the new packages and restarting the daemon:

$ systemctl restart ceph-mds.target

5.7. Restart all standby MDS daemons that were taken offline:

$ systemctl start ceph-mds.target

5.8. Restore the original value of max_mds for the volume:

$ ceph fs set <fs_name> max_mds <original_max_mds>
Upgrade all radosgw daemons by upgrading packages and restarting daemons on all hosts:
```
$ systemctl restart ceph-radosgw.target
```
Complete the upgrade by disallowing pre-Tentacle OSDs and enabling all new Tentacle-only functionality:
```
$ ceph osd require-osd-release tentacle
```
If you set noout at the beginning, be sure to clear it with:
```
$ ceph osd unset noout
```
Consider transitioning your cluster to use the cephadm deployment and orchestration framework to simplify cluster management and future upgrades. For more information on converting an existing cluster to cephadm, see https://docs.ceph.com/en/tentacle/cephadm/adoption/.

Post-upgrade ¶

Verify the cluster is healthy with ceph health.
Consider enabling telemetry to send anonymized usage statistics and crash information to Ceph upstream developers. To see what would be reported without actually sending any information to anyone:
```
$ ceph telemetry preview-all
```
If you are comfortable with the data that is reported, you can opt-in to automatically report high-level cluster metadata with:
```
$ ceph telemetry on
```
The public dashboard that aggregates Ceph telemetry can be found at https://telemetry-public.ceph.com/.

Upgrading from Pre-Reef Releases (like Quincy) ¶

You must first upgrade to Reef (18.2.z) or Squid (19.2.z) before upgrading to Tentacle.

Thank You to Our Contributors ¶

We express our gratitude to all members of the Ceph community who contributed by proposing pull requests, testing this release, providing feedback, and offering valuable suggestions.

If you are interested in helping test the next release, Umbrella, please join us at the #ceph-at-scale Slack channel.

The Tentacle release would not be possible without the contributions of the community:

Aashish Sharma ▪ Abhishek Desai ▪ Abhishek Kane ▪ Abhishek Lekshmanan ▪ Achint Kaur ▪ Achintk1491 ▪ Adam C. Emerson ▪ Adam King ▪ Adam Kupczyk ▪ Adam Lyon-Jones ▪ Adarsh Ashokan ▪ Afreen Misbah ▪ Aishwarya Mathuria ▪ Alex Ainscow ▪ Alex Kershaw ▪ Alex Wojno ▪ Alexander Indenbaum ▪ Alexey Odinokov ▪ Alexon Oliveira ▪ Ali Maredia ▪ Ali Masarwa ▪ Aliaksei Makarau ▪ Anatoly Scheglov ▪ Andrei Ivashchenko ▪ Ankit Kumar ▪ Ankush Behl ▪ Anmol Babu ▪ Anoop C S ▪ Anthony D Atri ▪ Anuradha Gadge ▪ Anushruti Sharma ▪ arm7star ▪ Artem Vasilev ▪ Avan Thakkar ▪ Aviv Caro ▪ Benedikt Heine ▪ Bernard Landon ▪ Bill Scales ▪ Brad Hubbard ▪ Brian P ▪ bugwz ▪ cailianchun ▪ Casey Bodley ▪ Chanyoung Park ▪ Chen Yuanrun ▪ Chengen Du ▪ Christian Rohmann ▪ Christopher Hoffman ▪ chungfengz ▪ Chunmei Liu ▪ Connor Fawcett ▪ Cory Snyder ▪ Cybertinus ▪ daijufang ▪ Dan Mick ▪ Dan van der Ster ▪ Daniel Gryniewicz ▪ Danny Al-Gaaf ▪ DanWritesCode ▪ David Galloway ▪ Deepika Upadhyay ▪ Dhairya Parmar ▪ Divyansh Kamboj ▪ Dnyaneshwari ▪ Dominique Leuenberger ▪ Dongdong Tao ▪ Doug Whitfield ▪ Drunkard Zhang ▪ Effi Ofer ▪ Emin ▪ Emin Mert Sunacoglu ▪ Enrico Bocchi ▪ Enrico De Fent ▪ er0k ▪ Erik Sjölund ▪ Ernesto Puerta ▪ Ethan Wu ▪ Feng, Hualong ▪ Florent Carli ▪ Gabriel BenHanokh ▪ Gal Salomon ▪ Garry Drankovich ▪ Gil Bregman ▪ Gilad Sid ▪ gitkenan ▪ Gregory O'Neill ▪ Guillaume Abrioux ▪ gukaifeng ▪ Hannes Baum ▪ haoyixing ▪ hejindong ▪ Hezko ▪ Hoai-Thu Vuong ▪ Hualong Feng ▪ Hyun Jin Kim ▪ igomon ▪ Igor Fedotov ▪ Igor Golikov ▪ Ilya Dryomov ▪ imtzw ▪ Indira Sawant ▪ Ivo Almeida ▪ J. Eric Ivancich ▪ Jakob Haufe ▪ James Oakley ▪ Jamie Pryde ▪ Jane Zhu ▪ Janne Heß ▪ Jannis Speer ▪ Jared Yu ▪ Jaya Prakash ▪ Jayaprakash-ibm ▪ Jesse F. Williamson ▪ Jesse Williamson ▪ Jianwei Zhang ▪ Jianxin Li ▪ jiawd ▪ Jiffin Tony Thottan ▪ Joao Eduardo Luis ▪ Joel Davidow ▪ John Agombar ▪ John Mulligan ▪ Jon Bailey ▪ Jos Collin ▪ Jose J Palacios-Perez ▪ Joshua Baergen ▪ Joshua Blanch ▪ Juan Ferrer Toribio ▪ Juan Miguel Olmo Martínez ▪ julpark ▪ junxiang Mu ▪ Kalpesh Pandya ▪ Kamoltat Sirivadhna ▪ kchheda3 ▪ Kefu Chai ▪ Ken Dreyer ▪ Kevin Niederwanger ▪ Kevin Zhao ▪ Kotresh Hiremath Ravishankar ▪ Kritik Sachdeva ▪ Kushal Deb ▪ Kushal Jyoti Deb ▪ Kyrylo Shatskyy ▪ Laimis Juzeliūnas ▪ Laura Flores ▪ Lee Sanders ▪ Leo Mylonas ▪ Leonid Chernin ▪ Leonid Usov ▪ lightmelodies ▪ Linjing Li ▪ liubingrun ▪ lizhipeng ▪ Lorenz Bausch ▪ Luc Ritchie ▪ Lucian Petrut ▪ Luo Rixin ▪ Ma Jianpeng ▪ Marc Singer ▪ Marcel Lauhoff ▪ Mark Kogan ▪ Mark Nelson ▪ Martin Nowak ▪ Matan Breizman ▪ Matt Benjamin ▪ Matt Vandermeulen ▪ Matteo Paramatti ▪ Matthew Vernon ▪ Max Carrara ▪ Max Kellermann ▪ Md Mahamudur Rahaman Sajib ▪ Michael J. Kidd ▪ Michal Nasiadka ▪ Mike Perez ▪ Miki Patel ▪ Milind Changire ▪ Mindy Preston ▪ Mingyuan Liang ▪ Mohit Agrawal ▪ molpako ▪ mosayyebzadeh ▪ Mouratidis Theofilos ▪ Mykola Golub ▪ Myoungwon Oh ▪ N Balachandran ▪ Naman Munet ▪ Naveen Naidu ▪ nbalacha ▪ Neeraj Pratap Singh ▪ Neha Ojha ▪ Niklas Hambüchen ▪ Nithya Balachandran ▪ Nitzan Mordechai ▪ Nizamudeen A ▪ Oguzhan Ozmen ▪ Omid Yoosefi ▪ Omri Zeneva ▪ Or Ozeri ▪ Orit Wasserman ▪ Oshrey Avraham ▪ Patrick Donnelly ▪ Paul Cuzner ▪ Paul Stemmet ▪ Paulo E. Castro ▪ Pedro Gonzalez Gomez ▪ Pere Diaz Bou ▪ Peter Sabaini ▪ Pierre Riteau ▪ Piotr Parczewski ▪ Piyush Agarwal ▪ Ponnuvel Palaniyappan ▪ Prachi Goel ▪ Prashant D ▪ prik73 ▪ Pritha Srivastava ▪ Puja Shahu ▪ pujashahu ▪ qn2060 ▪ Radoslaw Zarzynski ▪ Raja Sharma ▪ Ramana Raja ▪ Redouane Kachach ▪ rhkelson ▪ Richard Poole ▪ Rishabh Dave ▪ Robin Geuze ▪ Ronen Friedman ▪ Rongqi Sun ▪ Rostyslav Khudov ▪ Roy Sahar ▪ Ryotaro Banno ▪ Sachin Prabhu ▪ Sachin Punadikar ▪ Sam Goyal ▪ Samarah Uriarte ▪ Samuel Just ▪ Satoru Takeuchi ▪ Seena Fallah ▪ Shachar Sharon ▪ Shasha Lu ▪ Shawn Edwards ▪ Shen Jiatong ▪ Shilpa Jagannath ▪ shimin ▪ Shinya Hayashi ▪ Shraddha Agrawal ▪ Shreya Sapale ▪ Shreyansh Sancheti ▪ Shrish0098 ▪ Shua Lv ▪ Shweta Bhosale ▪ Shweta Sodani ▪ Shwetha K Acharya ▪ Sidharth Anupkrishnan ▪ Silent ▪ Simon Jürgensmeyer ▪ Soumya Koduri ▪ Sridhar Seshasayee ▪ Srinivasa Bharath Kanta ▪ Stellios Williams ▪ Steven Chien ▪ Sun Lan ▪ Sungjoon Koh ▪ Sungmin Lee ▪ Sunil Angadi ▪ Sunnat Samadov ▪ Surya Kumari Jangala ▪ Suyash Dongre ▪ T K Chandra Hasan ▪ Taha Jahangir ▪ Tan Changzhi ▪ Teng Jie ▪ Teoman Onay ▪ Thomas Lamprecht ▪ Tobias Fischer ▪ Tobias Urdin ▪ Tod Chen ▪ Tomer Haskalovitch ▪ TomNewChao ▪ Toshikuni Fukaya ▪ Trang Tran ▪ TruongSinh Tran-Nguyen ▪ Tyler Brekke ▪ Tyler Stachecki ▪ Umesh Muthuvara ▪ Vallari Agrawal ▪ Venky Shankar ▪ Victoria Mackie ▪ Ville Ojamo ▪ Vinay Bhaskar Varada ▪ Wang Chao ▪ wanglinke ▪ Xavi Hernandez ▪ Xiubo Li ▪ Xuehan Xu ▪ XueYu Bai ▪ Yaarit Hatuka ▪ Yan, Zheng ▪ Yantao Xue ▪ Yao guotao ▪ Yehuda Sadeh ▪ Yingxin Cheng ▪ Yite Gu ▪ Yonatan Zaken ▪ Yuri Weinstein ▪ Yuval Lifshitz ▪ Zac Dover ▪ Zack Cerza ▪ Zaken ▪ Zhang Song ▪ zhangjianwei2 ▪ Zhansong Gao ▪ Zhipeng Li ▪ 胡玮文

Finding My Place in the Ceph Community: Reflections Ahead of Cephalocon 2025

2025-10-22T00:00:00Z

Finding My Place in the Ceph Community: Reflections Ahead of Cephalocon 2025 ¶

Six months ago, I stepped into an exciting chapter in my career by joining the Ceph Foundation as Community Manager. I saw it as an opportunity to grow a community, enhance my marketing skills, and harness my passion for organizing events to make a positive impact. What I didn't expect was how truly rewarding this journey would be. So far, I've managed many campaigns, but the most significant was preparing for Cephalocon, taking place in Vancouver, BC, on October 28–29. They say what doesn't break you makes you stronger, and after six months with Ceph, I honestly feel like the Incredible Hulk: stronger, more resilient, and inspired by the power of open collaboration.

Cephalocon is the annual gathering of the global Ceph community, where contributors, users, and developers share ideas, exchange knowledge, and celebrate open-source storage progress. It's a space for innovation and collaboration, often sparking the next breakthrough for the project. This year's event in Vancouver aims to increase user engagement and showcase Ceph's versatility with real-world use cases.

As the Ceph Community Manager, I was invited to attend Cephalocon this year and deliver a presentation. This will be my first Cephalocon and my first visit to Vancouver, BC. I've spent my time with Ceph connecting with community members around the world, and all of those interactions have been through screens. I look forward to meeting many community members with whom I have partnered, including the collection of developers, operators, and advocates who make Ceph what it is.

Along with organizing the details of Cephalocon with the Ceph Events Team, I've had the privilege of supporting incredible contributors who share their stories through blog posts, tech talks, and open discussions. I’ve collaborated with participants in programs including the Google Summer of Code and the Ceph Developer Summit, which have shown me just how passionate, innovative, and collaborative this community really is. I’ve also been fortunate to work alongside people like Gaurav Sitlani, whose insights helped shape the Ceph Ambassador Program into a growing network of talented advocates; Frédéric Nass, who has been an amazing collaborator on Ceph blogs and events; Anthony D’Atri, who has been instrumental in refining our communication tools; and Joseph Mundackal, who taught me the ropes of making GitHub pull requests for ceph.io updates.

The Ceph community has quickly proven to be one of the most inspiring groups I’ve ever worked with. Every person I’ve met brings a story of solving challenges, scaling systems, and believing in open collaboration. My talk at Cephalocon 2025, Powered by People: Growing the Ceph Community Through User Engagement, will explore where the Ceph community has been and where we're headed next. I'll share how user engagement, storytelling, and cross-community collaboration are shaping the next chapter of the Ceph Foundation's work, and how every contributor plays a role in building a stronger, more connected ecosystem.

If you're attending Cephalocon this year, I'd love for you to join my session. If you aren't, there's still time to register! You'll learn how to get more involved with the Ceph Foundation, how we’re building tools to recognize contributors across the ecosystem, and how you can make your voice heard. Ceph’s greatest strength has always been its people, and together, we’re building something extraordinary.

Cephaloalon 2025 will take place in Vancouver, BC, on October 28-29.

Ceph Object Storage Deep Dive Series. Part 1

2025-10-15T00:00:00Z

Ceph RGW Architecture: A Deep Dive into its Core Foundations ¶

Introduction: The Stateless Powerhouse ¶

The Ceph Object Gateway (RGW) is far more than just a proxy; it's a high-level abstraction layer that seamlessly provides Amazon S3 and OpenStack Swift RESTful APIs on top of the underlying Reliable Autonomic Distributed Object Store (RADOS). For storage architects, the RGW is crucial because it translates standard HTTP requests for object operations into native RADOS operations executed directly against the cluster. This allows applications built for popular cloud object storage ecosystems to leverage a Ceph cluster as their storage backend without modification.

A fundamental principle governing RGW's design is its stateless nature. This critical architectural decision is the bedrock of its massive horizontal scalability and high availability. Since RGW daemons maintain no persistent state related to client sessions, you can achieve near-linear performance scaling simply by deploying more RGW instances behind a standard load balancer. The failure of any single RGW daemon is a non-critical event because the load balancer can redirect client traffic to the remaining healthy instances, making the outage transparent to end-users. All vital state, including user metadata, bucket definitions, ACLs, and object data, is durably stored within the RADOS cluster in designated pools.

In this first deep dive, we peel back the layers to examine the RGW frontend components, the specialized RADOS pools that house its internal metadata, and the critical mechanics of bucket indexing and sharding that enable high-performance object operations.

RGW Frontends ¶

An incoming client request to an RGW daemon traverses several internal layers, beginning with the frontend web server that handles the initial HTTP connection. RGW has historically supported two primary embedded frontends: Civetweb, the legacy default, and Beast, the modern, high-performance default choice.

Civetweb operates on a synchronous, thread-per-connection model. In contrast, Beast is a modern frontend built upon Boost. The Asio C++ library facilitates an asynchronous, event-driven I/O model. Instead of dedicating a thread to each connection, Beast uses a small pool of worker threads to service thousands of connections concurrently. This model is significantly more efficient in terms of CPU and memory utilization, as threads are not blocked waiting for I/O, and the memory overhead per-connection is drastically reduced. The architectural shift from Civetweb to Beast was a direct response to the demands of modern cloud-native applications, which often generate high-concurrency, high-IOPS workloads..

Frontend Configuration in Action ¶

When deploying or modifying RGW services using cephadm, the frontend type and its settings can be specified directly within the service specification file. Beast is the default and recommended option for the RGW frontend:

service_type: rgw
service_id: myrealm.myzone
spec:
  rgw_realm: myrealm
  rgw_zone: myzone
  ssl: true
  rgw_frontend_port: 1234
  rgw_frontend_type: beast
  rgw_frontend_ssl_certificate: ...

This YAML snippet illustrates how cephadm deploys an RGW service, specifying the realm and zone, enabling SSL termination, and explicitly setting the rgw_frontend_type to beast on TCP port 1234.

Understanding RGW RADOS Pools ¶

For RGW to operate as a truly stateless component, every piece of critical information, user data, metadata, and logs must be stored persistently within the RADOS layer. This persistence is achieved through a set of specialized, dedicated RADOS pools.

RGW's multi-pool architecture is a deliberate design choice that allows operators to physically separate different classes of data onto different hardware tiers, enabling a highly optimized balance of performance and cost. For example, latency-sensitive metadata and logs can be placed on fast replicated pools backed by SSD media. At the same time, capacity-heavy object payloads can reside on erasure-coded pools supported by slower, more cost-effective HDD or increasingly QLC-class SSDs. NVMe SSDs are preferable to legacy SAS/SATA SSDs as they offer future-proofing, better density, and better performance for the money. An NVMe server can actually cost less than a SATA server.

Key RGW Pools and Their Purposes ¶

Pool Name Suffix	Purpose	Typical Data Protection	Recommended Media
.rgw.root	Stores global RGW configuration (realms, zonegroups, zones)	Replicated	SSD
.rgw.control	Internal RGW daemon coordination	Replicated	SSD
.rgw.meta	User and bucket metadata	Replicated	SSD
.rgw.log	Operation and replication logs	Replicated	SSD
.rgw.buckets.index	Bucket object listings (omaps). Critical for performance	Replicated	SSD
.rgw.buckets.data	Main object data payload	Erasure Coded	TLC/QLC SSD, HDD
.rgw.buckets.non-ec	Auxiliary pool for operations incompatible with EC	Replicated	SSD / HDD

When the RGW service first tries to operate on a RADOS pool that does not exist, it will create that pool with the values of the config options osd_pool_default_pg_num and osd_pool_default_pgp_num. These defaults are sufficient for some pools, but others (especially those listed in placement_pools for the bucket index and data) will require additional tuning. Note that when the PG autoscaler is enabled it will adjust the placement group values for these pools automatically, with an increased BIAS for .index pools so that they are allocated more than their aggregate data would otherwise inform. For the autoscaler to work best with the constellation of RGW pools, we suggest raising the following values from their defaults:

# ceph config set global mon_target_pg_per_osd 300
# ceph config set global mon_max_pg_per_osd 600

Pool names specific to an RGW zone follow the naming convention zone-name.pool-name. For example, a zone named us-east will have the following pools:

.rgw.root
us-east.rgw.control
us-east.rgw.meta
us-east.rgw.log
us-east.rgw.buckets.index
us-east.rgw.buckets.data

The structure of these pools is vital for understanding RGW's operational mechanics. Many logical pools are consolidated using RADOS namespaces within the main RADOS pools (e.g., default.rgw.log).

We can list RADOS namespaces with a command of the following form. Here we can see how the rgw.meta pool contains three different RADOS namespaces:

# rados ls -p default.rgw.meta --all | awk '{ print $1 }' | sort -u
root
users.keys
users.uid

Pools with their namespaces are exposed when querying the RGW zone configuration:

$ radosgw-admin zone get --rgw-zone default
{
    "id": "d9c4f708-5598-4c44-9d36-849552a08c4d",
    "name": "default",
    "domain_root": "default.rgw.meta:root",
    "control_pool": "default.rgw.control",
    "gc_pool": "default.rgw.log:gc",
    "lc_pool": "default.rgw.log:lc",
    "log_pool": "default.rgw.log",
    "intent_log_pool": "default.rgw.log:intent",
    "usage_log_pool": "default.rgw.log:usage",
    "roles_pool": "default.rgw.meta:roles",
    "reshard_pool": "default.rgw.log:reshard",
    "user_keys_pool": "default.rgw.meta:users.keys",
    "user_email_pool": "default.rgw.meta:users.email",
    "user_swift_pool": "default.rgw.meta:users.swift",
    "user_uid_pool": "default.rgw.meta:users.uid",
    "otp_pool": "default.rgw.otp",
   ...
    "placement_pools": [
        {
            "key": "default-placement",
            "val": {
                "index_pool": "default.rgw.buckets.index",
                "storage_classes": {
                    "STANDARD": {
                        "data_pool": "default.rgw.buckets.data"
                    }
                },
                "data_extra_pool": "default.rgw.buckets.non-ec",
                "index_type": 0
            }
        }
    ],
    "realm_id": "",
    "notif_pool": "default.rgw.log:notif"
}

This JSON output details the configuration for the default zone. Notice how many different logical functions (GC, LC, usage logs) are mapped to the RADOS pool default.rgw.log` but are separated using RADOS Namespaces (e.g., default.rgw.log:gc``).

A Detailed Overview of the Bucket Index and Sharding ¶

The ability to list the contents of a bucket is fundamental to object storage. RGW implements this using a dedicated structure called the Bucket Index, which is responsible for listing bucket content, maintaining a journal for versioned operations, storing quota metadata, and serving as a log for multi-zone synchronization.

The Bucket Index and OMAPs ¶

The bucket index relies on a special feature of RADOS objects called the Object Map (OMAP). An OMAP is a key-value store associated with a RADOS object, similar in concept to Extended Attributes in a POSIX file. For each bucket, RGW creates one or more dedicated index objects in the .rgw.buckets.index pool. The listing information for the objects within that bucket is stored within the OMAP of these index objects.

Crucially, the performance of the bucket index relies entirely on the underlying key-value database: OMAPs are physically stored within the RocksDB database residing on the OSD's DB partition. This mandates that index pools like ``default.rgw.buckets.index` must currently use a replicated data protection scheme, as OMAP operations are not compatible with erasure-coded pools. Investing in fast flash devices (SSDs, ideally NVMe) for the OSD's DB partition is paramount for bucket listing performance. RGW index pools may select a CRUSH rule that places them on pure SSD OSDs, or on hybrid OSDs with the DB offloaded to SSDs. Since omaps are purely in the DB portion of a given OSD, either strategy suffices.

Tuning the Index Pool for Performance ¶

While fast storage for OSD DBs is critical, the distribution of the bucket index across the cluster is equally essential. This is controlled by the Placement Group (PG) count of the index pool. Poor PG tuning is a common cause of poor listing performance, especially in large clusters.

Placement Group (PG) Count and Parallelism ¶

Each PG is mapped to a set of OSDs, with one acting as the primary. When RGW performs a bucket listing, it sends parallel read requests to the OMAPs of many different bucket index shard objects. A higher PG count for the index pool distributes these shards across a greater number of primary OSDs. This increases the parallelism of the listing operation, as more physical devices can concurrently service the I/O requests. A low PG count can create a bottleneck where many requests are funneled to just a few OSDs, which then become saturated.

We suggest that each index pool have at least one PG for every OSD on which it is placed. When using the PG autoscaler, index pools should automatically have a BIAS value of 4 so that they receive a higher number of PGs. See above for recommendations on central configuration settings to allow the autoscaler to provision enough PGs to index pools.

Visualizing the Bucket Index Log ¶

First, we confirm the existence and Pool ID of the bucket index pool:

$ ceph osd lspools | grep default.rgw.buckets.index
6 default.rgw.buckets.index

Here we see that RADOS pool with ID 6 is the dedicated index pool for the default zone.

Now, let’s get a bucket name to use as an example: bucket1:

$ radosgw-admin bucket list | grep bucket1
    "bucket1",

Next, we can examine the index entries for a specific bucket from the default zone, bucket1, using: radosgw-admin:

$ radosgw-admin bi list --bucket bucket1
[
    {
        "type": "plain",
        "idx": "hosts5",
        "entry": {
            "name": "hosts5",
            "instance": "",
            "ver": {
                "pool": 16,
                "epoch": 3
            },
            "locator": "",
            "exists": "true",
            "meta": {
                "category": 1,
                "size": 4066,
                "mtime": "2022-12-14T16:27:02.562603Z",
                "etag": "71ad37de1d442f5ee2597a28fe07461e",
                "storage_class": "",
                "owner": "test",
                "owner_display_name": "test",
                "content_type": "",
                "accounted_size": 4066,
                "user_data": "",
                "appendable": "false"
            },
            "tag": "_iDrB7rnO7jqyyQ2po8bwqE0vL_Al6ZH",
            "flags": 0,
            "pending_map": [],
            "versioned_epoch": 0
        }
    }
]

The radosgw-admin bi list output displays the stored metadata for an S3 object (hosts5), including size, modification time (mtime), and ETag.

The Scalability Enabler: Bucket Sharding ¶

A significant performance problem arises when a bucket index grows very large. If a bucket's index is stored in a single RADOS object, only one operation can be performed at a time. This serialization limits parallelism and can become a severe bottleneck for high-throughput write workloads.

To circumvent this limitation, RGW employs Bucket Index Sharding. This mechanism divides the bucket index into multiple parts, with each shard stored on a separate RADOS object within the index pool. When an object is written, the update is directed to a specific shard determined by a hash of the object's name. This allows multiple operations to occur concurrently across different Placement Groups (PGs) and OSDs, improving overall scalability. The number of shards should be a prime number, and is configurable with the bucket_index_max_shards config option, which defaults to 11). We can get relevant metadata information about your bucket and objects using the radosgw-admin bucket stats command, like the shard count, bucket usage, quota, versioning, object lock, owner, etc.

$ radosgw-admin bucket stats --bucket bucket1 | grep shards
    "num_shards": 11,

Bucket Index pool for the default zone:

$ ceph osd lspools | grep default.rgw.buckets.index
6 default.rgw.buckets.index

We can visually confirm the existence of these shards as discrete OMAP RADOS objects:

$ rados -p default.rgw.buckets.index ls
.dir.7fb0a3df-9553-4a76-938d-d23711e67677.34162.1.9
.dir.7fb0a3df-9553-4a76-938d-d23711e67677.34162.1.0
.dir.7fb0a3df-9553-4a76-938d-d23711e67677.34162.1.10
.dir.7fb0a3df-9553-4a76-938d-d23711e67677.34162.1.1
.dir.7fb0a3df-9553-4a76-938d-d23711e67677.34162.1.7
.dir.7fb0a3df-9553-4a76-938d-d23711e67677.34162.1.8
.dir.7fb0a3df-9553-4a76-938d-d23711e67677.34162.1.2
.dir.7fb0a3df-9553-4a76-938d-d23711e67677.34162.1.6
.dir.7fb0a3df-9553-4a76-938d-d23711e67677.34162.1.5
.dir.7fb0a3df-9553-4a76-938d-d23711e67677.34162.1.4
.dir.7fb0a3df-9553-4a76-938d-d23711e67677.34162.1.3

Each .dir RADOS object listed here is a separate bucket index shard. In this example, 11 shards are visible, matching the default number of shards per bucket.

At bucket creation time, the initial number of shards is set by the bucket_index_max_shards option at the zonegroup level, and it is used for all buckets. If a different number of shards is required for a specific bucket, it is possible to change it.

Note: we recommend a maximum of 102,400 S3 objects per bucket index shard.

We can get the marker for a bucket using the stats command:

$ radosgw-admin bucket stats --bucket bucket1 | grep marker
    "marker": "7fb0a3df-9553-4a76-938d-d23711e67677.34162.1",

Now that we know that the marker for bucket1 is 7fb0a3df-9553-4a76-938d-d23711e67677.34162.1. Let’s upload an object named file1 to bucket1:

$ aws --endpoint=http://ceph-node02:8080 s3 cp /etc/hosts s3://bucket1/file1 --region default
upload: ../etc/hosts to s3://bucket1/file1

Let’s investigate the bucket index for this bucket at the RADOS level. By listing the omapkeys on the bucket index object, we can see a key called file1, which matches the uploaded object name. Here we are doing a listomapkeys on one of the 11 shard objects available, in this case, shard 2. As mentioned before, objects will be spread among the different shards during creation.

$ rados -p default.rgw.buckets.index listomapkeys .dir.7fb0a3df-9553-4a76-938d-d23711e67677.34162.1.2
file1

When we check the values, we can see that the key/value entry in the bucket index shard 2 omap object for bucket1 is 217 bytes in size. In the hex dump we see info including the object name.

$ rados -p default.rgw.buckets.index listomapvals .dir.7fb0a3df-9553-4a76-938d-d23711e67677.34162.1.2
file1
value (217 bytes) :
00000000  08 03 d3 00 00 00 05 00  00 00 66 69 6c 65 31 01  |..........file1.|
00000010  00 00 00 00 00 00 00 01  07 03 5a 00 00 00 01 32  |..........Z....2|
00000020  05 00 00 00 00 00 00 4b  ab a1 63 95 74 ba 04 20  |.......K..c.t.. |

When we add more S3 objects to our bucket, we see new key/value entries for each added to the shards available for the bucket. In this example file1, file2,file4, file10 landed in shard 2:

$ rados -p default.rgw.buckets.index listomapkeys .dir.7fb0a3df-9553-4a76-938d-d23711e67677.34162.1.2
file1
file2
file4
file10

We can confirm the placement of a specific shard, shard 2:

$ ceph osd map default.rgw.buckets.index .dir.7fb0a3df-9553-4a76-938d-d23711e67677.34162.1.2
osdmap e90 pool 'default.rgw.buckets.index' (9) object '.dir.7fb0a3df-9553-4a76-938d-d23711e67677.34162.1.2' -> pg 9.6fa75bc9 (9.9) -> up ([1,2], p5) acting ([1,2], p5)

This output shows that the index shard is replicated across the cluster and lives on specific OSDs. Distributing the index across multiple PGs (and therefore OSDs) enables parallelism.

The Zero-Byte Mystery: Why the Index Pool Appears Empty ¶

when you query the space usage of the bucket index pool, the result often surprises engineers unfamiliar with Ceph's OMAP architecture:

$ rados df -p default.rgw.buckets.index
POOL_NAME                  USED  OBJECTS  CLONES  COPIES  MISSING_ON_PRIMARY  UNFOUND  DEGRADED  RD_OPS       RD  WR_OPS      WR  USED COMPR  UNDER COMPR
default.rgw.buckets.index   0 B       11       0      33                   0        0         0     208  207 KiB      41  20 KiB         0 B          0 B

Even inspecting a single shard object(shard 2), shows a size of zero:

$ rados -p default.rgw.buckets.index stat .dir.7fb0a3df-9553-4a76-938d-d23711e67677.34162.1.2
default.rgw.buckets.index/.dir.7fb0a3df-9553-4a76-938d-d23711e67677.34162.1.2 mtime 2022-12-20T07:32:11.000000-0500, size 0

Despite containing 11 RADOS objects (shards), the pool reports 0 bytes used. This is because bucket index listing data is stored entirely as OMAP entries within the RocksDB database of each OSD, not as payload data in the RADOS object itself. This confirms why leveraging fast flash media (SSDs) for at least the OSD DB partition is essential for maximizing bucket index performance.

Managing Index Growth with Dynamic Bucket Resharding ¶

As a bucket scales to hundreds of thousands or millions of S3 objects, its index can become a performance bottleneck. By default, a single shard can become "hot" as it accumulates too many entries. The threshold for the number of S3 objects per shard is configurable, defaulting to 100,000. Very large numbers of S3 objects per bucket reintroduce the serialization problem that sharding was designed to solve. To combat this, RGW features an advanced, automated mechanism known as Dynamic Bucket Resharding (DBR).

DBR is a background process that continuously monitors the number of entries in each bucket index shard. When a shard grows beyond its configured threshold, DBR automatically and online triggers a resharding operation. This process creates a new set of index objects with a greater number of shards and then safely migrates the existing index entries from the old, smaller layout to the new, larger one.

The Evolution of Online Resharding: Minimizing Impact ¶

Historically, the resharding operation required temporarily pausing write I/O to the bucket. While read operations remained unaffected, this write pause could be noticeable and painful on very active workloads.

However, a significant enhancement coming in a forthcoming Tentacle release has drastically minimized this write freeze. The new implementation makes the resharding process nearly transparent, allowing writes to proceed with minimal interruption. This improvement is a vital step forward, making dynamic resharding a seamless, production-safe feature for even the most demanding environments.

Not Just Growing, but Shrinking: The Power of Shard Merging ¶

Dynamic resharding is not limited to just scaling up. Consider a scenario in which a bucket that once held millions of objects has a massive number of them deleted. The bucket now contains many sparsely populated or even empty index shards. This is inefficient, as listing operations must still check every shard, adding unnecessary overhead.

To address this, the DBR mechanism was enhanced to support shard merging as well. As detailed in the Ceph documentation and development trackers (e.g., BZ#2135354), if the object count in a bucket drops significantly, DBR can trigger a "downsizing" resharding operation. It will migrate the entries from many sparse shards into a new, smaller, and more densely packed set of index objects.

While DBR is a powerful automated feature, for scenarios where you know a bucket will be enormous from its inception, a standard best practice remains to pre-shard the bucket at creation time. By setting an appropriate initial number of shards, you can avoid the first dynamic resharding event altogether, ensuring optimal performance from the very first object written.

The Future is Ordered: A Glimpse into In-Order Sharding ¶

Currently, RGW's hashed sharding is optimized for write distribution, but it presents a challenge for listing objects in alphabetical order. To fulfill a paginated list request, RGW must perform a "scatter-gather" operation, querying every single shard and sorting the combined results. This can become a bottleneck for buckets with a very large number of shards.

To solve this, a significant new feature known as in-order sharding (or ordered bucket listing) is in development. This upcoming evolution will change the sharding logic to place objects into shards based on their lexicographical name rather than a hash.

The impact of this change will be transformative. Instead of querying all shards, a request to list objects will be directed to the specific shard(s) that contain the requested alphabetical range. This will make paginated listing operations dramatically faster and more efficient, particularly for workloads that rely heavily on browsing or iterating through object keys.

By combining the automated scaling of Dynamic Bucket Resharding with the listing efficiency of in-order sharding, Ceph RGW is on a clear path to providing virtually limitless and performant scalability within a single bucket, catering to the most demanding data lake and AI/ML use cases of the future.

Conclusion: The Engine of Scalability ¶

So far, we have journeyed through the high-performance path of a client request, from the initial connection at the Beast frontend, through the specialized RADOS pools, and deep into the intricate mechanics of the bucket index. You now understand how OMAPs form the backbone of object listings and how Dynamic Bucket Resharding acts as the engine of scalability, allowing a single bucket to grow to billions of objects while maintaining performance. We've uncovered the core mechanisms that handle object discovery and listing at massive scale.

However, our deep dive has so far focused on the index, which are the pointers to the data. But what about the data itself? And what about the crucial control plane metadata that defines the users, accounts, and rules governing the entire system?

In Part 2 of our series, we will answer these questions. We'll shift our focus to explore the elegant head/tail model of RGW's data layout, examine the system's core metadata, and uncover the robust background processes that manage data throughout its entire lifecycle.

The authors would like to thank IBM for supporting the community with our time to create these posts.

Ceph Object Storage Deep Dive Series. Part 2

2025-10-14T00:00:00Z

A Deep Dive into Ceph RGW: Data Path, Sharding, and Automated Management ¶

Introduction ¶

In the first part of this deep drive, we dissected the high-performance request path within the Ceph RGW. We covered its stateless frontends, foundational RADOS pools, and the critical bucket index, revealing how dynamic sharding enables virtually limitless scalability for object listings within a single bucket.

We established how RGW efficiently locates and lists objects at scale. Now, we shift our focus from the index to the objects themselves and the broader system that manages them. In this second deep dive, we will explore the control plane by examining the RGW metadata layout. We will then uncover how S3 objects are physically stored using the head/tail data model and conclude with a look at the critical background processes, Garbage Collection, and Lifecycle Management, that automate data governance.

RGW Metadata Layout: The Control Plane's Blueprint ¶

Just as the data for a single S3 object is meticulously organized across RADOS, the entire state of the RGW system, its users, buckets, and policies, is also durably stored within dedicated RADOS pools. This design is fundamental to the stateless nature of RGW daemons; all control plane information lives within the cluster itself, not on the gateways. This metadata is primarily housed in the .rgw.meta pool, while operational logs for processes like garbage collection and lifecycle management reside in the .rgw.log pool.

These metadata objects are stored in an internal binary format. For this reason, it is critical to use the radosgw-admin command-line tool for administration and interaction. This utility reliably decodes the binary records into human-readable JSON and ensures that any modifications are performed safely.

Note: Never attempt to modify objects in the .rgw.meta pool directly with the rados tool.

Key Metadata Categories ¶

The .rgw.meta pool uses RADOS namespaces to separate different types of information logically. When you query the metadata, you will encounter several top-level categories:

user: Stores S3 user records, including access keys, capabilities, usage quotas, and contact information including email.
bucket: The high-level named bucket record. This contains essential information including the bucket owner, its placement policy (which zone it belongs to), and various flags.
bucket.instance: Represents the concrete, physical instance of a bucket. This record tracks the bucket's unique ID, shard count for the index, versioning status, and creation timestamps. A single bucket name can have multiple instances over its lifetime, such as when it is deleted and recreated.
roles: Contains STS (Security Token Service) and IAM role definitions used by the policy evaluation engine to grant temporary credentials.
group: Defines user groups, which can be used for administrative operations or policy management.
topic: Stores configuration for S3 bucket event notifications.
otp: Holds one-time password credentials for multi-factor authentication.
account: Used for Swift account metadata if the Swift API is enabled.

Inspecting Metadata with radosgw-admin ¶

The ``radosgw-admin` tool provides a safe and structured way to explore this control plane data. First, you can list all available metadata categories:

$ radosgw-admin metadata list
[
    "user",
    "bucket",
    "bucket.instance",
    "roles",
    ...
]
$ radosgw-admin metadata list account
[
    "RGW42603947660038067",
    "RGW46950437120753278",
    "RGW40572530565246530",
    "RGW66892093834478914",
    "RGW63384910224424377",
    "RGW94705908964376531",
    "RGW25531238860968914"
]

Next, list the specific keys within a category, such as bucket or bucket.instance:

# List all bucket names
$ radosgw-admin metadata list bucket | grep bucket1
   "bucket1",

# List all concrete bucket instances
$ radosgw-admin metadata list bucket.instance | grep bucket1
"bucket1:7fb0a3df-9553-4a76-938d-d23711e67677.34162.1",

Finally, here is an example of retrieving and decoding a specific record using its key. Piping the output to jq formats the JSON output for readability:

# Get bucket metadata by its name
$ radosgw-admin metadata get bucket:bucket1 | jq .

# Get a user record by their UID
$ radosgw-admin metadata get user:my-user-id | jq .

Its important to mention that radosgw-admin makes our life easy with specific CLI parameters to interact with this metadata directly. For example: radosgw-admin user , radosgw-admin account, radosgw-admin bucket ,etc

Linking Metadata to Usage ¶

To bridge the gap between abstract metadata and real-world usage, radosgw-admin offers commands that aggregate this information:

# Get detailed stats for a bucket, including its shard count, object count, and size
$ radosgw-admin bucket stats --bucket <BUCKET_NAME> | jq .

# Get the complete metadata for a single object as RGW sees it
$ radosgw-admin object stat --bucket <BUCKET_NAME> --object <OBJECT_KEY> | jq .

This object stat command is handy, as it shows you the manifest, placement information, and all system attributes for a specific S3 object, providing a complete view from the gateway's perspective.

RGW Data Layout: The Head/Tail Object Model ¶

A single logical S3 object often consists of several physical RADOS objects. RGW employs a flexible head/tail object model that enables optimizations for various file sizes and complex operations including MultiPart Upload (MPU).

The primary RADOS object associated with any S3 object is the head object. Its RADOS object name is typically formed by concatenating the bucket's internal marker with the object's key, separated by an underscore, for example <bucket_marker>_<object_key>. The head object serves two primary purposes. First, it is the authoritative store for all object-level metadata, including ACLs, HTTP content type, ETag, and any user-defined metadata. This information is stored efficiently as RADOS extended attributes (xattrs) on the head object. Second, for small objects (by default, those up to the configurable rgw_max_chunk_size), the entire data payload of the S3 object is stored directly within the data portion of the head object. This is a crucial performance optimization, as it allows both the data and its associated metadata to be written to the cluster in a single, atomic RADOS operation, minimizing I/O amplification and latency for small-file workloads.

For objects that exceed this inline data size, the head object's data payload is used to store a manifest. This manifest is a metadata structure that describes how the rest of the object's data is physically laid out across the cluster. It contains an ordered list of the other RADOS objects, known as tail objects, that hold the remaining data chunks. Each entry in the manifest specifies the name of a tail object, its size, and its logical offset within the complete S3 object.

If the object size exceeds the rgw_max_chunk_size (default: 4MB), the data is striped across multiple RADOS objects: a head object (containing only metadata/manifest) and one or more tail objects (holding the bulk data).

We can retrieve the default striping size, which governs when data splitting occurs:

$ ceph config get mon rgw_obj_stripe_size
4194304

This output confirms the default RGW object stripe size is 4,194,304 bytes (4MB).

The interaction between the client-defined part size and RGW's internal striping size (rgw_obj_stripe_size) can result in the creation of specifically named tail objects. If a client uploads a part (e.g., 5 MiB) that is larger than the RGW stripe size (e.g., 4 MiB), RGW will automatically stripe that part across multiple RADOS objects. For instance, it might create a 4 MiB object named with a __multipart prefix if MPU is used, and a 1 MiB object named with a __shadow prefix to hold the remainder. These are simply tail objects whose names follow a specific convention, and both will be referenced correctly in the final manifest.

Here, we observe the head object for a large file:

$ aws --endpoint=http://ceph-node02:8080 s3 cp awscliv2.zip s3://bucket1/bigfile
$ aws --endpoint=http://ceph-node02:8080 s3 ls s3://bucket1/bigfile
2022-12-20 15:10:16   20971520 bigfile
$ rados -p default.rgw.buckets.data ls | grep bigfile$
7fb0a3df-9553-4a76-938d-d23711e67677.34162.1_bigfile

This is the head object for bigfile. It contains the object's xattrs metadata, including the user.rgw.manifest, which lists the locations of all tail objects.

The head object stores its metadata efficiently as extended attributes:

$ rados -p default.rgw.buckets.data listxattr 7fb0a3df-9553-4a76-938d-d23711e67677.34162.1_bigfile
user.rgw.acl
user.rgw.content_type
user.rgw.etag
user.rgw.idtag
user.rgw.manifest
user.rgw.pg_ver
user.rgw.source_zone
user.rgw.tail_tag
user.rgw.x-amz-content-sha256
user.rgw.x-amz-date

The listed extended attributes (xattr) confirm that the head object stores critical object metadata, notably user.rgw.manifest, which describes how the large object's data payload is split into tail objects.

The radosgw-admin object stat command can show the object’s manifest striping/parts via RGW metadata:

$ radosgw-admin object stat --bucket BUCKET --object OBJECT | jq .

Tail objects in our example:

# rados -p default.rgw.buckets.data ls | grep shadow_bigfile
7fb0a3df-9553-4a76-938d-d23711e67677.34162.1__shadow_bigfile.2~E_PYNwiBq0la0EuZcCOY30KgmRrf1pV.1_1
7fb0a3df-9553-4a76-938d-d23711e67677.34162.1__shadow_bigfile.2~E_PYNwiBq0la0EuZcCOY30KgmRrf1pV.2_1

The tail objects typically hold 4MB chunks of data:

default.rgw.buckets.data/7fb0a3df-9553-4a76-938d-d23711e67677.34162.1__shadow_bigfile.2_E_PYNwiBq0la0EuZcCOY30KgmRrf1pV.1_1 mtime 2022-12-20T15:10:16.000000-0500, size 4194304

S3 Multipart Upload: An Atomic Commit Operation ¶

The S3 Multipart Upload (MPU) feature is designed for efficiently uploading large objects by dividing them into smaller parts that can be uploaded independently and in parallel. RGW implements this elegantly as a metadata-only commit operation.

The workflow involves three key steps:

Multipart Upload Initiation: A request is sent to get a unique Upload ID.
Parts Upload: Individual parts are uploaded using both the Upload ID and a unique Part ID. Each part is stored as a distinct, temporary RADOS object. If a part size exceeds the RGW stripe size (default 4MB), it is internally segmented.
Multipart Upload Completion (Atomic Commit): When all parts are uploaded, the client sends a completion request. RGW avoids costly data copying. Instead, it creates the final head object and populates its internal manifest with pointers to the temporary RADOS objects that constitute the parts. This results in near-instantaneous completion.

This design makes the completion of a large object upload nearly instantaneous from the cluster's perspective. The head object itself contains no user data in this case, which is why low-level tools will report its size as 0 bytes; its payload is the manifest, not the object content.

MPU Structure in RADOS ¶

When a file is uploaded in chunks (e.g., 5MB chunks) and the RGW stripe width is 4 MiB, RGW handles the internal splitting: it takes the first 4 MiB to create a "multipart" RADOS object and the remaining 1 MiB to create a "shadow" tail RADOS object.

Let’s check it out with an example. We will set the client chunk size to 5 MiB, and upload a 20 MiB file:

$ aws configure set default.s3.multipart_chunksize 5MB
$ aws --endpoint=http://ceph-node02:8080 s3 cp text.txt s3://bucket1/5chuncks

We send 5 MiB chunks to RGW, and RGW has a stripe width of 4 MiB, which means RGW will take the first 4 MiB and create a "multipart" RADOS object and then a 1 MiB "shadow" RADOS tail object.

$ rados -p default.rgw.buckets.data ls | grep 5chuncks
7fb0a3df-9553-4a76-938d-d23711e67677.34162.1__shadow_5chuncks.2_r3yyxqL2hYs5DW32L9UXR3uawF4VEKL.2_1
7fb0a3df-9553-4a76-938d-d23711e67677.34162.1__multipart_5chuncks.2_r3yyxqL2hYs5DW32L9UXR3uawF4VEKL.2
7fb0a3df-9553-4a76-938d-d23711e67677.34162.1__shadow_5chuncks.2_r3yyxqL2hYs5DW32L9UXR3uawF4VEKL.3_1
7fb0a3df-9553-4a76-938d-d23711e67677.34162.1__shadow_5chuncks.2_r3yyxqL2hYs5DW32L9UXR3uawF4VEKL.4_1
7fb0a3df-9553-4a76-938d-d23711e67677.34162.1__multipart_5chuncks.2_r3yyxqL2hYs5DW32L9UXR3uawF4VEKL.4
7fb0a3df-9553-4a76-938d-d23711e67677.34162.1__shadow_5chuncks.2_r3yyxqL2hYs5DW32L9UXR3uawF4VEKL.1_1
7fb0a3df-9553-4a76-938d-d23711e67677.34162.1__multipart_5chuncks.2_r3yyxqL2hYs5DW32L9UXR3uawF4VEKL.3
7fb0a3df-9553-4a76-938d-d23711e67677.34162.1_5chuncks
7fb0a3df-9553-4a76-938d-d23711e67677.34162.1__multipart_5chuncks.2_r3yyxqL2hYs5DW32L9UXR3uawF4VEKL.1

The output shows creation of the various components, including the final head object (..._5chuncks), as well as multiple multipart and shadow objects corresponding to the striped parts.

The size verification of these objects demonstrates the RGW splitting logic: the multipart head RADOS object is 4 MiB, and the tail (shadow) RADOS object is 1 MiB.

# Check the size of the main 4MB chunk
$ rados -p default.rgw.buckets.data stat 7fb0a3df-9553-4a76-938d-d23711e67677.34162.1__multipart_5chuncks.2_r3yyxqL2hYs5DW32L9UXR3uawF4VEKL.2
default.rgw.buckets.data/7fb0a3df-9553-4a76-938d-d23711e67677.34162.1__multipart_5chuncks.2_r3yyxqL2hYs5DW32L9UXR3uawF4VEKL.2 mtime 2022-12-21T03:07:49.000000-0500, size 4194304

# Check the size of the remaining 1MB chunk
$ rados -p default.rgw.buckets.data stat 7fb0a3df-9553-4a76-938d-d23711e67677.34162.1__shadow_5chuncks.2_r3yyxqL2hYs5DW32L9UXR3uawF4VEKL.2_1
default.rgw.buckets.data/7fb0a3df-9553-4a76-938d-d23711e67677.34162.1__shadow_5chuncks.2_r3yyxqL2hYs5DW32L9UXR3uawF4VEKL.2_1 mtime 2022-12-21T03:07:49.000000-0500, size 1048576

These parts are not assembled or merged in RADOS: this is their final state.

Finally, the completed S3 object's head RADOS object contains only the metadata manifest, which is why it reports a size of zero bytes at the RADOS level:

$ rados -p default.rgw.buckets.data stat 7fb0a3df-9553-4a76-938d-d23711e67677.34162.1_5chuncks
default.rgw.buckets.data/7fb0a3df-9553-4a76-938d-d23711e67677.34162.1_5chuncks mtime 2022-12-21T03:07:49.000000-0500, size 0

More information on Multipart Upload can be found at AWS Multipart Upload.

The Asynchronous Garbage Collector (GC) ¶

When clients delete S3 objects or overwrite them, the underlying RADOS objects are not immediately removed. The primary function of object deletion is to update the bucket index (or place a delete marker, if versioning is active). Once the S3 object is removed from the index, its underlying RADOS objects are effectively "orphaned."

These orphaned RADOS objects are then inserted into the Garbage Collection (GC) queue. The Garbage Collector is a critical background process in RGW responsible for asynchronously reclaiming the storage space consumed by these deleted objects. This design ensures that client DELETE requests return quickly without waiting for the slow process of physically purging data blocks.

For workloads with high object churn (many creations and deletions), the GC process can lag behind, causing a build-up of reclaimable space. To combat this, administrators can tune several key parameters to make GC more aggressive:

rgw_gc_obj_min_wait: The minimum time a before deleted object becomes eligible for collection. Reducing this (default is 2 hours) accelerates space reclamation.
rgw_gc_max_concurrent_io: The number of parallel RADOS delete operations a GC thread can issue. Increasing this from the default of 10 allows GC to process more objects simultaneously, at the cost of higher background I/O on the cluster.
rgw_gc_processor_period: The interval between GC processing cycles. A lower value means the GC thread runs more frequently.
rgw_gc_max_trim_chunk: The number of log entries to process in a single batch.

We can use the below commands to list all objects scheduled for removal:

$ radosgw-admin gc list
$ radosgw-admin gc list --include-all

By default, Ceph waits for 2 hours between GC cycles. To manually run the GC deletion process, run:

$ radosgw-admin gc process --include-all

This command can be executed to force the Garbage Collector to process its backlog manually, ensuring the quick reclamation of space without waiting for the next scheduled run.

Note: The rgw_gc_max_objs option should NEVER be modified from its default value in a running cluster. This value should only be modified (if at all) before deploying RGWs.

Note also: radosgw-admin can accept the --bypass-gcswitch to delete underlying storage immediately, but we strongly recommend not passing this option.

Deployments with heavy S3 object churn may also find value in deploying a dedicated cohort of RGW daemons that only process GC events, which are then disabled in the client-facing cohort.

Lifecycle (LC) Management ¶

The Lifecycle (LC) Management engine automates data management based on user-defined policies applied to buckets. These policies consist of rules that trigger actions based on an object's age or other criteria. Everyday actions include Expiration, which deletes an object, and Transition, which moves an object to a different storage class. Lifecycle Transition can be defined between arbitrary storage classes (Tiers) within a cluster or to external S3 compatible endpoints, which include AWS, IBM Cloud or S3 Tape endpoints:

You can refine S3 Lifecycle expiration in RGW with fine-grained filters:

Current vs Noncurrent object versions
Expire delete markers (ExpiredObjectDeleteMarker)
Automatically abort incomplete multipart uploads (AbortIncompleteMultipartUpload)
Cap retained older versions via NewerNoncurrentVersions
Scope rules by object size using ObjectSizeGreaterThan and ObjectSizeLessThan

These filters, along with the use of S3 Tags, can be mixed to control cleanup behavior at scale with incredible granularity.

The LC engine is implemented as a set of multi-threaded worker processes. These workers periodically scan the bucket indexes across the cluster. For each object they encounter, they evaluate its properties against the bucket's lifecycle policy. If a rule's conditions are met, the corresponding action is executed. An Expiration action effectively triggers a standard delete, removing the object's index entry and enqueuing its data for GC. A Transition action involves copying the object's data to the target storage pool (which could be on a different media tier or even a remote cloud tier), and then updating the object's metadata to reflect its new location. To scale across large clusters, the LC engine's parallelism is tunable:

rgw_lc_max_worker: This controls the number of main worker threads, which process multiple bucket index shards in parallel. This should be increased for clusters with a vast number of buckets.
rgw_lc_max_wp_worker: This defines the number of sub-threads within each worker's pool, which process objects within a single shard in parallel. This should be increased for clusters with a few buckets that each contain a very large number of S3 objects.

Here is a radosgw-admin command listing the configured LC jobs in the cluster:

$ radosgw-admin lc list | jq .
[
  {
    "bucket": ":ingest:fcabdf4a-86f2-452f-a13f-e0902685c655.47553.1",
    "shard": "lc.0",
    "started": "Sat, 11 Oct 2025 11:20:59 GMT",
    "status": "COMPLETE"
  },
  {
    "bucket": ":tierbucket:fcabdf4a-86f2-452f-a13f-e0902685c655.323278.10",
    "shard": "lc.3",
    "started": "Sat, 11 Oct 2025 11:20:56 GMT",
    "status": "COMPLETE"
  },

We can get the information for a specific bucket using a command of the following form. This rule is using object tags with the k/v processed:true as a filter to expire objects older than one day.

$ # radosgw-admin lc get --bucket ingest
{
    "prefix_map": {
        "": {
            "status": true,
            "dm_expiration": false,
            "expiration": 1,
            "noncur_expiration": 0,
            "mp_expiration": 0,
            "obj_tags": {
                "tagset": {
                    "processed": "true"
                }
            },
            "transitions": {},
            "noncur_transitions": {}
        }
    },
    "rule_map": [
        {
            "id": "Delete objects that are older than 24 hours",
            "rule": {
                "id": "Delete objects that are older than 24 hours",
                "prefix": "",
                "status": "Enabled",
                "expiration": {
                    "days": "1",
                    "date": ""
                },
                "noncur_expiration": {
                    "days": "",
                    "date": ""
                },
                "mp_expiration": {
                    "days": "",
                    "date": ""
                },
                "filter": {
                    "prefix": "",
                    "obj_tags": {
                        "tagset": {
                            "processed": "true"
                        }
                    }
                },
                "transitions": {},
                "noncur_transitions": {},
                "dm_expiration": false
            }
        }
    ]
}

Conclusion: The Engine Room Revealed ¶

Across this two-part deep dive, we've journeyed through the core architectural pillars of Ceph RGW. From the high-performance frontends and the intricate mechanics of bucket index sharding to the elegant head/tail data layout and the automated background processes, you now have a comprehensive, end-to-end understanding of how RGW achieves its remarkable scalability and flexibility.

Understanding the engine's anatomy is just the first step. To truly master Ceph RGW, we must learn how to tune, secure, and operate it in complex, real-world environments.

This architectural exploration is the foundation for our ongoing series on Ceph RGW mastery.

The authors would like to thank IBM for supporting the community with our time to create these posts.