<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <title>Ceph Blog</title>
  <link href="https://ceph.io/en/news/blog/feed.xml" rel="self" />
  <link href="https://ceph.io/en/news/blog/" />
  <updated>2026-04-06T00:00:00Z</updated>
  <id>https://ceph.io/en/news/blog/</id>
  <entry>
    <title>v20.2.1 Tentacle released</title>
    <link href="https://ceph.io/en/news/blog/2026/v20-2-1-tentacle-released/" />
    <updated>2026-04-06T00:00:00Z</updated>
    <id>https://ceph.io/en/news/blog/2026/v20-2-1-tentacle-released/</id>
    <author>
      <name>Yuri Weinstein</name>
    </author>
      <category term="en-blog-post" />
      <category term="en-article" />
      <category term="blog-post" />
      <category term="release" />
      <category term="reef" />
    <content type="html" xml:base="https://ceph.io/en/news/blog/2026/v20-2-1-tentacle-released/">&lt;p&gt;This is the first minor release in the Tentacle series.
We recommend that all users update to this release.&lt;/p&gt;
&lt;h2 id=&quot;release-date&quot;&gt;Release Date &lt;a class=&quot;link-anchor&quot; href=&quot;#release-date&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;April 06, 2026&lt;/p&gt;
&lt;h2 id=&quot;notable-changes&quot;&gt;Notable Changes &lt;a class=&quot;link-anchor&quot; href=&quot;#notable-changes&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;h2 id=&quot;osd-%2F-bluestore&quot;&gt;OSD / BlueStore &lt;a class=&quot;link-anchor&quot; href=&quot;#osd-%2F-bluestore&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;EC Recovery: Fixed a length calculation bug in erase_after_ro_offset() that caused empty shards to retain data, leading to shard_size &amp;gt;= tobj_size assertion failures when recovering small objects in EC pools.&lt;/li&gt;
&lt;li&gt;BlueFS Volume Selector: Updated the BlueFS volume selector to properly account for file size changes when recovering the WAL in envelope mode.&lt;/li&gt;
&lt;li&gt;BlueFS: Fixed a bug where stat() missed the actual file size update after indexing WAL envelope files.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;monitor-(mon)&quot;&gt;Monitor (mon) &lt;a class=&quot;link-anchor&quot; href=&quot;#monitor-(mon)&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Fast EC Restrictions: Denied the ability to enable EC optimizations (&amp;quot;fast EC&amp;quot;) for non-4K-aligned chunk sizes. Unaligned chunk sizes handled by fast EC perform poorly and suffer from bugs, so attempts to force this configuration are now rejected.&lt;/li&gt;
&lt;li&gt;Peering: Ensured ceph pg repeer proposes a correctly sized pg temp, as optimized EC cannot cope with mismatched sizes.&lt;/li&gt;
&lt;li&gt;NVMeoF Gateway: Added a new nvme-gw listeners command to display all existing listeners (including auto-listeners) inside a pool/group.&lt;/li&gt;
&lt;li&gt;NVMeoF Failover: Overhauled the NVMeoF Gateway fast-failover logic. Beacon timeouts are now evaluated within prepare_beacon to support shorter intervals, and the mechanism for detecting monitor slowness was improved.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;librbd-%26-rbd-mirror&quot;&gt;librbd &amp;amp; rbd-mirror &lt;a class=&quot;link-anchor&quot; href=&quot;#librbd-%26-rbd-mirror&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;RBD: Introduced a new &lt;code&gt;RBD_LOCK_MODE_EXCLUSIVE_TRANSIENT&lt;/code&gt; policy for &lt;code&gt;rbd_lock_acquire()&lt;/code&gt;. This is a low-level interface intended to allow a peer to grab exclusive lock manually for short periods of time with other peers pausing their activity and waiting for the lock to be released rather than instantly aborting I/O and returning an error. It&#39;s possible to switch from &lt;code&gt;RBD_LOCK_MODE_EXCLUSIVE&lt;/code&gt; to &lt;code&gt;RBD_LOCK_MODE_EXCLUSIVE_TRANSIENT&lt;/code&gt; policy and vice versa even if the lock is already held.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;ceph-object-gateway-(rgw)&quot;&gt;Ceph Object Gateway (RGW) &lt;a class=&quot;link-anchor&quot; href=&quot;#ceph-object-gateway-(rgw)&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Multi-Part Operations: Fixed conditional validation handling in MultiWrite, Delete, and MultiDelete workflows.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;mgr%2Fdashboard&quot;&gt;mgr/dashboard &lt;a class=&quot;link-anchor&quot; href=&quot;#mgr%2Fdashboard&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;UI Navigation: Redesigned the main landing page; the &amp;quot;Dashboard&amp;quot; navigation item was renamed to &amp;quot;Overview&amp;quot; and uses a new carbonized productive card layout.&lt;/li&gt;
&lt;li&gt;NVMeoF Management: Added the nvmeof get_subsystems CLI command, fixed JSON output indentation for NVMeoF CLI commands, and reverted the server_addr API parameter back to traddr for consistency.&lt;/li&gt;
&lt;li&gt;Hosts View: Fixed a bug causing the IP addresses of hosts to be hidden on the Hosts page due to an issue with fact merging.&lt;/li&gt;
&lt;li&gt;Forms &amp;amp; Modals: Standardized forms onto the Carbon Design System, including the pools form, service form, multi-site realm token export modal, delete zone modal, and password change forms.&lt;/li&gt;
&lt;li&gt;Form Validation: Generalized form error handling and validations using a new cdValidate directive.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;mgr%2Fcephadm&quot;&gt;mgr/cephadm &lt;a class=&quot;link-anchor&quot; href=&quot;#mgr%2Fcephadm&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Monitoring Stack: Bumped the default container image versions for the monitoring stack: Prometheus to v3.6.0, Node-exporter to v1.9.1, Alertmanager to v0.28.1, and Grafana to v12.2.0.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;security-changes&quot;&gt;Security Changes &lt;a class=&quot;link-anchor&quot; href=&quot;#security-changes&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Monitoring Stack Images: Updated Prometheus, Alertmanager, and Grafana container image versions, picking up upstream security and stability fixes.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;configuration-changes&quot;&gt;Configuration Changes &lt;a class=&quot;link-anchor&quot; href=&quot;#configuration-changes&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;bluefs_check_volume_selector_on_mount&lt;/code&gt;: The previous bluefs_check_volume_selector_on_umount debug setting was renamed and repurposed. It now checks for volume selector inconsistencies on both mount and unmount phases.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;mon_nvmeofgw_beacon_grace&lt;/code&gt;: The default grace period before marking a gateway as failed has been reduced from 10 seconds to 7 seconds for faster failover.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;nvmeof_mon_client_tick_period&lt;/code&gt;: The default beacon tick interval has been lowered from 2 seconds to 1 second.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;changelog&quot;&gt;Changelog &lt;a class=&quot;link-anchor&quot; href=&quot;#changelog&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;[rgw][tentacle] backport of cloud-restore related PRs (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65830&quot;&gt;pr#65830&lt;/a&gt;, Soumya Koduri)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Add normalization and casesensitive options to the subvolume group creation command (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65564&quot;&gt;pr#65564&lt;/a&gt;, Venky Shankar, Xavi Hernandez)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;auth: msgr2 can return incorrect allowed_modes through AuthBadMethodFrame (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65336&quot;&gt;pr#65336&lt;/a&gt;, Miki Patel)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;backports variants improvements and Dockerfile&lt;span&gt;&lt;/span&gt;.build changes (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66010&quot;&gt;pr#66010&lt;/a&gt;, John Mulligan, Zack Cerza)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Beacon diff (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66958&quot;&gt;pr#66958&lt;/a&gt;, Leonid Chernin, Samuel Just)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;blk/kernel: bring &amp;quot;bdev_async_discard&amp;quot; config parameter back (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65609&quot;&gt;pr#65609&lt;/a&gt;, Igor Fedotov)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;blk/kernel: improve DiscardThread life cycle (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65213&quot;&gt;pr#65213&lt;/a&gt;, Igor Fedotov)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;bluestore/BlueFS: fix bytes_written_slow counter with aio_write (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66355&quot;&gt;pr#66355&lt;/a&gt;, chungfengz)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;build-with-container: add argument groups to organize options (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65628&quot;&gt;pr#65628&lt;/a&gt;, John Mulligan)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;build-with-container: build image variants (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65946&quot;&gt;pr#65946&lt;/a&gt;, John Mulligan)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;ceph-mixin: Update monitoring mixin (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65692&quot;&gt;pr#65692&lt;/a&gt;, Aashish Sharma, SuperQ, Ankush Behl)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;ceph-volume: fix UdevData initialisation from empty /run/udev/data/* file (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65923&quot;&gt;pr#65923&lt;/a&gt;, Matteo Paramatti)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;ceph-volume: lvm&lt;span&gt;&lt;/span&gt;.Lvm&lt;span&gt;&lt;/span&gt;.setup_metadata_devices refactor (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65925&quot;&gt;pr#65925&lt;/a&gt;, Guillaume Abrioux)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;ceph-volume: support additional dmcrypt params (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65544&quot;&gt;pr#65544&lt;/a&gt;, Guillaume Abrioux)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;ceph-volume: use udev data instead of LVM subprocess in get_devices() (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65921&quot;&gt;pr#65921&lt;/a&gt;, Guillaume Abrioux)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;ceph_release, doc/dev: update tentacle as stable release (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65988&quot;&gt;pr#65988&lt;/a&gt;, Laura Flores)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;cephadm, debian/rules: Use system packages for cephadm bundled dependencies (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66256&quot;&gt;pr#66256&lt;/a&gt;, Kefu Chai)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;cephadm: fix building rpm-sourced cephadm zippapp on el10 (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65292&quot;&gt;pr#65292&lt;/a&gt;, John Mulligan)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;cephadm: set default image for tentacle release (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65719&quot;&gt;pr#65719&lt;/a&gt;, Adam King)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;cephadm: support custom distros by falling back to ID_LIKE (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65696&quot;&gt;pr#65696&lt;/a&gt;, bachmanity1)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;cephfs-journal-tool: Journal trimming issue (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65601&quot;&gt;pr#65601&lt;/a&gt;, Kotresh HR)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;client: fix async/sync I/O stalling due to buffer list exceeding INT_MAX (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65256&quot;&gt;pr#65256&lt;/a&gt;, Dhairya Parmar)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;client: fix dump_mds_requests to valid json format (&lt;a href=&quot;http://tracker.ceph.com/issues/73639&quot;&gt;issue#73639&lt;/a&gt;, &lt;a href=&quot;https://github.com/ceph/ceph/pull/66156&quot;&gt;pr#66156&lt;/a&gt;, haoyixing)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;client: fix unmount hang after lookups (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65254&quot;&gt;pr#65254&lt;/a&gt;, Dhairya Parmar)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;client: use path supplied in statfs (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65132&quot;&gt;pr#65132&lt;/a&gt;, Christopher Hoffman)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;common/frag: properly convert frag_t to net/store endianness (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66540&quot;&gt;pr#66540&lt;/a&gt;, Patrick Donnelly, Max Kellermann)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;common: Allow PerfCounters to return a provided service ID (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65587&quot;&gt;pr#65587&lt;/a&gt;, Adam C. Emerson)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;debian/control: add iproute2 to build dependencies (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66737&quot;&gt;pr#66737&lt;/a&gt;, Kefu Chai)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;debian/control: Add libxsimd-dev build dependency for vendored Arrow (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66248&quot;&gt;pr#66248&lt;/a&gt;, Kefu Chai)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;debian/control: record python3-packaging dependency for ceph-volume (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66590&quot;&gt;pr#66590&lt;/a&gt;, Thomas Lamprecht, Max R. Carrara)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/cephfs: fix docs for pause_purging and pause_cloning (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66452&quot;&gt;pr#66452&lt;/a&gt;, Rishabh Dave)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/mgr/smb: document the &#39;provider&#39; option for smb share (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65617&quot;&gt;pr#65617&lt;/a&gt;, Sachin Prabhu)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/radosgw: change all intra-docs links to use ref (1 of 6) (&lt;a href=&quot;https://github.com/ceph/ceph/pull/67043&quot;&gt;pr#67043&lt;/a&gt;, Ville Ojamo)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/radosgw: change all intra-docs links to use ref (2 of 6) (&lt;a href=&quot;https://github.com/ceph/ceph/pull/67084&quot;&gt;pr#67084&lt;/a&gt;, Ville Ojamo)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/radosgw: Cosmetic improvements and ref links in account&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/67064&quot;&gt;pr#67064&lt;/a&gt;, Ville Ojamo)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/rbd/rbd-config-ref: add clone settings section (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66175&quot;&gt;pr#66175&lt;/a&gt;, Ilya Dryomov)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc: add Tentacle to os recommendations (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66464&quot;&gt;pr#66464&lt;/a&gt;, Casey Bodley, Joseph Mundackal)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc: fetch releases from main branch (&lt;a href=&quot;https://github.com/ceph/ceph/pull/67002&quot;&gt;pr#67002&lt;/a&gt;, Patrick Donnelly)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc: Pin pip to &amp;lt;25&lt;span&gt;&lt;/span&gt;.3 for RTD as a workaround for pybind in admin/doc-read-the-docs&lt;span&gt;&lt;/span&gt;.txt (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66106&quot;&gt;pr#66106&lt;/a&gt;, Ville Ojamo)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc: Remove sphinxcontrib-seqdiag Python package from RTD builds (&lt;a href=&quot;https://github.com/ceph/ceph/pull/67296&quot;&gt;pr#67296&lt;/a&gt;, Ville Ojamo)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc: Update dashboard pending release notes (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65984&quot;&gt;pr#65984&lt;/a&gt;, Afreen Misbah)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;encode: Fix bad use of DENC_DUMP_PRE (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66565&quot;&gt;pr#66565&lt;/a&gt;, Adam Kupczyk)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Fast failover (&lt;a href=&quot;https://github.com/ceph/ceph/pull/67150&quot;&gt;pr#67150&lt;/a&gt;, leonidc, Leonid Chernin)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Fix multifs auth caps check (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65358&quot;&gt;pr#65358&lt;/a&gt;, Kotresh HR)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Form retains old data when switching from edit to create (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65654&quot;&gt;pr#65654&lt;/a&gt;, pujashahu)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Generalize error handling for angular forms (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66904&quot;&gt;pr#66904&lt;/a&gt;, Afreen Misbah)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;github: pin GH Actions to SHA-1 commit (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65761&quot;&gt;pr#65761&lt;/a&gt;, Ernesto Puerta)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;install-deps&lt;span&gt;&lt;/span&gt;.sh: install proper compiler version on Debian/Ubuntu (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66015&quot;&gt;pr#66015&lt;/a&gt;, Dan Mick)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;install-deps: Replace apt-mirror (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66672&quot;&gt;pr#66672&lt;/a&gt;, David Galloway)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;libcephfs: New feature - add ceph_setlk and ceph_getlk functions (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65258&quot;&gt;pr#65258&lt;/a&gt;, Giorgos Kappes)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;librbd: fix ExclusiveLock::accept_request() when !is_state_locked() (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66628&quot;&gt;pr#66628&lt;/a&gt;, Ilya Dryomov)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;librbd: introduce RBD_LOCK_MODE_EXCLUSIVE_TRANSIENT (&lt;a href=&quot;https://github.com/ceph/ceph/pull/67279&quot;&gt;pr#67279&lt;/a&gt;, Ilya Dryomov)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mds/FSMap: fix join_fscid being incorrectly reset for active MDS during filesystem removal (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65777&quot;&gt;pr#65777&lt;/a&gt;, ethanwu)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mds/MDSDaemon: unlock &lt;code&gt;mds&#92;_lock&lt;/code&gt; while shutting down Beacon and others (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64885&quot;&gt;pr#64885&lt;/a&gt;, Max Kellermann)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mds: dump export_ephemeral_random_pin as double (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65163&quot;&gt;pr#65163&lt;/a&gt;, Enrico Bocchi)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mds: fix rank 0 marked damaged if stopping fails after Elid flush (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65778&quot;&gt;pr#65778&lt;/a&gt;, ethanwu)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mds: Fix readdir when osd is full (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65346&quot;&gt;pr#65346&lt;/a&gt;, Kotresh HR)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mds: fix snapdiff result fragmentation (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65362&quot;&gt;pr#65362&lt;/a&gt;, Igor Fedotov, Md Mahamudur Rahaman Sajib)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mds: include auth credential in session dump (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65255&quot;&gt;pr#65255&lt;/a&gt;, Patrick Donnelly)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mds: Return ceph&lt;span&gt;&lt;/span&gt;.dir&lt;span&gt;&lt;/span&gt;.subvolume vxattr (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65779&quot;&gt;pr#65779&lt;/a&gt;, Edwin Rodriguez)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mds: skip charmap handler check for MDS requests (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64953&quot;&gt;pr#64953&lt;/a&gt;, Patrick Donnelly)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mds: wrong snap check for directory with parent snaps (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65259&quot;&gt;pr#65259&lt;/a&gt;, Patrick Donnelly)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/alerts: enforce ssl context to SMTP_SSL (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66140&quot;&gt;pr#66140&lt;/a&gt;, Nizamudeen A)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/cephadm: Add some new fields to the cephadm NVMEoF spec file (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66987&quot;&gt;pr#66987&lt;/a&gt;, Gil Bregman)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/cephadm: bump monitoring stack versions (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65895&quot;&gt;pr#65895&lt;/a&gt;, Nizamudeen A)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/cephadm: Change the default of max hosts per namespace in NVMEoF to 16 (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66819&quot;&gt;pr#66819&lt;/a&gt;, Gil Bregman)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/cephadm: don&#39;t mark nvmeof daemons without pool and group in name as stray (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65594&quot;&gt;pr#65594&lt;/a&gt;, Adam King)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/cephadm: update grafana conf for disconnected environment (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66209&quot;&gt;pr#66209&lt;/a&gt;, Nizamudeen A)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/cephadm: Use a persistent volume to store Loki DB (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66023&quot;&gt;pr#66023&lt;/a&gt;, Aashish Sharma)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/DaemonServer: fixed mistype for mgr_osd_messages (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63345&quot;&gt;pr#63345&lt;/a&gt;, Konstantin Shalygin)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/DaemonState: Minimise time we hold the DaemonStateIndex lock (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65464&quot;&gt;pr#65464&lt;/a&gt;, Brad Hubbard)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dasboard : Carbonize pools form (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66789&quot;&gt;pr#66789&lt;/a&gt;, Abhishek Desai, Ankit Kumar)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard :  Fixed labels issue (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66603&quot;&gt;pr#66603&lt;/a&gt;, Abhishek Desai)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard : Carbonize -&amp;gt; Report an issue modal (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66048&quot;&gt;pr#66048&lt;/a&gt;, Abhishek Desai)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard : fix - about model tooltip issue (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66276&quot;&gt;pr#66276&lt;/a&gt;, Devika Babrekar)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard : fix - CephFS Authorize Modal Update issue (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66419&quot;&gt;pr#66419&lt;/a&gt;, Devika Babrekar)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard : fix css for carbon input fields (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65490&quot;&gt;pr#65490&lt;/a&gt;, Abhishek Desai)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard : Fix secure-monitoring-stack creds issue (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65943&quot;&gt;pr#65943&lt;/a&gt;, Abhishek Desai)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard : Fixed mirrored image usage info bar (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65491&quot;&gt;pr#65491&lt;/a&gt;, Abhishek Desai)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard : Fixed usage bar for secondary site in rbd mirroing (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65927&quot;&gt;pr#65927&lt;/a&gt;, Abhishek Desai)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard : Fixed warning icon colour issue with carbon colour (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66271&quot;&gt;pr#66271&lt;/a&gt;, Abhishek Desai)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard : Hide suppressed  alert on landing page (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65737&quot;&gt;pr#65737&lt;/a&gt;, Abhishek Desai)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard : Remove subalerts details for multiple subalerts (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66295&quot;&gt;pr#66295&lt;/a&gt;, Abhishek Desai)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard : Skip calls until secure_monitoring_stack is enabled (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65673&quot;&gt;pr#65673&lt;/a&gt;, Abhishek Desai)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: --no-group-append default value to False, aligned with old cli&amp;quot; (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65678&quot;&gt;pr#65678&lt;/a&gt;, Tomer Haskalovitch)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: Add Archive zone configuration to the Dashboard (&lt;a href=&quot;https://github.com/ceph/ceph/pull/67131&quot;&gt;pr#67131&lt;/a&gt;, Aashish Sharma)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: add customizations to table-actions (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65956&quot;&gt;pr#65956&lt;/a&gt;, Naman Munet)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: Add full page tearsheet component (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66892&quot;&gt;pr#66892&lt;/a&gt;, Afreen Misbah)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: Add generic wizard component (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66893&quot;&gt;pr#66893&lt;/a&gt;, Afreen Misbah)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: add get_subsystem nvme command (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66941&quot;&gt;pr#66941&lt;/a&gt;, Tomer Haskalovitch)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: add indentation to the json output of nvmeof cli commands (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66940&quot;&gt;pr#66940&lt;/a&gt;, Tomer Haskalovitch)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: add multiple ceph users deletion (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65658&quot;&gt;pr#65658&lt;/a&gt;, Pedro Gonzalez Gomez)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: add nsid param to ns add command (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65677&quot;&gt;pr#65677&lt;/a&gt;, Tomer Haskalovitch)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: add nsid param to ns list command (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65749&quot;&gt;pr#65749&lt;/a&gt;, Tomer Haskalovitch)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: Add overview page and change &#39;Dashboard&#39; to &#39;Overview&#39; (&lt;a href=&quot;https://github.com/ceph/ceph/pull/67118&quot;&gt;pr#67118&lt;/a&gt;, Afreen Misbah)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: Add productive card component (&lt;a href=&quot;https://github.com/ceph/ceph/pull/67147&quot;&gt;pr#67147&lt;/a&gt;, Afreen Misbah)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: add text-label-list component (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66312&quot;&gt;pr#66312&lt;/a&gt;, Pedro Gonzalez Gomez)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: Adding QAT Compression dropdown on RGW Service form (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66642&quot;&gt;pr#66642&lt;/a&gt;, Devika Babrekar)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: allow deletion of non-default zone and zonegroup (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66211&quot;&gt;pr#66211&lt;/a&gt;, Aashish Sharma)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: Allow FQDN in Connect Cluster form -&amp;gt; Cluster API URL (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65622&quot;&gt;pr#65622&lt;/a&gt;, Aashish Sharma)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: Blank entry for Storage Capacity in dashboard under Cluster &amp;gt; Expand Cluster &amp;gt; Review (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65705&quot;&gt;pr#65705&lt;/a&gt;, Naman Munet)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: bump validator package to address vulnerability (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66227&quot;&gt;pr#66227&lt;/a&gt;, Naman Munet)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: Carbonize - Multisite Zone (&lt;a href=&quot;https://github.com/ceph/ceph/pull/67117&quot;&gt;pr#67117&lt;/a&gt;, Dnyaneshwari Talwekar)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: Carbonize Administration module &amp;gt; Create Realm/Zone group/zone (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66986&quot;&gt;pr#66986&lt;/a&gt;, Dnyaneshwari Talwekar)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: Carbonize multisite sync policy forms (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66302&quot;&gt;pr#66302&lt;/a&gt;, Naman Munet)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: carbonize service form (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66978&quot;&gt;pr#66978&lt;/a&gt;, Pedro Gonzalez Gomez)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: Carbonize the Change Password Form (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66401&quot;&gt;pr#66401&lt;/a&gt;, Afreen Misbah)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: carbonize-delete-zone-modal (&lt;a href=&quot;https://github.com/ceph/ceph/pull/67100&quot;&gt;pr#67100&lt;/a&gt;, Sagar Gopale)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: carbonize-delete-zonegroup-modal (&lt;a href=&quot;https://github.com/ceph/ceph/pull/67014&quot;&gt;pr#67014&lt;/a&gt;, Sagar Gopale)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: carbonized-multisite-export-realm-token-modal (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66649&quot;&gt;pr#66649&lt;/a&gt;, Sagar Gopale)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: change the default max namespace from 4096 to None in subsystem add command (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65951&quot;&gt;pr#65951&lt;/a&gt;, Tomer Haskalovitch)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: Edit user via UI throwing multiple server errors (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66081&quot;&gt;pr#66081&lt;/a&gt;, Naman Munet)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: empty-data-message (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66902&quot;&gt;pr#66902&lt;/a&gt;, Sagar Gopale)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: fetch all namespaces in a gateway group (&lt;a href=&quot;https://github.com/ceph/ceph/pull/67140&quot;&gt;pr#67140&lt;/a&gt;, Afreen Misbah)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: fix command alias help message (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65750&quot;&gt;pr#65750&lt;/a&gt;, Tomer Haskalovitch)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: fix dashboard freeze on missing smb permissions (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65873&quot;&gt;pr#65873&lt;/a&gt;, Pedro Gonzalez Gomez)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: fix data mismatch in Advance section in Tiering (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65672&quot;&gt;pr#65672&lt;/a&gt;, Dnyaneshwari Talwekar)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: Fix display of IP address in host page (&lt;a href=&quot;https://github.com/ceph/ceph/pull/67146&quot;&gt;pr#67146&lt;/a&gt;, Afreen Misbah)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: fix icon alignment in navigation header (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66091&quot;&gt;pr#66091&lt;/a&gt;, Naman Munet)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: fix misaligned text links on login page (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66052&quot;&gt;pr#66052&lt;/a&gt;, prik73, Afreen Misbah)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: fix missing schedule interval in rbd API (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65560&quot;&gt;pr#65560&lt;/a&gt;, Nizamudeen A)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: fix multi-cluster route reload logic (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66504&quot;&gt;pr#66504&lt;/a&gt;, Aashish Sharma)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: fix multisite wizard realm configuration mode (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66017&quot;&gt;pr#66017&lt;/a&gt;, Aashish Sharma)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: fix None force param handling in ns add_host so it won&#39;t raise exceptions (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65679&quot;&gt;pr#65679&lt;/a&gt;, Tomer Haskalovitch)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: fix ns add and resize commands help (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66939&quot;&gt;pr#66939&lt;/a&gt;, Tomer Haskalovitch)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: fix oauth2-service creation UI error (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66139&quot;&gt;pr#66139&lt;/a&gt;, Nizamudeen A)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: fix prometheus API error when not configured (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65856&quot;&gt;pr#65856&lt;/a&gt;, Nizamudeen A)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: fix rbd form mirroring toggle (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65874&quot;&gt;pr#65874&lt;/a&gt;, Nizamudeen A)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: fix RBD mirror schedule inheritance in pool and image APIs (&lt;a href=&quot;https://github.com/ceph/ceph/pull/67107&quot;&gt;pr#67107&lt;/a&gt;, Imran Imtiaz)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: fix smb button and table column (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65657&quot;&gt;pr#65657&lt;/a&gt;, Pedro Gonzalez Gomez)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: Fix table width expansion on manager module dropdown selection #74089 (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66647&quot;&gt;pr#66647&lt;/a&gt;, Sagar Gopale)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: fix the separation between CLI and API only commands (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65781&quot;&gt;pr#65781&lt;/a&gt;, Tomer Haskalovitch)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: Fix timestamps in APIs (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66029&quot;&gt;pr#66029&lt;/a&gt;, Afreen Misbah)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: fix total capacity value in dashboard (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65647&quot;&gt;pr#65647&lt;/a&gt;, Nizamudeen A)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: fix typo in error when gw does not exist (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66956&quot;&gt;pr#66956&lt;/a&gt;, Tomer Haskalovitch)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: fix zone update API forcing STANDARD storage class (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65619&quot;&gt;pr#65619&lt;/a&gt;, Aashish Sharma)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: fixes for quick-bootstrap script (&lt;a href=&quot;https://github.com/ceph/ceph/pull/67040&quot;&gt;pr#67040&lt;/a&gt;, Nizamudeen A)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: FS - Attach Command showing undefined for MountData (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65675&quot;&gt;pr#65675&lt;/a&gt;, Dnyaneshwari Talwekar)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: Group similar alerts (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65493&quot;&gt;pr#65493&lt;/a&gt;, Abhishek Desai)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: Handle pool creation in tiering local storage class creation (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65680&quot;&gt;pr#65680&lt;/a&gt;, Dnyaneshwari, Naman Munet)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: Maintain sentence case consistency in side nav bar titles (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66050&quot;&gt;pr#66050&lt;/a&gt;, Aashish Sharma)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: ns list now support not passing nqn param (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65897&quot;&gt;pr#65897&lt;/a&gt;, Tomer Haskalovitch)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: raise exception if both size and rbd_image_size are being passed in ns add (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65816&quot;&gt;pr#65816&lt;/a&gt;, Tomer Haskalovitch)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: rbd consistency group and snapshot APIs (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66935&quot;&gt;pr#66935&lt;/a&gt;, Imran Imtiaz)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: Remove illegible texts from the dashboard (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66306&quot;&gt;pr#66306&lt;/a&gt;, Afreen Misbah)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: remove not needed &#39;cli_version&#39; field from gw info com… (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66942&quot;&gt;pr#66942&lt;/a&gt;, Tomer Haskalovitch)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: Remove the time dropdown from grafana iframe (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65853&quot;&gt;pr#65853&lt;/a&gt;, Abhishek Desai)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: removes nx folder (&lt;a href=&quot;https://github.com/ceph/ceph/pull/67003&quot;&gt;pr#67003&lt;/a&gt;, Afreen Misbah)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: rename &#39;Zone Group&#39; labels to &#39;Zonegroup&#39; (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66790&quot;&gt;pr#66790&lt;/a&gt;, Sagar Gopale)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: Rename Alerts tab to All Alerts (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66532&quot;&gt;pr#66532&lt;/a&gt;, Sagar Gopale)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: Rename side-nav panel items (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65846&quot;&gt;pr#65846&lt;/a&gt;, Naman Munet)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: replace bootstrap badges with carbon tags (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66350&quot;&gt;pr#66350&lt;/a&gt;, pujaoshahu)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: replace usage or progress bar with carbon meter chart (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66934&quot;&gt;pr#66934&lt;/a&gt;, Naman Munet)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: rgw accounts form group mode disable option is not working (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66351&quot;&gt;pr#66351&lt;/a&gt;, Naman Munet)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: server side table rendering improvements (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65828&quot;&gt;pr#65828&lt;/a&gt;, Nizamudeen A)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: service creation fails if service name is same as sevice type (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66481&quot;&gt;pr#66481&lt;/a&gt;, Naman Munet)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: Set max subsystem count to 512 rather than 4096 (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66284&quot;&gt;pr#66284&lt;/a&gt;, Afreen Misbah)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: support gw get_stats and listener info (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65896&quot;&gt;pr#65896&lt;/a&gt;, Tomer Haskalovitch)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: Tiering form - Placement Target in Advanced Section (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65653&quot;&gt;pr#65653&lt;/a&gt;, Dnyaneshwari Talwekar)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: update teuth_ref hash in api test (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66706&quot;&gt;pr#66706&lt;/a&gt;, Nizamudeen A)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard:[NFS] add Subvolume Groups and Subvolumes in &amp;quot;Edit NFS Export form&amp;quot; (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65650&quot;&gt;pr#65650&lt;/a&gt;, Dnyaneshwari Talwekar)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/prometheus: Handle empty/invalid JSON from orch get-security-config (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65906&quot;&gt;pr#65906&lt;/a&gt;, Sunnatillo)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/telemetry: add &#39;ec_optimizations&#39; flag to &#39;basic_pool_flags&#39; collection (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65969&quot;&gt;pr#65969&lt;/a&gt;, Laura Flores)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/vol: handling the failed non-atomic operation (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65728&quot;&gt;pr#65728&lt;/a&gt;, Neeraj Pratap Singh)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/vol: keep and show clone source info (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64650&quot;&gt;pr#64650&lt;/a&gt;, Rishabh Dave)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/volumes: Keep mon caps if auth key has remaining mds/osd caps (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65262&quot;&gt;pr#65262&lt;/a&gt;, Enrico Bocchi)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/volumes: remove unnecessary log error lines from earmark handling (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66991&quot;&gt;pr#66991&lt;/a&gt;, Avan Thakkar)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr: avoid explicit dropping of ref (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65005&quot;&gt;pr#65005&lt;/a&gt;, Milind Changire)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr:python: avoid pyo3 errors by running certain cryptographic functions in a child process (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66794&quot;&gt;pr#66794&lt;/a&gt;, Nizamudeen A, John Mulligan, Paulo E. Castro)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mon/FSCommands: avoid unreachable code triggering compiler warning (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65261&quot;&gt;pr#65261&lt;/a&gt;, Patrick Donnelly)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mon/MgrMonitor: add a space before &amp;quot;is already disabled&amp;quot; (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64687&quot;&gt;pr#64687&lt;/a&gt;, Zehua Qi)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mon/OSDMonitor&lt;span&gt;&lt;/span&gt;.cc: optionally display availability status in json (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65794&quot;&gt;pr#65794&lt;/a&gt;, Shraddha Agrawal)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mon: Add command &amp;quot;nvme-gw listeners&amp;quot; (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66584&quot;&gt;pr#66584&lt;/a&gt;, Vallari Agrawal)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mon: ceph pg repeer should propose a correctly sized pg temp (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66324&quot;&gt;pr#66324&lt;/a&gt;, Alex Ainscow)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mon: Deny EC optimizations (fast EC) for non-4k-aligned chunk-sizes (&lt;a href=&quot;https://github.com/ceph/ceph/pull/67319&quot;&gt;pr#67319&lt;/a&gt;, Alex Ainscow)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;monc: synchronize tick() of MonClient with shutdown() (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66916&quot;&gt;pr#66916&lt;/a&gt;, Radoslaw Zarzynski)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;monitoring: fix &amp;quot;In&amp;quot; OSDs in Cluster-Advanced grafana panel&lt;span&gt;&lt;/span&gt;. Also change units from decbytes to bytes wherever used in the panel (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65670&quot;&gt;pr#65670&lt;/a&gt;, Aashish Sharma)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;monitoring: fix &amp;quot;Total gateway&amp;quot; and &amp;quot;Ceph Health NVMeoF WARNING&amp;quot; grafana graphs (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66225&quot;&gt;pr#66225&lt;/a&gt;, Vallari Agrawal)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;monitoring: fix CephPgImbalance alert rule expression (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66828&quot;&gt;pr#66828&lt;/a&gt;, Aashish Sharma)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;monitoring: Fix Filesystem grafana dashboard units (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66018&quot;&gt;pr#66018&lt;/a&gt;, Ankush Behl)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;monitoring: fix MTU Mismatch alert rule and expr (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65708&quot;&gt;pr#65708&lt;/a&gt;, Aashish Sharma)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;monitoring: fix rgw_servers filtering in rgw sync overview grafana (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66989&quot;&gt;pr#66989&lt;/a&gt;, Aashish Sharma)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;monitoring: Fixes for smb overview (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66019&quot;&gt;pr#66019&lt;/a&gt;, Ankush Behl)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;monitoring: make cluster matcher backward compatible for pre-reef metrics (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66984&quot;&gt;pr#66984&lt;/a&gt;, Aashish Sharma)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;monitoring: update NVMeoFTooManyNamespaces to 4096 ns (&lt;a href=&quot;https://github.com/ceph/ceph/pull/67039&quot;&gt;pr#67039&lt;/a&gt;, Vallari Agrawal)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;monitoring: upgrade grafana version to 12&lt;span&gt;&lt;/span&gt;.3&lt;span&gt;&lt;/span&gt;.1 (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66963&quot;&gt;pr#66963&lt;/a&gt;, Aashish Sharma)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;nvmeof: refactor beacon timer for exact frequency timing with drift correction (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66536&quot;&gt;pr#66536&lt;/a&gt;, Alexander Indenbaum)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Objecter: respect higher epoch subscription in tick (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66972&quot;&gt;pr#66972&lt;/a&gt;, Nitzan Mordechai)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;os/bluestore: cumulative patch to fix extent map resharding and around (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65964&quot;&gt;pr#65964&lt;/a&gt;, Igor Fedotov, Adam Kupczyk, Jaya Prakash)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;os/bluestore: fix vselector update after enveloped WAL recovery (&lt;a href=&quot;https://github.com/ceph/ceph/pull/67333&quot;&gt;pr#67333&lt;/a&gt;, Igor Fedotov, Adam Kupczyk)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;os/bluestore: introduce device type specific allocation policy (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66839&quot;&gt;pr#66839&lt;/a&gt;, Igor Fedotov)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;osd/ECUtil: Fix erase_after_ro_offset length calculation and add tests (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66825&quot;&gt;pr#66825&lt;/a&gt;, Alex Ainscow)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;osd/PeeringState: re-evaluate full OSDs while waiting for recovery re… (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65701&quot;&gt;pr#65701&lt;/a&gt;, Nitzan Mordechai)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;osd/scrub: do not reduce min chunk on preemption (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66214&quot;&gt;pr#66214&lt;/a&gt;, Ronen Friedman)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;osd/scrub: fix blocked scrub accounting (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66220&quot;&gt;pr#66220&lt;/a&gt;, Ronen Friedman)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;osd/scrub: new/modified perf counters for scrub preemption (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66234&quot;&gt;pr#66234&lt;/a&gt;, Ronen Friedman)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;osd: Do not remove objects with divergent logs if only partial writes (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66725&quot;&gt;pr#66725&lt;/a&gt;, Alex Ainscow)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;osd: Fix fast EC truncate to whole stripe (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66543&quot;&gt;pr#66543&lt;/a&gt;, Alex Ainscow)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;osd: Fix for num_bytes mismatch occurring from snapshot workloads with partial writes in fast_ec (&lt;a href=&quot;https://github.com/ceph/ceph/pull/67137&quot;&gt;pr#67137&lt;/a&gt;, Jon Bailey)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;osd: Fix memory leak of ECDummyOp (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66977&quot;&gt;pr#66977&lt;/a&gt;, Alex Ainscow)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;osd: Fix stats mismatch cluster error seen during scrubbing occasionally (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65793&quot;&gt;pr#65793&lt;/a&gt;, Jon Bailey)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;osd: Relax missing entry assert for partial writes (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65860&quot;&gt;pr#65860&lt;/a&gt;, Alex Ainscow)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;osd: stop scrub_purged_snaps() from ignoring osd_beacon_report_interval (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65478&quot;&gt;pr#65478&lt;/a&gt;, Radoslaw Zarzynski)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;pickup object corpus 20&lt;span&gt;&lt;/span&gt;.2&lt;span&gt;&lt;/span&gt;.0 380 gdbcbbd3f281 (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66592&quot;&gt;pr#66592&lt;/a&gt;, Nitzan Mordechai)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;prometheus: Add Cephadm orch ps output metric to prometheus (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66760&quot;&gt;pr#66760&lt;/a&gt;, Ankush Behl)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;pybind/mgr/dashboard: dashboard/requirements-lint&lt;span&gt;&lt;/span&gt;.txt: re-pin rsscheck (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66877&quot;&gt;pr#66877&lt;/a&gt;, Ronen Friedman)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;pybind/mgr/pg_autoscaler: Introduce dynamic threshold to improve scal… (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66871&quot;&gt;pr#66871&lt;/a&gt;, Prashant D)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;pybind/mgr: pin cheroot version in requirements-required&lt;span&gt;&lt;/span&gt;.txt (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65635&quot;&gt;pr#65635&lt;/a&gt;, Adam King)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;pybind/rados: Add list_lockers() and break_lock() to Rados Python interface (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65098&quot;&gt;pr#65098&lt;/a&gt;, Gil Bregman)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;qa/multisite: switch to boto3 (&lt;a href=&quot;https://github.com/ceph/ceph/pull/67318&quot;&gt;pr#67318&lt;/a&gt;, Shilpa Jagannath, Adam C. Emerson)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;qa/rgw: bucket notifications use pynose (&lt;a href=&quot;https://github.com/ceph/ceph/pull/67449&quot;&gt;pr#67449&lt;/a&gt;, Casey Bodley, Adam C. Emerson)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;qa/standalone/availability&lt;span&gt;&lt;/span&gt;.sh: retry after feature is turned on (&lt;a href=&quot;https://github.com/ceph/ceph/pull/67226&quot;&gt;pr#67226&lt;/a&gt;, Shraddha Agrawal)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;qa/suites/nvmeof: add upgrade sub-suite (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65583&quot;&gt;pr#65583&lt;/a&gt;, Vallari Agrawal)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;qa/suites/rados/thrash-old-clients: Add OSD warnings to ignore list (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65369&quot;&gt;pr#65369&lt;/a&gt;, Naveen Naidu)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;qa/suites/rbd/valgrind: don&#39;t hardcode os_type in memcheck&lt;span&gt;&lt;/span&gt;.yaml (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66196&quot;&gt;pr#66196&lt;/a&gt;, Ilya Dryomov)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;qa/suites/upgrade: add &amp;quot;Replacing daemon mds&amp;quot; to ignorelist (&lt;a href=&quot;http://tracker.ceph.com/issues/71615&quot;&gt;issue#71615&lt;/a&gt;, &lt;a href=&quot;http://tracker.ceph.com/issues/50279&quot;&gt;issue#50279&lt;/a&gt;, &lt;a href=&quot;https://github.com/ceph/ceph/pull/64888&quot;&gt;pr#64888&lt;/a&gt;, Venky Shankar)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;qa/suites: wait longer before stopping OSDs with valgrind (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63716&quot;&gt;pr#63716&lt;/a&gt;, Nitzan Mordechai)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;qa/tasks/ceph_manager: population must be a sequence (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64746&quot;&gt;pr#64746&lt;/a&gt;, Kyr Shatskyy)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;qa/tasks/qemu: rocky 10 enablement (&lt;a href=&quot;https://github.com/ceph/ceph/pull/67283&quot;&gt;pr#67283&lt;/a&gt;, Ilya Dryomov)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;qa/tasks/rbd_mirror_thrash: don&#39;t use random&lt;span&gt;&lt;/span&gt;.randrange() on floats (&lt;a href=&quot;https://github.com/ceph/ceph/pull/67163&quot;&gt;pr#67163&lt;/a&gt;, Ilya Dryomov)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;qa/tasks/workunit: fix no module named &#39;pipes&#39; (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66250&quot;&gt;pr#66250&lt;/a&gt;, Kyr Shatskyy)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;qa/tests: added inital draft for tentacle-p2p (&lt;a href=&quot;https://github.com/ceph/ceph/pull/67765&quot;&gt;pr#67765&lt;/a&gt;, Patrick Donnelly, Yuri Weinstein)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;qa/tests: added messages to the whitelist (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65645&quot;&gt;pr#65645&lt;/a&gt;, Laura Flores, Yuri Weinstein)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;qa/tests: wait for module to be available for connection (&lt;a href=&quot;https://github.com/ceph/ceph/pull/67196&quot;&gt;pr#67196&lt;/a&gt;, Nizamudeen A)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;qa/valgrind&lt;span&gt;&lt;/span&gt;.supp: make gcm_cipher_internal suppression more resilient (&lt;a href=&quot;https://github.com/ceph/ceph/pull/67281&quot;&gt;pr#67281&lt;/a&gt;, Ilya Dryomov)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;qa/workunits/nvmeof/basic_tests: use nvme-cli 2&lt;span&gt;&lt;/span&gt;.13 (&lt;a href=&quot;https://github.com/ceph/ceph/pull/67285&quot;&gt;pr#67285&lt;/a&gt;, Vallari Agrawal)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;qa/workunits/rados: remove cache tier test (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65540&quot;&gt;pr#65540&lt;/a&gt;, Nitzan Mordechai)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;qa/workunits/rbd: adapt rbd_mirror&lt;span&gt;&lt;/span&gt;.sh for trial nodes (&lt;a href=&quot;https://github.com/ceph/ceph/pull/67152&quot;&gt;pr#67152&lt;/a&gt;, Ilya Dryomov)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;qa/workunits/rbd: reduce randomized sleeps in live import tests (&lt;a href=&quot;https://github.com/ceph/ceph/pull/67154&quot;&gt;pr#67154&lt;/a&gt;, Ilya Dryomov)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;qa/workunits/rbd: use the same qemu-iotests version throughout (&lt;a href=&quot;https://github.com/ceph/ceph/pull/67282&quot;&gt;pr#67282&lt;/a&gt;, Ilya Dryomov)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;qa/workunits/rgw: drop netstat usage (&lt;a href=&quot;https://github.com/ceph/ceph/pull/67184&quot;&gt;pr#67184&lt;/a&gt;, Kyr Shatskyy)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;qa/workunits: add Rocky Linux support to librados tests (&lt;a href=&quot;https://github.com/ceph/ceph/pull/67091&quot;&gt;pr#67091&lt;/a&gt;, Nitzan Mordechai)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;qa: Disable OSD benchmark from running for tests (&lt;a href=&quot;https://github.com/ceph/ceph/pull/67068&quot;&gt;pr#67068&lt;/a&gt;, Sridhar Seshasayee)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;qa: don&#39;t assume that /dev/sda or /dev/vda is present in unmap&lt;span&gt;&lt;/span&gt;.t (&lt;a href=&quot;https://github.com/ceph/ceph/pull/67077&quot;&gt;pr#67077&lt;/a&gt;, Ilya Dryomov)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;qa: Fix test_with_health_warn_with_2_active_MDSs (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65260&quot;&gt;pr#65260&lt;/a&gt;, Kotresh HR)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;qa: ignore cluster warning (evicting unresponsive &lt;span&gt;&lt;/span&gt;.&lt;span&gt;&lt;/span&gt;.&lt;span&gt;&lt;/span&gt;.) with tasks/mgr-osd-full (&lt;a href=&quot;http://tracker.ceph.com/issues/73278&quot;&gt;issue#73278&lt;/a&gt;, &lt;a href=&quot;https://github.com/ceph/ceph/pull/66125&quot;&gt;pr#66125&lt;/a&gt;, Venky Shankar)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;qa: Improve scalability test (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66224&quot;&gt;pr#66224&lt;/a&gt;, Vallari Agrawal)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;qa: krbd_blkroset&lt;span&gt;&lt;/span&gt;.t: eliminate a race in the open_count test (&lt;a href=&quot;https://github.com/ceph/ceph/pull/67075&quot;&gt;pr#67075&lt;/a&gt;, Ilya Dryomov)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;qa: Run RADOS suites with ec optimizations on and off (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65471&quot;&gt;pr#65471&lt;/a&gt;, Jamie Pryde)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;qa: suppress OpenSSL valgrind leaks (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65660&quot;&gt;pr#65660&lt;/a&gt;, Laura Flores)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;rbd-mirror: add cluster fsid to remote meta cache key (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66297&quot;&gt;pr#66297&lt;/a&gt;, Mykola Golub)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;rbd-mirror: allow incomplete demote snapshot to sync after rbd-mirror daemon restart (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66164&quot;&gt;pr#66164&lt;/a&gt;, VinayBhaskar-V)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Relax scrub of shard sizes for upgraded EC pools (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66021&quot;&gt;pr#66021&lt;/a&gt;, Alex Ainscow)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Revert &amp;quot;Merge pull request #66958 from Hezko/wip-74413-tentacle&amp;quot; (&lt;a href=&quot;https://github.com/ceph/ceph/pull/67750&quot;&gt;pr#67750&lt;/a&gt;, Patrick Donnelly)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Revert &amp;quot;PrimeryLogPG: don&#39;t accept ops with mixed balance_reads and rwordered flags&amp;quot; (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66611&quot;&gt;pr#66611&lt;/a&gt;, Radoslaw Zarzynski)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;RGW | fix conditional Delete, MultiDelete and Put (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65949&quot;&gt;pr#65949&lt;/a&gt;, Ali Masarwa)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;RGW | fix conditional MultiWrite (&lt;a href=&quot;https://github.com/ceph/ceph/pull/67425&quot;&gt;pr#67425&lt;/a&gt;, Ali Masarwa)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;rgw/account: bucket acls are not completely migrated once the user is migrated to an account (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65666&quot;&gt;pr#65666&lt;/a&gt;, kchheda3)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;rgw/admin: Add max-entries and marker to bucket list (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65485&quot;&gt;pr#65485&lt;/a&gt;, Tobias Urdin)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;rgw/lc: LCOpAction_CurrentExpiration checks mtime for delete markers (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65965&quot;&gt;pr#65965&lt;/a&gt;, Casey Bodley)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;rgw/tentacle: clean up &lt;span&gt;&lt;/span&gt;.rgw_op&lt;span&gt;&lt;/span&gt;.cc&lt;span&gt;&lt;/span&gt;.swn file (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66161&quot;&gt;pr#66161&lt;/a&gt;, Soumya Koduri)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;rgw: add metric when send message with kafka and ampq (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65904&quot;&gt;pr#65904&lt;/a&gt;, Hoai-Thu Vuong)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;rgw: fix &#39;bucket rm --bypass-gc&#39; for copied objects (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66004&quot;&gt;pr#66004&lt;/a&gt;, Casey Bodley)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;rgw: fix &lt;code&gt;radosgw-admin object unlink &amp;lt;span&amp;gt;&amp;lt;/span&amp;gt;.&amp;lt;span&amp;gt;&amp;lt;/span&amp;gt;.&amp;lt;span&amp;gt;&amp;lt;/span&amp;gt;.&lt;/code&gt; (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66151&quot;&gt;pr#66151&lt;/a&gt;, J. Eric Ivancich)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;RGW: multi object delete op; skip olh update for all deletes but the last one (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65488&quot;&gt;pr#65488&lt;/a&gt;, Oguzhan Ozmen)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;rgw: update keystone repo stable branch to 2024&lt;span&gt;&lt;/span&gt;.2 (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66241&quot;&gt;pr#66241&lt;/a&gt;, Kyr Shatskyy)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;rpm: default to gcc-toolset-13, not just for crimson (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65752&quot;&gt;pr#65752&lt;/a&gt;, John Mulligan, Casey Bodley)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;scripts/build/ceph&lt;span&gt;&lt;/span&gt;.spec&lt;span&gt;&lt;/span&gt;.in: fix rhel version checks (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66865&quot;&gt;pr#66865&lt;/a&gt;, Ronen Friedman)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;src/ceph_osd, osd: Implement running benchmark during OSD creation - Phase 1 (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65522&quot;&gt;pr#65522&lt;/a&gt;, Sridhar Seshasayee)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;src: Move the decision to build the ISA plugin to the top level make file (&lt;a href=&quot;https://github.com/ceph/ceph/pull/67894&quot;&gt;pr#67894&lt;/a&gt;, Alex Ainscow)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;sync build-with-container patches from main (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65843&quot;&gt;pr#65843&lt;/a&gt;, John Mulligan, Dan Mick)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;systemd services: fix installing ceph-volume@ (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66861&quot;&gt;pr#66861&lt;/a&gt;, Thomas Lamprecht)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;tasks/cbt_performance: Tolerate exceptions during performance data up… (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66102&quot;&gt;pr#66102&lt;/a&gt;, Nitzan Mordechai)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;test/ceph_assert&lt;span&gt;&lt;/span&gt;.cc: Disable core files (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66334&quot;&gt;pr#66334&lt;/a&gt;, Bob Ham)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;test/neorados: Catch timeouts in Poll test (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65605&quot;&gt;pr#65605&lt;/a&gt;, Adam C. Emerson)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;test: disable known flaky tests in run-rbd-unit-tests (&lt;a href=&quot;https://github.com/ceph/ceph/pull/67559&quot;&gt;pr#67559&lt;/a&gt;, Ilya Dryomov)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;tools: handle get-attr as read-only ops in ceph_objectstore_tool (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66537&quot;&gt;pr#66537&lt;/a&gt;, Jaya Prakash)&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
</content>
  </entry>
  <entry>
    <title>Ceph Q1 2026 Newsletter</title>
    <link href="https://ceph.io/en/news/blog/2026/Q1-community-newsletter/" />
    <updated>2026-03-31T00:00:00Z</updated>
    <id>https://ceph.io/en/news/blog/2026/Q1-community-newsletter/</id>
    <author>
      <name>Anthony Middleton</name>
    </author>
      <category term="en-blog-post" />
      <category term="en-article" />
      <category term="blog-post" />
      <category term="community" />
      <category term="governance" />
      <category term="ceph events" />
    <content type="html" xml:base="https://ceph.io/en/news/blog/2026/Q1-community-newsletter/">&lt;p&gt;During this quarter, the Ceph Foundation focused on strengthening its structure by establishing new governance charters, event strategies, and financial plans. To enhance transparency as our community evolves, this newsletter offers a look behind the scenes at these foundational details. Our aim is to update Ceph Community members on the latest developments within the Foundation and to clarify how they can contribute to our ongoing growth. There are several ways to get involved with the foundation. If you have a concept for a Ceph-related project, we encourage you to take the first step toward bringing your idea to the next level and &lt;a href=&quot;https://form.asana.com/?k=7aCHVRhp0x1Ga1nOCXlckQ&amp;amp;d=9283783873717&quot;&gt;submit a funding request&lt;/a&gt; whenever you are ready. Feedback and suggestions will be offered along your journey with Ceph.&lt;/p&gt;
&lt;h2 id=&quot;in-this-issue&quot;&gt;In this issue &lt;a class=&quot;link-anchor&quot; href=&quot;#in-this-issue&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;CSC and Ceph Foundation Board Meeting&lt;/li&gt;
&lt;li&gt;Ceph Foundation Charters Approved&lt;/li&gt;
&lt;li&gt;Ceph Governing Board Hiring a Technical Writer&lt;/li&gt;
&lt;li&gt;OVHcloud Spending Update&lt;/li&gt;
&lt;li&gt;Ceph Tech Talks Are Back (Monthly Schedule)&lt;/li&gt;
&lt;li&gt;Upcoming Ceph Days Events&lt;/li&gt;
&lt;li&gt;Ceph Community Slack Upgraded to Pro&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&quot;csc-and-ceph-foundation-board-meeting&quot;&gt;CSC and Ceph Foundation Board Meeting &lt;a class=&quot;link-anchor&quot; href=&quot;#csc-and-ceph-foundation-board-meeting&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The Ceph Foundation Board recently hosted the Ceph Steering Committee (CSC) for a collaborative discussion on the current state of the Ceph project and how both sides can work together to support the Ceph community.&lt;/p&gt;
&lt;p&gt;These quarterly meetings are designed to foster communication between the two committees working to build Ceph for the benefit of its users and contributors. The Ceph Board&#39;s goal is to help provide greater context to the CSC as they make decisions and to support their missions, thereby bridging the communication gap. The meeting&#39;s agenda is available &lt;a href=&quot;https://pad.ceph.com/p/foundation-board-csc-meeting&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;h3 id=&quot;key-discussion-areas&quot;&gt;Key discussion areas &lt;a class=&quot;link-anchor&quot; href=&quot;#key-discussion-areas&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Operational complexity&lt;/li&gt;
&lt;li&gt;Friction in contributing and getting reviews&lt;/li&gt;
&lt;li&gt;Fragmented communication&lt;/li&gt;
&lt;li&gt;Unclear strategy in some areas&lt;/li&gt;
&lt;li&gt;Unclear ownership across parts of the ecosystem&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;major-takeaways&quot;&gt;Major takeaways &lt;a class=&quot;link-anchor&quot; href=&quot;#major-takeaways&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;The Board will continue to work on closing the gap between developer experience and real-world operator needs, with help from the CSC around failure handling and upgrades.&lt;/li&gt;
&lt;li&gt;Both committees agreed that contributor experience remains a key challenge, with CI complexity and limited reviewer bandwidth identified as the biggest sources of friction, not the contribution process itself.&lt;/li&gt;
&lt;li&gt;The Board and the CSC recognize that perception matters. After hearing feedback that Ceph feels “unwelcoming,” this concern is being addressed and will guide improvements in onboarding and engagement.&lt;/li&gt;
&lt;li&gt;The Ceph Foundation will keep focusing on areas like performance and efficiency. There&#39;s an opportunity to improve how project-wide priorities are communicated, helping to align contributors and ecosystem partners.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;action-items&quot;&gt;Action Items &lt;a class=&quot;link-anchor&quot; href=&quot;#action-items&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Ceph Community Manager: Document best practices for large contributions requiring early community engagement&lt;/li&gt;
&lt;li&gt;Ceph Community Manager: Improve discoverability of contributor guidelines and ambassador resources&lt;/li&gt;
&lt;li&gt;CSC: Prepare Q2 response on project priorities, pain points, and foundation delegation opportunities&lt;/li&gt;
&lt;li&gt;Ceph Community Manager: Share contributor feedback data with CSC for deeper analysis&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&quot;road-to-fully-onboarding-into-the-linux-foundation%3A-charters-approved&quot;&gt;Road to Fully Onboarding into the Linux Foundation: Charters Approved &lt;a class=&quot;link-anchor&quot; href=&quot;#road-to-fully-onboarding-into-the-linux-foundation%3A-charters-approved&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Ceph now has a clearly defined governance model that separates technical decision-making from funding and community growth, making it easier to understand how decisions are made and how to get involved. This task began in September 2025, when the Ceph Foundation initiated its transition into The Linux Foundation, marking a significant step toward strengthening the long-term sustainability and neutrality of the Ceph project.&lt;/p&gt;
&lt;p&gt;The outcome of this work was the formal approval of two foundational governance frameworks. The Ceph Foundation Charter was established to define how the Foundation raises, allocates, and manages resources in support of the project, while also creating a clear and transparent structure for decision-making, community outreach, and ecosystem development. At its core, the Charter exists to ensure that Ceph operates as a vendor-neutral, community-driven project with sustainable funding and broad industry participation.&lt;/p&gt;
&lt;p&gt;In parallel, the Ceph Technical Charter was approved by the Ceph Steering Committee (CSC), reinforcing the independence of the project’s technical governance and clarifying how technical decisions are made in alignment with the needs of the community.&lt;/p&gt;
&lt;p&gt;Together, these milestones establish a balanced governance model: the Foundation focuses on funding, outreach, and ecosystem growth, while the technical community retains authority over the project’s technical direction and innovation.&lt;/p&gt;
&lt;p&gt;The third and final step of this process, which is currently in the works, involves transferring the Ceph trademark to the Linux Foundation after completing steps one and two.&lt;/p&gt;
&lt;p&gt;Review the Foundation Charter: &lt;a href=&quot;https://cdn.platform.linuxfoundation.org/agreements/cephfoundation.pdf&quot;&gt;https://cdn.platform.linuxfoundation.org/agreements/cephfoundation.pdf&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Review the Ceph Technical Charter:&lt;br&gt;
&lt;a href=&quot;https://github.com/ceph/ceph/blob/main/doc/technical-charter.rst&quot;&gt;https://github.com/ceph/ceph/blob/main/doc/technical-charter.rst&lt;/a&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&quot;opportunity%3A-technical-writer-role-open&quot;&gt;Opportunity: Technical Writer Role Open &lt;a class=&quot;link-anchor&quot; href=&quot;#opportunity%3A-technical-writer-role-open&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Ceph is investing directly in better documentation, making it easier for new users to adopt and experienced operators to scale, and helping those interested in contributing to Ceph get onboarded for development. The Ceph Technical Committee is actively interviewing a &lt;strong&gt;Technical Writer&lt;/strong&gt; to help improve the quality and accessibility of Ceph documentation.&lt;/p&gt;
&lt;h3 id=&quot;about-the-role&quot;&gt;About the role &lt;a class=&quot;link-anchor&quot; href=&quot;#about-the-role&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Focus on clarity, consistency, and usability of technical content&lt;/li&gt;
&lt;li&gt;Collaborate with developers and contributors across the community&lt;/li&gt;
&lt;li&gt;Help lower the barrier to entry for new users and operators&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;how-to-get-involved&quot;&gt;How to get involved &lt;a class=&quot;link-anchor&quot; href=&quot;#how-to-get-involved&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Review docs and flag gaps&lt;/li&gt;
&lt;li&gt;Suggest onboarding pain points&lt;/li&gt;
&lt;li&gt;Participate in doc sprints&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&quot;ovhcloud-spending-update&quot;&gt;OVHcloud Spending Update &lt;a class=&quot;link-anchor&quot; href=&quot;#ovhcloud-spending-update&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The Ceph Governing Board, with Mark Nelson (Clyso) and Patrick Donnelly (IBM), is working to reduce infrastructure costs. Reports from Mark and Joachim Kraftmayer (Clyso) on OVHcloud show potential for lower monthly hosting expenses. Negotiations with OVHcloud could yield savings, which would be reinvested to boost the Foundation&#39;s support for community programs, events, and infrastructure.&lt;/p&gt;
&lt;h3 id=&quot;current-overview&quot;&gt;Current overview &lt;a class=&quot;link-anchor&quot; href=&quot;#current-overview&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Monthly costs over the past year have ranged between **$5K–$10K USD **&lt;/li&gt;
&lt;li&gt;Spending peaked around August and has since been reduced to approximately &lt;strong&gt;$7K/month&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Costs remain higher than earlier in the year, prompting further review&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;key-cost-drivers&quot;&gt;Key cost drivers &lt;a class=&quot;link-anchor&quot; href=&quot;#key-cost-drivers&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;A significant portion of expenses comes from storage&lt;/li&gt;
&lt;li&gt;Four 10TB volumes (supporting Chacra nodes), along with associated snapshots, accounted for &lt;strong&gt;over half of total monthly costs (~$4K USD&lt;/strong&gt;)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;recent-actions&quot;&gt;Recent actions &lt;a class=&quot;link-anchor&quot; href=&quot;#recent-actions&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Unused snapshots and resources have been removed by David Galloway (IBM)&lt;/li&gt;
&lt;li&gt;Ongoing collaboration with OVHcloud to identify additional optimization opportunities&lt;/li&gt;
&lt;li&gt;Further cost savings are expected as usage is refined&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;what%E2%80%99s-next&quot;&gt;What’s next &lt;a class=&quot;link-anchor&quot; href=&quot;#what%E2%80%99s-next&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Continued monitoring and cost optimization efforts&lt;/li&gt;
&lt;li&gt;Improved visibility into infrastructure usage and spending&lt;/li&gt;
&lt;li&gt;Updates will be shared in the next newsletter and monthly reports as new efficiencies are realized&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&quot;ceph-tech-talks-are-back&quot;&gt;Ceph Tech Talks Are Back &lt;a class=&quot;link-anchor&quot; href=&quot;#ceph-tech-talks-are-back&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;&lt;a href=&quot;https://ceph.io/en/community/tech-talks/&quot;&gt;Ceph Tech Talks&lt;/a&gt; have officially returned, giving the community a consistent way to learn directly from contributors and share real-world experience. Together, we have discussed the &lt;a href=&quot;https://youtu.be/6ovdJ79AqbM?si=an65PKQI9RyryzTf&quot;&gt;Running Teuthology Outside of the Sepia lab&lt;/a&gt; and &lt;a href=&quot;https://youtu.be/PqniV0qzq68?si=MTxdnp3xq35-QtVv&quot;&gt;How to Get Involved with the Ceph Ambassador Program&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Attend the Tech Talk on April 22, 2026, at 12 pm EST/9 am PDT. Our topic will be,** MAAS as a Backend: Provisioning Infrastructure for Teuthology Suites.**&lt;/p&gt;
&lt;p&gt;Follow the &lt;a href=&quot;https://ceph.io/en/community/meetups/&quot;&gt;Ceph Community Calendar&lt;/a&gt; for more information.&lt;/p&gt;
&lt;h3 id=&quot;what-to-expect&quot;&gt;What to expect &lt;a class=&quot;link-anchor&quot; href=&quot;#what-to-expect&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Deep dives into real-world Ceph use cases and features&lt;/li&gt;
&lt;li&gt;Presentations led by community members and contributors&lt;/li&gt;
&lt;li&gt;A wide variety of topics for users and developers&lt;/li&gt;
&lt;li&gt;Interactive sessions with opportunities for Q&amp;amp;A&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;get-involved&quot;&gt;Get involved &lt;a class=&quot;link-anchor&quot; href=&quot;#get-involved&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Attend upcoming sessions to stay current&lt;/li&gt;
&lt;li&gt;Submit a proposal to present your work &lt;a href=&quot;https://airtable.com/apphc2dbSP8GuCdor/pagKnGCFWqHvgdCrm/form&quot;&gt;here&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&quot;upcoming-ceph-events-and-the-state-of-cephalocon&quot;&gt;Upcoming Ceph Events and the State of Cephalocon &lt;a class=&quot;link-anchor&quot; href=&quot;#upcoming-ceph-events-and-the-state-of-cephalocon&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The Ceph Foundation is evolving its approach to events in order to better serve the global community.&lt;/p&gt;
&lt;p&gt;Instead of focusing all of our resources on a single large event like Cephalocon, the Foundation is now emphasizing &lt;strong&gt;local and regional engagement&lt;/strong&gt; to expand Ceph’s reach into nearby communities and projects. This shift was prompted by new guidelines regarding spending within the foundation. These current fiduciary guidelines still require &lt;strong&gt;broader community support and sponsorship participation&lt;/strong&gt;.&lt;/p&gt;
&lt;h3 id=&quot;what%E2%80%99s-changing&quot;&gt;What’s changing &lt;a class=&quot;link-anchor&quot; href=&quot;#what%E2%80%99s-changing&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Increased focus on &lt;a href=&quot;https://ceph.io/en/community/events/&quot;&gt;Ceph Days&lt;/a&gt;, meetups, &lt;a href=&quot;https://events.linuxfoundation.org/&quot;&gt;data-driven conferences&lt;/a&gt;, and community-led events&lt;/li&gt;
&lt;li&gt;Strategic investment in opportunities that introduce Ceph to new audiences&lt;/li&gt;
&lt;li&gt;A shift toward distributed, community-driven engagement&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;important-note-on-future-large-events&quot;&gt;Important note on future large events &lt;a class=&quot;link-anchor&quot; href=&quot;#important-note-on-future-large-events&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;The Foundation remains open to hosting large-scale events like Cephalocon&lt;/li&gt;
&lt;li&gt;This ensures alignment with the Foundation’s financial responsibilities while enabling sustainable growth&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;how-to-get-involved-1&quot;&gt;How to get involved &lt;a class=&quot;link-anchor&quot; href=&quot;#how-to-get-involved-1&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;As part of the Foundation’s financial stewardship under the Linux Foundation, large-scale events require broader community sponsorship and participation&lt;/li&gt;
&lt;li&gt;Organize or support a Ceph event in your region&lt;/li&gt;
&lt;li&gt;Partner with related open source or data infrastructure communities&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;need-support%3F&quot;&gt;Need support? &lt;a class=&quot;link-anchor&quot; href=&quot;#need-support%3F&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Community members can submit &lt;a href=&quot;https://form.asana.com/?k=7aCHVRhp0x1Ga1nOCXlckQ&amp;amp;d=9283783873717&quot;&gt;funding requests&lt;/a&gt; to attend or organize events, as well as to support Ceph-related activities&lt;/li&gt;
&lt;li&gt;Requests are reviewed by the Board (approval is not guaranteed)&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&quot;upcoming-ceph-days-%E2%80%93-march-2026&quot;&gt;Upcoming Ceph Days – March 2026 &lt;a class=&quot;link-anchor&quot; href=&quot;#upcoming-ceph-days-%E2%80%93-march-2026&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Ceph Days continues to grow as the primary way the community connects locally. There were two community-driven events that happened this March:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Ceph Days India&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Sponsors: IBM and Clyso&lt;/li&gt;
&lt;li&gt;Attendees: 192&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Ceph Days Raleigh&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Sponsor: IBM&lt;/li&gt;
&lt;li&gt;Attendees: 78&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;why-attend&quot;&gt;Why attend &lt;a class=&quot;link-anchor&quot; href=&quot;#why-attend&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Connect with other Ceph users and contributors&lt;/li&gt;
&lt;li&gt;Learn from real-world deployments and technical sessions&lt;/li&gt;
&lt;li&gt;Grow your local Ceph network&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;get-involved-1&quot;&gt;Get involved &lt;a class=&quot;link-anchor&quot; href=&quot;#get-involved-1&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Share your ideas for a Ceph Days: &lt;a href=&quot;https://pad.ceph.com/p/ceph-days-2026&quot;&gt;https://pad.ceph.com/p/ceph-days-2026&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Help organize or promote events in your region&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&quot;ceph-community-slack-upgraded-to-pro&quot;&gt;Ceph Community Slack Upgraded to Pro &lt;a class=&quot;link-anchor&quot; href=&quot;#ceph-community-slack-upgraded-to-pro&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The community can now access Slack messages from the last 90 days! The &lt;a href=&quot;https://join.slack.com/t/ceph-storage/shared_invite/zt-3jlvf8f6e-45tyKGpqkkfcC9feAUpgfQ&quot;&gt;Ceph Community Slack&lt;/a&gt; workspace has been upgraded to Slack Pro, courtesy of the Linux Foundation, as part of negotiations requested by the Ceph Board.&lt;/p&gt;
&lt;h3 id=&quot;what-this-means&quot;&gt;What this means &lt;a class=&quot;link-anchor&quot; href=&quot;#what-this-means&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;No more losing conversations!&lt;/li&gt;
&lt;li&gt;There is no additional cost to the Ceph community&lt;/li&gt;
&lt;li&gt;A more robust platform for collaboration and communication&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;why-it-matters&quot;&gt;Why it matters &lt;a class=&quot;link-anchor&quot; href=&quot;#why-it-matters&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Supports better knowledge sharing across contributors and users&lt;/li&gt;
&lt;li&gt;Improves accessibility of past discussions and technical insights&lt;/li&gt;
&lt;li&gt;Strengthens real-time collaboration within the community&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;get-involved-2&quot;&gt;Get involved &lt;a class=&quot;link-anchor&quot; href=&quot;#get-involved-2&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Become a moderator for the Ceph Slack workspace&lt;/li&gt;
&lt;li&gt;Email &lt;a href=&quot;amiddleton@linuxfoundation.org&quot;&gt;amiddleton@linuxfoundation.org&lt;/a&gt; for more information&lt;/li&gt;
&lt;/ul&gt;
</content>
  </entry>
  <entry>
    <title>v18.2.8 Reef released</title>
    <link href="https://ceph.io/en/news/blog/2026/v18-2-8-reef-released/" />
    <updated>2026-03-20T00:00:00Z</updated>
    <id>https://ceph.io/en/news/blog/2026/v18-2-8-reef-released/</id>
    <author>
      <name>Yuri Weinstein</name>
    </author>
      <category term="en-blog-post" />
      <category term="en-article" />
      <category term="blog-post" />
      <category term="release" />
      <category term="reef" />
    <content type="html" xml:base="https://ceph.io/en/news/blog/2026/v18-2-8-reef-released/">&lt;p&gt;This is the eighth, and expected to be last, backport release in the Reef series. We recommend that all users update to this release.&lt;/p&gt;
&lt;h2 id=&quot;release-date&quot;&gt;Release Date &lt;a class=&quot;link-anchor&quot; href=&quot;#release-date&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;March 20, 2026&lt;/p&gt;
&lt;h2 id=&quot;known-issues&quot;&gt;Known Issues &lt;a class=&quot;link-anchor&quot; href=&quot;#known-issues&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;During QA for v18.2.8, it was found that there was a bug for upgrades from
Pacific to Reef. Pacific OSDs (and other Ceph daemons) were still using a
deprecated connection feature bit that was adopted to indicate a Reef OSD.
This can cause a OSD_UPGRADE_FINISHED warning before all OSDs are actually
upgraded to Reef. There are no known issues associated with Pacific and Reef
OSDs interoperating where Pacific OSDs are &amp;quot;advertising&amp;quot; Reef compatibility;
however, out of an abundance of caution, we no longer recommend upgrading
from Pacific to Reef directly.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;security-fixes&quot;&gt;Security Fixes &lt;a class=&quot;link-anchor&quot; href=&quot;#security-fixes&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;CephFS Client: A fix was merged to prohibit unprivileged users from modifying
the sgid or suid bits on a file. Previously, unprivileged users were
inadvertently permitted to set these bits if they were the sole bits being
modified.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Mgr Alerts: The SMTP SSL context was enforced in the mgr/alerts module to
resolve a security vulnerability (GHSA-xj9f-7g59-m4jx).&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;notable-changes&quot;&gt;Notable Changes &lt;a class=&quot;link-anchor&quot; href=&quot;#notable-changes&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;RGW (RADOS Gateway):&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Fixed an issue where bucket rm --bypass-gc was mistakenly removing head objects instead of tail objects, potentially causing data inconsistencies.&lt;/li&gt;
&lt;li&gt;Fixed rgw-restore-bucket-index to handle objects with leading hyphens and to process versioned buckets correctly.&lt;/li&gt;
&lt;li&gt;Addressed an issue in the msg/async protocol that caused memory locks and hangs during connection shutdown.&lt;/li&gt;
&lt;li&gt;RGW STS: Made JWKS URL verification configurable for AWS compliance via the rgw_enable_jwks_url_verification configuration.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;CephFS / MDS:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Prevented the MDS from stalling (up to 5 seconds) during rename/stat workloads by forcing the log to nudge for unstable locks after early replies.&lt;/li&gt;
&lt;li&gt;Fixed cephfs-journal-tool so it no longer incorrectly resets the journal trim position during disaster recovery, which was causing stale journal objects to linger forever in the metadata pool.&lt;/li&gt;
&lt;li&gt;Fixed a bug where ll_walk incorrectly processed absolute paths as relative paths.&lt;/li&gt;
&lt;li&gt;Prevented the ceph fs volume create command from accidentally deleting user-created pools if the command aborted during cleanup.&lt;/li&gt;
&lt;li&gt;MDS Batched Operations: Added a new mds_allow_batched_ops configuration option (default: true) to control whether the MDS can batch lookup or getattr RPCs.&lt;/li&gt;
&lt;li&gt;CephFS Subvolumes: Added the ceph fs subvolume snapshot getpath command to allow users to retrieve the absolute path of a snapshot of a subvolume.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;BlueStore:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Fixed a bug where the bytes_written_slow performance counter incorrectly reported 0 when using aio_write.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;changelog&quot;&gt;Changelog &lt;a class=&quot;link-anchor&quot; href=&quot;#changelog&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;span&gt;&lt;/span&gt;.github: Fix RTD build retrigger (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63616&quot;&gt;pr#63616&lt;/a&gt;, David Galloway)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;rgw&gt; Ensure the ETag format is consistent with AWS S3 API (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62608&quot;&gt;pr#62608&lt;/a&gt;, Casey Bodley, liubingrun)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;[reef] os/bluestore: fix _extend_log seq advance (&lt;a href=&quot;https://github.com/ceph/ceph/pull/61653&quot;&gt;pr#61653&lt;/a&gt;, Pere Diaz Bou)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;[reef] RGW backports (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63031&quot;&gt;pr#63031&lt;/a&gt;, Soumya Koduri)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;[reef] rgw/dbstore: Update bucket attrs as part of put_info() (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64488&quot;&gt;pr#64488&lt;/a&gt;, Soumya Koduri)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;auth: msgr2 can return incorrect allowed_modes through AuthBadMethodFrame (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65334&quot;&gt;pr#65334&lt;/a&gt;, Miki Patel)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;backport build-with-container patches from main (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65188&quot;&gt;pr#65188&lt;/a&gt;, John Mulligan, Dan Mick, Zack Cerza)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Backport the hybrid_btree2 allocator and prereqs (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62539&quot;&gt;pr#62539&lt;/a&gt;, Igor Fedotov, Jrchyang Yu)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;backports variants improvements and Dockerfile&lt;span&gt;&lt;/span&gt;.build changes (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66012&quot;&gt;pr#66012&lt;/a&gt;, John Mulligan, Zack Cerza)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;blk/kernel: improve DiscardThread life cycle (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65216&quot;&gt;pr#65216&lt;/a&gt;, Igor Fedotov)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;blk/KernelDevice: Introduce a cap on the number of pending discards (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62220&quot;&gt;pr#62220&lt;/a&gt;, Joshua Baergen)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;blk/kerneldevice: notify_all only required when discard_drain wait for condition (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62152&quot;&gt;pr#62152&lt;/a&gt;, Yite Gu)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;blk/kerneldevice: some fix for device discard (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62481&quot;&gt;pr#62481&lt;/a&gt;, Igor Fedotov, Yite Gu)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;bluestore/BlueFS: fix bytes_written_slow counter with aio_write (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66353&quot;&gt;pr#66353&lt;/a&gt;, chungfengz)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;build backports (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65066&quot;&gt;pr#65066&lt;/a&gt;, John Mulligan, Zack Cerza)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;build-with-container: add argument groups to organize options (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65630&quot;&gt;pr#65630&lt;/a&gt;, John Mulligan)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;build-with-container: build image variants (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65944&quot;&gt;pr#65944&lt;/a&gt;, John Mulligan)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;build-with-container: two small fixes (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62339&quot;&gt;pr#62339&lt;/a&gt;, John Mulligan)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;ceph-fuse: Improve fuse mount usage message (&lt;a href=&quot;https://github.com/ceph/ceph/pull/61275&quot;&gt;pr#61275&lt;/a&gt;, Kotresh HR)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;ceph-volume: allow zapping partitions on multipath devices (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62178&quot;&gt;pr#62178&lt;/a&gt;, Guillaume Abrioux)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;ceph-volume: do not convert LVs&#39;s symlink to real path (&lt;a href=&quot;https://github.com/ceph/ceph/pull/59989&quot;&gt;pr#59989&lt;/a&gt;, Guillaume Abrioux)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;ceph-volume: fix regex usage in &lt;code&gt;set&#92;_dmcrypt&#92;_no&#92;_workqueue&lt;/code&gt; (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62791&quot;&gt;pr#62791&lt;/a&gt;, Guillaume Abrioux)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;ceph&lt;span&gt;&lt;/span&gt;.spec&lt;span&gt;&lt;/span&gt;.in: add man/rgw-gap-list (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63999&quot;&gt;pr#63999&lt;/a&gt;, Matan Breizman)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;ceph&lt;span&gt;&lt;/span&gt;.spec&lt;span&gt;&lt;/span&gt;.in: Remove rgw-restore-bucket-index&lt;span&gt;&lt;/span&gt;.8* from packaging (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64130&quot;&gt;pr#64130&lt;/a&gt;, Kefu Chai)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;cephfs,mon: fs rename must require FS to be offline and refuse_client_session to be set (&lt;a href=&quot;http://tracker.ceph.com/issues/66088&quot;&gt;issue#66088&lt;/a&gt;, &lt;a href=&quot;https://github.com/ceph/ceph/pull/61410&quot;&gt;pr#61410&lt;/a&gt;, Rishabh Dave, Venky Shankar)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;cephfs-journal-tool: fix segfault during &#39;journal import&#39; from invalid dump file (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62114&quot;&gt;pr#62114&lt;/a&gt;, Jos Collin)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;cephfs-journal-tool: Journal trimming issue (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65603&quot;&gt;pr#65603&lt;/a&gt;, Kotresh HR)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;cephfs-shell: add option to remove xattr (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62409&quot;&gt;pr#62409&lt;/a&gt;, Neeraj Pratap Singh)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;cephfs-top, qa: Remove unnecessary global statements in tests (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62606&quot;&gt;pr#62606&lt;/a&gt;, Kefu Chai)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;cephfs-top: exception when terminal size greater than PAD_WIDTH (&lt;a href=&quot;https://github.com/ceph/ceph/pull/61773&quot;&gt;pr#61773&lt;/a&gt;, Jos Collin)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;cephfs: session tracker accounts for killing sessions (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65253&quot;&gt;pr#65253&lt;/a&gt;, Abhishek Lekshmanan)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;client: fix d_reclen for readdir (&lt;a href=&quot;https://github.com/ceph/ceph/pull/61519&quot;&gt;pr#61519&lt;/a&gt;, Xavi Hernandez)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;client: fixed a bug that read operation hung (&lt;a href=&quot;https://github.com/ceph/ceph/pull/60695&quot;&gt;pr#60695&lt;/a&gt;, Tod Chen)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;client: Handle empty pathnames for &lt;code&gt;ceph&#92;_chownat()&lt;/code&gt; and &lt;code&gt;ceph&#92;_statxat()&lt;/code&gt; (&lt;a href=&quot;https://github.com/ceph/ceph/pull/61165&quot;&gt;pr#61165&lt;/a&gt;, Anoop C S)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;client: ll_walk will process absolute paths as relative (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62500&quot;&gt;pr#62500&lt;/a&gt;, Patrick Donnelly)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;client: prohibit unprivileged users from setting sgid/suid bits (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66040&quot;&gt;pr#66040&lt;/a&gt;, Kefu Chai)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;client: return EOPNOTSUPP for fallocate with mode 0 (&lt;a href=&quot;https://github.com/ceph/ceph/pull/60657&quot;&gt;pr#60657&lt;/a&gt;, Milind Changire)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;cls/rbd: write image mirror status if state is CREATING (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63236&quot;&gt;pr#63236&lt;/a&gt;, N Balachandran)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;cls/rgw: non-versioned listings skip past version suffix (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62591&quot;&gt;pr#62591&lt;/a&gt;, Casey Bodley)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;common/options: fix the description of osd_max_scrubs (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62378&quot;&gt;pr#62378&lt;/a&gt;, Satoru Takeuchi)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;common/options: fix typo in description (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64218&quot;&gt;pr#64218&lt;/a&gt;, Lorenz Bausch)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;common/pick_address: Add IPv6 support to is_addr_in_subnet (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62814&quot;&gt;pr#62814&lt;/a&gt;, Nitzan Mordechai)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;container: small container image improvements (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62345&quot;&gt;pr#62345&lt;/a&gt;, John Mulligan)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;crush: use std::vector instead of variable length arrays (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62014&quot;&gt;pr#62014&lt;/a&gt;, Kefu Chai)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;debian/control: add iproute2 to build dependencies (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66738&quot;&gt;pr#66738&lt;/a&gt;, Kefu Chai)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;debian: package mgr/rgw in ceph-mgr-modules-core (&lt;a href=&quot;https://github.com/ceph/ceph/pull/57874&quot;&gt;pr#57874&lt;/a&gt;, Kefu Chai)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/architecture: remove sentence (&lt;a href=&quot;https://github.com/ceph/ceph/pull/61615&quot;&gt;pr#61615&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/cephadm/services: Add mention of --zap for OSD removal (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62444&quot;&gt;pr#62444&lt;/a&gt;, Anthony D&#39;Atri)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/cephadm/services: Correct indentation in osd&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62428&quot;&gt;pr#62428&lt;/a&gt;, Anthony D&#39;Atri)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/cephadm/services: Fix formatting in osd&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62811&quot;&gt;pr#62811&lt;/a&gt;, Anthony D&#39;Atri)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/cephadm/services: improve rgw&lt;span&gt;&lt;/span&gt;.rst and snmp-gateway&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62695&quot;&gt;pr#62695&lt;/a&gt;, Anthony D&#39;Atri)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/cephadm: Add admonition re restarting an OSD service (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62797&quot;&gt;pr#62797&lt;/a&gt;, Anthony D&#39;Atri)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/cephadm: Add PG autoscaler advice to upgrade&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62380&quot;&gt;pr#62380&lt;/a&gt;, Anthony D&#39;Atri)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/cephadm: clarify &amp;quot;Monitoring OSD State&amp;quot; (&lt;a href=&quot;https://github.com/ceph/ceph/pull/61665&quot;&gt;pr#61665&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/cephadm: Correct formatting in upgrade&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63148&quot;&gt;pr#63148&lt;/a&gt;, Anthony D&#39;Atri)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/cephadm: correct markup in rgw&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63074&quot;&gt;pr#63074&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/cephadm: improve &amp;quot;Maintenance Mode&amp;quot; (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63496&quot;&gt;pr#63496&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/cephadm: s/confg/config/ (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62645&quot;&gt;pr#62645&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/cephfs: add a note about estimated replay completion time (&lt;a href=&quot;http://tracker.ceph.com/issues/71629&quot;&gt;issue#71629&lt;/a&gt;, &lt;a href=&quot;https://github.com/ceph/ceph/pull/65058&quot;&gt;pr#65058&lt;/a&gt;, Venky Shankar, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/cephfs: correct ill-formatted command (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63502&quot;&gt;pr#63502&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/cephfs: correct reference structure in fs-volumes&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63545&quot;&gt;pr#63545&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/cephfs: Cosmetic changes and small fixes in cephfs-mirroring&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63468&quot;&gt;pr#63468&lt;/a&gt;, Ville Ojamo)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/cephfs: document first-damage&lt;span&gt;&lt;/span&gt;.py (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63978&quot;&gt;pr#63978&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/cephfs: edit ceph-dokan&lt;span&gt;&lt;/span&gt;.rst (1 of x) (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64736&quot;&gt;pr#64736&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/cephfs: edit ceph-dokan&lt;span&gt;&lt;/span&gt;.rst (2 of x) (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64760&quot;&gt;pr#64760&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/cephfs: edit ceph-dokan&lt;span&gt;&lt;/span&gt;.rst (3 of x) (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64786&quot;&gt;pr#64786&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/cephfs: edit disaster-recovery&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64645&quot;&gt;pr#64645&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/cephfs: edit disaster-recovery&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64609&quot;&gt;pr#64609&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/cephfs: edit troubleshooting&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65380&quot;&gt;pr#65380&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/cephfs: edit troubleshooting&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65094&quot;&gt;pr#65094&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/cephfs: edit troubleshooting&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65091&quot;&gt;pr#65091&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/cephfs: edit troubleshooting&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65126&quot;&gt;pr#65126&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/cephfs: edit troubleshooting&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65123&quot;&gt;pr#65123&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/cephfs: edit troubleshooting&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65097&quot;&gt;pr#65097&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/cephfs: edit troubleshooting&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65078&quot;&gt;pr#65078&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/cephfs: edit troubleshooting&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65088&quot;&gt;pr#65088&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/cephfs: edit troubleshooting&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65047&quot;&gt;pr#65047&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/cephfs: edit troubleshooting&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65044&quot;&gt;pr#65044&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/cephfs: edit troubleshooting&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65041&quot;&gt;pr#65041&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/cephfs: edit troubleshooting&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65037&quot;&gt;pr#65037&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/cephfs: edit troubleshooting&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65026&quot;&gt;pr#65026&lt;/a&gt;, Zac Dover, Venky Shankar)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/cephfs: edit troubleshooting&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64904&quot;&gt;pr#64904&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/cephfs: edit troubleshooting&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64901&quot;&gt;pr#64901&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/cephfs: edit troubleshooting&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64879&quot;&gt;pr#64879&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/cephfs: edit troubleshooting&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64872&quot;&gt;pr#64872&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/cephfs: edit troubleshooting&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64853&quot;&gt;pr#64853&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/cephfs: edit troubleshooting&lt;span&gt;&lt;/span&gt;.rst (Slow MDS) (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65201&quot;&gt;pr#65201&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/cephfs: Improve mount-using-fuse&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64473&quot;&gt;pr#64473&lt;/a&gt;, Anthony D&#39;Atri)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/cephfs: link section for pausing async threads in section for (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62875&quot;&gt;pr#62875&lt;/a&gt;, Rishabh Dave)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/cephfs: Update deprecation notice in experimental-features&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63949&quot;&gt;pr#63949&lt;/a&gt;, Ville Ojamo)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/cephfs: Update quota&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65083&quot;&gt;pr#65083&lt;/a&gt;, Jannis Speer, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/dev/cephfs-mirroring: edit file 1 of x (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63299&quot;&gt;pr#63299&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/dev/cephfs-mirroring: edit file 2 of x (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63274&quot;&gt;pr#63274&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/dev/cephfs-mirroring: edit file 3 of x (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63548&quot;&gt;pr#63548&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/dev/cephfs-mirroring: edit file 4 of x (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63661&quot;&gt;pr#63661&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/dev/config: Document how to use :confval: directive for config op… (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64167&quot;&gt;pr#64167&lt;/a&gt;, Kefu Chai)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/dev/release-process&lt;span&gt;&lt;/span&gt;.rst: document new Jenkins job for containers (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62613&quot;&gt;pr#62613&lt;/a&gt;, Dan Mick)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/dev/release-process&lt;span&gt;&lt;/span&gt;.rst: release builds cannot build containers (&lt;a href=&quot;https://github.com/ceph/ceph/pull/61818&quot;&gt;pr#61818&lt;/a&gt;, Dan Mick, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/dev: Debuggging with gdb (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63994&quot;&gt;pr#63994&lt;/a&gt;, Matan Breizman)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/dev: update link to backporter manual (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63991&quot;&gt;pr#63991&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/dev:update blkin&lt;span&gt;&lt;/span&gt;.rst doc for lttng trace (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65212&quot;&gt;pr#65212&lt;/a&gt;, lizhipeng)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/glossary: s/OMAP/omap/ (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63738&quot;&gt;pr#63738&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/man/8: Improve mount&lt;span&gt;&lt;/span&gt;.ceph&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65184&quot;&gt;pr#65184&lt;/a&gt;, Anthony D&#39;Atri)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/mgr/ceph_api: edit index&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63198&quot;&gt;pr#63198&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/mgr/crash&lt;span&gt;&lt;/span&gt;.rst: remove outdated module enabling instructions (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64285&quot;&gt;pr#64285&lt;/a&gt;, Kefu Chai)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/mgr/dashboard_plugins: edit feature_toggles&lt;span&gt;&lt;/span&gt;.inc&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63705&quot;&gt;pr#63705&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/mgr: edit administrator&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63208&quot;&gt;pr#63208&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/mgr: edit alerts&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63201&quot;&gt;pr#63201&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/mgr: edit cli_api (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63744&quot;&gt;pr#63744&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/mgr: edit cli_api&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63690&quot;&gt;pr#63690&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/mgr: edit crash&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63539&quot;&gt;pr#63539&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/mgr: edit dashboard&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63316&quot;&gt;pr#63316&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/mgr: edit debug&lt;span&gt;&lt;/span&gt;.inc&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63394&quot;&gt;pr#63394&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/mgr: edit diskpredictor&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63424&quot;&gt;pr#63424&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/mgr: edit feature_toggles&lt;span&gt;&lt;/span&gt;.inc&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63397&quot;&gt;pr#63397&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/mgr: edit hello&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63508&quot;&gt;pr#63508&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/mgr: edit influx&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63455&quot;&gt;pr#63455&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/mgr: edit insights&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63511&quot;&gt;pr#63511&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/mgr: edit iostat&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63681&quot;&gt;pr#63681&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/mgr: edit iostat&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63514&quot;&gt;pr#63514&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/mgr: edit localpool&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63670&quot;&gt;pr#63670&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/mgr: edit localpool&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63551&quot;&gt;pr#63551&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/mgr: edit mds_autoscaler&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63493&quot;&gt;pr#63493&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/mgr: edit modules&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63667&quot;&gt;pr#63667&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/mgr: edit modules&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63578&quot;&gt;pr#63578&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/mgr: edit motd&lt;span&gt;&lt;/span&gt;.inc&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63403&quot;&gt;pr#63403&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/mgr: edit nfs&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63664&quot;&gt;pr#63664&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/mgr: edit nfs&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63581&quot;&gt;pr#63581&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/mgr: edit orchestrator&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63584&quot;&gt;pr#63584&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/mgr: edit progress&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63658&quot;&gt;pr#63658&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/mgr: edit progress&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63587&quot;&gt;pr#63587&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/mgr: edit prometheus&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63590&quot;&gt;pr#63590&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/mgr: edit rgw&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63593&quot;&gt;pr#63593&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/mgr: edit telegraf&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63612&quot;&gt;pr#63612&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/mgr: edit telemetry (1 of x) (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63769&quot;&gt;pr#63769&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/mgr: edit telemetry (2 of x) (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63772&quot;&gt;pr#63772&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/mgr: edit telemetry (3 of x) (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63775&quot;&gt;pr#63775&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/mgr: edit telemetry (4 of x) (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63778&quot;&gt;pr#63778&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/mgr: edit telemetry&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64344&quot;&gt;pr#64344&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/mgr: edit telemetry&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63810&quot;&gt;pr#63810&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/mgr: edit telemetry&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63906&quot;&gt;pr#63906&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/mgr: edit telemetry&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63865&quot;&gt;pr#63865&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/mgr: edit telemetry&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63693&quot;&gt;pr#63693&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/mgr: edit telemetry&lt;span&gt;&lt;/span&gt;.rst (lines 300-400) (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63868&quot;&gt;pr#63868&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/mgr: Improve prometheus&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62931&quot;&gt;pr#62931&lt;/a&gt;, Zac Dover, Anthony D&#39;Atri)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/mgr: Small improvements in rgw&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63626&quot;&gt;pr#63626&lt;/a&gt;, Ville Ojamo)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/monitoring: correct list formatting (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63542&quot;&gt;pr#63542&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/rados/configuration/bluestore-config-ref: Fix lowcase typo (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62261&quot;&gt;pr#62261&lt;/a&gt;, Adam Kupczyk)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/rados/configuration/bluestore-config-ref: Fix lowercase typos (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62291&quot;&gt;pr#62291&lt;/a&gt;, Dan van der Ster)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/rados/configuration: Correct admonition in ceph-conf&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62621&quot;&gt;pr#62621&lt;/a&gt;, Anthony D&#39;Atri)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/rados/configuration: Improve ceph-conf&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63943&quot;&gt;pr#63943&lt;/a&gt;, Anthony D&#39;Atri)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/rados/configuration: Mention show-with-defaults and ceph-conf (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65207&quot;&gt;pr#65207&lt;/a&gt;, Niklas Hambüchen)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/rados/configuration: Small improvements in ceph-conf&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64288&quot;&gt;pr#64288&lt;/a&gt;, Ville Ojamo)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/rados/operations/stretch-mode: Improve doc (&lt;a href=&quot;https://github.com/ceph/ceph/pull/61654&quot;&gt;pr#61654&lt;/a&gt;, Kamoltat Sirivadhna)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/rados/operations: Actually mention &lt;code&gt;upmap&#92;_max&#92;_deviation&lt;/code&gt; setting … (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64119&quot;&gt;pr#64119&lt;/a&gt;, Niklas Hambüchen)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/rados/operations: add kernel client procedure to read balancer documentation (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65440&quot;&gt;pr#65440&lt;/a&gt;, Laura Flores)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/rados/operations: Add settings advice to balancer&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63536&quot;&gt;pr#63536&lt;/a&gt;, Anthony D&#39;Atri)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/rados/operations: Additional improvements to placement-groups&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63650&quot;&gt;pr#63650&lt;/a&gt;, Anthony D&#39;Atri)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/rados/operations: Address suggestions for stretch-mode&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63850&quot;&gt;pr#63850&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/rados/operations: edit cache-tiering&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63696&quot;&gt;pr#63696&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/rados/operations: Improve erasure-code&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62574&quot;&gt;pr#62574&lt;/a&gt;, Anthony D&#39;Atri)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/rados/operations: Improve health-checks&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65239&quot;&gt;pr#65239&lt;/a&gt;, Anthony D&#39;Atri)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/rados/operations: Improve placement-groups&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63647&quot;&gt;pr#63647&lt;/a&gt;, Anthony D&#39;Atri)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/rados/operations: Improve stretch-mode&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63816&quot;&gt;pr#63816&lt;/a&gt;, Anthony D&#39;Atri)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/rados/ops: add caps restore command (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64322&quot;&gt;pr#64322&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/rados/ops: edit cache-tiering&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64497&quot;&gt;pr#64497&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/rados/ops: edit cache-tiering&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63831&quot;&gt;pr#63831&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/rados: document section absent in release &amp;lt; T (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64868&quot;&gt;pr#64868&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/rados: edit balancer&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63684&quot;&gt;pr#63684&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/rados: edit ops/user-management&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63893&quot;&gt;pr#63893&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/rados: enhance &amp;quot;pools&lt;span&gt;&lt;/span&gt;.rst&amp;quot; (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63862&quot;&gt;pr#63862&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/rados: improve markup in cache-tiering&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63505&quot;&gt;pr#63505&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/rados: remove clonedata command (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64394&quot;&gt;pr#64394&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/rados: repair short underline (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65138&quot;&gt;pr#65138&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/rados: s/enpty/empty/ in pgcalc doc (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63499&quot;&gt;pr#63499&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/rados: Update mClock doc on steps to override OSD IOPS capacity config (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63072&quot;&gt;pr#63072&lt;/a&gt;, Sridhar Seshasayee)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/radosgw /notifications: fix topic details (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62405&quot;&gt;pr#62405&lt;/a&gt;, Laimis Juzeliunas)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/radosgw/admin&lt;span&gt;&lt;/span&gt;.rst: explain bucket and uid flags for bucket quota (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64022&quot;&gt;pr#64022&lt;/a&gt;, Hyun Jin Kim)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/radosgw/cloud-transition: fix details (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62835&quot;&gt;pr#62835&lt;/a&gt;, Laimis Juzeliunas)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/radosgw/s3: Document delete-if-unmodified-since (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64316&quot;&gt;pr#64316&lt;/a&gt;, Anthony D&#39;Atri)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/radosgw: add &amp;quot;persistent_topic_size&amp;quot; (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64140&quot;&gt;pr#64140&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/radosgw: add rgw_enable_lc_threads &amp;amp; rgw_enable_gc_threads (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64339&quot;&gt;pr#64339&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/radosgw: Cosmetic and formatting improvements in vault&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63230&quot;&gt;pr#63230&lt;/a&gt;, Ville Ojamo)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/radosgw: Cosmetic improvements in cloud-transition&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63449&quot;&gt;pr#63449&lt;/a&gt;, Ville Ojamo)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/radosgw: Cosmetic improvements in dynamicresharding&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64059&quot;&gt;pr#64059&lt;/a&gt;, Ville Ojamo)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/radosgw: edit &amp;quot;Lifecycle Settings&amp;quot; (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64548&quot;&gt;pr#64548&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/radosgw: edit cloud-transition (1 of x) (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64025&quot;&gt;pr#64025&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/radosgw: edit config-ref&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64648&quot;&gt;pr#64648&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/radosgw: edit metrics&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63813&quot;&gt;pr#63813&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/radosgw: edit sentence in metrics&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63701&quot;&gt;pr#63701&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/radosgw: Fix RST syntax rendeded as text in oidc&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62990&quot;&gt;pr#62990&lt;/a&gt;, Ville Ojamo)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/radosgw: improve &amp;quot;pubsub_push_pending&amp;quot; info (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64114&quot;&gt;pr#64114&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/radosgw: Improve and more consistent formatting (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62910&quot;&gt;pr#62910&lt;/a&gt;, Ville Ojamo)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/radosgw: Improve cloud-restore and cloud-transition (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62667&quot;&gt;pr#62667&lt;/a&gt;, Anthony D&#39;Atri)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/radosgw: Improve formatting in layout&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63000&quot;&gt;pr#63000&lt;/a&gt;, Anthony D&#39;Atri)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/radosgw: Improve layout&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62450&quot;&gt;pr#62450&lt;/a&gt;, Anthony D&#39;Atri)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/radosgw: Improve rgw-cache&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64476&quot;&gt;pr#64476&lt;/a&gt;, Ville Ojamo)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/radosgw: Promptify CLI commands and fix formatting in layout&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63916&quot;&gt;pr#63916&lt;/a&gt;, Ville Ojamo)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/radosgw: Promptify CLI, cosmetic fixes (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62857&quot;&gt;pr#62857&lt;/a&gt;, Ville Ojamo)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/radosgw: remove &amp;quot;pubsub_event_lost&amp;quot; (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64127&quot;&gt;pr#64127&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/radosgw: remove &amp;quot;pubsub_event_triggered&amp;quot; (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64156&quot;&gt;pr#64156&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/radosgw: remove cloud-restore from reef (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65638&quot;&gt;pr#65638&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/radosgw: update aws specification link (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64096&quot;&gt;pr#64096&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/radosgw: Use ref for hyperlinking to multisite (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63312&quot;&gt;pr#63312&lt;/a&gt;, Ville Ojamo)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/rbd/rbd-config-ref: add clone settings section (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66173&quot;&gt;pr#66173&lt;/a&gt;, Ilya Dryomov)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/rbd: add mirroring troubleshooting info (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63847&quot;&gt;pr#63847&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/rgw: add man documentation for the rgw-gap-list tool (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63997&quot;&gt;pr#63997&lt;/a&gt;, J. Eric Ivancich)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/rgw: clarify path-style vs virtual-hosted-style access (&lt;a href=&quot;https://github.com/ceph/ceph/pull/61987&quot;&gt;pr#61987&lt;/a&gt;, Casey Bodley)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/rgw: document Admin and System Users (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62882&quot;&gt;pr#62882&lt;/a&gt;, Casey Bodley)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/rgw: remove metrics&lt;span&gt;&lt;/span&gt;.rst which did not apply to reef (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66320&quot;&gt;pr#66320&lt;/a&gt;, Casey Bodley)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/rgw: use &#39;confval&#39; directive to render sts config options (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63442&quot;&gt;pr#63442&lt;/a&gt;, Casey Bodley)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/src/common/options: mgr&lt;span&gt;&lt;/span&gt;.yaml&lt;span&gt;&lt;/span&gt;.in edit (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63765&quot;&gt;pr#63765&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/src: edit osd&lt;span&gt;&lt;/span&gt;.yaml&lt;span&gt;&lt;/span&gt;.in (osd_deep_scrub_interval_cv) (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63956&quot;&gt;pr#63956&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/start: edit documenting-ceph&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63653&quot;&gt;pr#63653&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc/start: edit documenting-ceph&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63708&quot;&gt;pr#63708&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc: add note admonitions in two files (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64493&quot;&gt;pr#64493&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc: Clarify the status of MS Windows client support (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64482&quot;&gt;pr#64482&lt;/a&gt;, Anthony D&#39;Atri)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc: do not depend on typed-ast (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64400&quot;&gt;pr#64400&lt;/a&gt;, Kefu Chai)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc: Document ceph-mgr module configuration options (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64397&quot;&gt;pr#64397&lt;/a&gt;, Kefu Chai)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc: fix formatting in cephfs_mirror dev doc (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63251&quot;&gt;pr#63251&lt;/a&gt;, Jos Collin)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc: Fix links to mClock config reference (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64798&quot;&gt;pr#64798&lt;/a&gt;, Pierre Riteau)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc: Fix missing blank line Sphinx warnings (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63338&quot;&gt;pr#63338&lt;/a&gt;, Ville Ojamo)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc: Fix unterminated inline literal in ceph-conf&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64171&quot;&gt;pr#64171&lt;/a&gt;, Kefu Chai)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc: Fixed a spelling error (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64148&quot;&gt;pr#64148&lt;/a&gt;, &lt;a href=&quot;http://Instelligence.io&quot;&gt;Instelligence.io&lt;/a&gt;)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc: Fixes a typo in balancer operations (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65740&quot;&gt;pr#65740&lt;/a&gt;, Tyler Brekke)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc: mgr/dashboard: add OAuth2 SSO documentation (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64034&quot;&gt;pr#64034&lt;/a&gt;, Pedro Gonzalez Gomez, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc: Pin pip to &amp;lt;25&lt;span&gt;&lt;/span&gt;.3 for RTD as a workaround for pybind in admin/doc-read-the-docs&lt;span&gt;&lt;/span&gt;.txt (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66118&quot;&gt;pr#66118&lt;/a&gt;, Ville Ojamo)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc: Remove sphinxcontrib-seqdiag Python package from RTD builds (&lt;a href=&quot;https://github.com/ceph/ceph/pull/67528&quot;&gt;pr#67528&lt;/a&gt;, Ville Ojamo)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc: Revert &amp;quot;doc/radosgw: add &amp;quot;persistent_topic_size&amp;quot;&amp;quot; (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64179&quot;&gt;pr#64179&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc: Revert &amp;quot;doc: mgr/dashboard: add OAuth2 SSO documentation&amp;quot; (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66796&quot;&gt;pr#66796&lt;/a&gt;, Ville Ojamo)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc: Revert doc/cephadm: correct markup in rgw&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66971&quot;&gt;pr#66971&lt;/a&gt;, Ville Ojamo)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc: src/pybind/mgr/dashboard: edit HACKING&lt;span&gt;&lt;/span&gt;.rst (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63697&quot;&gt;pr#63697&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc: update cephfs-journal-tool docs (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63109&quot;&gt;pr#63109&lt;/a&gt;, Jos Collin)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;doc: update mgr modules notify_types (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64531&quot;&gt;pr#64531&lt;/a&gt;, Nitzan Mordechai)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;fix: the RGW crash caused by special characters (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64052&quot;&gt;pr#64052&lt;/a&gt;, mertsunacoglu, Emin)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;github: pin GH Actions to SHA-1 commit (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65759&quot;&gt;pr#65759&lt;/a&gt;, Ernesto Puerta)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Handle failures in metric parsing (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65595&quot;&gt;pr#65595&lt;/a&gt;, Anmol Babu)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;install-deps&lt;span&gt;&lt;/span&gt;.sh: install proper compiler version on Debian/Ubuntu (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66014&quot;&gt;pr#66014&lt;/a&gt;, Dan Mick)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;install-deps: Replace apt-mirror (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66669&quot;&gt;pr#66669&lt;/a&gt;, David Galloway)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;librbd/cache/pwl: fix memory leak in SyncPoint persist context cleanup (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64093&quot;&gt;pr#64093&lt;/a&gt;, Kefu Chai)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;librbd/migration/QCOWFormat: don&#39;t complete read_clusters() inline (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64195&quot;&gt;pr#64195&lt;/a&gt;, Ilya Dryomov)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;librbd: disallow &amp;quot;rbd trash mv&amp;quot; if image is in a group (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62967&quot;&gt;pr#62967&lt;/a&gt;, Ilya Dryomov)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;librbd: images aren&#39;t closed in group_snap&#92;_*_by_record() on error (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64620&quot;&gt;pr#64620&lt;/a&gt;, Miki Patel)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;librbd: respect rbd_default_snapshot_quiesce_mode in group_snap_create() (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62962&quot;&gt;pr#62962&lt;/a&gt;, Ilya Dryomov)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;LogMonitor: set no_reply for forward MLog commands (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62212&quot;&gt;pr#62212&lt;/a&gt;, Nitzan Mordechai)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mds/Beacon: wake up the thread in shutdown() (&lt;a href=&quot;https://github.com/ceph/ceph/pull/61513&quot;&gt;pr#61513&lt;/a&gt;, Max Kellermann)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mds: add an asok command to dump export states (&lt;a href=&quot;https://github.com/ceph/ceph/pull/61512&quot;&gt;pr#61512&lt;/a&gt;, Zhansong Gao)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mds: add more debug logs and log events (&lt;a href=&quot;https://github.com/ceph/ceph/pull/61518&quot;&gt;pr#61518&lt;/a&gt;, Xiubo Li)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mds: do not process client metrics message with fast dispatch (&lt;a href=&quot;http://tracker.ceph.com/issues/68865&quot;&gt;issue#68865&lt;/a&gt;, &lt;a href=&quot;https://github.com/ceph/ceph/pull/61339&quot;&gt;pr#61339&lt;/a&gt;, Venky Shankar)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mds: drop client metrics during recovery (&lt;a href=&quot;https://github.com/ceph/ceph/pull/61299&quot;&gt;pr#61299&lt;/a&gt;, Patrick Donnelly)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mds: dump next_snap when checking dentry corruption (&lt;a href=&quot;https://github.com/ceph/ceph/pull/61978&quot;&gt;pr#61978&lt;/a&gt;, Milind Changire)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mds: Fix invalid access of mdr-&amp;gt;dn[0]&lt;span&gt;&lt;/span&gt;.back() (&lt;a href=&quot;https://github.com/ceph/ceph/pull/61516&quot;&gt;pr#61516&lt;/a&gt;, Anoop C S)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mds: Fix invalid access of mdr-&amp;gt;dn[0]&lt;span&gt;&lt;/span&gt;.back() (&lt;a href=&quot;https://github.com/ceph/ceph/pull/61450&quot;&gt;pr#61450&lt;/a&gt;, Anoop C S)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mds: Fix readdir when osd is full (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65348&quot;&gt;pr#65348&lt;/a&gt;, Kotresh HR)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mds: fix snapdiff result fragmentation (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65364&quot;&gt;pr#65364&lt;/a&gt;, Igor Fedotov, Md Mahamudur Rahaman Sajib)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mds: nudge log for unstable locks after early reply (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64540&quot;&gt;pr#64540&lt;/a&gt;, Patrick Donnelly)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mds: prevent duplicate wrlock acquisition for a single request (&lt;a href=&quot;https://github.com/ceph/ceph/pull/61839&quot;&gt;pr#61839&lt;/a&gt;, Xiubo Li, Sunnatillo)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mds: session in the importing state cannot be cleared if an export subtree task is interrupted while the state of importer is acking (&lt;a href=&quot;https://github.com/ceph/ceph/pull/61514&quot;&gt;pr#61514&lt;/a&gt;, Zhansong Gao)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mds: use SimpleLock::WAIT_ALL for wait mask (&lt;a href=&quot;https://github.com/ceph/ceph/pull/67495&quot;&gt;pr#67495&lt;/a&gt;, Patrick Donnelly)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;memory lock issues causing hangs during connection shutdown (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65786&quot;&gt;pr#65786&lt;/a&gt;, Nitzan Mordechai)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/alerts: enforce ssl context to SMTP_SSL (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66142&quot;&gt;pr#66142&lt;/a&gt;, Nizamudeen A)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/cephadm: Fix unfound progress events (&lt;a href=&quot;https://github.com/ceph/ceph/pull/58450&quot;&gt;pr#58450&lt;/a&gt;, Prashant D)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/DaemonState: Minimise time we hold the DaemonStateIndex lock (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65463&quot;&gt;pr#65463&lt;/a&gt;, Brad Hubbard)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: adapt service creation form to support nvmeof creation (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63304&quot;&gt;pr#63304&lt;/a&gt;, Afreen Misbah)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: add &lt;span&gt;&lt;/span&gt;.nvmrc so ci can pick the node version (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64666&quot;&gt;pr#64666&lt;/a&gt;, Nizamudeen A)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: Add ceph_daemon filter to rgw overview grafana panel queries (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62268&quot;&gt;pr#62268&lt;/a&gt;, Aashish Sharma)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: add prometheus read permission to cluster_mgr role (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62651&quot;&gt;pr#62651&lt;/a&gt;, Nizamudeen A)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: Dashboard not showing Object/Overview correctly (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62664&quot;&gt;pr#62664&lt;/a&gt;, Aashish Sharma)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: fix access control permissions for roles (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62455&quot;&gt;pr#62455&lt;/a&gt;, Nizamudeen A)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: Fix empty ceph version in GET api/hosts (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62730&quot;&gt;pr#62730&lt;/a&gt;, Afreen Misbah)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: Fix inline markup warning in API documentation (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64270&quot;&gt;pr#64270&lt;/a&gt;, Kefu Chai)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: fix make check tests (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63186&quot;&gt;pr#63186&lt;/a&gt;, Afreen Misbah)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: fix zone update API forcing STANDARD storage class (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65621&quot;&gt;pr#65621&lt;/a&gt;, Aashish Sharma)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: show non default realm sync status in rgw overview page (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65002&quot;&gt;pr#65002&lt;/a&gt;, Aashish Sharma)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/dashboard: use system packages when running tox (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64612&quot;&gt;pr#64612&lt;/a&gt;, Nizamudeen A)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/nfs: validate path when modifying cephfs export (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62278&quot;&gt;pr#62278&lt;/a&gt;, Dhairya Parmar)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/rbd_support: always parse interval and start_time in Schedules::remove() (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62964&quot;&gt;pr#62964&lt;/a&gt;, Ilya Dryomov)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/snap_schedule: fix typo in error message during retention add (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65295&quot;&gt;pr#65295&lt;/a&gt;, Milind Changire)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/snap_schedule: handle volume delete (&lt;a href=&quot;https://github.com/ceph/ceph/pull/61187&quot;&gt;pr#61187&lt;/a&gt;, Milind Changire)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/vol: add command to get snapshot path (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62917&quot;&gt;pr#62917&lt;/a&gt;, Rishabh Dave)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/vol: don&#39;t delete user-created pool in &amp;quot;volume create&amp;quot; command (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63069&quot;&gt;pr#63069&lt;/a&gt;, Rishabh Dave)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/vol: print proper message when subvolume metadata filename is too long (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62050&quot;&gt;pr#62050&lt;/a&gt;, Rishabh Dave)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/volumes: allow disabling async job threads (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62436&quot;&gt;pr#62436&lt;/a&gt;, Rishabh Dave)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/volumes: fix dangling symlink in clone index (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62109&quot;&gt;pr#62109&lt;/a&gt;, Neeraj Pratap Singh)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/volumes: Keep mon caps if auth key has remaining mds/osd caps (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65297&quot;&gt;pr#65297&lt;/a&gt;, Enrico Bocchi)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr/volumes: periodically check for async work (&lt;a href=&quot;http://tracker.ceph.com/issues/61867&quot;&gt;issue#61867&lt;/a&gt;, &lt;a href=&quot;https://github.com/ceph/ceph/pull/61230&quot;&gt;pr#61230&lt;/a&gt;, Venky Shankar)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr: add status command (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62505&quot;&gt;pr#62505&lt;/a&gt;, Patrick Donnelly)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr: allow disabling always-on modules (&lt;a href=&quot;https://github.com/ceph/ceph/pull/60563&quot;&gt;pr#60563&lt;/a&gt;, Rishabh Dave)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mgr: process map before notifying clients (&lt;a href=&quot;https://github.com/ceph/ceph/pull/57065&quot;&gt;pr#57065&lt;/a&gt;, Patrick Donnelly)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mon [stretch mode]: support disable_stretch_mode &amp;amp; qa/workunits/mon: ensure election strategy is &amp;quot;connectivity&amp;quot; for stretch mode (&lt;a href=&quot;https://github.com/ceph/ceph/pull/60630&quot;&gt;pr#60630&lt;/a&gt;, Laura Flores, Kamoltat Sirivadhna)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mon/AuthMonitor: provide command to rotate the key for a user credential (&lt;a href=&quot;https://github.com/ceph/ceph/pull/58236&quot;&gt;pr#58236&lt;/a&gt;, Patrick Donnelly)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mon/test_mon_osdmap_prune: Use first_pinned instead of first_committed (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63343&quot;&gt;pr#63343&lt;/a&gt;, Aishwarya Mathuria)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mon: Track and process pending pings after election (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62925&quot;&gt;pr#62925&lt;/a&gt;, Kamoltat Sirivadhna)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;monitor: Enhance historic ops command output and error handling (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64843&quot;&gt;pr#64843&lt;/a&gt;, Nitzan Mordechai)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;monitoring: add user-agent headers to the urllib (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65473&quot;&gt;pr#65473&lt;/a&gt;, Nizamudeen A)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;monitoring: fix MTU Mismatch alert rule and expr (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65710&quot;&gt;pr#65710&lt;/a&gt;, Aashish Sharma)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;objclass: deprecate cls_cxx_gather (&lt;a href=&quot;https://github.com/ceph/ceph/pull/60195&quot;&gt;pr#60195&lt;/a&gt;, Nitzan Mordechai)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;os/bluestore: Disable invoking unittest_deferred (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66359&quot;&gt;pr#66359&lt;/a&gt;, Adam Kupczyk)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;os/bluestore: do cache locally compressor engines ever used (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62145&quot;&gt;pr#62145&lt;/a&gt;, Igor Fedotov, Adam Kupczyk)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;os/bluestore: fix bdev expansion and more (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62216&quot;&gt;pr#62216&lt;/a&gt;, Igor Fedotov)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;os/bluestore: Fix ExtentDecoderPartial::_consume_new_blob (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62054&quot;&gt;pr#62054&lt;/a&gt;, Adam Kupczyk)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;os/bluestore: Fix race in BlueFS truncate / remove (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62840&quot;&gt;pr#62840&lt;/a&gt;, Adam Kupczyk)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;os/bluestore: In BlueFS::truncate accept wierd alloc_unit (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66056&quot;&gt;pr#66056&lt;/a&gt;, Adam Kupczyk)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;os/bluestore: make BlueFS an exclusive selector for volume reserved (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62721&quot;&gt;pr#62721&lt;/a&gt;, Igor Fedotov)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;osd/scheduler/OpSchedulerItem: Fix calculation of recovery latency counters (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62801&quot;&gt;pr#62801&lt;/a&gt;, Sridhar Seshasayee)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;osd/scrub: allow longer waits for replicas to respond (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63940&quot;&gt;pr#63940&lt;/a&gt;, Ronen Friedman)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;osd/scrub: discard repair_oinfo_oid() (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62569&quot;&gt;pr#62569&lt;/a&gt;, Ronen Friedman)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;osd: add clear_shards_repaired command (&lt;a href=&quot;https://github.com/ceph/ceph/pull/60566&quot;&gt;pr#60566&lt;/a&gt;, Daniel Radjenovic)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;osd: don&#39;t send stale hb msgr&#39;s addresses in MOSDBoot (&lt;a href=&quot;https://github.com/ceph/ceph/pull/56520&quot;&gt;pr#56520&lt;/a&gt;, Radosław Zarzyński)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;osd: fix osd mclock queue item leak (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62364&quot;&gt;pr#62364&lt;/a&gt;, Samuel Just)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;OSD: Split osd_recovery_sleep into settings applied to degraded or clean PGs (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62399&quot;&gt;pr#62399&lt;/a&gt;, Md Mahamudur Rahaman Sajib)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;osd_types: Restore new_object marking for delete missing entries (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63152&quot;&gt;pr#63152&lt;/a&gt;, Nitzan Mordechai)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;OSDMonitor: exclude destroyed OSDs from &amp;quot;ceph node ls&amp;quot; output (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62326&quot;&gt;pr#62326&lt;/a&gt;, Nitzan Mordechai)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;OSDMonitor: Make sure pcm is initialised (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63805&quot;&gt;pr#63805&lt;/a&gt;, Brad Hubbard)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;PendingReleaseNotes; doc/rados/operations: document &amp;quot;rm-pg-upmap-primary-{all}&amp;quot; commands (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62468&quot;&gt;pr#62468&lt;/a&gt;, Laura Flores)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;PGMap: remove pool max_avail scale factor (&lt;a href=&quot;https://github.com/ceph/ceph/pull/61320&quot;&gt;pr#61320&lt;/a&gt;, Michael J. Kidd)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;pybind/mgr/dashboard: Use teuthology&#39;s actual requirements (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65418&quot;&gt;pr#65418&lt;/a&gt;, David Galloway)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;pybind/mgr: attempt to fix mypy importing from python-common (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63313&quot;&gt;pr#63313&lt;/a&gt;, John Mulligan)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;pybind/mgr: Fix missing empty lines in mgr_module&lt;span&gt;&lt;/span&gt;.py (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64267&quot;&gt;pr#64267&lt;/a&gt;, Ville Ojamo)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;pybind/mgr: pin cheroot version in requirements-required&lt;span&gt;&lt;/span&gt;.txt (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65637&quot;&gt;pr#65637&lt;/a&gt;, Nizamudeen A, Adam King)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;qa/cephfs: ignore warning that pg is stuck peering for upgrade jobs (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65448&quot;&gt;pr#65448&lt;/a&gt;, Rishabh Dave)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;qa/cephfs: randomize configs in &lt;code&gt;fs:thrash:workloads&lt;/code&gt; (&lt;a href=&quot;https://github.com/ceph/ceph/pull/61341&quot;&gt;pr#61341&lt;/a&gt;, Venky Shankar)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;qa/cephfs: switch to ubuntu 22&lt;span&gt;&lt;/span&gt;.04 for stock kernel testing (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62492&quot;&gt;pr#62492&lt;/a&gt;, Venky Shankar)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;qa/cephfs: update ignorelist (&lt;a href=&quot;https://github.com/ceph/ceph/pull/61383&quot;&gt;pr#61383&lt;/a&gt;, Rishabh Dave)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;qa/multisite: add extra checkpoints in datalog_autotrim testcase (&lt;a href=&quot;https://github.com/ceph/ceph/pull/61508&quot;&gt;pr#61508&lt;/a&gt;, Shilpa Jagannath)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;qa/rbd/iscsi: ignore MON_DOWN warning in logs (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64596&quot;&gt;pr#64596&lt;/a&gt;, Adam King)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;qa/rgw: bump maven version in hadoop task to resolve 404 Not Found (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63927&quot;&gt;pr#63927&lt;/a&gt;, Casey Bodley)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;qa/rgw: fix perl tests missing Amazon::S3 module (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64281&quot;&gt;pr#64281&lt;/a&gt;, Mark Kogan)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;qa/rgw: remove hadoop-s3a subsuite (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64669&quot;&gt;pr#64669&lt;/a&gt;, Casey Bodley)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;qa/rgw: run verify tests with garbage collection disabled (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62953&quot;&gt;pr#62953&lt;/a&gt;, Casey Bodley)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;qa/suites/krbd: use a standard fixed-1 cluster in unmap subsuite (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64918&quot;&gt;pr#64918&lt;/a&gt;, Ilya Dryomov)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;qa/suites/orch/cephadm: add PG_DEGRADED to ignorelist (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63055&quot;&gt;pr#63055&lt;/a&gt;, Shraddha Agrawal)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;qa/suites: wait longer before stopping OSDs with valgrind (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63717&quot;&gt;pr#63717&lt;/a&gt;, Nitzan Mordechai)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;qa/tasks/ceph_manager: population must be a sequence (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64748&quot;&gt;pr#64748&lt;/a&gt;, Kyr Shatskyy)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;qa/tasks/cephfs/mount: use &#39;ip route&#39; instead &#39;route&#39; (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63129&quot;&gt;pr#63129&lt;/a&gt;, Kyr Shatskyy)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;qa/tasks/workunit: fix no module named &#39;pipes&#39; (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66252&quot;&gt;pr#66252&lt;/a&gt;, Kyr Shatskyy)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;qa/tests: added initial test for &lt;code&gt;client-upgrade-reef-tentacle&lt;/code&gt; (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64761&quot;&gt;pr#64761&lt;/a&gt;, Yuri Weinstein)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;qa/workunits/fs/misc: remove data pool cleanup (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63017&quot;&gt;pr#63017&lt;/a&gt;, Patrick Donnelly)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;qa: add missing &lt;span&gt;&lt;/span&gt;.qa links (&lt;a href=&quot;https://github.com/ceph/ceph/pull/67529&quot;&gt;pr#67529&lt;/a&gt;, Patrick Donnelly)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;qa: Disable OSD benchmark from running for tests (&lt;a href=&quot;https://github.com/ceph/ceph/pull/67067&quot;&gt;pr#67067&lt;/a&gt;, Sridhar Seshasayee)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;qa: enable debug mds/client for fs/nfs suite (&lt;a href=&quot;http://tracker.ceph.com/issues/63482&quot;&gt;issue#63482&lt;/a&gt;, &lt;a href=&quot;https://github.com/ceph/ceph/pull/65251&quot;&gt;pr#65251&lt;/a&gt;, Venky Shankar)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;qa: fix multi-fs tests in test_mds_metrics&lt;span&gt;&lt;/span&gt;.py (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64340&quot;&gt;pr#64340&lt;/a&gt;, Jos Collin)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;qa: fix test_cephfs_mirror_stats failure (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62116&quot;&gt;pr#62116&lt;/a&gt;, Jos Collin)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;qa: ignore pg availability/degraded warnings (&lt;a href=&quot;https://github.com/ceph/ceph/pull/61297&quot;&gt;pr#61297&lt;/a&gt;, Patrick Donnelly)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;qa: ignore variant of down fs (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62092&quot;&gt;pr#62092&lt;/a&gt;, Patrick Donnelly)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;qa: increase the http&lt;span&gt;&lt;/span&gt;.maxRequestBuffer to 100MB and enable the git debug logs (&lt;a href=&quot;https://github.com/ceph/ceph/pull/61279&quot;&gt;pr#61279&lt;/a&gt;, Xiubo Li)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;qa: suppress OpenSSL valgrind leaks (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65663&quot;&gt;pr#65663&lt;/a&gt;, Laura Flores)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;qa: use a larger timeout for kernel_untar_build workunit (&lt;a href=&quot;http://tracker.ceph.com/issues/68855&quot;&gt;issue#68855&lt;/a&gt;, &lt;a href=&quot;https://github.com/ceph/ceph/pull/61340&quot;&gt;pr#61340&lt;/a&gt;, Venky Shankar)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;rados/test_crash&lt;span&gt;&lt;/span&gt;.sh: add PG_DEGRADED to ignorelist (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62396&quot;&gt;pr#62396&lt;/a&gt;, Shraddha Agrawal)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;rbd-mirror: add cluster fsid to remote meta cache key (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66272&quot;&gt;pr#66272&lt;/a&gt;, Mykola Golub)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;rbd-mirror: allow incomplete demote snapshot to sync after rbd-mirror daemon restart (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66163&quot;&gt;pr#66163&lt;/a&gt;, VinayBhaskar-V)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;rbd-mirror: prevent image deletion if remote image is not primary (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64738&quot;&gt;pr#64738&lt;/a&gt;, VinayBhaskar-V)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;rbd-mirror: release lock before calling m_async_op_tracker&lt;span&gt;&lt;/span&gt;.finish_op() (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64091&quot;&gt;pr#64091&lt;/a&gt;, VinayBhaskar-V)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;rbd: display mirror state creating (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62939&quot;&gt;pr#62939&lt;/a&gt;, N Balachandran)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Recent pipeline backports (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65250&quot;&gt;pr#65250&lt;/a&gt;, Dan Mick)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;resolve pacific/quincy upgrade failures (&lt;a href=&quot;https://github.com/ceph/ceph/pull/67657&quot;&gt;pr#67657&lt;/a&gt;, Patrick Donnelly)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;rgw/iam: add policy evaluation for Arn-based Conditions (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62434&quot;&gt;pr#62434&lt;/a&gt;, Casey Bodley)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;rgw/rados: enable object deletion at rados pool quota (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62094&quot;&gt;pr#62094&lt;/a&gt;, Casey Bodley, Samuel Just)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;rgw/sts: Implementation of validating JWT using modulus and exponent (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63053&quot;&gt;pr#63053&lt;/a&gt;, Pritha Srivastava)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;rgw:  Try  to handle unwatch errors sensibly (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62403&quot;&gt;pr#62403&lt;/a&gt;, Adam C. Emerson)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;rgw: add force option to &lt;code&gt;radosgw-admin object rm &amp;lt;span&amp;gt;&amp;lt;/span&amp;gt;.&amp;lt;span&amp;gt;&amp;lt;/span&amp;gt;.&amp;lt;span&amp;gt;&amp;lt;/span&amp;gt;.&lt;/code&gt; (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64311&quot;&gt;pr#64311&lt;/a&gt;, J. Eric Ivancich)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;rgw: add missing last_modified field to swift API (&lt;a href=&quot;https://github.com/ceph/ceph/pull/61553&quot;&gt;pr#61553&lt;/a&gt;, Andrei Ivashchenko)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;rgw: allow bucket notification send message to kafka with multiple br… (&lt;a href=&quot;https://github.com/ceph/ceph/pull/61825&quot;&gt;pr#61825&lt;/a&gt;, Hoai-Thu Vuong)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;rgw: bring rgw-restore-bucket-index up to current version (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64514&quot;&gt;pr#64514&lt;/a&gt;, J. Eric Ivancich, Michael J. Kidd)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;rgw: Changed discard buffer size (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63711&quot;&gt;pr#63711&lt;/a&gt;, Artem Vasilev)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;rgw: check all JWKS for STS (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64937&quot;&gt;pr#64937&lt;/a&gt;, Alex Wojno)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;rgw: correctly set worker thread names (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63095&quot;&gt;pr#63095&lt;/a&gt;, Milind Changire)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;rgw: don&#39;t use merge_and_store_attrs() when recreating a bucket (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64411&quot;&gt;pr#64411&lt;/a&gt;, Casey Bodley)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;rgw: fix &#39;bucket rm --bypass-gc&#39; for copied objects (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66002&quot;&gt;pr#66002&lt;/a&gt;, Casey Bodley)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;rgw: fix bug with rgw-gap-list (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62723&quot;&gt;pr#62723&lt;/a&gt;, J. Eric Ivancich, Michael J. Kidd)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;rgw: fix empty storage class on display of multipart uploads (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64312&quot;&gt;pr#64312&lt;/a&gt;, J. Eric Ivancich)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;rgw: fix to correctly store updated attrs in backend store after erasing an attr/attrs for delete ops on a bucket (&lt;a href=&quot;https://github.com/ceph/ceph/pull/61996&quot;&gt;pr#61996&lt;/a&gt;, Pritha Srivastava, Wei Wang)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;rgw: Head/GetObject support partNumber (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62544&quot;&gt;pr#62544&lt;/a&gt;, Casey Bodley)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;rgw: keep the tails when copying object to itself (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62656&quot;&gt;pr#62656&lt;/a&gt;, Jane Zhu)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;rgw: make incomplete multipart upload part of bucket check efficient (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64464&quot;&gt;pr#64464&lt;/a&gt;, J. Eric Ivancich)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;rgw: make keystone work without admin token(service ac requirement) (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64200&quot;&gt;pr#64200&lt;/a&gt;, Deepika Upadhyay)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;rgw: make rgw-restore-bucket-index more robust (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64622&quot;&gt;pr#64622&lt;/a&gt;, J. Eric Ivancich)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;rgw: optimize bucket listing to skip past regions of namespaced entries (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62234&quot;&gt;pr#62234&lt;/a&gt;, J. Eric Ivancich)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;rgw: prevent crash in &lt;code&gt;radosgw-admin bucket object shard &amp;lt;span&amp;gt;&amp;lt;/span&amp;gt;.&amp;lt;span&amp;gt;&amp;lt;/span&amp;gt;.&amp;lt;span&amp;gt;&amp;lt;/span&amp;gt;.&lt;/code&gt; (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62885&quot;&gt;pr#62885&lt;/a&gt;, J. Eric Ivancich)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;rgw: PutObjectLockConfiguration can enable object lock on existing buckets (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62063&quot;&gt;pr#62063&lt;/a&gt;, Casey Bodley)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;rgw: radoslist improvements primarily to better support gap list tool (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62418&quot;&gt;pr#62418&lt;/a&gt;, J. Eric Ivancich)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;rgw: trigger resharding of versioned buckets sooner (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63598&quot;&gt;pr#63598&lt;/a&gt;, J. Eric Ivancich)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;rgw: update keystone repo stable branch to 2024&lt;span&gt;&lt;/span&gt;.2 (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66243&quot;&gt;pr#66243&lt;/a&gt;, Kyr Shatskyy)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Rocky 9/10 support backports (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64658&quot;&gt;pr#64658&lt;/a&gt;, Zack Cerza, John Mulligan, David Galloway, Alexander Indenbaum)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;run-make-check&lt;span&gt;&lt;/span&gt;.sh backports (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65837&quot;&gt;pr#65837&lt;/a&gt;, John Mulligan, luo rixin)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;run-make&lt;span&gt;&lt;/span&gt;.sh: Typo in argument addition (&lt;a href=&quot;https://github.com/ceph/ceph/pull/66690&quot;&gt;pr#66690&lt;/a&gt;, David Galloway)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;scrub: use a generic interface for scheduling timer based events (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63558&quot;&gt;pr#63558&lt;/a&gt;, Samuel Just, Ronen Friedman)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;src/common/options: Clarify scope of scrub intervals in osd&lt;span&gt;&lt;/span&gt;.yaml&lt;span&gt;&lt;/span&gt;.in (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63490&quot;&gt;pr#63490&lt;/a&gt;, Anthony D&#39;Atri)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;src/common: add guidance for deep-scrubbing ratio warning (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62503&quot;&gt;pr#62503&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;src/common: add guidance for mon_warn_pg_not_scrubbed (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62552&quot;&gt;pr#62552&lt;/a&gt;, Zac Dover)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;src/mon/OSDMonitor&lt;span&gt;&lt;/span&gt;.cc: [Stretch Mode] WRN non-existent CRUSH location assigned to MON (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62040&quot;&gt;pr#62040&lt;/a&gt;, Kamoltat Sirivadhna)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;src: modernize sample&lt;span&gt;&lt;/span&gt;.ceph&lt;span&gt;&lt;/span&gt;.conf (&lt;a href=&quot;https://github.com/ceph/ceph/pull/61642&quot;&gt;pr#61642&lt;/a&gt;, Anthony D&#39;Atri)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;suites/rados: cache tier deprecated, no need to keep the tests for it (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62210&quot;&gt;pr#62210&lt;/a&gt;, Nitzan Mordechai)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;sync build-with-container patches from main (&lt;a href=&quot;https://github.com/ceph/ceph/pull/65845&quot;&gt;pr#65845&lt;/a&gt;, John Mulligan, Dan Mick)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;tasks/cephfs/mount: use 192&lt;span&gt;&lt;/span&gt;.168&lt;span&gt;&lt;/span&gt;.144&lt;span&gt;&lt;/span&gt;.0&lt;span&gt;&lt;/span&gt;.0/20 for brxnet (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63134&quot;&gt;pr#63134&lt;/a&gt;, Kyr Shatskyy)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;test/common: unittest_fault_injector omits unit-main target (&lt;a href=&quot;https://github.com/ceph/ceph/pull/63979&quot;&gt;pr#63979&lt;/a&gt;, Casey Bodley)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;test/librbd/test_notify&lt;span&gt;&lt;/span&gt;.py: conditionally ignore some errors (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62688&quot;&gt;pr#62688&lt;/a&gt;, Ilya Dryomov)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;test/librbd/test_notify&lt;span&gt;&lt;/span&gt;.py: force line-buffered output (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62751&quot;&gt;pr#62751&lt;/a&gt;, Ilya Dryomov)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;test/rbd: remove unit tests about cache tiering (&lt;a href=&quot;https://github.com/ceph/ceph/pull/64588&quot;&gt;pr#64588&lt;/a&gt;, Laura Flores)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;TEST_backfill_grow fails after finding &amp;quot;num_bytes mismatch&amp;quot; in osd log (&lt;a href=&quot;https://github.com/ceph/ceph/pull/60901&quot;&gt;pr#60901&lt;/a&gt;, Mohit Agrawal)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;tools/ceph-objectstore-tool: tricks to tolerate disk errors for &amp;quot;pg export&amp;quot; command (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62122&quot;&gt;pr#62122&lt;/a&gt;, Igor Fedotov)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Wip trackers 50371 67352 67489 69639 reef (&lt;a href=&quot;https://github.com/ceph/ceph/pull/62473&quot;&gt;pr#62473&lt;/a&gt;, Brad Hubbard, Patrick Donnelly)&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
</content>
  </entry>
  <entry>
    <title>Assessing the performance of the CLAY Erasure Code Plugin</title>
    <link href="https://ceph.io/en/news/blog/2025/cbt-performance-benchmarking-part4/" />
    <updated>2026-02-11T00:00:00Z</updated>
    <id>https://ceph.io/en/news/blog/2025/cbt-performance-benchmarking-part4/</id>
    <author>
      <name>Jake Squelch (IBM)</name>
    </author>
      <category term="en-blog-post" />
      <category term="en-article" />
      <category term="blog-post" />
      <category term="ceph" />
      <category term="benchmarks" />
      <category term="performance" />
    <content type="html" xml:base="https://ceph.io/en/news/blog/2025/cbt-performance-benchmarking-part4/">&lt;p&gt;CBT Performance Benchmarking - Part 4. What can we say about CLAY?&lt;/p&gt;
&lt;h2 id=&quot;outline-of-the-blog-series&quot;&gt;&lt;a id=&quot;outline&quot;&gt;&lt;/a&gt;Outline of the Blog Series &lt;a class=&quot;link-anchor&quot; href=&quot;#outline-of-the-blog-series&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://ceph.io/en/news/blog/2025/cbt-performance-benchmarking-part1/&quot;&gt;&lt;strong&gt;Part 1&lt;/strong&gt;&lt;/a&gt; - How to start a Ceph cluster for a performance benchmark with CBT&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://ceph.io/en/news/blog/2025/cbt-performance-benchmarking-part2/&quot;&gt;&lt;strong&gt;Part 2&lt;/strong&gt;&lt;/a&gt; - Defining YAML contents&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://ceph.io/en/news/blog/2025/cbt-performance-benchmarking-part3/&quot;&gt;&lt;strong&gt;Part 3&lt;/strong&gt;&lt;/a&gt; - How to start a CBT performance benchmark&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Part 4&lt;/strong&gt; - Assessing the performance of the CLAY erasure code plugin&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;p&gt;Contents:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;#client&quot;&gt;Client IO results for CLAY&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;#down&quot;&gt;Client IO with an OSD down&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;#good&quot;&gt;What is CLAY good at?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;#probs&quot;&gt;Problems with using CLAY&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;#read&quot;&gt;How does CLAY read data from the drive?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;#broke&quot;&gt;CLAY is broken in tentacle&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;#summary&quot;&gt;Summary&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&quot;client-io-results-for-clay&quot;&gt;&lt;a id=&quot;client&quot;&gt;&lt;/a&gt;Client IO results for CLAY &lt;a class=&quot;link-anchor&quot; href=&quot;#client-io-results-for-clay&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;As a refresher lets quickly look back on the &lt;strong&gt;client IO&lt;/strong&gt; results of &lt;strong&gt;CLAY&lt;/strong&gt; compared to &lt;strong&gt;JErasure&lt;/strong&gt;:&lt;/p&gt;
&lt;p&gt;If we look back to &lt;strong&gt;Step 3&lt;/strong&gt; in &lt;a href=&quot;https://ceph.io/en/news/blog/2025/cbt-performance-benchmarking-part3/&quot;&gt;&lt;strong&gt;Part 3&lt;/strong&gt;&lt;/a&gt; of the blog &lt;code&gt;(Generating a comparison report)&lt;/code&gt;, we saw that &lt;strong&gt;reads&lt;/strong&gt; had practically identical curves between CLAY &amp;amp; JErasure for both &lt;strong&gt;4K random reads&lt;/strong&gt; and &lt;strong&gt;1024K sequential reads&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;However, when we compared &lt;strong&gt;writes&lt;/strong&gt; we saw that the performance hit to CLAY was substantially larger, particularly for higher bandwidths. The &lt;strong&gt;1024k Sequential Writes&lt;/strong&gt; diagram represents this:&lt;/p&gt;
&lt;details&gt;
&lt;summary&gt;Click to see Part 3 diagrams&lt;/summary&gt;
&lt;p&gt;&lt;img src=&quot;images/part_3_diag.jpg&quot; alt=&quot;alt text&quot; title=&quot;part 3 reference&quot;&gt;&lt;/p&gt;
&lt;/details&gt;
&lt;p&gt;&lt;strong&gt;So why was this?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;This is because of CLAY&#39;s &lt;strong&gt;encoding process&lt;/strong&gt;, it is significantly more complex. While JErasure performs a single encoding pass, CLAY uses three phases:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;50% of data is encoded using &lt;strong&gt;PRT&lt;/strong&gt; (Product Recovery Transform), 50% of the data is copied to form an intermediate set of buffers&lt;/li&gt;
&lt;li&gt;All the intermediate data is encoded using &lt;strong&gt;RS&lt;/strong&gt; (Reed-Solomon) to form a second set of intermediate buffers&lt;/li&gt;
&lt;li&gt;50% of the result is encoded using &lt;strong&gt;PFT&lt;/strong&gt; (Parity Fractional Transform), 50% of the data is copied to form the output buffers&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Essentially, CLAY performs &lt;strong&gt;2x&lt;/strong&gt; the encoding plus an &lt;strong&gt;additional&lt;/strong&gt; memcpy (memory copy) compared to JErasure&#39;s 1x encoding. This overhead therefore directly translates to &lt;strong&gt;lower write throughput&lt;/strong&gt; for CLAY, as shown by the diagrams above. The performance impact increases for larger IO sizes because more data is being encoded.&lt;/p&gt;
&lt;p&gt;Referenced the following: &lt;a href=&quot;https://people.iith.ac.in/mynav/pdfs/talks/Clay_Fast18.pdf&quot;&gt;&#39;Clay Codes: Moulding MDS Codes to Yield an MSR Code&#39;&lt;/a&gt; above for information on CLAY&#39;s encoding process.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&quot;client-io-with-an-osd-down&quot;&gt;&lt;a id=&quot;down&quot;&gt;&lt;/a&gt;Client IO with an OSD down &lt;a class=&quot;link-anchor&quot; href=&quot;#client-io-with-an-osd-down&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;details&gt;
&lt;summary&gt;Click to see Part 3 diagram&lt;/summary&gt;
&lt;p&gt;&lt;img src=&quot;images/part_3_down_ref.png&quot; alt=&quot;alt text&quot; title=&quot;part 3 reference with OSD down&quot;&gt;&lt;/p&gt;
&lt;/details&gt;
&lt;p&gt;We then moved onto &lt;strong&gt;Step 4&lt;/strong&gt; in &lt;a href=&quot;https://ceph.io/en/news/blog/2025/cbt-performance-benchmarking-part3/&quot;&gt;&lt;strong&gt;Part 3&lt;/strong&gt;&lt;/a&gt; of the blog &lt;code&gt;(Running a test with an OSD down)&lt;/code&gt;, and we saw that performance had got even worse for CLAY here. The curves are no longer near identical for the reads (as shown by the above diagram). CLAY is clearly performing worse in this scenario, which we did not initially expect.&lt;/p&gt;
&lt;p&gt;This latency increase is due to the specific implementation of CLAY within Ceph. For &lt;strong&gt;degraded&lt;/strong&gt; read IOs (when a client requests data from a missing shard), the system is configured to read and decode all the data to reconstruct the missing information. Just as the &lt;strong&gt;encode&lt;/strong&gt; process (for write IOs) has a higher overheads when using CLAY, the &lt;strong&gt;decode&lt;/strong&gt; process (for degraded read IOs) has similarly higher overheads. This is an implementation choice - when recovering objects (see next section) CLAY uses a more efficient method for recovering the data. This method could also have been used for degraded reads.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&quot;what-is-clay-good-at%3F&quot;&gt;&lt;a id=&quot;good&quot;&gt;&lt;/a&gt;What is CLAY good at? &lt;a class=&quot;link-anchor&quot; href=&quot;#what-is-clay-good-at%3F&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Now you may be thinking, if CLAY is slower for writes and degraded reads, why use it? The answer is for &lt;strong&gt;Network Bandwidth Optimisations&lt;/strong&gt; during recovery processes like &lt;strong&gt;backfill&lt;/strong&gt; and &lt;strong&gt;recovery&lt;/strong&gt; that use the erasure code to reconstruct and repair the missing parts of objects.&lt;/p&gt;
&lt;p&gt;While JErasure requires &lt;strong&gt;k&lt;/strong&gt; (data shards) to reconstruct &lt;strong&gt;one&lt;/strong&gt; missing shard, CLAY uses coupled layers to reconstruct data using a significantly smaller amount of data from the remaining shards. In a standard 4+2 setup, JErasure would need to pull 100% of the data from the other 4 shards to rebuild the 5th.&lt;/p&gt;
&lt;p&gt;This is what it would look like if we were to use JErasure and simulate a recovery of data when shard 0 is missing:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;images/jerasure1a.jpg&quot; alt=&quot;alt text&quot; title=&quot;jerasure eg&quot;&gt;&lt;/p&gt;
&lt;p&gt;Now we will compare this to how CLAY would recover data if shard 0 was missing. &lt;code&gt;CLAY reduces this traffic by approximately 50%&lt;/code&gt; as you can see in the below example:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;images/clay1b.jpg&quot; alt=&quot;alt text&quot; title=&quot;clay eg&quot;&gt;&lt;/p&gt;
&lt;p&gt;It&#39;s important to note that the above configuration has a chosen non-default stripe unit of &lt;strong&gt;32K&lt;/strong&gt; which, with a 4+2 CLAY code, results in a sub chunk size of &lt;strong&gt;4K&lt;/strong&gt; and matches both the &lt;strong&gt;NVMe block size&lt;/strong&gt; and the &lt;strong&gt;Bluestore allocation unit&lt;/strong&gt;. See the Ceph documentation &lt;a href=&quot;https://docs.ceph.com/en/latest/rados/operations/erasure-code-clay/&quot;&gt;here&lt;/a&gt; for how to calculate the sub chunk size for your configuration.&lt;/p&gt;
&lt;p&gt;We can see that with the CLAY example above &lt;strong&gt;more&lt;/strong&gt; data shards are read, however overall &lt;strong&gt;less&lt;/strong&gt; data is read. We can therefore see that for our configuration, CLAY will be more efficient in recovering data when a shard is missing.&lt;/p&gt;
&lt;p&gt;With this erasure code profile CLAY will always read &lt;strong&gt;50%&lt;/strong&gt; of each other shard to recover a missing shard, however the subchunks that are read will vary depending on which shard is missing. The next diagram shows which subchunks will be read for each missing shard:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;images/clay2.jpg&quot; alt=&quot;alt text&quot; title=&quot;clay 2eg&quot;&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: While the diagram shows a 50% saving in network traffic, this comes at the &lt;code&gt;cost of IOPS&lt;/code&gt;. We can see how shards 4 and 5 must perform &lt;strong&gt;four&lt;/strong&gt; individual reads per stripe to gather those specific sub-chunks, so we can see here how it can be dependant on which shard is missing.&lt;/p&gt;
&lt;p&gt;In summary CLAY reads much less data than JErasure during recovery/backfill saving approximately &lt;strong&gt;50%&lt;/strong&gt; network bandwidth which in systems that are limited by the network should improve the performance of recovery.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&quot;problems-with-using-clay&quot;&gt;&lt;a id=&quot;probs&quot;&gt;&lt;/a&gt;Problems with using CLAY &lt;a class=&quot;link-anchor&quot; href=&quot;#problems-with-using-clay&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Choosing your stripe unit is critical:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;If stripe unit is 4K:&lt;/strong&gt; Sub-chunks become tiny (512 bytes) and reads of less than 4K are &lt;code&gt;rounded up&lt;/code&gt; to 4K.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This leads to &lt;strong&gt;extra&lt;/strong&gt; data reads because the NVMe block size is 4K. This means that recovery reads &lt;strong&gt;1x to 4x&lt;/strong&gt; the amount of data from drives but transmits &lt;strong&gt;50% less&lt;/strong&gt; data across the network, there is still many more &lt;strong&gt;IOPs&lt;/strong&gt; and &lt;strong&gt;CPU usage&lt;/strong&gt; in this scenario.&lt;/p&gt;
&lt;p&gt;Let&#39;s break this down a step further using examples of shards 0, 1 and 2 missing. Here we can see in blue the desired amount of data that we want to read, however the orange is the actual amount of data that is read due to these allignment issues.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;images/helper.jpg&quot; alt=&quot;alt text&quot; title=&quot;helper eg&quot;&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;If stripe unit is 32K:&lt;/strong&gt; This fixes the fragmentation issue that we see above (sub-chunks align better with 4K drive blocks), but introduces some classic and fast EC problems:&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In a classic EC pool, any overwrite requires reading the &lt;strong&gt;entire&lt;/strong&gt; stripe, even if you only changed &lt;strong&gt;one&lt;/strong&gt; byte. At 32K, small writes become incredibly expensive because of the &lt;code&gt;Read-Modify-Write&lt;/code&gt; overhead. In classic EC objects are padded to a multiple of the stripe width, so a larger stripe unit &lt;strong&gt;increases&lt;/strong&gt; wasted capacity. In fast EC objects are not padded but a larger stripe unit still results in &lt;strong&gt;more&lt;/strong&gt; coding parity data and &lt;strong&gt;less&lt;/strong&gt; storage efficiency. So there are still negatives to bare in mind if you are to pick a stripe unit of 32K.&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;&lt;/p&gt;
&lt;h2 id=&quot;how-does-clay-read-data-from-the-drive%3F&quot;&gt;&lt;a id=&quot;read&quot;&gt;&lt;/a&gt;How does CLAY read data from the drive? &lt;a class=&quot;link-anchor&quot; href=&quot;#how-does-clay-read-data-from-the-drive%3F&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;h3 id=&quot;fragmented-reads&quot;&gt;Fragmented Reads &lt;a class=&quot;link-anchor&quot; href=&quot;#fragmented-reads&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;As shown above, CLAY issues &lt;strong&gt;fragmented reads&lt;/strong&gt;. If the stripe unit gets smaller, for example &lt;strong&gt;4K&lt;/strong&gt;, the sub-chunk size drops to &lt;strong&gt;512 bytes&lt;/strong&gt;. This is because NVMe and HDD drives have a minimum block size of &lt;strong&gt;4K&lt;/strong&gt;, therefore any 512 byte read is &lt;strong&gt;rounded up&lt;/strong&gt; to this minimum of 4K. This can result in CLAY reading the same 4K block multiple times to extract different 512 byte sub-chunks, and discarding the rest of the data. This therefore wastes &lt;strong&gt;CPU&lt;/strong&gt; and &lt;strong&gt;drive IOPs&lt;/strong&gt;, so if either of these are your performance bottlenecks this is not a good scenario.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Squid&lt;/strong&gt; recovery also always tries to read &lt;strong&gt;2MB&lt;/strong&gt; from each stripe and expects the read to be truncated if the object is smaller than &lt;code&gt;2MB * number of stripes&lt;/code&gt;. With CLAY this results in a lot of small reads being issued beyond the end of the object. While these as quickly fail and do not stop CLAY recovering the data, this does waste &lt;strong&gt;additional CPU resources&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Referring to the same &lt;a href=&quot;https://people.iith.ac.in/mynav/pdfs/talks/Clay_Fast18.pdf&quot;&gt;paper&lt;/a&gt; as before: Results have been shown that encoding data can take up to &lt;strong&gt;70%&lt;/strong&gt; longer in terms of CPU usage, if your cluster &lt;strong&gt;isn&#39;t&lt;/strong&gt; CPU limited then you won&#39;t notice this. These results also showed dramatic savings in &lt;strong&gt;backfill&lt;/strong&gt; and &lt;strong&gt;recovery&lt;/strong&gt; time - but they were done on a system that was network limited and used much wider erasure codes (26 node cluster) than most people would typically use.&lt;/p&gt;
&lt;p&gt;There is scope to improve the implementation of CLAY - currently the reads are issued &lt;strong&gt;serially&lt;/strong&gt;, which will add a lot of latency to the recovery. A more efficient approach would be to issue a single read in &lt;strong&gt;parallel&lt;/strong&gt; using &lt;code&gt;readv&lt;/code&gt; or to read the entire stripe into memory once, then transmit the required data for the network. The latter would be the better method. This would trade &lt;strong&gt;drive bandwidth&lt;/strong&gt; for a considerable saving in &lt;strong&gt;CPU utilisation&lt;/strong&gt; and &lt;strong&gt;drive IOPs&lt;/strong&gt;.&lt;/p&gt;
&lt;h3 id=&quot;more-in-depth%3A&quot;&gt;More in depth: &lt;a class=&quot;link-anchor&quot; href=&quot;#more-in-depth%3A&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;We went over the 3 phases of how CLAY encodes data earlier. Decoding is also done in 3 phases, but on half the quantity of data:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;25% of the data is decoded using &lt;strong&gt;PRT&lt;/strong&gt;, 25% of the data is copied to form an intermediate set of buffers&lt;/li&gt;
&lt;li&gt;All (50%) of the intermediate data is decoded using &lt;strong&gt;RS&lt;/strong&gt; to form a 2nd set of intermediate buffers&lt;/li&gt;
&lt;li&gt;25% of the data is decoded using &lt;strong&gt;PFT&lt;/strong&gt;, 25% of the data is copied to form the output data&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Therefore, CLAY has an additional &lt;strong&gt;0.5x memcpy&lt;/strong&gt; of the data and the &lt;strong&gt;same&lt;/strong&gt; decoding costs, as JErasure. Hence there is slightly more overhead for CLAY (memcpy&#39;s + slight inefficiencies from performing several smaller decodes rather than one large decode). CLAY requires less data to perform the recovery so we can save on &lt;strong&gt;network bandwidth&lt;/strong&gt; (and if implemented correctly, &lt;strong&gt;drive bandwidth&lt;/strong&gt;)&lt;/p&gt;
&lt;p&gt;To round off:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;CLAY has &lt;strong&gt;higher&lt;/strong&gt; encoding costs and the &lt;strong&gt;same&lt;/strong&gt; decoding cost&lt;/li&gt;
&lt;li&gt;CLAY has some memcpy&#39;s that JErasure does not have&lt;/li&gt;
&lt;li&gt;CLAY has multiple encode/decode steps and there will be some small overheads/inefficiencies - for example, encoding 12K of data in 3 batches of 4K (CLAY) versus encoding 12K of data in 1 batch (JErasure)&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&quot;clay-is-broken-in-tentacle&quot;&gt;&lt;a id=&quot;broke&quot;&gt;&lt;/a&gt;CLAY is broken in Tentacle &lt;a class=&quot;link-anchor&quot; href=&quot;#clay-is-broken-in-tentacle&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;When performing benchmarking on the Tentacle release, a significant issue was discovered: The recovery benefit was &lt;strong&gt;non-existent&lt;/strong&gt; for Tentacle.&lt;/p&gt;
&lt;p&gt;In the tests, recovery in Tentacle transmitted the &lt;strong&gt;full&lt;/strong&gt; amount of data, behaving like standard JErasure but with a &lt;strong&gt;higher&lt;/strong&gt; CPU overhead of CLAY. This isn&#39;t the case for Squid however, which is what was used for the updated performance benchmarking used throughout this blog.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&quot;summary&quot;&gt;&lt;a id=&quot;summary&quot;&gt;&lt;/a&gt;Summary &lt;a class=&quot;link-anchor&quot; href=&quot;#summary&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;CLAY is a fascinating project and definitely has potential, but for the average user, remains niche.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;I&#39;d recommend CLAY if:&lt;/strong&gt; Your cluster is strictly &lt;strong&gt;Network Bottlenecked&lt;/strong&gt; and you use wide erasure codes (eg 20+ nodes) where the 50% saving is a very considerable amount.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;I&#39;d recommend you avoid CLAY if:&lt;/strong&gt; You are &lt;strong&gt;CPU&lt;/strong&gt; or &lt;strong&gt;IOPs&lt;/strong&gt; limited, or if you primarily use HDDs, as the fragmented serial reads will cripple recovery performance.&lt;/p&gt;
&lt;p&gt;For most production environments, the simplicity and predictable performance of JErasure remains the better choice I believe.&lt;/p&gt;
&lt;p&gt;Please note that there is a plan to end support for CLAY from the V release. Please see &lt;a href=&quot;https://ceph.io/en/news/blog/2025/ending-support-for-ec-plugins/&quot;&gt;here&lt;/a&gt; for more details.&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;&lt;a href=&quot;https://ceph.io/en/community/connect/&quot;&gt;Link to connect with Ceph on slack&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Contact us in the &lt;strong&gt;#cbt&lt;/strong&gt; channel in the Ceph on slack workspace above!&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;#outline&quot;&gt;Link to previous parts of the blog series&lt;/a&gt;&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>RGW Bucket Resharding Without Pausing</title>
    <link href="https://ceph.io/en/news/blog/2026/rgw-improved-resharding/" />
    <updated>2026-02-01T00:00:00Z</updated>
    <id>https://ceph.io/en/news/blog/2026/rgw-improved-resharding/</id>
    <author>
      <name>Daniel Alexander Parkes, Anthony D&#39;Atri</name>
    </author>
      <category term="en-blog-post" />
      <category term="en-article" />
      <category term="blog-post" />
      <category term="ceph" />
      <category term="rgw" />
      <category term="s3" />
    <content type="html" xml:base="https://ceph.io/en/news/blog/2026/rgw-improved-resharding/">&lt;h2 id=&quot;introduction%3A-the-foundation-of-scalable-object-storage&quot;&gt;Introduction: The Foundation of Scalable Object Storage &lt;a class=&quot;link-anchor&quot; href=&quot;#introduction%3A-the-foundation-of-scalable-object-storage&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;In the modern data landscape, object storage has evolved from a simple file
repository into the foundational layer for AI/ML pipelines, data lakehouses,
real-time analytics, and massive-scale archival systems. At the heart of this
evolution is a deceptively simple
question: &lt;strong&gt;How do you efficiently locate and access billions of objects stored in a single bucket?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The answer lies in one of Ceph&#39;s most critical performance
mechanisms: &lt;strong&gt;bucket index sharding&lt;/strong&gt;. This architectural
pattern divides a bucket&#39;s index into multiple parallel structures, enabling
concurrent operations across thousands of objects while maintaining the
consistency and reliability that enterprise workloads demand.&lt;/p&gt;
&lt;p&gt;But there&#39;s always been a catch. As workloads grow and evolve, buckets need
to be resharded. Historically, when the buckets to be resharded had a vast
number of objects, this operation came with a painful trade-off: blocking
client writes from seconds to minutes, with a chance of causing application
disruptions, 504 Gateway errors, and operational headaches.&lt;/p&gt;
&lt;p&gt;With Ceph Tentacle, we&#39;re eliminating this trade-off. The new near-zero
impact bucket resharding architecture transforms what was once a
maintenance window event into a seamless background operation
that your applications will never notice.&lt;/p&gt;
&lt;p&gt;Note: As of 2026/02/05, the functionality described in this article is expected
in an upcoming Tentacle update.&lt;/p&gt;
&lt;h3 id=&quot;executive-summary&quot;&gt;Executive Summary &lt;a class=&quot;link-anchor&quot; href=&quot;#executive-summary&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;The Challenge&lt;/strong&gt;: In Ceph Squid, resharding a 20-million-object bucket blocked
writes for 4+ minutes, returning 504 errors. Even larger buckets (500M objects)
required 94 minutes of complete write unavailability.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The Solution&lt;/strong&gt;: Ceph Tentacle&#39;s two-phase architecture moves the heavy lifting to a non-blocking background phase, eliminating the impact on clients IO.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The Results&lt;/strong&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;images/24e09a3f-0c15-4b5d-8ec4-c37508617d7a.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;(note: in this graphic 8.1 refers to Squid and 9.0 to Tentacle)&lt;/p&gt;
&lt;p&gt;In this deep dive, we&#39;ll explore:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Why bucket sharding is essential for modern workloads&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The challenges of resharding in Ceph Squid and earlier versions&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The enhanced two-phase architecture in Ceph Tentacle&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Before/after performance comparison from production testing&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The future of bucket indexing with in-order sharding&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&quot;the-scalability-enabler%3A-understanding-bucket-index-sharding&quot;&gt;The Scalability Enabler: Understanding Bucket Index Sharding &lt;a class=&quot;link-anchor&quot; href=&quot;#the-scalability-enabler%3A-understanding-bucket-index-sharding&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;h3 id=&quot;the-bucket-index-and-omaps&quot;&gt;The Bucket Index and omaps &lt;a class=&quot;link-anchor&quot; href=&quot;#the-bucket-index-and-omaps&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;In Ceph&#39;s Object Gateway(RGW), the ability to list bucket contents is fundamental
to object storage operations. The Object Gateway(RGW) implements this using a
dedicated structure called the bucket index, which maintains an inventory of
all objects in a bucket. This index is stored using a special RADOS feature
called the Object Map (omap) - essentially a key-value store associated with a
RADOS object, physically residing in the RocksDB database on each OSD&#39;s DB partition.&lt;/p&gt;
&lt;p&gt;Without sharding, a bucket&#39;s entire index is stored in a single RADOS object.
While elegant in its simplicity, this creates a fundamental performance problem:&lt;/p&gt;
&lt;p&gt;The Single-Index Bottleneck: Since only one operation can modify this index at
a time, you&#39;re looking at complete serialization. Write operations must queue
and wait their turn to update the index. As your bucket grows to millions of
objects with thousands of concurrent write operations, this serialization
becomes a severe bottleneck.&lt;/p&gt;
&lt;p&gt;Think of it like a busy airport with only one runway. No matter how many planes
are waiting to land, only one can touch down at a time.&lt;/p&gt;
&lt;h3 id=&quot;sharding%3A-parallelism-through-distribution&quot;&gt;Sharding: Parallelism Through Distribution &lt;a class=&quot;link-anchor&quot; href=&quot;#sharding%3A-parallelism-through-distribution&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Bucket index sharding&lt;/strong&gt; solves this bottleneck by dividing the index into
multiple parts (shards), with each shard stored as a separate RADOS object
within the &lt;code&gt;.rgw.buckets.index&lt;/code&gt; pool. When an object is written, the Ceph
Object Gateway (RGW) calculates a hash of the object&#39;s name to determine
which shard should receive the index update. This enables multiple operations
to run concurrently across multiple Placement Groups (PGs), distributing
requests among the the OSDs that host the index pool.&lt;/p&gt;
&lt;p&gt;Returning to our airport analogy: you now have multiple runways, each handling
different aircraft simultaneously. The more runways (shards) you have, the more
parallel operations you can support.&lt;/p&gt;
&lt;p&gt;The sharding mechanism uses the &lt;code&gt;rgw_max_objs_per_shard&lt;/code&gt; tunable (default:
100,000 objects per shard) to determine optimal distribution.&lt;/p&gt;
&lt;p&gt;We recommend maintaining no more than 102,400 objects per shard for optimal performance.&lt;/p&gt;
&lt;h3 id=&quot;why-single-bucket-scale-is-mission-critical-in-modern-object-workloads&quot;&gt;Why Single-Bucket Scale is Mission-Critical in Modern Object Workloads &lt;a class=&quot;link-anchor&quot; href=&quot;#why-single-bucket-scale-is-mission-critical-in-modern-object-workloads&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Here&#39;s where bucket sharding becomes even more critical: modern analytics
architectures are converging on single-bucket designs.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The Data Lakehouse Pattern&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Apache Iceberg, Apache Hudi, and Delta Lake (the table formats revolutionizing
data architecture) organize petabytes of data within a single bucket using
hierarchical prefixes:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;s3://analytics-lakehouse/
├── warehouse/
│   ├── sales_db/
│   │   └── transactions/
│   │       ├── data/
│   │       │   ├── year=2025/month=11/
│   │       │   │   ├── 00045-23-a1b2c3d4.parquet
│   │       │   │   └── 00046-24-e5f6g7h8.parquet
│   │       │   └── year=2025/month=10/
│   │       └── metadata/
│   │           ├── v1.metadata.json
│   │           ├── v2.metadata.json
│   │           └── snap-1234567890.avro
│   └── customer_db/
│       └── profiles/
├── staging/
└── archive/
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;The implication?&lt;/strong&gt; Modern data platforms need buckets that can:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Scale to billions of objects distributed across thousands of prefixes&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Handle mixed workloads: batch ETL, interactive queries, real-time streaming&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Adapt dynamically to growth and contraction without downtime&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Maintain sub-second listing performance across massive object counts&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is precisely where seamless resharding becomes absolutely critical.&lt;/p&gt;
&lt;h2 id=&quot;the-challenge%3A-resharding-in-ceph-squid-and-earlier&quot;&gt;The Challenge: Resharding in Ceph Squid and Earlier &lt;a class=&quot;link-anchor&quot; href=&quot;#the-challenge%3A-resharding-in-ceph-squid-and-earlier&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Understanding the improvements in Ceph Tentacle requires understanding the challenges of the previous approach.&lt;/p&gt;
&lt;h3 id=&quot;the-blocking-resharding-process&quot;&gt;The Blocking Resharding Process &lt;a class=&quot;link-anchor&quot; href=&quot;#the-blocking-resharding-process&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;In Ceph Squid and earlier versions, bucket resharding followed this process:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Resharding operation initiates (manually or via dynamic resharding)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;All client write operations are blocked to the bucket&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Index entries are copied from source shards to destination shards&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Applications receive 504 Gateway Timeout errors&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Operations teams monitor progress&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;In buckets with small object counts, resharding is almost unnoticeable, but as
object counts grow, the write pause can last from minutes to hours, depending
on bucket size. Read operations continued, but write unavailability required
careful planning for production workloads.&lt;/p&gt;
&lt;h3 id=&quot;the-operational-impact&quot;&gt;The Operational Impact &lt;a class=&quot;link-anchor&quot; href=&quot;#the-operational-impact&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;This blocking behavior created several operational constraints:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Maintenance Windows&lt;/strong&gt;: Resharding typically requires scheduling during off-peak
hours with advance notification to application teams.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Capacity Planning Tradeoffs&lt;/strong&gt;: Teams set high initial shard counts based on
pre-sharding usage estimates for the bucket, but these are hard to calculate up front.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Dynamic Resharding Concerns&lt;/strong&gt;: Automatic reshards could trigger during peak
business hours, potentially causing disruptions. Some organizations disabled dynamic resharding entirely and managed sharding manually.&lt;/p&gt;
&lt;p&gt;Ceph Tentacle addresses these challenges with a fundamentally different approach.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1768993012078/46db01ac-dc0c-46e0-a36c-d1f99096f874.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;h2 id=&quot;the-solution%3A-non-pausing-resharding-in-ceph-tentacle&quot;&gt;The Solution: Non-Pausing Resharding in Ceph Tentacle &lt;a class=&quot;link-anchor&quot; href=&quot;#the-solution%3A-non-pausing-resharding-in-ceph-tentacle&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Now let&#39;s explore what changes in Ceph Tentacle - and why it&#39;s transformational.&lt;/p&gt;
&lt;h3 id=&quot;reshard-two-phase-architecture&quot;&gt;Reshard Two-Phase Architecture &lt;a class=&quot;link-anchor&quot; href=&quot;#reshard-two-phase-architecture&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;&lt;img src=&quot;f0a0b81b-5779-474e-8920-8ede4da5d1a4.jpeg&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;The Ceph Object Gateway(RGW) engineering team fundamentally redesigned the resharding
process from the ground up. Instead of blocking all writes while copying index entries,
Ceph Tentacle introduces an intelligent two-phase incremental approach that keeps your
applications running:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Phase 1: Log Record Phase (Non-Blocking)&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;During this phase, which comprises the bulk of the resharding operation:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Client writes continue normally&lt;/strong&gt; - no blocking whatsoever&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Index operations are logged&lt;/strong&gt; to source shards alongside regular write operations&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Background migration begins&lt;/strong&gt; - existing index entries start copying to destination shards&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Change tracking&lt;/strong&gt; - a sophisticated logging mechanism captures all modifications&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Phase 2: Progress Phase (Minimal Pause, zero client impact)&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Only after the bulk of entries have been migrated does Phase 2 begin:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Brief write pause&lt;/strong&gt; - With zero client impact (milliseconds to low seconds)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Log synchronization&lt;/strong&gt; - recent changes recorded during Phase 1 are applied to destination shards&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Conflict resolution&lt;/strong&gt; - entries modified during migration are reconciled&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Bucket stats recalculation&lt;/strong&gt; - metadata is updated to reflect the new shard layout&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Cutover&lt;/strong&gt; - bucket switches to the new index layout&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Normal operations resume&lt;/strong&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;The Key Innovation:&lt;/strong&gt; By recording changes as lightweight logs during Phase 1,
the system only needs to synchronize recent modifications during the brief Phase
2 pause. The bulk of the work - migrating millions of existing entries - happens
entirely in the background while your applications continue writing uninterrupted.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Backward Compatibility:&lt;/strong&gt; Ceph Tentacle&#39;s resharding maintains compatibility
as a superset of the previous implementation. If some OSD nodes haven&#39;t yet
upgraded, resharding safely fails rather than risking data loss, and the
system checks version compatibility before proceeding.&lt;/p&gt;
&lt;h3 id=&quot;what-this-means-for-your-operations&quot;&gt;What This Means For Your Operations &lt;a class=&quot;link-anchor&quot; href=&quot;#what-this-means-for-your-operations&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;The practical implications extend far beyond eliminating 504 errors:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;1. Eliminate Maintenance Windows.&lt;/strong&gt; No more scheduling resharding operations for 2 AM on Sunday. Trigger reshards during peak business hours if needed - your applications won&#39;t notice.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;2. Enable True Dynamic Scaling&lt;/strong&gt;&lt;br&gt;
Dynamic bucket resharding can now be fully trusted. The automation you&#39;ve wanted - automatic scaling up and down with minimal client interruption.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;3. Production Confidence.&lt;/strong&gt; Deploy resharding changes without coordination, without warning application teams, without anxiety. It just works.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;4. Faster Response to Demand.&lt;/strong&gt; Workload explodes? Trigger an immediate upshard. No more waiting for a maintenance window.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;5. Simplified Operations.&lt;/strong&gt; One less thing requiring complex runbooks, escalation procedures, and off-hours coordination. Focus on value-add activities instead.&lt;/p&gt;
&lt;h2 id=&quot;performance-comparison%3A-before-and-after&quot;&gt;Performance Comparison: Before and After &lt;a class=&quot;link-anchor&quot; href=&quot;#performance-comparison%3A-before-and-after&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;To validate the architectural improvements, we conducted extensive testing
comparing Ceph Squid and Tentacle under identical conditions. The results
demonstrate the transformational impact of near-zero impact resharding resharding.&lt;/p&gt;
&lt;h3 id=&quot;test-scenario%3A-small-scale-bucket-with-20-million-objects&quot;&gt;Test Scenario: Small-Scale Bucket with 20 Million Objects &lt;a class=&quot;link-anchor&quot; href=&quot;#test-scenario%3A-small-scale-bucket-with-20-million-objects&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Configuration:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Environment&lt;/strong&gt;: Single-site deployment using &lt;code&gt;s3cmd&lt;/code&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Bucket size&lt;/strong&gt;: ~20 million objects&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Resharding operation&lt;/strong&gt;: Manual upshard (401 → 10,001 shards for 8.1, 307 → 10,001 for 9.0)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Test action&lt;/strong&gt;: Upload a 300MB object during active reshard&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Results:&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;9093c5b6-e337-45ce-bc3c-027b9490d5a6.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The Impact:&lt;/strong&gt; Uploads that previously required 4+ minutes due to complete blocking
now complete in 17 seconds for 300MB objects, with zero errors. That&#39;s a 93% reduction
in client-perceived latency - or more accurately, the elimination of the problem entirely.&lt;/p&gt;
&lt;p&gt;From an application perspective, resharding is now completely transparent. Your
applications continue serving requests without any indication that a major
infrastructure operation is happening beneath them.&lt;/p&gt;
&lt;h3 id=&quot;test-scenario%3A-medium-scale-bucket-with-500-million-objects&quot;&gt;Test Scenario: Medium-Scale Bucket with 500 Million Objects &lt;a class=&quot;link-anchor&quot; href=&quot;#test-scenario%3A-medium-scale-bucket-with-500-million-objects&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;For larger buckets, the improvements are even more dramatic.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Test Methodology Note&lt;/strong&gt;: This test was deliberately conducted as a stress
scenario to evaluate behavior under extreme conditions. The cluster was pushed
to near-saturation with concurrent large-object uploads during resharding
operations. This aggressive test configuration amplifies resharding times
significantly beyond typical production scenarios, allowing us to validate
the improvements under worst-case conditions.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Configuration:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Environment&lt;/strong&gt;: Single-site deployment using s3cmd&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Test&lt;/strong&gt;: Upload 300MB and 1GB objects during downshard operation&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Resharding operation&lt;/strong&gt;: Downshard from 10,001 → 1,999 shards&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Load&lt;/strong&gt;: Concurrent large uploads pushing cluster toward capacity limits&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;The Results:&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;images/3460f4e3-7fd6-4b8b-9300-aad79fea10da.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The Impact:&lt;/strong&gt; While typical production resharding in Ceph Squid would
complete faster than the 94 minutes shown here, this stress test reveals
critical behavior differences. Under load, Ceph Squid&#39;s blocking architecture
creates cascading issues - the longer the reshard takes, the longer
applications are blocked, potentially triggering timeouts and retry
storms. Ceph Tentacle&#39;s non-blocking architecture eliminates this
entire failure mode. Whether resharding takes 10 minutes or 90 minutes,
applications continue operating normally.&lt;/p&gt;
&lt;h3 id=&quot;at-a-glance%3A-the-transformation&quot;&gt;At a Glance: The Transformation &lt;a class=&quot;link-anchor&quot; href=&quot;#at-a-glance%3A-the-transformation&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;Aspect&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Ceph Squid&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Ceph Tentacle&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Client Impact&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Complete write blocking&lt;/td&gt;
&lt;td&gt;Zero write blocking&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Error Rate&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;504 Gateway errors&lt;/td&gt;
&lt;td&gt;No errors&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;20M Object Upshard&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;4m23s blocked&lt;/td&gt;
&lt;td&gt;17s upload (no pause)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;500M Object Downshard&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;94 minutes blocked&lt;/td&gt;
&lt;td&gt;5-17s uploads (no pause)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Maintenance Window&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Required&lt;/td&gt;
&lt;td&gt;Not required&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Dynamic Resharding&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Often disabled&lt;/td&gt;
&lt;td&gt;Enabled&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2 id=&quot;looking-forward%3A-the-future-of-bucket-indexing&quot;&gt;Looking Forward: The Future of Bucket Indexing &lt;a class=&quot;link-anchor&quot; href=&quot;#looking-forward%3A-the-future-of-bucket-indexing&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The near-zero-impact bucket resharding feature in Ceph Tentacle is
transformational, but it&#39;s part of a broader evolution in how Ceph
handles bucket indexing at scale.&lt;/p&gt;
&lt;h3 id=&quot;in-order-sharding%3A-the-next-frontier&quot;&gt;In-Order Sharding: The Next Frontier &lt;a class=&quot;link-anchor&quot; href=&quot;#in-order-sharding%3A-the-next-frontier&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Currently, RGW&#39;s hashed sharding optimizes for write distribution but presents
challenges for alphabetical listing operations. To fulfill a paginated list
request, RGW must perform a &amp;quot;scatter-gather&amp;quot; operation: querying every shard
and sorting the combined results. For buckets with thousands of shards, this
becomes a bottleneck.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;In-order sharding&lt;/strong&gt; (ordered bucket listing) is in active development and will revolutionize listing performance:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The Change&lt;/strong&gt;: Instead of using a hash function, objects will be placed into shards based on lexicographical name ordering.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The Impact&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;List requests can target specific shard ranges instead of querying all shards.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Paginated listing becomes dramatically faster (query 1-2 shards instead of thousands).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Prefix-based queries (critical for data lakehouses) become highly efficient.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Iterating through object keys becomes significantly more performant.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Why This Matters for Data Lakehouses:&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Apache Iceberg, Hudi, and Delta Lake all rely heavily on prefix-based object discovery:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;s3://lakehouse/warehouse/sales_db/transactions/data/year=2025/month=11/
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;With in-order sharding, a query for this prefix would hit only the specific
shards containing objects in that lexicographical range - not all 10,000 shards in the bucket.&lt;/p&gt;
&lt;p&gt;Combined with non-pausing resharding, Ceph is building toward virtually unlimited,
performant scalability within a single bucket - exactly what modern data platforms demand.&lt;/p&gt;
&lt;p&gt;For a detailed slide deck on the topic, check out
&lt;a href=&quot;https://cephalocon2025.sched.com/speaker/ivancich&quot;&gt;Eric&lt;/a&gt; Ivancich&#39;s excellent Cephalocon talk:&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=H-CRhw3XLGw&quot;&gt;Video&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://static.sched.com/hosted_files/cephalocon2025/80/Cephalocon%202025%20Ivancich.pdf?_gl=1*153a8oy*_gcl_au*MTgwNzY1MDQ2MC4xNzU4MjA2ODAy*FPAU*MTgwNzY1MDQ2MC4xNzU4MjA2ODAy&quot;&gt;Slides&lt;/a&gt;&lt;/p&gt;
&lt;h2 id=&quot;conclusion%3A-a-new-era-of-operational-excellence&quot;&gt;Conclusion: A New Era of Operational Excellence &lt;a class=&quot;link-anchor&quot; href=&quot;#conclusion%3A-a-new-era-of-operational-excellence&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Ceph Tentacle&#39;s near-zero impact bucket resharding represents a
fundamental shift in production object storage operations,
eliminating one of the most significant pain points in large-scale deployments.&lt;/p&gt;
&lt;p&gt;As Ceph continues evolving with features like in-order sharding, the vision
becomes clear: &lt;strong&gt;single-bucket architectures that scale infinitely without operational complexity&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;For data lakehouse architects building on Apache Iceberg, for AI/ML engineers
managing billions of training artifacts, and for enterprise architects demanding
the highest availability without operational friction, Ceph Tentacle
delivers the operational maturity that production workloads require.&lt;/p&gt;
&lt;p&gt;*All test configurations were performed on HDD production-equivalent hardware.
Results may vary based on hardware specifications, network topology, and workload
characteristics. Consult the official documentation for detailed configuration guidance and best practices.&lt;/p&gt;
&lt;p&gt;We would like to thank IBM for the time to author these articles.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>Mastering IAM in Ceph: Multi-Tenancy, Access Control, and Why ACLs Must Die</title>
    <link href="https://ceph.io/en/news/blog/2026/mastering-iam/" />
    <updated>2026-01-24T00:00:00Z</updated>
    <id>https://ceph.io/en/news/blog/2026/mastering-iam/</id>
    <author>
      <name>Daniel Alexander Parkes, Anthony D&#39;Atri</name>
    </author>
      <category term="en-blog-post" />
      <category term="en-article" />
      <category term="blog-post" />
      <category term="ceph" />
      <category term="rgw" />
      <category term="s3" />
    <content type="html" xml:base="https://ceph.io/en/news/blog/2026/mastering-iam/">&lt;h2 id=&quot;introduction&quot;&gt;Introduction &lt;a class=&quot;link-anchor&quot; href=&quot;#introduction&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1767549864482/5e07de10-5b83-4de3-a013-fd9c3f77427a.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;h2 id=&quot;introduction%3A-when-security-theater-becomes-a-real-disaster&quot;&gt;Introduction: When Security Theater Becomes a Real Disaster &lt;a class=&quot;link-anchor&quot; href=&quot;#introduction%3A-when-security-theater-becomes-a-real-disaster&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;In March 2017, a misconfigured S3 bucket at Verizon exposed the personal
information of 14 million customers. The root cause wasn&#39;t a sophisticated
attack; it was a simple oversight in access permissions. The bucket was
set to be publicly accessible due to S3 permission misconfiguration, and no one
noticed because ACLs were managed separately from the company&#39;s centralized IAM
policies. The security team had implemented careful, identity-based access controls,
but a resource-level ACL silently bypassed them by granting access to &amp;quot;All Users.&amp;quot;&lt;/p&gt;
&lt;p&gt;This scenario repeats constantly across the industry: ACLs creating invisible
access paths that security teams don&#39;t know exist, buckets accidentally exposed
to the public internet, and contractors uploading data that the bucket owner
cannot reliably read or administer, while still consuming capacity.&lt;/p&gt;
&lt;p&gt;Between 2017 and 2019, major companies exposed hundreds of millions of records
via misconfigured S3 permissions (ACLs and/or bucket policies):&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Verizon (2017)&lt;/strong&gt;: &lt;a href=&quot;https://www.techtarget.com/searchsecurity/news/450422709/Misconfigured-AWS-S3-bucket-exposes-millions-of-Verizon-customers-data&quot;&gt;14 million customers&lt;/a&gt; - An AWS S3 bucket configured for public access exposed names, addresses, account PINs&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Facebook (2019)&lt;/strong&gt;: &lt;a href=&quot;https://www.upguard.com/breaches/facebook-user-data-leak&quot;&gt;540 million records&lt;/a&gt; - Third-party apps stored user data in publicly accessible S3 buckets&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Instagram (2019)&lt;/strong&gt;: &lt;a href=&quot;https://www.cpomagazine.com/cyber-security/instagram-breach-exposes-personal-data-of-49-million-users/&quot;&gt;49 million records&lt;/a&gt; - Marketing firm left influencer database unprotected in AWS S3&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The AWS response was clear: &lt;a href=&quot;https://aws.amazon.com/about-aws/whats-new/2023/04/amazon-s3-security-best-practices-buckets-default/&quot;&gt;since April 2023&lt;/a&gt;,
&lt;strong&gt;all new S3 buckets default to &amp;quot;ACLs disabled&amp;quot;&lt;/strong&gt; (BucketOwnerEnforced) and &lt;strong&gt;Block Public Access enabled&lt;/strong&gt;.
AWS strongly recommends disabling ACLs on existing buckets and migrating to a pure
policy-based model with IAM Accounts architecture.&lt;/p&gt;
&lt;p&gt;If you&#39;re running the Ceph Object Gateway (RGW), you have access to the same IAM
Accounts model introduced in Ceph Squid 19.2.0. This post explains why ACLs must
be disabled immediately and how to implement modern, secure access control with IAM policies.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Do This First (Quick Security Wins)&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Before reading further, take these two actions on all production buckets:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Enable Block Public Access&lt;/strong&gt; - Prevents public exposure via ACLs or bucket policies&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Deny ACL operations&lt;/strong&gt; - Add explicit deny for &lt;code&gt;s3:PutObjectAcl&lt;/code&gt; and &lt;code&gt;s3:PutBucketAcl&lt;/code&gt; as defense-in-depth&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;These changes prevent the attack patterns described in this post. Continue reading to understand why and how.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;hr&gt;
&lt;h2 id=&quot;why-acls-failed%3F&quot;&gt;Why ACLs Failed? &lt;a class=&quot;link-anchor&quot; href=&quot;#why-acls-failed%3F&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Access Control Lists (ACLs) were S3&#39;s original permission system. They failed
for several critical reasons that made them fundamentally unsafe for production use.&lt;/p&gt;
&lt;h3 id=&quot;public-access-disasters&quot;&gt;Public Access Disasters &lt;a class=&quot;link-anchor&quot; href=&quot;#public-access-disasters&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;The most dangerous ACL failure was a silent public exposure. A single misconfigured
ACL could grant the entire internet access to your data, and your security team would
never know because ACLs weren&#39;t visible in centralized IAM policies.&lt;/p&gt;
&lt;p&gt;How it happened:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;$ export RGW_ENDPOINT=&amp;quot;https://rgw.example.com&amp;quot;

# Developer accidentally makes object public during testing
$ aws --profile developer --endpoint-url &amp;quot;$RGW_ENDPOINT&amp;quot; s3api put-object-acl &#92;
  --bucket bucketacl &#92;
  --key hosts &#92;
  --grant-read uri=http://acs.amazonaws.com/groups/global/AllUsers
$ aws --profile developer --endpoint-url &amp;quot;$RGW_ENDPOINT&amp;quot; s3api get-object-acl &#92;
  --bucket bucketacl &#92; 
  --key hosts
{
    &amp;quot;Owner&amp;quot;: {
        &amp;quot;DisplayName&amp;quot;: &amp;quot;developer&amp;quot;,
        &amp;quot;ID&amp;quot;: &amp;quot;developer&amp;quot;
    },
    &amp;quot;Grants&amp;quot;: [
        {
            &amp;quot;Grantee&amp;quot;: {
                &amp;quot;Type&amp;quot;: &amp;quot;Group&amp;quot;,
                &amp;quot;URI&amp;quot;: &amp;quot;http://acs.amazonaws.com/groups/global/AllUsers&amp;quot;
            },
            &amp;quot;Permission&amp;quot;: &amp;quot;READ&amp;quot;
        }
    ]
}

# Security team checks IAM policies - looks fine (against the same RGW endpoint)

$ aws --profile account-root --endpoint-url &amp;quot;$RGW_ENDPOINT&amp;quot; iam get-user-policy &#92;
  --user-name developer &#92;
  --policy-name S3Access

# ✓ Least privilege, no issues detected

# Meanwhile, the object is public to anyone who can reach the RGW endpoint:

$ curl &amp;quot;$RGW_ENDPOINT/bucketacl/hosts&amp;quot; 
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
10.2XX.0.X   ceph01 

# Full access, no authentication required
# The same risk exists at bucket scope; a public bucket ACL enables unauthenticated listing
# which can leak keys and metadata

$ aws --profile developer --endpoint-url &amp;quot;$RGW_ENDPOINT&amp;quot; s3api put-bucket-acl &#92;
--bucket bucketacl --acl public-read

# Unauthenticated Access to list bucket contents

$ curl -s &amp;quot;$RGW_ENDPOINT/bucketacl&amp;quot; | xmllint --format -
&amp;lt;?xml version=&amp;quot;1.0&amp;quot; encoding=&amp;quot;UTF-8&amp;quot;?&amp;gt;
&amp;lt;ListBucketResult xmlns=&amp;quot;http://s3.amazonaws.com/doc/2006-03-01/&amp;quot;&amp;gt;
  &amp;lt;Name&amp;gt;bucketacl&amp;lt;/Name&amp;gt;
 ...
  &amp;lt;Contents&amp;gt;
    &amp;lt;Key&amp;gt;hosts&amp;lt;/Key&amp;gt;
    &amp;lt;LastModified&amp;gt;2025-12-31T08:58:21.346Z&amp;lt;/LastModified&amp;gt;
    &amp;lt;ETag&amp;gt;&amp;quot;71ae31ad9b6e7fda9cb5a8628b2e152a&amp;quot;&amp;lt;/ETag&amp;gt;
    &amp;lt;Size&amp;gt;415&amp;lt;/Size&amp;gt;
    &amp;lt;StorageClass&amp;gt;STANDARD&amp;lt;/StorageClass&amp;gt;
    &amp;lt;Owner&amp;gt;
      &amp;lt;ID&amp;gt;developer&amp;lt;/ID&amp;gt;
      &amp;lt;DisplayName&amp;gt;developer&amp;lt;/DisplayName&amp;gt;
 ...
&amp;lt;/ListBucketResult&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Why was it catastrophic&lt;/strong&gt;?&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Decentralized control&lt;/strong&gt;: ACLs could be set per-bucket and per-object, creating millions of potential exposure points&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;No visibility&lt;/strong&gt;: ACLs didn&#39;t appear in the IAM console - security teams had no way to audit them centrally&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Silent bypasses&lt;/strong&gt;: Even perfect IAM policies couldn&#39;t prevent an ACL from granting public access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Object-level chaos&lt;/strong&gt;: With millions of objects, each having its own ACL, comprehensive auditing was impossible&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;strong&gt;Real-world impact&lt;/strong&gt;: The three breaches in our introduction (Verizon, Facebook, Instagram) all involved publicly
accessible S3 data caused by permission misconfiguration (ACLs, bucket policies, or both), combined with weak central
visibility and auditing; exactly the problems that policy-based access control solves.&lt;/p&gt;
&lt;h3 id=&quot;the-object-ownership-problem&quot;&gt;The Object Ownership Problem &lt;a class=&quot;link-anchor&quot; href=&quot;#the-object-ownership-problem&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Beyond public access, ACLs created an ownership nightmare. When external accounts uploaded objects to your bucket, they owned those objects, not you.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;# Contractor uploads data to your bucket
$ aws --endpoint-url &amp;quot;$RGW_ENDPOINT&amp;quot; s3 cp sensitive.pdf s3://company-bucket/contractor-data/ --profile contractor
upload: ./sensitive.pdf to s3://company-bucket/contractor-data/sensitive.pdf

# Who owns this object?
$ aws --endpoint-url &amp;quot;$RGW_ENDPOINT&amp;quot; s3api get-object-acl &#92;
  --bucket company-bucket &#92;
  --key contractor-data/sensitive.pdf &#92;
  --profile contractor
{
    &amp;quot;Owner&amp;quot;: {
        &amp;quot;DisplayName&amp;quot;: &amp;quot;Contractor Account&amp;quot;,
        &amp;quot;ID&amp;quot;: &amp;quot;contractor&amp;quot;  ← Contractor owns it, not you!
    }
}

# You (bucket owner) can&#39;t READ the object
$ aws --endpoint-url &amp;quot;$RGW_ENDPOINT&amp;quot; s3 cp &#92;
  s3://company-bucket/contractor-data/sensitive.pdf &#92;
  ./test.pdf --profile company-admin
fatal error: An error occurred (403) when calling the HeadObject operation: Forbidden

# You can&#39;t even GET the ACL to see permissions
$ aws --endpoint-url &amp;quot;$RGW_ENDPOINT&amp;quot; s3api get-object-acl &#92;
  --bucket company-bucket --key contractor-data/sensitive.pdf &#92;
  --profile company-admin
fatal error: An error occurred (AccessDenied) when calling the GetObjectAcl operation: Access Denied


# You can&#39;t MODIFY the ACL
$ aws --endpoint-url &amp;quot;$RGW_ENDPOINT&amp;quot; s3api put-object-acl &#92;
  --bucket company-bucket --key contractor-data/sensitive.pdf &#92;
  --acl private --profile company-admin
fatal error: An error occurred (AccessDenied) when calling the PutObjectAcl operation: Access Denied
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For on-premises Ceph deployments, while there&#39;s no per-GB billing surprise,
the &lt;strong&gt;operational and compliance problems are identical&lt;/strong&gt;: you can&#39;t read,
audit, or manage data in your own infrastructure.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;In Ceph RGW, bucket owners CAN delete objects they don&#39;t own. However, they
still can&#39;t read, view ACLs, or manage those objects, creating operational
blind spots and compliance risks.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3 id=&quot;the-authenticated-read-trap-(over-sharing-inside-the-cluster)&quot;&gt;The authenticated-read trap (over-sharing inside the cluster) &lt;a class=&quot;link-anchor&quot; href=&quot;#the-authenticated-read-trap-(over-sharing-inside-the-cluster)&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;ACLs include grants that &lt;em&gt;appear&lt;/em&gt; safer than &amp;quot;public&amp;quot; but remain dangerously broad.
In S3, &lt;code&gt;authenticated-read&lt;/code&gt; grants read access to the &lt;code&gt;AuthenticatedUsers&lt;/code&gt; group;
in Ceph RGW terms, that can translate to &amp;quot;any identity that can authenticate to
this RGW endpoint/cluster,&amp;quot; not &amp;quot;only my team.&amp;quot; On a shared on-premises
platform (multiple accounts, tenants, service accounts, CI users, integrations),
this can lead to accidental cross-team or cross-tenant data exposure.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;# Finance team uploads &amp;quot;internal&amp;quot; data with authenticated-read
# (thinking it&#39;s safer than public)
$ aws --endpoint-url &amp;quot;$RGW_ENDPOINT&amp;quot; s3 cp finance-report.pdf &#92;
  s3://company-bucket/finance-report.pdf &#92;
  --acl authenticated-read --profile finance-team
 upload: ./finance-report.pdf to s3://company-bucket/finance-report.pdf

# Check the ACL - looks reasonable?
$ aws --endpoint-url &amp;quot;$RGW_ENDPOINT&amp;quot; s3api get-object-acl &#92;
  --bucket company-bucket &#92;
  --key finance-report.pdf --profile finance-team
{
    &amp;quot;Owner&amp;quot;: {
        &amp;quot;DisplayName&amp;quot;: &amp;quot;Finance Team&amp;quot;,
        &amp;quot;ID&amp;quot;: &amp;quot;finance-team&amp;quot;
    },
    &amp;quot;Grants&amp;quot;: [
        {
            &amp;quot;Grantee&amp;quot;: {
                &amp;quot;Type&amp;quot;: &amp;quot;Group&amp;quot;,
                &amp;quot;URI&amp;quot;: &amp;quot;http://acs.amazonaws.com/groups/global/AuthenticatedUsers&amp;quot;
            },
            &amp;quot;Permission&amp;quot;: &amp;quot;READ&amp;quot;  ← ANY authenticated user on the cluster!
        }
    ]
}

# DevOps team (completely different department) can read it!
$ aws --profile devops --endpoint-url &amp;quot;$RGW_ENDPOINT&amp;quot; s3 cp &#92;
  s3://company-bucket/finance-report.pdf ./leaked.pdf
download: s3://company-bucket/finance-report.pdf to ./leaked.pdf

# Contractor user (or any other authenticated user) can also access it
$ aws --profile contractor --endpoint-url &amp;quot;$RGW_ENDPOINT&amp;quot; s3 cp &#92;
  s3://company-bucket/finance-report.pdf ./contractor-copy.pdf
download: s3://company-bucket/finance-report.pdf to ./contractor-copy.pdf

# Anonymous users are still blocked
$ aws s3 cp s3://company-bucket/finance-report.pdf ./anon.pdf &#92;
  --endpoint-url &amp;quot;$RGW_ENDPOINT&amp;quot; --no-sign-request
fatal error: An error occurred (403) when calling the HeadObject operation: Forbidden
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;public-write-is-an-integrity-disaster%2C-not-just-a-leak&quot;&gt;Public write is an integrity disaster, not just a leak &lt;a class=&quot;link-anchor&quot; href=&quot;#public-write-is-an-integrity-disaster%2C-not-just-a-leak&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;ACL errors are not solely about &amp;quot;read&amp;quot; exposure. With bucket ACLs, &lt;code&gt;public-read-write&lt;/code&gt;
(or broad write grants) can enable untrusted PUT requests to a bucket. That turns into
an integrity incident: poisoned datasets, overwritten &amp;quot;golden&amp;quot; artifacts, malware
hosting, or backup tampering. Even on-prem &amp;quot;internal-only&amp;quot; does not save you; it
just changes the attacker&#39;s vector, but the threat still exists.&lt;/p&gt;
&lt;h3 id=&quot;write_acp-is-the-%22permission-to-rewrite-permissions.%22&quot;&gt;WRITE_ACP is the &amp;quot;permission to rewrite permissions.&amp;quot; &lt;a class=&quot;link-anchor&quot; href=&quot;#write_acp-is-the-%22permission-to-rewrite-permissions.%22&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;ACLs don’t just control data-plane actions; they can delegate control-plane authority
over the ACL itself. In Ceph RGW S3 semantics, &lt;code&gt;WRITE_ACP&lt;/code&gt; the permission that allows
changing a bucket&#39;s ACL (required &lt;code&gt;WRITE_ACP&lt;/code&gt; for &lt;code&gt;PUT Bucket ACL&lt;/code&gt;). If the wrong
principal has it, they can escalate later by granting broader access (including
public exposure), and this delegation is distributed across buckets and objects.
This is a governance anti-pattern because the system contains a hidden &amp;quot;permission to change permissions.&amp;quot;&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;# Step 1: Bucket owner grants contractor WRITE + WRITE_ACP
$ aws --endpoint-url &amp;quot;$RGW_ENDPOINT&amp;quot; s3api put-bucket-acl &#92;
  --bucket company-bucket &#92;
  --grant-write id=contractor &#92;
  --grant-write-acp id=contractor &#92;
  --profile developer

# Verify the ACL
$ aws --endpoint-url &amp;quot;$RGW_ENDPOINT&amp;quot; s3api get-bucket-acl &#92;
  --bucket company-bucket --profile developer
{
    &amp;quot;Owner&amp;quot;: {
        &amp;quot;DisplayName&amp;quot;: &amp;quot;developer&amp;quot;,
        &amp;quot;ID&amp;quot;: &amp;quot;developer&amp;quot;
    },
    &amp;quot;Grants&amp;quot;: [
        {
            &amp;quot;Grantee&amp;quot;: {
                &amp;quot;DisplayName&amp;quot;: &amp;quot;Contractor Account&amp;quot;,
                &amp;quot;ID&amp;quot;: &amp;quot;contractor&amp;quot;,
                &amp;quot;Type&amp;quot;: &amp;quot;CanonicalUser&amp;quot;
            },
            &amp;quot;Permission&amp;quot;: &amp;quot;WRITE&amp;quot;
        },
        {
            &amp;quot;Grantee&amp;quot;: {
                &amp;quot;DisplayName&amp;quot;: &amp;quot;Contractor Account&amp;quot;,
                &amp;quot;ID&amp;quot;: &amp;quot;contractor&amp;quot;,
                &amp;quot;Type&amp;quot;: &amp;quot;CanonicalUser&amp;quot;
            },
            &amp;quot;Permission&amp;quot;: &amp;quot;WRITE_ACP&amp;quot;  ← Contractor can modify ACLs!
        },
        {
            &amp;quot;Grantee&amp;quot;: {
                &amp;quot;DisplayName&amp;quot;: &amp;quot;developer&amp;quot;,
                &amp;quot;ID&amp;quot;: &amp;quot;developer&amp;quot;,
                &amp;quot;Type&amp;quot;: &amp;quot;CanonicalUser&amp;quot;
            },
            &amp;quot;Permission&amp;quot;: &amp;quot;FULL_CONTROL&amp;quot;
        }
    ]
}

# Step 2: Contractor abuses WRITE_ACP to make bucket PUBLIC
$ aws --endpoint-url &amp;quot;$RGW_ENDPOINT&amp;quot; s3api put-bucket-acl &#92;
  --bucket company-bucket &#92;
  --acl public-read --profile contractor
# Success! Contractor just made the bucket public

# Step 3: Verify the escalation
$ aws --endpoint-url &amp;quot;$RGW_ENDPOINT&amp;quot; s3api get-bucket-acl &#92;
  --bucket company-bucket --profile developer
{
    &amp;quot;Owner&amp;quot;: {
        &amp;quot;DisplayName&amp;quot;: &amp;quot;developer&amp;quot;,
        &amp;quot;ID&amp;quot;: &amp;quot;developer&amp;quot;
    },
    &amp;quot;Grants&amp;quot;: [
        {
            &amp;quot;Grantee&amp;quot;: {
                &amp;quot;Type&amp;quot;: &amp;quot;Group&amp;quot;,
                &amp;quot;URI&amp;quot;: &amp;quot;http://acs.amazonaws.com/groups/global/AllUsers&amp;quot;
            },
            &amp;quot;Permission&amp;quot;: &amp;quot;READ&amp;quot;  ← NOW PUBLIC! Anyone can list contents
        },
        {
            &amp;quot;Grantee&amp;quot;: {
                &amp;quot;DisplayName&amp;quot;: &amp;quot;developer&amp;quot;,
                &amp;quot;ID&amp;quot;: &amp;quot;developer&amp;quot;,
                &amp;quot;Type&amp;quot;: &amp;quot;CanonicalUser&amp;quot;
            },
            &amp;quot;Permission&amp;quot;: &amp;quot;FULL_CONTROL&amp;quot;
        }
    ]
}

# Step 4: Anonymous users can now list the bucket
$ aws s3 ls s3://company-bucket/ &#92;
  --endpoint-url &amp;quot;$RGW_ENDPOINT&amp;quot; --no-sign-request
2025-12-31 05:00:00         27 finance-report.pdf
# Public exposure complete
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;the-solution%3A-stop-using-acls-immediately&quot;&gt;The Solution: Stop using ACLs immediately &lt;a class=&quot;link-anchor&quot; href=&quot;#the-solution%3A-stop-using-acls-immediately&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;AWS and the Ceph Object Gateway (RGW) provide controls to disable ACLs
entirely. This should be your first action on any production bucket.&lt;/p&gt;
&lt;h3 id=&quot;step-1%3A-block-public-access&quot;&gt;Step 1: Block Public Access &lt;a class=&quot;link-anchor&quot; href=&quot;#step-1%3A-block-public-access&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Enforce public access blocks to prevent bucket ACLs from granting public access.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Ceph AWS CLI Configuration Note&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;All &lt;code&gt;aws&lt;/code&gt; CLI commands in this guide assume your AWS CLI profile is configured: See the &lt;a href=&quot;https://docs.ceph.com/en/latest/radosgw/s3/commons/#aws-cli-setup&quot;&gt;Ceph documentation on AWS CLI configuration&lt;/a&gt; and &lt;a href=&quot;https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html&quot;&gt;AWS CLI endpoint configuration&lt;/a&gt; for details.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Bucket-level (Granularity per individual bucket):&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;# Anon access is enabled on bucket from previous example

$ aws s3 ls s3://company-bucket/ &#92;
  --endpoint-url &amp;quot;$RGW_ENDPOINT&amp;quot; --no-sign-request
                           PRE contractor-data/
2025-12-31 07:13:55         26 finance-report.pdf

# We use public-access-block on our bucket

$ aws s3api put-public-access-block &#92;
  --bucket company-bucket &#92;
  --public-access-block-configuration &#92;
    BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true &#92;
  --endpoint-url &amp;quot;$RGW_ENDPOINT&amp;quot; &#92;
  --profile developer

# Public access has been removed from the bucket,
# a non-authorized request fails after the put-public-access-block

$ aws s3 ls s3://company-bucket/ --endpoint-url &amp;quot;$RGW_ENDPOINT&amp;quot; --no-sign-request
fatal error: An error occurred (AccessDenied) when&#92;
  calling the ListObjectsV2 operation: Access Denied
# Some AWS CLI versions surface certain error responses
# poorly; if you see a Python exception, re-run with
# --debug to confirm the underlying HTTP 403/AccessDenied.
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;What each setting does:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;BlockPublicAcls&lt;/strong&gt;: Prevents new public ACLs from being applied (redundant if BucketOwnerEnforced, but adds defense in depth)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;IgnorePublicAcls&lt;/strong&gt;: Ignores existing public ACLs (treats them as private)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;BlockPublicPolicy&lt;/strong&gt;: Prevents bucket policies that grant public access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;RestrictPublicBuckets&lt;/strong&gt;: Blocks public access to buckets even if policies exist&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;step-2%3A-deny-acl-operations-via-iam-policy&quot;&gt;Step 2: Deny ACL Operations via IAM Policy &lt;a class=&quot;link-anchor&quot; href=&quot;#step-2%3A-deny-acl-operations-via-iam-policy&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;As the root account administrator, you should establish a security baseline that
prevents ACL usage &lt;strong&gt;by default&lt;/strong&gt; for all users and groups. This way, even if a
developer tries to use ACLs in the future, they&#39;ll get an immediate &lt;code&gt;Access Denied&lt;/code&gt;
error, preventing accidents before they happen.&lt;/p&gt;
&lt;p&gt;The governance pattern creates a standard &amp;quot;DenyACLs&amp;quot; policy that you attach to every
new user or group you create. This establishes ACL blocking as your organization&#39;s security baseline.&lt;/p&gt;
&lt;p&gt;Create the standard policy:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;$ cat &amp;gt; deny-acl-operations.json &amp;lt;&amp;lt;&#39;EOF&#39;
{
  &amp;quot;Version&amp;quot;: &amp;quot;2012-10-17&amp;quot;,
  &amp;quot;Statement&amp;quot;: [
    {
      &amp;quot;Sid&amp;quot;: &amp;quot;DenyACLOperations&amp;quot;,
      &amp;quot;Effect&amp;quot;: &amp;quot;Deny&amp;quot;,
      &amp;quot;Action&amp;quot;: [
        &amp;quot;s3:PutObjectAcl&amp;quot;,
        &amp;quot;s3:PutObjectVersionAcl&amp;quot;,
        &amp;quot;s3:PutBucketAcl&amp;quot;
      ],
      &amp;quot;Resource&amp;quot;: [
        &amp;quot;arn:aws:s3:::*&amp;quot;,
        &amp;quot;arn:aws:s3:::*/*&amp;quot;
      ]
    }
  ]
}
EOF
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here is an example of how to apply the policy to new users as you create them:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;# Create a new developer
$ aws iam create-user --user-name alice
{
    &amp;quot;User&amp;quot;: {
        &amp;quot;Path&amp;quot;: &amp;quot;/&amp;quot;,
        &amp;quot;UserName&amp;quot;: &amp;quot;alice&amp;quot;,
        &amp;quot;UserId&amp;quot;: &amp;quot;4abb3a59-7991-4644-8863-347b02adc48f&amp;quot;,
        &amp;quot;Arn&amp;quot;: &amp;quot;arn:aws:iam::RGW89761398048153XXX:user/alice&amp;quot;,
        &amp;quot;CreateDate&amp;quot;: &amp;quot;2025-01-03T15:44:06.920034Z&amp;quot;
    }
}
$ aws iam create-access-key --user-name alice

# Immediately apply the ACL deny policy (before giving any other permissions)
$ aws iam put-user-policy &#92;
  --user-name alice &#92;
  --policy-name DenyACLs &#92;
  --policy-document file://deny-acl-operations.json

# Now grant the user their actual S3 permissions
$ aws iam attach-user-policy --user-name alice --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If Alice later tries to configure ACLs on any bucket, she will get &lt;code&gt;Access Denied&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;# Create a bucket as Alice, upload an Object and try to apply a public ACL on the Object
$ aws --profile alice --endpoint-url=&amp;quot;$RGW_ENDPOINT&amp;quot; &#92;
  s3 mb s3://alicebucket
$ aws --profile alice --endpoint-url &amp;quot;$RGW_ENDPOINT&amp;quot; &#92;
  s3 cp finance-report.pdf s3://alicebucket
$ aws --profile alice --endpoint-url &amp;quot;$RGW_ENDPOINT&amp;quot; &#92;
  s3api put-object-acl --bucket alicebucket --key &#92;
  finance-report.pdf --acl public-read
#  Error: Access Denied
fatal error: An error occurred (AccessDenied) when calling the PutObjectAcl operation: Access Denied
# Some AWS CLI versions surface certain error responses poorly; if you see a Python exception, re-run with --debug to confirm the underlying HTTP 403/AccessDenied.
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;%22wait%2C-how-do-i-share-data-now%3F%22&quot;&gt;&amp;quot;Wait, How Do I Share Data Now?&amp;quot; &lt;a class=&quot;link-anchor&quot; href=&quot;#%22wait%2C-how-do-i-share-data-now%3F%22&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;With ACLs disabled, you might be wondering: How do I grant cross-account access
to share my datasets?&lt;/p&gt;
&lt;p&gt;Previously, you might have used ACLs to grant a contractor account read access
to specific objects or allowed a partner account to upload files. With ACLs gone,
how do you securely share data between accounts?&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Two modern approaches exist&lt;/strong&gt;:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;How It Works&lt;/th&gt;
&lt;th&gt;Access Pattern&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Bucket policies&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Resource owner adds bucket policy; requesting account adds identity policy&lt;/td&gt;
&lt;td&gt;Direct, always-on access&lt;/td&gt;
&lt;td&gt;Static, permanent sharing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;IAM Role assumption&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Resource owner creates an assumable role; requesting account assumes it&lt;/td&gt;
&lt;td&gt;Temporary session (1-12h)&lt;/td&gt;
&lt;td&gt;Dynamic, auditable access&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;strong&gt;We&#39;ll focus on IAM role assumption&lt;/strong&gt; because it provides:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Temporary credentials&lt;/strong&gt; that auto-expire (vs. permanent keys)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Detailed audit trails&lt;/strong&gt; showing who assumed what role and when (vs. static access logs)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Instant revocation&lt;/strong&gt; by deleting the role (vs. updating multiple policies)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Least privilege&lt;/strong&gt; with time-bound access (vs. always-on permissions)&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is also AWS&#39;s recommended pattern and follows zero-trust principles. Let&#39;s see how.&lt;/p&gt;
&lt;h2 id=&quot;iam-accounts%3A-the-modern-solution&quot;&gt;IAM Accounts: The Modern Solution &lt;a class=&quot;link-anchor&quot; href=&quot;#iam-accounts%3A-the-modern-solution&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The Ceph Object Gateway (RGW) implements AWS-compatible IAM Accounts, introduced in
Squid/19.2.0. This provides proper multi-tenancy with policy-based access control instead of ACLs.&lt;/p&gt;
&lt;h3 id=&quot;what-is-an-iam-account%3F&quot;&gt;What is an IAM Account? &lt;a class=&quot;link-anchor&quot; href=&quot;#what-is-an-iam-account%3F&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;An &lt;strong&gt;IAM Account&lt;/strong&gt; provides isolation for identities and access control:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-ini&quot;&gt;Account: finance-team (ID: RGW12345678901234567)
├── Users &amp;amp; Groups (isolated per account)
├── Roles (isolated per account)  
├── Policies (fine-grained permissions)
└── S3 Buckets (owned by account)
&lt;/code&gt;&lt;/pre&gt;
&lt;blockquote&gt;
&lt;p&gt;S3 bucket &lt;strong&gt;names&lt;/strong&gt; are globally unique across ALL accounts in a flat namespace
(just like AWS S3). If Finance creates a bucket called &lt;code&gt;financial-reports&lt;/code&gt; no
other account can use that name. However, bucket ownership and access control
are account-specific, only Finance can manage their &lt;code&gt;financial-reports&lt;/code&gt; bucket.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Ceph accounts can optionally belong to a tenant for namespace isolation. Within
a tenant, bucket names are unique to that tenant; they are not globally unique across all tenants.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Key distinction:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Account Root User&lt;/strong&gt;: Emergency admin access only, created with &lt;code&gt;--account-root&lt;/code&gt; flag&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;IAM Users&lt;/strong&gt;: Day-to-day access, follows the least privilege principle&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For this post, we&#39;ll assume you have two accounts already set up:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Finance Account&lt;/strong&gt; (ID: &lt;code&gt;RGW00893359550361292&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;DevOps Account&lt;/strong&gt; (ID: &lt;code&gt;RGW89761398048153888&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote&gt;
&lt;p&gt;For a complete guide on creating IAM Accounts, users, and basic configuration,
see our previous post: &lt;a href=&quot;https://ceph.io/en/news/blog/2025/enhancing-ceph-multitenancy-with-iam-accounts/&quot;&gt;Enhancing Ceph Multitenancy with IAM Accounts&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 id=&quot;cross-account-sharing%3A-the-modern-way&quot;&gt;Cross-Account Sharing: The Modern Way &lt;a class=&quot;link-anchor&quot; href=&quot;#cross-account-sharing%3A-the-modern-way&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Scenario&lt;/strong&gt;: Finance needs to give DevOps read-only access to backup data for
disaster recovery testing. Previously, this might have been done with ACLs. Now, we use cross-account role assumption.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Requirements&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;DevOps can read backups, but cannot modify or delete them&lt;/li&gt;
&lt;li&gt;Access uses temporary credentials (not long-term keys)&lt;/li&gt;
&lt;li&gt;Finance can revoke access instantly&lt;/li&gt;
&lt;li&gt;Fully auditable (who accessed what, when)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;how-it-works&quot;&gt;How It Works &lt;a class=&quot;link-anchor&quot; href=&quot;#how-it-works&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;The key insight: Create a role in the Finance account (same account as the bucket).
When DevOps assumes this role, they temporarily &amp;quot;become&amp;quot; a Finance account principal with Finance credentials.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1765907670163/8b060b2d-6f10-4761-82b7-75717606b121.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;This is the same STS pattern we covered in our &lt;a href=&quot;https://ceph.io/en/news/blog/2025/rgw-modernizing-sts/&quot;&gt;previous post on temporary credentials&lt;/a&gt;,
but now applied to cross-account access.&lt;/p&gt;
&lt;h3 id=&quot;implementation&quot;&gt;Implementation &lt;a class=&quot;link-anchor&quot; href=&quot;#implementation&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;h4 id=&quot;1.-finance-creates-a-cross-account-role-for-the-devops-team&quot;&gt;1. Finance Creates a Cross-Account Role for the Devops Team &lt;a class=&quot;link-anchor&quot; href=&quot;#1.-finance-creates-a-cross-account-role-for-the-devops-team&quot;&gt;¶&lt;/a&gt;&lt;/h4&gt;
&lt;p&gt;Finance creates a role called &lt;code&gt;devops-backup-reader&lt;/code&gt; in their account with two policies:&lt;/p&gt;
&lt;p&gt;The Trust Policy (who can assume this role):&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;$ cat &amp;gt; trust-policy.json &amp;lt;&amp;lt;&#39;EOF&#39;
{
  &amp;quot;Version&amp;quot;: &amp;quot;2012-10-17&amp;quot;,
  &amp;quot;Statement&amp;quot;: [{
    &amp;quot;Effect&amp;quot;: &amp;quot;Allow&amp;quot;,
    &amp;quot;Principal&amp;quot;: {
      &amp;quot;AWS&amp;quot;: &amp;quot;arn:aws:iam::RGW89761398048153888:user/dave-backup-ops&amp;quot;
    },
    &amp;quot;Action&amp;quot;: &amp;quot;sts:AssumeRole&amp;quot;
  }]
}
EOF
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This says: &lt;em&gt;&lt;strong&gt;&amp;quot;DevOps account user ‘dave’ can assume this role.&amp;quot;&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;You can use in the trust policy the &lt;code&gt;RGWXXXX:root&lt;/code&gt; formatting for the Principal.
This gives access to all users in the devops account to assume the role. Then we
could configure in the devops account to allow users from a specific IAM group to
be able to assume the finance &lt;code&gt;devops-backup-reader&lt;/code&gt; role.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;And the Permission Policy (what the role can do):&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;$ cat &amp;gt; role-permissions.json &amp;lt;&amp;lt;&#39;EOF&#39;
{
  &amp;quot;Version&amp;quot;: &amp;quot;2012-10-17&amp;quot;,
  &amp;quot;Statement&amp;quot;: [{
    &amp;quot;Effect&amp;quot;: &amp;quot;Allow&amp;quot;,
    &amp;quot;Action&amp;quot;: [&amp;quot;s3:GetObject&amp;quot;, &amp;quot;s3:ListBucket&amp;quot;],
    &amp;quot;Resource&amp;quot;: [
      &amp;quot;arn:aws:s3:::finance-backups&amp;quot;,
      &amp;quot;arn:aws:s3:::finance-backups/*&amp;quot;
    ]
  }]
}
EOF
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This says: &lt;strong&gt;&amp;quot;&lt;em&gt;This role can list &amp;amp; read the finance-backups bucket.&lt;/em&gt;&amp;quot;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Once we have the policy files created, we can go ahead and create the IAM role &lt;code&gt;devops-backup-reader&lt;/code&gt; :&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;$ aws --profile finance-admin s3 mb s3://finance-backups 
$ aws iam create-role &#92;
  --profile finance-admin &#92;
  --role-name devops-backup-reader &#92;
  --assume-role-policy-document file://trust-policy.json

$ aws iam put-role-policy &#92;
  --profile finance-admin &#92;
  --role-name devops-backup-reader &#92;
  --policy-name ReadBackups &#92;
  --policy-document file://role-permissions.json
&lt;/code&gt;&lt;/pre&gt;
&lt;h4 id=&quot;2.-devops-user-accesses-the-finance-account-dataset&quot;&gt;2. DevOps User Accesses the Finance Account Dataset &lt;a class=&quot;link-anchor&quot; href=&quot;#2.-devops-user-accesses-the-finance-account-dataset&quot;&gt;¶&lt;/a&gt;&lt;/h4&gt;
&lt;p&gt;Dave from the DevOps team assumes the role and gets temporary Finance credentials:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;# Assume Finance role
$ aws --endpoint-url &amp;quot;$RGW_ENDPOINT&amp;quot; sts assume-role &#92;
  --profile dave-backup-ops &#92;
  --role-arn &amp;quot;arn:aws:iam::RGW00893359550361292:role/devops-backup-reader&amp;quot; &#92;
  --role-session-name david-devops-backup-finance &#92;
  --region default

{
    &amp;quot;Credentials&amp;quot;: {
        &amp;quot;AccessKeyId&amp;quot;: &amp;quot;ASIA****************&amp;quot;,
        &amp;quot;SecretAccessKey&amp;quot;: &amp;quot;REDACTED&amp;quot;,
        &amp;quot;SessionToken&amp;quot;: &amp;quot;REDACTED&amp;quot;,
        &amp;quot;Expiration&amp;quot;: &amp;quot;2025-0X-15TXX:00:00Z&amp;quot;
    }
}

# Use temporary credentials
$ export AWS_ACCESS_KEY_ID=ASIA****************
$ export AWS_SECRET_ACCESS_KEY=REDACTED
$ export AWS_SESSION_TOKEN=REDACTED

# Access Finance backups (using Finance account credentials!)
$ aws --endpoint-url &amp;quot;$RGW_ENDPOINT&amp;quot; s3 ls s3://finance-backups/
2025-01-14 02:00:00  daily-backup-2025-01-14.tar.gz

$ aws --endpoint-url &amp;quot;$RGW_ENDPOINT&amp;quot; s3 cp s3://finance-backups/daily-backup-2025-01-14.tar.gz .
download: s3://finance-backups/daily-backup-2025-01-14.tar.gz to ./daily-backup-2025-01-14.tar.gz
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;why-this-works-(and-why-no-bucket-policy-is-needed)&quot;&gt;Why This Works (And Why No Bucket Policy Is Needed) &lt;a class=&quot;link-anchor&quot; href=&quot;#why-this-works-(and-why-no-bucket-policy-is-needed)&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;The role &lt;code&gt;devops-backup-reader&lt;/code&gt; is in the Finance account (same account as the
bucket). When Dave assumes this role, he receives temporary Finance account
credentials. From the bucket&#39;s perspective, this is same-account access:
only the role&#39;s policy is required; no bucket policy is needed.&lt;/p&gt;
&lt;p&gt;The cross-account part: Only the AssumeRole action crosses accounts. The actual
bucket access is the same account (role and bucket), both in Finance.&lt;/p&gt;
&lt;h3 id=&quot;security-benefits-of-this-approach&quot;&gt;Security Benefits of This Approach &lt;a class=&quot;link-anchor&quot; href=&quot;#security-benefits-of-this-approach&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Temporary credentials&lt;/strong&gt;: Expire after 1 hour (configurable up to 12 hours)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;No shared secrets&lt;/strong&gt;: DevOps never sees Finance&#39;s long-term keys&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Instant revocation&lt;/strong&gt;: Finance deletes the role → all access stops immediately&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Audit trail&lt;/strong&gt;: Logs show role name, session name, and requesting account&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Least privilege&lt;/strong&gt;: Role has only read permissions, nothing more&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Better than ACLs&lt;/strong&gt;: Centralized control, no object-level chaos&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;what-the-audit-logs-show&quot;&gt;What the Audit Logs Show &lt;a class=&quot;link-anchor&quot; href=&quot;#what-the-audit-logs-show&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;The Ceph Object Gateway (RGW) audit logs capture the
complete cross-account access pattern. Here&#39;s what you will see:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: Ensure RGW audit logging is enabled. See the &lt;a href=&quot;https://docs.ceph.com/en/latest/radosgw/config-ref/#bucket-and-object-audit-logging&quot;&gt;Ceph documentation on bucket and object audit logging&lt;/a&gt; (OPS logs) for configuration details.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Example audit log extract when DevOps assumes the Finance role:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;$ tail -f /var/log/ceph/ops-log-ceph-client.rgw.default.ceph02.fvqogr.log | jq .
{
...
  &amp;quot;time&amp;quot;: &amp;quot;2025-01-04T17:34:07.711570Z&amp;quot;,
  &amp;quot;time_local&amp;quot;: &amp;quot;2025-01-04T17:34:07.711570+0000&amp;quot;,
  &amp;quot;remote_addr&amp;quot;: &amp;quot;10.251.0.21&amp;quot;,
  &amp;quot;user&amp;quot;: &amp;quot;98b5e284-bd74-4a54-922e-cf1ee1d460c2&amp;quot;,
  &amp;quot;operation&amp;quot;: &amp;quot;assume_role&amp;quot;,
  &amp;quot;uri&amp;quot;: &amp;quot;POST / HTTP/1.1&amp;quot;,
  &amp;quot;http_status&amp;quot;: &amp;quot;200&amp;quot;,
  &amp;quot;bytes_sent&amp;quot;: 999,
  &amp;quot;user_agent&amp;quot;: &amp;quot;aws-cli/1.38.34 md/Botocore#1.37.34 ua/2.1 os/linux#5.14.0-496.el9.x86_64 md/arch#x86_64 lang/python#3.9.19 md/pyimpl#CPython m/N cfg/retry-mode#legacy botocore/1.37.34&amp;quot;,
  &amp;quot;referrer&amp;quot;: &amp;quot;&amp;quot;,
  &amp;quot;trans_id&amp;quot;: &amp;quot;tx000001bb92497c13eba06-00695aa48f-494246-default&amp;quot;,
  &amp;quot;access_key_id&amp;quot;: &amp;quot;MPUWRVKZFH9XXXXXXX&amp;quot;,
  &amp;quot;temp_url&amp;quot;: false
}

# We can then get any specific details on this user
$ radosgw-admin user info --access-key=MPUWRVKZFH9XXXXXXX
{
    &amp;quot;user_id&amp;quot;: &amp;quot;98b5e284-bd74-4a54-922e-cf1ee1d460c2&amp;quot;,
    &amp;quot;display_name&amp;quot;: &amp;quot;dave-backup-ops&amp;quot;,
    &amp;quot;email&amp;quot;: &amp;quot;&amp;quot;,
    &amp;quot;suspended&amp;quot;: 0,
    &amp;quot;max_buckets&amp;quot;: 1000,
    ...
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Example audit log extract when Dave from the DevOps Account accesses the Finance bucket:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;bucket&amp;quot;: &amp;quot;finance-backups&amp;quot;,
  &amp;quot;object&amp;quot;: &amp;quot;daily-backup-2025-01-14.tar.gz&amp;quot;,
  &amp;quot;time&amp;quot;: &amp;quot;2026-01-04T17:42:35.956711Z&amp;quot;,
  &amp;quot;time_local&amp;quot;: &amp;quot;2026-01-04T17:42:35.956711+0000&amp;quot;,
  &amp;quot;remote_addr&amp;quot;: &amp;quot;10.251.0.21&amp;quot;,
  &amp;quot;object_owner&amp;quot;: &amp;quot;RGW00893359550361292&amp;quot;,
  &amp;quot;user&amp;quot;: &amp;quot;98b5e284-bd74-4a54-922e-cf1ee1d460c2&amp;quot;,
  &amp;quot;operation&amp;quot;: &amp;quot;get_obj&amp;quot;,
  &amp;quot;uri&amp;quot;: &amp;quot;GET /finance-backups/daily-backup-2025-01-14.tar.gz HTTP/1.1&amp;quot;,
  &amp;quot;http_status&amp;quot;: &amp;quot;200&amp;quot;,
  &amp;quot;bytes_sent&amp;quot;: 26,
  &amp;quot;bytes_received&amp;quot;: 0,
  &amp;quot;object_size&amp;quot;: 26,
  &amp;quot;total_time&amp;quot;: 3,
  &amp;quot;user_agent&amp;quot;: &amp;quot;aws-cli/1.38.34 md/Botocore#1.37.34 ua/2.1 os/linux#5.14.0-496.el9.x86_64 md/arch#x86_64 lang/python#3.9.19 md/pyimpl#CPython m/N cfg/retry-mode#legacy botocore/1.37.34&amp;quot;,
  &amp;quot;trans_id&amp;quot;: &amp;quot;tx00000a13eeac4ce551ce2-00695aa68b-494246-default&amp;quot;,
  &amp;quot;authentication_type&amp;quot;: &amp;quot;STS&amp;quot;,
  &amp;quot;sts_info&amp;quot;: {
    &amp;quot;role_name&amp;quot;: &amp;quot;$devops-backup-reader&amp;quot;,
    &amp;quot;role_session&amp;quot;: &amp;quot;david-devops-backup-finance&amp;quot;
  },
  &amp;quot;temp_url&amp;quot;: false
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;What this tells you:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Who&lt;/strong&gt;: Dave from DevOps (identified by role session name and the user uid)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;When&lt;/strong&gt;: &lt;code&gt;2026-01-04T17:42:35.956711Z&lt;/code&gt; (exact UTC timestamp)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;What&lt;/strong&gt;: Downloaded &lt;code&gt;daily-backup-2025-01-14.tar.gz&lt;/code&gt; from &lt;code&gt;finance-backups&lt;/code&gt; bucket&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;How&lt;/strong&gt;: Via STS temporary credentials (&lt;code&gt;authentication_type: &amp;quot;STS&amp;quot;&lt;/code&gt;)
&lt;ul&gt;
&lt;li&gt;Assumed role: &lt;code&gt;devops-backup-reader&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Session: &lt;code&gt;david-devops-backup-finance&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;From where&lt;/strong&gt;: IP address &lt;code&gt;10.251.0.21&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Bucket owner&lt;/strong&gt;: Finance account &lt;code&gt;RGW00893359550361292&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Status&lt;/strong&gt;: Success (&lt;code&gt;http_status: 200&lt;/code&gt;, 26 bytes transferred)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Key security insights from this log:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Authentication type is explicitly marked as &amp;quot;STS&amp;quot;&lt;/strong&gt; - You can easily filter all temporary credential access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;User who assumed the role is identified&lt;/strong&gt; - (&lt;code&gt;98b5e284-bd74-4a54-922e-cf1ee1d460c2&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Role name is captured&lt;/strong&gt; - You know which role was used (&lt;code&gt;devops-backup-reader&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Session name is captured&lt;/strong&gt; - You can trace back to who initiated the session (Dave via &lt;code&gt;david-devops-backup-finance&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Object owner is logged&lt;/strong&gt; - Confirms the bucket belongs to the Finance account, not the accessor&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Full HTTP details&lt;/strong&gt; - User agent shows it was AWS CLI, complete with version&lt;/li&gt;
&lt;/ol&gt;
&lt;blockquote&gt;
&lt;p&gt;Compared to ACLs: With ACLs, you had no audit trail showing who from which
account accessed what. The logs only showed &amp;quot;someone accessed the object&amp;quot;
with no attribution to the originating account or session context.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Comparison of IAM Roles Versus ACLs:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;ACLs: Decentralized, object-level, permanent, no audit trail of cross-account access&lt;/li&gt;
&lt;li&gt;IAM Roles: Centralized, temporary, revocable, full audit trail with account attribution&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;understanding-policy-evaluation&quot;&gt;Understanding Policy Evaluation &lt;a class=&quot;link-anchor&quot; href=&quot;#understanding-policy-evaluation&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;To use IAM effectively, you need to understand how permissions are evaluated.&lt;/p&gt;
&lt;h3 id=&quot;the-basic-rule&quot;&gt;The Basic Rule &lt;a class=&quot;link-anchor&quot; href=&quot;#the-basic-rule&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;When a user requests access to an S3 resource, it follows the following workflow,
taking into account that any &lt;code&gt;DENY&lt;/code&gt; always wins over &lt;code&gt;ALLOW&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1765908590713/83e2782a-fffe-4b8b-a03c-8017778ba232.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Explicit &lt;code&gt;DENY&lt;/code&gt; always wins, even if there are multiple &lt;code&gt;ALLOW&lt;/code&gt; statements.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3 id=&quot;same-account-vs-cross-account&quot;&gt;Same-Account vs Cross-Account &lt;a class=&quot;link-anchor&quot; href=&quot;#same-account-vs-cross-account&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Same-Account Access&lt;/strong&gt; (user and bucket in the same account):&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Permission needed in either the bucket policy or the identity policy&lt;/li&gt;
&lt;li&gt;One &lt;code&gt;ALLOW&lt;/code&gt; is sufficient&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Cross-Account Access&lt;/strong&gt; (using role assumption):&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Permission needed for AssumeRole (on both sides - trust policy + identity policy)&lt;/li&gt;
&lt;li&gt;Role&#39;s identity policy grants bucket access (same-account from bucket&#39;s perspective)&lt;/li&gt;
&lt;li&gt;No bucket policy needed&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;the-security-roadmap%3A-enterprise-s3-security-coming-to-ceph&quot;&gt;The Security Roadmap: Enterprise S3 Security Coming to Ceph &lt;a class=&quot;link-anchor&quot; href=&quot;#the-security-roadmap%3A-enterprise-s3-security-coming-to-ceph&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The Ceph community is making a significant investment in enterprise S3 security.
Several critical features are under active development to bring Ceph RGW to full
feature parity with AWS S3&#39;s modern security model. Here&#39;s what&#39;s coming and why it matters.&lt;/p&gt;
&lt;h3 id=&quot;bucketownerenforced%3A-disabling-acls-(coming-in-a-tentacle-update)&quot;&gt;BucketOwnerEnforced: Disabling ACLs (Coming in a Tentacle update) &lt;a class=&quot;link-anchor&quot; href=&quot;#bucketownerenforced%3A-disabling-acls-(coming-in-a-tentacle-update)&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Status: Merged into Ceph v20.3.0 (Tentacle) (&lt;a href=&quot;https://tracker.ceph.com/issues/63323&quot;&gt;Issue #63323)&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;What it does: The &lt;code&gt;PutBucketOwnershipControls&lt;/code&gt; API with &lt;code&gt;BucketOwnerEnforced&lt;/code&gt;
setting disables ACLs entirely and forces all objects to be owned by the bucket
owner regardless of who uploaded them.&lt;/p&gt;
&lt;p&gt;The problem it solves:&lt;/p&gt;
&lt;p&gt;Before (with ACLs):&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Contractor uploads → contractor owns object → you, as the owner of the bucket, can&#39;t delete it&lt;/li&gt;
&lt;li&gt;Developer sets ACL to public → bucket exposed to the internet&lt;/li&gt;
&lt;li&gt;Objects disappear from inventory (owned by other accounts)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;After (BucketOwnerEnforced):&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Anyone uploads → you own the object → you control it completely&lt;/li&gt;
&lt;li&gt;ACLs are ignored → impossible to make the bucket public accidentally via ACLs&lt;/li&gt;
&lt;li&gt;All objects visible in your inventory reports&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;How it will work:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;# Enable BucketOwnerEnforced on a bucket
$ aws s3api put-bucket-ownership-controls &#92;
  --bucket company-data &#92;
  --ownership-controls &#39;Rules=[{ObjectOwnership=BucketOwnerEnforced}]&#39;
&lt;/code&gt;&lt;/pre&gt;
&lt;blockquote&gt;
&lt;p&gt;Once enabled, any requests that include ACL headers (e.g., &lt;code&gt;--acl public-read&lt;/code&gt;) will
fail. Applications must be audited before enabling this feature on their buckets
because if the application &lt;strong&gt;is using ACLs&lt;/strong&gt; in their workflow the application
requests using the ACL headers will start failing.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3 id=&quot;s3control-api-block-public-access-(coming-soon)&quot;&gt;S3Control API Block Public Access (Coming Soon) &lt;a class=&quot;link-anchor&quot; href=&quot;#s3control-api-block-public-access-(coming-soon)&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Status: Active development, &lt;a href=&quot;https://github.com/ceph/ceph/pull/64293&quot;&gt;PR #64293&lt;/a&gt; under review&lt;/p&gt;
&lt;p&gt;You&#39;ve disabled ACLs in your Finance account. You&#39;ve enabled Block Public Access.
Your security team is confident the Finance buckets are locked down. Then someone
in the Marketing account creates a new IAM user, spins up a bucket, and accidentally
makes it public during a website deployment test. Your Finance settings didn&#39;t apply
to Marketing&#39;s account because each account manages its own configuration independently.&lt;/p&gt;
&lt;p&gt;This is where account-level controls become critical. While individual buckets can
have their own Block Public Access settings, managing hundreds or thousands of
buckets individually is error-prone. The S3Control API allows you to set
account-level defaults that apply automatically to all buckets in that
account, both existing and any new bucket created in the future.&lt;/p&gt;
&lt;p&gt;Account-level enforcement prevents all public access:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;# Block all public access for entire account
$ aws s3control put-public-access-block &#92;
  --account-id RGW11111111111111111 &#92;
  --public-access-block-configuration &#92;
    BlockPublicAcls=true,&#92;
    IgnorePublicAcls=true,&#92;
    BlockPublicPolicy=true,&#92;
    RestrictPublicBuckets=true
&lt;/code&gt;&lt;/pre&gt;
&lt;blockquote&gt;
&lt;p&gt;Once the account administrator sets this policy using S3Control, regular account
users cannot override it. If a user later tries to disable Block Public Access
on a specific bucket, make a bucket public via ACL, or add a public bucket policy,
all those attempts will fail with &amp;quot;Access Denied.&amp;quot; The account-level setting takes
precedence and cannot be bypassed by bucket-level operations. This creates a
secure-by-default environment in which enabling public access using ACLs at
the bucket-level is impossible.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;What each setting will do:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;BlockPublicAcls&lt;/strong&gt;: Prevents new public ACLs from being applied to buckets/objects&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;IgnorePublicAcls&lt;/strong&gt;: Ignores existing public ACLs (treats them as private)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;BlockPublicPolicy&lt;/strong&gt;: Prevents bucket policies that grant public access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;RestrictPublicBuckets&lt;/strong&gt;: Blocks public access even if policies exist&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote&gt;
&lt;p&gt;Account-level Block Public Access is enforced by the account administrator on
regular users within that account, but the account administrator themselves
can still modify or disable it. For enforcement from a &lt;strong&gt;higher authority&lt;/strong&gt;,
you need organization-level controls. See the next section on Organizational
Units and SCPs, which enable Ceph/RGW cluster administrators to enforce
immutable policies across all accounts.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3 id=&quot;organizational-units-and-service-control-policies-(future)&quot;&gt;Organizational Units and Service Control Policies (Future) &lt;a class=&quot;link-anchor&quot; href=&quot;#organizational-units-and-service-control-policies-(future)&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Status&lt;/strong&gt;: Roadmap item for future Ceph releases&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;What it will do&lt;/strong&gt;: Enable cluster administrators to enforce immutable security
policies across multiple accounts—policies that even account administrators cannot disable or modify.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The problem it solves&lt;/strong&gt;: Account-level controls rely on administrator discipline.
A determined (or compromised) account administrator can disable Block Public Access
or re-enable ACLs. Organization-level controls provide actual enforcement from a
higher authority that cannot be bypassed.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Example use cases&lt;/strong&gt; (when available):&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Immutable Block Public Access&lt;/strong&gt;: Cluster admin sets organization-wide &amp;quot;no public buckets&amp;quot;
policy: account admins cannot disable it&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Required encryption&lt;/strong&gt;: Force all objects to use encryption → accounts cannot opt out&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cross-account access policies&lt;/strong&gt;: Restrict which accounts can share data with external accounts&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Audit requirements&lt;/strong&gt;: Enforce logging and monitoring so that accounts cannot be disabled&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This will provide enterprise multi-tenant governance that scales to thousands
of accounts with immutable top-down policy enforcement.&lt;/p&gt;
&lt;h2 id=&quot;conclusion%3A-ceph&#39;s-enterprise-security-transformation&quot;&gt;Conclusion: Ceph&#39;s Enterprise Security Transformation &lt;a class=&quot;link-anchor&quot; href=&quot;#conclusion%3A-ceph&#39;s-enterprise-security-transformation&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1767550127997/3f2532cb-2bac-45d2-b99c-b69ff3f7fec6.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;The migration from ACLs to IAM represents a fundamental shift in S3 security
philosophy: from decentralized, object-level chaos to centralized, policy-based control.&lt;/p&gt;
&lt;p&gt;Available today in Ceph Squid and later:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;IAM Accounts: Multi-tenant isolation with proper account boundaries&lt;/li&gt;
&lt;li&gt;Cross-account role assumption: Secure data sharing with temporary credentials&lt;/li&gt;
&lt;li&gt;Comprehensive audit logging: Full visibility into who accessed what, when, and how&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Coming soon (active development):&lt;/p&gt;
&lt;ol start=&quot;4&quot;&gt;
&lt;li&gt;BucketOwnerEnforced (Upcoming Tentacle update): Disable ACLs, fix ownership chaos&lt;/li&gt;
&lt;li&gt;S3Control Block Public Access (Tentacle/Umbrella): Account-level public access prevention&lt;/li&gt;
&lt;li&gt;Organizational Units &amp;amp; SCPs (future): Immutable cluster-wide security policies&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The Ceph community is making a substantial investment to bring Ceph Object
Gateway (RGW) to full feature parity with AWS S3&#39;s modern security model.
The roadmap is clear, and the commitment is real.&lt;/p&gt;
&lt;p&gt;The modern S3 security model is simpler, safer, and more auditable than ACLs ever
were. ACLs created invisible access paths that security teams couldn&#39;t see. IAM
policies are explicit, centralized, and visible in one place.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Disable ACLs today&lt;/strong&gt;. Your future self will thank you.&lt;/p&gt;
&lt;p&gt;Daniel would like to thank IBM for supporting the community with his time to create these posts.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>Breaking the Static Key Habit: Modernizing Ceph RGW S3 Security with STS</title>
    <link href="https://ceph.io/en/news/blog/2025/rgw-modernizing-sts/" />
    <updated>2025-12-18T00:00:00Z</updated>
    <id>https://ceph.io/en/news/blog/2025/rgw-modernizing-sts/</id>
    <author>
      <name>Daniel Alexander Parkes, Anthony D&#39;Atri</name>
    </author>
      <category term="en-blog-post" />
      <category term="en-article" />
      <category term="blog-post" />
      <category term="ceph" />
      <category term="rgw" />
      <category term="s3" />
    <content type="html" xml:base="https://ceph.io/en/news/blog/2025/rgw-modernizing-sts/">&lt;h2 id=&quot;introduction%3A-the-usd-148-million-lesson&quot;&gt;Introduction: The USD 148 Million Lesson &lt;a class=&quot;link-anchor&quot; href=&quot;#introduction%3A-the-usd-148-million-lesson&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;In late 2016, &lt;a href=&quot;https://www.uber.com/en-CH/newsroom/2016-data-incident&quot;&gt;Uber&lt;/a&gt;
learned that intruders had accessed a trove of personal data stored in an
Amazon S3 bucket. The entry point was painfully mundane: attackers accessed
Uber&#39;s source code on GitHub using stolen credentials, found an AWS credential,
and used it to access Uber’s data. That single, long-lived credential exposed
data on roughly 57 million users and 600,000 drivers.&lt;/p&gt;
&lt;p&gt;The breach was bad; the duration risk was worse. Static access keys do not expire.
Once leaked, they remain active until someone notices, locates every instance in
use, and rotates them. That makes credential theft uniquely dangerous in cloud
and S3-style storage, because an attacker can repeatedly return, automate access,
and quietly expand their footprint.&lt;/p&gt;
&lt;p&gt;Uber ultimately agreed to a $148 million multistate settlement related to how
the incident was handled and disclosed. The exact dollar figure is not the
main lesson, though. The lesson is this: a single static key can turn a small
mistake into a durable breach.&lt;/p&gt;
&lt;p&gt;If you are running the Ceph Object Gateway (RGW), you face the same dynamic:
S3 credentials in an application configuration file &lt;code&gt;config.yaml&lt;/code&gt;, embedded
in scripts, or stored in CI/CD variables. Each one is a long-lived credential
that, once copied, can be used from anywhere the S3 endpoint is reachable.&lt;/p&gt;
&lt;p&gt;This post shows you how to eliminate static credentials using Security Token
Service (STS) with temporary credentials that expire automatically. By the end,
you&#39;ll understand how to implement the same security model that prevented these
breaches from being even worse, and how to adapt it for Ceph RGW.&lt;/p&gt;
&lt;h2 id=&quot;the-static-credential-problem&quot;&gt;The Static Credential Problem &lt;a class=&quot;link-anchor&quot; href=&quot;#the-static-credential-problem&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Let&#39;s take a look at some examples of how most applications access S3 storage today:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;# app-config.yaml (application config file)
s3:
  endpoint: https://s3.example.com
  access_key: AKIA1234567890ABCDEF
  secret_key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
  bucket: production-data
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Or with the credentials embedded directly in code:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# backup.py
import boto3

s3 = boto3.client(
    &#39;s3&#39;,
    endpoint_url=&#39;https://s3.example.com&#39;,
    aws_access_key_id=&#39;AKIA1234567890ABCDEF&#39;,
    aws_secret_access_key=&#39;wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY&#39;
)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Or in environment variables (slightly better, but not by much):&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;export AWS_ACCESS_KEY_ID=AKIA1234567890ABCDEF
export AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;why-this-is-dangerous%3F&quot;&gt;Why This Is Dangerous? &lt;a class=&quot;link-anchor&quot; href=&quot;#why-this-is-dangerous%3F&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;h3 id=&quot;the-permanence-problem&quot;&gt;The Permanence Problem &lt;a class=&quot;link-anchor&quot; href=&quot;#the-permanence-problem&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;The fundamental issue with static credentials is that they never expire. Once
created, these keys authenticate requests indefinitely, working the same on
day one as they do five years later. This creates a dangerous organizational
memory gap. Keys made in 2020 still work in 2025, but no one remembers which
application uses them, what permissions they have, or whether they&#39;re even
still needed. When rotation finally becomes necessary, it requires coordinated
updates across all applications simultaneously, often in the middle of an
incident when coordination is most difficult.&lt;/p&gt;
&lt;h3 id=&quot;key-proliferation&quot;&gt;Key Proliferation &lt;a class=&quot;link-anchor&quot; href=&quot;#key-proliferation&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Static credentials spread like a virus through an organization&#39;s infrastructure.
They start in a configuration file for a single application, then get copied
into container images where they&#39;re baked into immutable layers. They&#39;re added
to CI/CD pipelines where they&#39;re shared across multiple projects. Developers
copy them to their laptops for testing, where they sync to cloud backup services.
They end up in documentation and internal wikis, pasted as &amp;quot;helpful examples&amp;quot; for
other teams. Each copy represents another attack vector, another place where the
credentials might leak.&lt;/p&gt;
&lt;h3 id=&quot;the-revocation-nightmare&quot;&gt;The Revocation Nightmare &lt;a class=&quot;link-anchor&quot; href=&quot;#the-revocation-nightmare&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;When credentials are eventually stolen, and with this level of exposure, it&#39;s
&lt;em&gt;when&lt;/em&gt;&amp;quot; not &lt;em&gt;if&lt;/em&gt;. The response options are replete with shortcomings. The
credentials work from anywhere where the S3 endpoint is accessible, so there&#39;s
no easy way to distinguish legitimate requests from attacker activity. Revoking
them immediately breaks every application that depends on those keys, forcing
an emergency deployment across potentially dozens of services. The alternative
is to leave them active while attackers maintain access, then race to update
applications before further damage occurs. Organizations need to coordinate
emergency updates during an active security incident, precisely when
coordination is hardest.&lt;/p&gt;
&lt;h3 id=&quot;the-permission-accumulation-problem&quot;&gt;The Permission Accumulation Problem &lt;a class=&quot;link-anchor&quot; href=&quot;#the-permission-accumulation-problem&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Static keys tend to accumulate permissions over time. They start with minimal
access, but as requirements evolve, it&#39;s easier to grant permissions than to
audit what&#39;s truly necessary carefully. &lt;em&gt;This key needs to read and write,
just to be safe.&lt;/em&gt; &lt;em&gt;Let&#39;s give it access to all buckets; we might expand to
new ones later.&lt;/em&gt; No one wants to risk disrupting production by restricting
access, mainly when credentials are spread across so many systems that tracking
down every usage point seems impossible.&lt;/p&gt;
&lt;h3 id=&quot;the-real-cost&quot;&gt;The Real Cost &lt;a class=&quot;link-anchor&quot; href=&quot;#the-real-cost&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;The Uber incident shows the real cost of a leaked static key. A single exposed
AWS access key pai9r exposed sensitive data to roughly 57 million users and
600,000 drivers, and Uber later agreed to a USD 148 million multistate settlement
related to the incident and its handling.&lt;/p&gt;
&lt;p&gt;The uncomfortable truth is that static keys turn small mistakes into persistent
breaches because credentials do not naturally &amp;quot;die”. Without expiration,
containment depends entirely on detection and coordinated rotation across
every place that the key has spread.&lt;/p&gt;
&lt;h2 id=&quot;the-solution%3A-temporary-credentials-via-sts&quot;&gt;The Solution: Temporary Credentials via STS &lt;a class=&quot;link-anchor&quot; href=&quot;#the-solution%3A-temporary-credentials-via-sts&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Security Token Service (STS) fundamentally reimagines how applications
authenticate with S3. Instead of using permanent credentials that live
forever, applications request temporary credentials that expire automatically
after a defined window, typically between fifteen minutes and twelve hours.
This simple shift transforms the entire security model.&lt;/p&gt;
&lt;p&gt;The mechanics work like this: Applications maintain a minimal service account
that is authorized to assume a role. When the application needs to access S3,
it calls the STS service using those service account credentials to request
temporary credentials for a specific role. STS validates that the service
account is authorized to assume that role, then issues time-limited credentials.
The application uses these temporary credentials for actual S3 operations. When
they expire, the application requests fresh credentials. The entire process is
transparent to the application&#39;s business logic.&lt;/p&gt;
&lt;p&gt;![](images/sequence.png align=&amp;quot;center&amp;quot;)&lt;/p&gt;
&lt;h3 id=&quot;the-security-transformation&quot;&gt;The Security Transformation &lt;a class=&quot;link-anchor&quot; href=&quot;#the-security-transformation&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;With static keys, credentials remain valid indefinitely. Once stolen, they persist
indefinitely. STS eliminates these problems through automatic expiration. When an
application calls &lt;code&gt;AssumeRole&lt;/code&gt;, it specifies a &lt;code&gt;DurationSeconds&lt;/code&gt; parameter that
defaults to 3600 seconds (one hour). The temporary credentials returned include
an expiration timestamp that cannot be modified or extended. If an attacker steals
temporary credentials from a compromised server or intercepts them in transit, those
credentials become worthless the moment they expire.&lt;/p&gt;
&lt;p&gt;The audit trail improves dramatically as well. Instead of seeing generic access
key IDs that could be used by any application anywhere, the RGW logs now show
which specific role was assumed (&lt;code&gt;role_name&lt;/code&gt;) and the session name provided when
the role was assumed (&lt;code&gt;role_session_name&lt;/code&gt;). When applications use descriptive
session names that include the application name and a timestamp, security teams
can immediately identify which application and which specific execution generated
each request. This attribution becomes critical during incident response, when
distinguishing legitimate traffic from attacker activity can mean the difference
between containing a breach and suffering a complete data exfiltration.&lt;/p&gt;
&lt;p&gt;Consider the compromise scenario: An attacker gains access to a production server
and dumps memory, capturing the application&#39;s current S3 credentials. With static
keys, this can represent full, ongoing access to your data, potentially for months
before detection. With STS, the attacker has at most one hour before those credentials
expire and become useless. STS is not a silver bullet: it will not stop an attacker
already on the host. It does put every stolen credential on a timer, which sharply
limits persistence and reduces the “evergreen access key” problem. The application
continues to operate normally, automatically refreshing its credentials; incident
response can focus on evicting the attacker and preventing further refreshes rather
than racing to replace long-lived keys everywhere.&lt;/p&gt;
&lt;h4 id=&quot;%22wait%2C-aren&#39;t-we-still-using-a-static-key-to-assume-the-role%3F%22&quot;&gt;&amp;quot;Wait, aren&#39;t we still using a static key to assume the role?&amp;quot; &lt;a class=&quot;link-anchor&quot; href=&quot;#%22wait%2C-aren&#39;t-we-still-using-a-static-key-to-assume-the-role%3F%22&quot;&gt;¶&lt;/a&gt;&lt;/h4&gt;
&lt;p&gt;Yes, but with a critical difference. The service account (e.g., &lt;code&gt;backup-service&lt;/code&gt;)
possesses static Access and Secret Keys, but this user has zero permissions to
access S3 data. It cannot list buckets, read objects, or delete data.&lt;/p&gt;
&lt;p&gt;Its only capability is to call the STS API to assume a specific Role. If these
credentials are leaked, an attacker cannot directly steal data. They would have
to know which Role to assume and how to use it, which would add significant
friction. Furthermore, you have traces in the audit logs, and you can rotate
these service keys without disrupting the application&#39;s active S3 sessions.&lt;/p&gt;
&lt;h2 id=&quot;quick-primer%3A-understanding-roles-(just-what-you-need-for-sts)&quot;&gt;Quick Primer: Understanding Roles (Just What You Need for STS) &lt;a class=&quot;link-anchor&quot; href=&quot;#quick-primer%3A-understanding-roles-(just-what-you-need-for-sts)&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Roles are part of the IAM (Identity and Access Management) API, which the Ceph
Object Gateway (RGW) implements to provide AWS-compatible identity management.
In this post, we focus on how roles enable STS-based authentication. We&#39;ll dive
deeper into the full IAM capabilities, including users, groups, policies, and
account-level governance, in a specific IAM security post coming soon.&lt;/p&gt;
&lt;h3 id=&quot;the-role-structure&quot;&gt;The Role Structure &lt;a class=&quot;link-anchor&quot; href=&quot;#the-role-structure&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Every role has two policies:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Trust Policy - Defines who can assume the role&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Permission Policy - Defines what the role can do once assumed&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Here&#39;s the complete flow: Your application holds a minimal service account that
is authorized to assume a role (via the role trust policy, an identity policy,
or both). When it needs to work (e.g., access S3 resources), it calls STS to
assume a role (e.g., &lt;code&gt;backup-reader&lt;/code&gt;). STS checks the role&#39;s trust policy,
validates the request, and issues temporary credentials (access key, secret
key, session token) that inherit the role&#39;s permissions. Those credentials
expire after one hour. The application uses them for S3 operations and
automatically requests new credentials as needed.&lt;/p&gt;
&lt;p&gt;Here is an example Trust Policy (who can assume the role) allowing the
user &lt;code&gt;backup-service&lt;/code&gt; to assume the role:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;Version&amp;quot;: &amp;quot;2012-10-17&amp;quot;,
  &amp;quot;Statement&amp;quot;: [{
    &amp;quot;Effect&amp;quot;: &amp;quot;Allow&amp;quot;,
    &amp;quot;Principal&amp;quot;: {&amp;quot;AWS&amp;quot;: &amp;quot;arn:aws:iam::123456:user/backup-service&amp;quot;},
    &amp;quot;Action&amp;quot;: &amp;quot;sts:AssumeRole&amp;quot;
  }]
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here is an example Permission Policy (what the role can do),
allowing read-only access to the bucket &lt;code&gt;backups&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;Version&amp;quot;: &amp;quot;2012-10-17&amp;quot;,
  &amp;quot;Statement&amp;quot;: [{
    &amp;quot;Effect&amp;quot;: &amp;quot;Allow&amp;quot;,
    &amp;quot;Action&amp;quot;: [&amp;quot;s3:GetObject&amp;quot;, &amp;quot;s3:ListBucket&amp;quot;],
    &amp;quot;Resource&amp;quot;: [
      &amp;quot;arn:aws:s3:::backups&amp;quot;,
      &amp;quot;arn:aws:s3:::backups/*&amp;quot;
    ]
  }]
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this post, we&#39;ll use inline policies (policies embedded directly in the role).
There are other canned policy types available in the IAM API, which we&#39;ll cover
in a future IAM post.&lt;/p&gt;
&lt;h3 id=&quot;beyond-service-accounts%3A-single-sign-on-authentication-with-oidc-integration&quot;&gt;Beyond Service Accounts: Single Sign-on Authentication with OIDC Integration &lt;a class=&quot;link-anchor&quot; href=&quot;#beyond-service-accounts%3A-single-sign-on-authentication-with-oidc-integration&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;The pattern we&#39;ll implement uses a service account with static credentials to
assume roles. However, RGW also supports &lt;code&gt;AssumeRoleWithWebIdentity&lt;/code&gt;, which
allows applications to assume roles using tokens from an enterprise identity
provider (such as RHSSO (Keycloak), IBM Security Verify, etc.) via OpenID
Connect (OIDC). This eliminates the need for static credentials: applications
authenticate via your existing SSO system to obtain a JWT, which they then use
to request a temporary credential directly from the STS API. This is the most
secure option for organizations with mature identity infrastructure, though it
requires additional OIDC provider configuration in RGW. We&#39;ll cover this
advanced pattern in a future post on identity federation.&lt;/p&gt;
&lt;h2 id=&quot;implementing-sts-in-ceph-rgw%3A-step-by-step&quot;&gt;Implementing STS in Ceph RGW: Step by Step &lt;a class=&quot;link-anchor&quot; href=&quot;#implementing-sts-in-ceph-rgw%3A-step-by-step&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;This implementation builds on the IAM foundation covered
in &lt;a href=&quot;https://ceph.io/en/news/blog/2025/enhancing-ceph-multitenancy-with-iam-accounts&quot;&gt;Enhancing Ceph Multitenancy with IAM Accounts.&lt;/a&gt;
If you&#39;re new to Ceph IAM accounts, that post covers account creation, user
management, and policy basics. Here, we focus specifically on enabling STS
and using roles for temporary credentials.&lt;/p&gt;
&lt;p&gt;Let&#39;s build on an example use case. We&#39;ll create a role for a backup service
that needs read-only access to a specific bucket.&lt;/p&gt;
&lt;p&gt;To follow this guide, you will need:&lt;/p&gt;
&lt;p&gt;Admin access to the Ceph cluster: SSH access to a node where you can
run &lt;code&gt;ceph&lt;/code&gt; and &lt;code&gt;radosgw-admin&lt;/code&gt; commands.&lt;/p&gt;
&lt;p&gt;AWS CLI: Installed on your workstation to interact with the RGW S3 endpoint.&lt;/p&gt;
&lt;p&gt;Python 3 and Boto3: For running the automation scripts (&lt;code&gt;pip install boto3&lt;/code&gt;).&lt;/p&gt;
&lt;p&gt;Ceph Squid or later: While basic STS works on older versions, the IAM Accounts
feature used in this guide requires Ceph Squid (19.2.0) or newer.&lt;/p&gt;
&lt;h3 id=&quot;step-1%3A-enable-sts-in-rgw-configuration&quot;&gt;Step 1: Enable STS in RGW Configuration &lt;a class=&quot;link-anchor&quot; href=&quot;#step-1%3A-enable-sts-in-rgw-configuration&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;STS must be explicitly enabled in your RGW configuration. The configuration
uses the Ceph config database and requires two settings.&lt;/p&gt;
&lt;p&gt;Generate a secure STS key (must be exactly 16 characters):&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;# Generate a 16-character random key
$ openssl rand -hex 8
# Example output: 0a1b2c3d4e5f6789
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Configure RGW to use STS:&lt;/p&gt;
&lt;p&gt;Most deployments use client.rgw.default as the RGW client identifier. If your
deployment uses a custom service name, replace default with your service name.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;# Set the STS encryption key (MUST be exactly 16 characters)
$ ceph config set client.rgw.default rgw_sts_key 0a1b2c3d4e5f6789

# Enable STS authentication
$ ceph config set client.rgw.default rgw_s3_auth_use_sts true
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;em&gt;Ceph-Specific Configuration Note&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Unlike AWS, where STS is a global service enabled by default, Ceph requires you
to explicitly configure the encryption key used to sign the session tokens.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Critical Requirement&lt;/em&gt;: The &lt;code&gt;rgw_sts_key&lt;/code&gt; must be exactly 16 characters long.
If it is 15 or 17 characters, the STS handshake will fail silently or with
opaque 500 errors.&lt;/p&gt;
&lt;p&gt;Restart all RGW instances to apply changes:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;# For default service
$ ceph orch restart client.rgw
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Verify the configuration:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;$ ceph config get client.rgw.default rgw_s3_auth_use_sts
$ ceph config get client.rgw.default rgw_sts_key
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;step-2%3A-create-iam-account%2C-root-user%2C-and-service-user&quot;&gt;Step 2: Create IAM Account, Root User, and Service User &lt;a class=&quot;link-anchor&quot; href=&quot;#step-2%3A-create-iam-account%2C-root-user%2C-and-service-user&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;IAM accounts provide multi-tenancy and resource organization. We&#39;ll create an
account, a root user for administrative tasks, and a restricted service user
for applications.&lt;/p&gt;
&lt;h4 id=&quot;create-the-iam-account%3A&quot;&gt;Create the IAM account: &lt;a class=&quot;link-anchor&quot; href=&quot;#create-the-iam-account%3A&quot;&gt;¶&lt;/a&gt;&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;$ radosgw-admin account create  --account-name=backup-team 
{
    &amp;quot;id&amp;quot;: &amp;quot;RGW89761398048153888&amp;quot;,
    &amp;quot;tenant&amp;quot;: &amp;quot;&amp;quot;,
    &amp;quot;name&amp;quot;: &amp;quot;backup-team&amp;quot;,
    &amp;quot;email&amp;quot;: &amp;quot;&amp;quot;,
    &amp;quot;quota&amp;quot;: {
        &amp;quot;enabled&amp;quot;: false,
        &amp;quot;check_on_raw&amp;quot;: false,
        &amp;quot;max_size&amp;quot;: -1,
        &amp;quot;max_size_kb&amp;quot;: 0,
        &amp;quot;max_objects&amp;quot;: -1
    },
    &amp;quot;bucket_quota&amp;quot;: {
        &amp;quot;enabled&amp;quot;: false,
        &amp;quot;check_on_raw&amp;quot;: false,
        &amp;quot;max_size&amp;quot;: -1,
        &amp;quot;max_size_kb&amp;quot;: 0,
        &amp;quot;max_objects&amp;quot;: -1
    },
    &amp;quot;max_users&amp;quot;: 1000,
    &amp;quot;max_roles&amp;quot;: 1000,
    &amp;quot;max_groups&amp;quot;: 1000,
    &amp;quot;max_buckets&amp;quot;: 1000,
    &amp;quot;max_access_keys&amp;quot;: 4
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h4 id=&quot;create-the-account-root-user-(for-administrative-tasks)&quot;&gt;Create the Account Root User (for administrative tasks) &lt;a class=&quot;link-anchor&quot; href=&quot;#create-the-account-root-user-(for-administrative-tasks)&quot;&gt;¶&lt;/a&gt;&lt;/h4&gt;
&lt;p&gt;The account root user has full permissions on all resources within the account
by default, including the ability to use the IAM API to create roles and manage
policies. This is built into the account system; no additional capabilities are
needed.&lt;/p&gt;
&lt;h4 id=&quot;create-the-root-user-for-the-account%3A&quot;&gt;Create the root user for the account: &lt;a class=&quot;link-anchor&quot; href=&quot;#create-the-root-user-for-the-account%3A&quot;&gt;¶&lt;/a&gt;&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;$ radosgw-admin user create   --account-id=RGW89761398048153888 &#92;
  --uid=backup-admin   --display-name=&amp;quot;Backup-Team-Admin&amp;quot; &#92;
  --account-root   --gen-access-key   --gen-secret
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;--account-root&lt;/code&gt; flag is critical: it designates this user as the account&#39;s
root user, granting full administrative permissions within the account&#39;s scope.&lt;/p&gt;
&lt;p&gt;The Ceph documentation stats that: &lt;em&gt;Account owners are encouraged to use this
account root user for management only, and create users and roles with fine-grained
permissions for specific applications.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;For this tutorial, we&#39;ll use the root user for setup tasks to keep things simple.
In production, you would typically use the root user to set up IAM users with
specific permissions, then remove or restrict the root user&#39;s credentials.&lt;/p&gt;
&lt;h3 id=&quot;create-the-backup-service-user-(for-applications)&quot;&gt;Create the Backup Service User (for applications) &lt;a class=&quot;link-anchor&quot; href=&quot;#create-the-backup-service-user-(for-applications)&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;This user will have minimal permissions, only the ability to assume roles.
No direct access to S3 resources.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;$ radosgw-admin user create &#92;
  --account-id=RGW89761398048153888 &#92;
  --uid=backup-service &#92;
  --display-name=&amp;quot;backup-service&amp;quot; &#92;
  --gen-access-key &#92;
  --gen-secret
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;em&gt;The service account has no S3 permissions and no IAM capabilities. It can only
assume roles that explicitly trust it.&lt;/em&gt;&lt;/p&gt;
&lt;h3 id=&quot;configure-aws-cli-profiles&quot;&gt;Configure AWS CLI Profiles &lt;a class=&quot;link-anchor&quot; href=&quot;#configure-aws-cli-profiles&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Configure two AWS CLI profiles, one for each user. Each profile contains the user&#39;s
credentials and the RGW/STS endpoint URL, so we don’t need to specify the endpoint
on each &lt;code&gt;AWS CLI&lt;/code&gt; command. See
the &lt;a href=&quot;https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html&quot;&gt;AWS CLI configuration documentation&lt;/a&gt;
for details.&lt;/p&gt;
&lt;h3 id=&quot;aws-profile-summary-for-this-setup%3A&quot;&gt;AWS Profile summary for this setup: &lt;a class=&quot;link-anchor&quot; href=&quot;#aws-profile-summary-for-this-setup%3A&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;backup-admin&lt;/code&gt; profile: Uses root user credentials, S3/IAM/STS endpoint &lt;a href=&quot;https://s3.cephlabs.com&quot;&gt;https://s3.cephlabs.com&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;backup-service&lt;/code&gt; profile: Uses service account credentials, S3/IAM/STS endpoint &lt;a href=&quot;https://s3.cephlabs.com&quot;&gt;https://s3.cephlabs.com&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Here is an example &lt;code&gt;.aws/config&lt;/code&gt; file:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-ini&quot;&gt;[profile backup-admin]
region = default
output = json
services = ceph-rgw

[profile backup-service]
region = default
output = json
services = ceph-rgw

[services ceph-rgw]
s3 =
  endpoint_url = https://s3.cephlabs.com
s3api =
  endpoint_url = https://s3.cephlabs.com
iam =
  endpoint_url = https://s3.cephlabs.com
sts =
  endpoint_url = https://s3.cephlabs.com
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Verify both profiles:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;# Test root user (should work - has full permissions)
$ aws s3 ls --profile backup-admin

# Test service user (should fail - has no S3 permissions yet)
$ aws s3 ls --profile backup-service
# Expect: AccessDenied in RGW logs
argument of type &#39;NoneType&#39; is not iterable
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;identity-summary&quot;&gt;Identity Summary &lt;a class=&quot;link-anchor&quot; href=&quot;#identity-summary&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;At this point, you have two users in the IAM account:&lt;/p&gt;
&lt;p&gt;||User||Type||Permissions||Used For||
|&lt;code&gt;backup-admin&lt;/code&gt;|Account root user (&lt;code&gt;--account-root&lt;/code&gt;)|Full permissions on all account resources + IAM API access	Creating buckets, creating/managing roles via AWS CLI|
|&lt;code&gt;backup-service&lt;/code&gt;|Regular user|None (can only assume roles)|Running backup applications with temporary credentials|&lt;/p&gt;
&lt;h3 id=&quot;step-3%3A-create-the-backup-bucket&quot;&gt;Step 3: Create the Backup Bucket &lt;a class=&quot;link-anchor&quot; href=&quot;#step-3%3A-create-the-backup-bucket&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Run this as the backup admin user (who has S3 permissions):&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;$ aws s3 mb s3://backups --profile backup-admin
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Why the admin user? The service account (&lt;code&gt;backup-service&lt;/code&gt;) has no S3 permissions
yet; it can only assume roles. The admin user creates the infrastructure (buckets),
then creates roles that grant specific permissions to those buckets.&lt;/p&gt;
&lt;p&gt;Verify the bucket exists:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;$ aws s3 ls --profile backup-admin
2025-12-12 17:09:25 backups
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;step-4%3A-create-the-role&quot;&gt;Step 4: Create the Role &lt;a class=&quot;link-anchor&quot; href=&quot;#step-4%3A-create-the-role&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Run these commands as the account root user (&lt;code&gt;backup-admin&lt;/code&gt;),
who has full IAM API permissions.&lt;/p&gt;
&lt;p&gt;Create a role trust policy (who can assume this role):&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;cat &amp;gt; trust-policy.json &amp;lt;&amp;lt;&#39;EOF&#39;
{
  &amp;quot;Version&amp;quot;: &amp;quot;2012-10-17&amp;quot;,
  &amp;quot;Statement&amp;quot;: [{
    &amp;quot;Effect&amp;quot;: &amp;quot;Allow&amp;quot;,
    &amp;quot;Principal&amp;quot;: {&amp;quot;AWS&amp;quot;: &amp;quot;arn:aws:iam::RGW89761398048153888:user/backup-service&amp;quot;},
    &amp;quot;Action&amp;quot;: &amp;quot;sts:AssumeRole&amp;quot;
  }]
}
EOF
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;_ARNs in IAM Accounts (Ceph Object Gateway): In the IAM Accounts model, the user
ARN is built from the account ID plus the user name; in Ceph this “name” corresponds
to the user’s display-name (not the &lt;code&gt;--uid&lt;/code&gt;). If your &lt;code&gt;--uid&lt;/code&gt; and &lt;code&gt;--display-name&lt;/code&gt;
differ, ensure that your trust policy Principal ARN uses the display-name value,
or the AssumeRole request will not match.&lt;/p&gt;
&lt;p&gt;Authorization to assume a role can be granted in two ways. In this tutorial we grant
it via the role trust policy by naming the service user as the &lt;code&gt;Principal&lt;/code&gt;. In
same-account setups, this is sufficient; no user policy is required. If you instead
trust the whole account or you are doing cross-account access, attach an identity
policy to the user or group allowing sts:AssumeRole on the specific role ARN._&lt;/p&gt;
&lt;p&gt;Create the role:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;$ aws iam create-role &#92;
  --profile backup-admin &#92;
  --role-name backup-reader &#92;
  --assume-role-policy-document file://trust-policy.json
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Create permission policy (what the role can do):&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;cat &amp;gt; permissions-policy.json &amp;lt;&amp;lt;&#39;EOF&#39;
{
  &amp;quot;Version&amp;quot;: &amp;quot;2012-10-17&amp;quot;,
  &amp;quot;Statement&amp;quot;: [{
    &amp;quot;Effect&amp;quot;: &amp;quot;Allow&amp;quot;,
    &amp;quot;Action&amp;quot;: [
      &amp;quot;s3:GetObject&amp;quot;,
      &amp;quot;s3:ListBucket&amp;quot;
    ],
    &amp;quot;Resource&amp;quot;: [
      &amp;quot;arn:aws:s3:::backups&amp;quot;,
      &amp;quot;arn:aws:s3:::backups/*&amp;quot;
    ]
  }]
}
EOF
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Attach permissions to role:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;$ aws iam put-role-policy &#92;
  --profile backup-admin &#92;
  --role-name backup-reader &#92;
  --policy-name backup-read-policy &#92;
  --policy-document file://permissions-policy.json
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Verify that the role was created:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;$ aws iam get-role &#92;
  --profile backup-admin &#92;
  --role-name backup-reader
{
    &amp;quot;Role&amp;quot;: {
        &amp;quot;Path&amp;quot;: &amp;quot;/&amp;quot;,
        &amp;quot;RoleName&amp;quot;: &amp;quot;backup-reader&amp;quot;,
        &amp;quot;RoleId&amp;quot;: &amp;quot;8c8eec8c-c647-42bb-8a53-36c6d2fc747a&amp;quot;,
        &amp;quot;Arn&amp;quot;: &amp;quot;arn:aws:iam::RGW89761398048153888:role/backup-reader&amp;quot;,
        &amp;quot;CreateDate&amp;quot;: &amp;quot;2025-12-12T22:10:18.644Z&amp;quot;,
        &amp;quot;AssumeRolePolicyDocument&amp;quot;: {
            &amp;quot;Version&amp;quot;: &amp;quot;2012-10-17&amp;quot;,
            &amp;quot;Statement&amp;quot;: [
                {
                    &amp;quot;Effect&amp;quot;: &amp;quot;Allow&amp;quot;,
                    &amp;quot;Principal&amp;quot;: {
                        &amp;quot;AWS&amp;quot;: &amp;quot;arn:aws:iam::RGW89761398048153888:user/backup-service&amp;quot;
                    },
                    &amp;quot;Action&amp;quot;: &amp;quot;sts:AssumeRole&amp;quot;
                }
            ]
        },
        &amp;quot;Description&amp;quot;: &amp;quot;&amp;quot;,
        &amp;quot;MaxSessionDuration&amp;quot;: 3600
    }
}

$ aws --profile backup-service sts assume-role --role-arn &amp;quot;arn:aws:iam::RGW89761398048153888:role/backup-reader&amp;quot; --output json --role-session-name testbr
{
    &amp;quot;Credentials&amp;quot;: {
        &amp;quot;AccessKeyId&amp;quot;: &amp;quot;reUwxxxxxxn&amp;quot;,
        &amp;quot;SecretAccessKey&amp;quot;: &amp;quot;CQGxxxxxxx&amp;quot;,
        &amp;quot;SessionToken&amp;quot;: &amp;quot;nADwRdQ5xxxx90qMZlDPl4ozBjcQKF1tceytgNVGD5D4h2FpoMvjybl31cXI9uh/nUrQePW+Ob3TmpMa4QXdXfml/gQYSYeQLJEzNncQPUQB9+QUl5TShDy4RYYziRulTMWrkYokL6kI0uN0LksQ56/qOyd59A1qbWtsBNYBdvxUUi7r3lhrifn4MNWQbErJKCVNdVOBSzN1L34JDMvjEqN2QyKWLQI16D+XhCq8V05OnQFMHsf128BealrX+KkWS6+74G960WzoHzWDwHF1uO08VlFYCdHO0A==&amp;quot;,
        &amp;quot;Expiration&amp;quot;: &amp;quot;2025-12-12T23:34:00.247844317Z&amp;quot;
    },
    &amp;quot;AssumedRoleUser&amp;quot;: {
        &amp;quot;Arn&amp;quot;: &amp;quot;arn:aws:sts::RGW89761398048153888:assumed-role/backup-reader/testbr&amp;quot;
    },
    &amp;quot;PackedPolicySize&amp;quot;: 0
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;step-5%3A-write-application-code-(python-example)&quot;&gt;Step 5: Write Application Code (Python Example) &lt;a class=&quot;link-anchor&quot; href=&quot;#step-5%3A-write-application-code-(python-example)&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;This code runs with the service account credentials (&lt;code&gt;backup-service&lt;/code&gt;), which have
no direct S3 access. The application calls STS to assume the &lt;code&gt;backup-reader&lt;/code&gt;
role and receives temporary credentials for S3 operations.&lt;/p&gt;
&lt;p&gt;Identity flow:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The application starts with &lt;code&gt;backup-service&lt;/code&gt; credentials (long-term, minimal permissions)&lt;/li&gt;
&lt;li&gt;Calls &lt;code&gt;AssumeRole&lt;/code&gt; using those credentials to request the &lt;code&gt;backup-reader&lt;/code&gt; role&lt;/li&gt;
&lt;li&gt;Receives temporary credentials (access key + secret + session token)&lt;/li&gt;
&lt;li&gt;Uses temporary credentials for all S3 operations&lt;/li&gt;
&lt;li&gt;Temporary credentials expire after 1 hour (or configured duration)&lt;/li&gt;
&lt;li&gt;Application manually checks expiration before each operation and refreshes if needed&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Upload test file (as admin user who has write permissions):&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;$ echo &amp;quot;test backup data&amp;quot; &amp;gt; test-backup.txt
$ aws s3 cp test-backup.txt s3://backups/ --profile backup-admin
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Running the script: download the script from GitHub Gist, and export the
variables of the &lt;code&gt;backup-service&lt;/code&gt; user:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;# Download script
$ wget -O backup_service.py https://gist.githubusercontent.com/likid0/f7c40c4851bf32c595c7a5e63cf21f35/raw/137bfeea46c20d46d37fa026e29f1b5193c3e281/gistfile1.txt
# Make it executable
$ chmod +x backup_service.py
# Export Vars
$ export AWS_ACCESS_KEY_ID=&#39;AKIAIOSFODNN7EXAMPLE&#39;
$ export AWS_SECRET_ACCESS_KEY=&#39;wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY&#39;
$ export S3_ENDPOINT_URL=&#39;https://s3.example.com&#39;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Run the example &lt;code&gt;backup_service.py&lt;/code&gt; cript:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;$ python backup_service.py

================================================================================
 Backup Service - STS Temporary Credentials Demo
================================================================================

Configuration:
  Endpoint: https://s3.cephlabs.com
  User:     backup-service
  Key:      VQQPNR4XOW...

Note: These are the service account&#39;s PERMANENT credentials
      They have NO direct S3 permissions (can only assume roles)

================================================================================
Calling AssumeRole API to get temporary credentials...
================================================================================

AssumeRole Parameters:
  RoleArn:         arn:aws:iam::RGW89761398048153888:role/backup-reader
  RoleSessionName: backup-job-1765579042
  DurationSeconds: 3600 (1 hour)

Authentication:  Using service account credentials (backup-service)
                 AccessKey: VQQPNR4XOW...

Calling STS endpoint: https://s3.cephlabs.com

SUCCESS! Received temporary credentials:
  AccessKeyId:     YAhacPIIT4BcUWiyPC0M...
  SecretAccessKey: S4VSWFTM2U... (redacted)
  SessionToken:    Yn3A4Mt4VGQoIvloer2ByH3aecQAeP... (redacted)
  Expiration:      2025-12-12 23:37:22.878813+00:00
================================================================================


 Listing backups in bucket &#39;backups&#39;...
   Found 1 object(s):

    test-backup.txt
      Size: 0.00 MB (17 bytes)
      Modified: 2025-12-12 22:11:23.474000+00:00


================================================================================
Demo completed successfully!
================================================================================
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This example script uses manual credential checking: the &lt;code&gt;_check_credentials()&lt;/code&gt;
method checks expiration time before each operation and calls &lt;code&gt;_refresh_credentials()&lt;/code&gt;
when needed. This is simple and works well for most use cases.&lt;/p&gt;
&lt;p&gt;For long-running jobs (hours or days), see the &amp;quot;Handling Long-Running Jobs:
Credential Refresh Strategies&amp;quot; section later in this post, which covers
automatic credential refresh using Boto3&#39;s &lt;code&gt;RefreshableCredentials&lt;/code&gt;. With
automatic refresh, Boto3 handles the timing and renewal for you so you never
have to think about expiration.&lt;/p&gt;
&lt;h2 id=&quot;handling-long-running-jobs%3A-credential-refresh-strategies&quot;&gt;Handling Long-Running Jobs: Credential Refresh Strategies &lt;a class=&quot;link-anchor&quot; href=&quot;#handling-long-running-jobs%3A-credential-refresh-strategies&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;A critical consideration for production deployments is handling jobs that run
longer than the credential lifetime.&lt;/p&gt;
&lt;h3 id=&quot;the-challenge&quot;&gt;The Challenge &lt;a class=&quot;link-anchor&quot; href=&quot;#the-challenge&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;The DurationSeconds Parameter controls how long the temporary credentials remain valid:&lt;/p&gt;
&lt;p&gt;Minimum: 900 seconds (15 minutes) configurable via &lt;code&gt;rgw_sts_min_session_duration&lt;/code&gt;
Default: 3600 seconds (1 hour)
Maximum: Limited by the role&#39;s &lt;code&gt;max_session_duration&lt;/code&gt; attribute (defaults to 3600)&lt;/p&gt;
&lt;p&gt;When a role is created, it has a &lt;code&gt;max_session_duration&lt;/code&gt; of 3600 seconds by default.
This means even if you request &lt;code&gt;DurationSeconds=7200&lt;/code&gt; (2 hours), the request
will be limited to the role&#39;s maximum. To allow longer sessions, you would
need to modify the role&#39;s max_session_duration when creating it (though for
security, shorter durations are recommended).&lt;/p&gt;
&lt;p&gt;Here, we share three example strategies for handling this.&lt;/p&gt;
&lt;h3 id=&quot;strategy-1%3A-increase-token-duration-(up-to-12-hours)&quot;&gt;Strategy 1: Increase Token Duration (Up to 12 Hours) &lt;a class=&quot;link-anchor&quot; href=&quot;#strategy-1%3A-increase-token-duration-(up-to-12-hours)&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;The most straightforward approach is to request longer-lived credentials and
configure the role to allow them.&lt;/p&gt;
&lt;h4 id=&quot;configure-maximum-session-duration-on-the-role%3A&quot;&gt;Configure maximum session duration on the role: &lt;a class=&quot;link-anchor&quot; href=&quot;#configure-maximum-session-duration-on-the-role%3A&quot;&gt;¶&lt;/a&gt;&lt;/h4&gt;
&lt;p&gt;When creating the role, you can set a maximum session duration:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;$ aws iam create-role &#92;
  --profile backup-admin &#92;
  --role-name backup-reader &#92;
  --assume-role-policy-document file://trust-policy.json &#92;
  --max-session-duration 43200  # 12 hours
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Or modify an existing role:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;$ aws iam update-role &#92;
  --profile backup-admin &#92;
  --role-name backup-reader &#92;
  --max-session-duration 43200  # 12 hours
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Verify the setting:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;$ aws iam get-role &#92;
  --profile backup-admin &#92;
  --role-name backup-reader &#92;
  --query &#39;Role.MaxSessionDuration&#39;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;RGW Configuration: the following config option controls the global maximum:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;ceph config set client.rgw.default rgw_sts_max_session_duration 43200
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Limitations:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Maximum duration in Ceph RGW: 12 hours (43,200 seconds)&lt;/li&gt;
&lt;li&gt;Not an ideal solution, as it extends the duration of the tokens to twelve hours&lt;/li&gt;
&lt;li&gt;Suitable for jobs that can be completed within 12 hours&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;strategy-2%3A-automatic-credential-refresh-with-refreshablecredentials&quot;&gt;Strategy 2: Automatic Credential Refresh with RefreshableCredentials &lt;a class=&quot;link-anchor&quot; href=&quot;#strategy-2%3A-automatic-credential-refresh-with-refreshablecredentials&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;For jobs longer than 12 hours, or to avoid managing token duration,
implement automatic refresh using botocore&#39;s &lt;code&gt;RefreshableCredentials&lt;/code&gt;.
This pattern continuously calls &lt;code&gt;AssumeRole&lt;/code&gt; to get fresh credentials
before expiration.&lt;/p&gt;
&lt;p&gt;An enhanced &lt;code&gt;BackupService&lt;/code&gt; example script with STS token Auto-Refresh
is available &lt;a href=&quot;https://gist.github.com/likid0/25519b2f46b63de89f7fe0d2dc9ff283&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;How it works:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;RefreshableCredentials&lt;/code&gt; wraps your credential fetching logic&lt;/li&gt;
&lt;li&gt;Before each AWS API call, Boto3 checks if credentials are expired or expiring soon&lt;/li&gt;
&lt;li&gt;If needed, &lt;code&gt;boto3&lt;/code&gt; automatically calls &lt;code&gt;_refresh_credentials()&lt;/code&gt; to get fresh credentials&lt;/li&gt;
&lt;li&gt;Your application never sees authentication errors due to expiration&lt;/li&gt;
&lt;li&gt;Each refresh calls &lt;code&gt;AssumeRole&lt;/code&gt; using the original service account credentials&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Key Advantages:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Works for jobs of any length (days, weeks)&lt;/li&gt;
&lt;li&gt;No manual credential management needed&lt;/li&gt;
&lt;li&gt;Boto3 handles refresh timing automatically&lt;/li&gt;
&lt;li&gt;Original service account credentials remain secure (never exposed to S3 operations)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Important Notes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The service account&#39;s long-term credentials must remain valid for the entire job&lt;/li&gt;
&lt;li&gt;Each refresh makes a new &lt;code&gt;AssumeRole&lt;/code&gt; call to STS&lt;/li&gt;
&lt;li&gt;Credentials are cached in memory only (not written to disk)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;strategy-3%3A-use-third-party-libraries&quot;&gt;Strategy 3: Use Third-Party Libraries &lt;a class=&quot;link-anchor&quot; href=&quot;#strategy-3%3A-use-third-party-libraries&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;If you prefer not to work with botocore internals, use a well-maintained library:&lt;/p&gt;
&lt;p&gt;Install the library &lt;code&gt;aws-assume-role-lib&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;$ pip install aws-assume-role-lib
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Reference the library in code:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import boto3
import aws_assume_role_lib

# Create session with automatic refresh
parent_session = boto3.Session(
    aws_access_key_id=&#39;BACKUP_SERVICE_KEY&#39;,
    aws_secret_access_key=&#39;your-secret-key&#39;
)

# This session automatically refreshes expired credentials
assumed_role_session = aws_assume_role_lib.assume_role(
    parent_session, 
    &#39;arn:aws:iam::RGW12345678901234567:role/backup-reader&#39;
)

# Use it like any boto3 session
s3 = assumed_role_session.client(&#39;s3&#39;, endpoint_url=&#39;https://s3.example.com&#39;)
s3.list_buckets()  # Credentials auto-refresh as needed
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;static-key-rotation%3A-completing-the-security-model&quot;&gt;Static Key Rotation: Completing the Security Model &lt;a class=&quot;link-anchor&quot; href=&quot;#static-key-rotation%3A-completing-the-security-model&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;You&#39;ve now implemented STS for temporary credentials, but there&#39;s one
final layer to complete the security architecture: rotating the service
account&#39;s static keys.&lt;/p&gt;
&lt;h3 id=&quot;background%3A-the-create_date-field&quot;&gt;Background: The create_date Field &lt;a class=&quot;link-anchor&quot; href=&quot;#background%3A-the-create_date-field&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Starting with Tentacle, Ceph RGW now includes a &lt;code&gt;create_date&lt;/code&gt; timestamp for
each access key in the user metadata. This addition enables programmatic key
age tracking and automated rotation: a critical capability for eliminating
static credential risk.&lt;/p&gt;
&lt;p&gt;Example output from &lt;code&gt;radosgw-admin user info&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
    &amp;quot;user_id&amp;quot;: &amp;quot;backup-service&amp;quot;,
    &amp;quot;keys&amp;quot;: [{
        &amp;quot;user&amp;quot;: &amp;quot;backup-service&amp;quot;,
        &amp;quot;access_key&amp;quot;: &amp;quot;XXXXXXXX&amp;quot;,
        &amp;quot;secret_key&amp;quot;: &amp;quot;XtDhTWsb6vkNOsAnWBXSIhDhqdRBYXXXXXXX&amp;quot;,
        &amp;quot;active&amp;quot;: true,
        &amp;quot;create_date&amp;quot;: &amp;quot;2025-12-12T22:02:16.628205Z&amp;quot;  ← Key creation timestamp
    }]
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;recommended-approach%3A-use-a-secrets-manager&quot;&gt;Recommended Approach: Use a Secrets Manager &lt;a class=&quot;link-anchor&quot; href=&quot;#recommended-approach%3A-use-a-secrets-manager&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;The best way to implement key rotation is with a secrets manager such as
HashiCorp Vault, IBM GKLM, AWS Secrets Manager, Google Secret Manager,
or Azure Key Vault. This approach enables zero-downtime rotation
without code changes.&lt;/p&gt;
&lt;p&gt;How it works:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Application queries secrets manager (no hardcoded credentials):&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Application starts up and queries Vault/secrets manager for credentials&lt;/li&gt;
&lt;li&gt;Gets current &lt;code&gt;access_key&lt;/code&gt; and &lt;code&gt;secret_key&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Uses these to call &lt;code&gt;AssumeRole&lt;/code&gt; and get temporary STS credentials&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;When keys rotate (automated monthly rotation using the &lt;code&gt;create_date&lt;/code&gt; field time stamp):&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Generate a new Ceph access key with &lt;code&gt;radosgw-admin key create&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Update the secret in Vault/secrets manager with new credentials&lt;/li&gt;
&lt;li&gt;Keep both old and new keys active in Ceph for a 7-day transition period&lt;/li&gt;
&lt;li&gt;After 7 days, remove the old key from Ceph&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Application automatically gets new keys:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The next time the application restarts or refreshes credentials, it queries the secrets manager&lt;/li&gt;
&lt;li&gt;Gets the new credentials automatically&lt;/li&gt;
&lt;li&gt;No code changes required: the application doesn&#39;t know rotation happened&lt;/li&gt;
&lt;li&gt;No downtime: the old key still works during the transition&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;migration-strategy%3A-from-static-to-temporary&quot;&gt;Migration Strategy: From Static to Temporary &lt;a class=&quot;link-anchor&quot; href=&quot;#migration-strategy%3A-from-static-to-temporary&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;You can&#39;t flip a switch and convert all applications overnight. The transition
requires methodical planning, careful testing, and phased rollout. Organizations
that rush this process end up with broken applications, emergency rollbacks, and
frustrated teams. The ones that succeed treat it as a deliberate migration project
with clear phases and success criteria.&lt;/p&gt;
&lt;p&gt;The challenge isn&#39;t technical; the STS implementation is straightforward once
one understands roles. The challenge is organizational: identifying where
static credentials exist, understanding what each application actually needs,
and coordinating updates across teams that may not even realize they&#39;re using
S3. This is why the first phase isn&#39;t about changing anything; it&#39;s about
understanding what you have.&lt;/p&gt;
&lt;h2 id=&quot;coming-in-the-next-post&quot;&gt;Coming in the Next Post &lt;a class=&quot;link-anchor&quot; href=&quot;#coming-in-the-next-post&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;You now have STS working in your Ceph environment. Your applications use temporary
credentials that expire automatically, dramatically reducing the blast radius of
credential theft. The permanent credentials your applications hold can&#39;t access
S3 directly; they can only assume specific roles with limited permissions. Each
role follows least privilege. Every access is logged with full attribution.&lt;/p&gt;
&lt;p&gt;We kept the IAM explanation minimal in this post, just enough to implement STS.
In the next post, we&#39;ll dive into IAM architecture and access control patterns.
We&#39;ll cover the new IAM Accounts model introduced in Ceph Squid, how it creates
proper multi-tenancy, and why the distinction between root account and IAM users
matters for security. We&#39;ll explore advanced least privilege patterns, trust policy
design for cross-account access, and how to test policies before deployment. We&#39;ll
also examine organizational mandates, such as blocking ACLs entirely and using the
new S3Control API for account-level governance.&lt;/p&gt;
&lt;p&gt;The authors would like to thank IBM for supporting the community with our time to create these posts.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>RocksDB Compression in Ceph: Space Savings with No Performance Cost</title>
    <link href="https://ceph.io/en/news/blog/2025/rocksdb-compression-ftw/" />
    <updated>2025-12-17T00:00:00Z</updated>
    <id>https://ceph.io/en/news/blog/2025/rocksdb-compression-ftw/</id>
    <author>
      <name>Daniel Alexander Parkes, Anthony D&#39;Atri</name>
    </author>
      <category term="en-blog-post" />
      <category term="en-article" />
      <category term="blog-post" />
      <category term="ceph" />
      <category term="rados" />
      <category term="rocksdb" />
      <category term="osd" />
      <category term="mon" />
    <content type="html" xml:base="https://ceph.io/en/news/blog/2025/rocksdb-compression-ftw/">&lt;h2 id=&quot;introduction&quot;&gt;Introduction &lt;a class=&quot;link-anchor&quot; href=&quot;#introduction&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;In the world of data storage, engineers and architects constantly face a
fundamental dilemma: the trade-off between performance and efficiency.
It’s a balancing act. When you want to save space, you typically enable
features like compression, but the common assumption is that this will
cost you performance, a CPU cycle tax that slows throughput.&lt;/p&gt;
&lt;p&gt;But what if you could significantly reduce your metadata storage footprint
without slowing things down?&lt;/p&gt;
&lt;p&gt;This search for an answer to this question started with research work
from Mark Nelson, who published
a &lt;a href=&quot;https://ceph.io/en/news/blog/2022/rocksdb-tuning-deep-dive&quot;&gt;blog post&lt;/a&gt; on &lt;a href=&quot;http://ceph.io&quot;&gt;ceph.io&lt;/a&gt;
that covers RocksDB tuning in depth, exploring RocksDB compression with
positive results. These promising results sparked a conversation on the
upstream GitHub about enabling compression by default; a link to the PR
is available &lt;a href=&quot;https://github.com/ceph/ceph/pull/53343&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;To build on the previous investigation, the Ceph performance team ran tests on a
robust hardware configuration running IBM Storage Ceph 7.1 (Reef). The cluster
used the BlueStore OSDs for an erasure-coded (EC 4+2) pool, with a hybrid OSD
storage setup: HDDs for object data and fast NVMe drives for the BlueStore WAL+DB.&lt;/p&gt;
&lt;p&gt;To understand the test, it&#39;s helpful to know what the WAL+DB is. In modern Ceph,
the BlueStore storage engine manages all data on the OSDs (physical devices).
To do this, it must maintain a vast catalog of internal metadata: think of it
as a high-speed index that quickly locates every piece of data.&lt;/p&gt;
&lt;p&gt;RocksDB, a high-performance key-value database, manages this critical index. In
our hybrid cluster, the RocksDB database runs on the fast NVMe deviceses, while
the actual object data resides on the slower HDDs.&lt;/p&gt;
&lt;p&gt;Because this metadata can grow very large, RocksDB&#39;s efficiency, how much space
it consumes on those expensive NVMe drives, is a critical factor in the cluster&#39;s
overall cost and performance. Our test, therefore, focuses on a simple,
high-stakes question:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Can we compress this metadata to save space &lt;em&gt;without&lt;/em&gt; paying a performance penalty?&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;executive-overview&quot;&gt;Executive Overview &lt;a class=&quot;link-anchor&quot; href=&quot;#executive-overview&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The results were not just positive; they were counterintuitive, revealing a powerful
opportunity for optimization that comes with virtually no downside.&lt;/p&gt;
&lt;p&gt;The results confirm that using RocksDB compression has no detrimental effect on
either throughput or resource consumption in Ceph, while providing significant
savings in DB space (compression ratio), especially for smaller objects. As a
result of the tests, RocksDB encryption is now enabled by default beginning with
the Squid release.&lt;/p&gt;
&lt;h2 id=&quot;test-environment-and-details&quot;&gt;Test Environment and Details &lt;a class=&quot;link-anchor&quot; href=&quot;#test-environment-and-details&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;All tests were run against the Ceph Gateway (RGW) to simulate a typical
Object Storage workload.&lt;/p&gt;
&lt;p&gt;Two different sets of object sizes were used in testing. Each workload leveraged
five clients and a range of fixed sizes (one object size per bucket, repeated
across the total bucket count), as listed below.&lt;/p&gt;
&lt;h3 id=&quot;testing-configuration&quot;&gt;Testing Configuration &lt;a class=&quot;link-anchor&quot; href=&quot;#testing-configuration&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Smaller&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;1 KiB, 4 KiB, 8 KiB, 64 KiB, 256 KiB&lt;/li&gt;
&lt;li&gt;100K objects, 300 buckets, five clients (150M total objectss)&lt;/li&gt;
&lt;li&gt;Fill Workload (~8%) - 3hr&lt;/li&gt;
&lt;li&gt;Hybrid workload (45% reads, 35% writes, 15% stats, 5% deletes)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Larger&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;1 MiB, 4 MiB, 8 MiB, 64 MiB, 256 MiB&lt;/li&gt;
&lt;li&gt;300 objects, 100 buckets, five clients (150K total objects)&lt;/li&gt;
&lt;li&gt;Fill Workload (~7%) - 40m&lt;/li&gt;
&lt;li&gt;Hybrid workload (45% reads, 35% writes, 15% stats, 5% deletes)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;hardware-used&quot;&gt;Hardware Used &lt;a class=&quot;link-anchor&quot; href=&quot;#hardware-used&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Two identical clusters, each with&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;3x Monitor / Manager nodes&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Dell R630&lt;/li&gt;
&lt;li&gt;2x E5-2683 v3 (28 total cores, 56 threads)&lt;/li&gt;
&lt;li&gt;128 GB RAM&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;8x OSD / RGW nodes&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Supermicro 6048R&lt;/li&gt;
&lt;li&gt;2x Intel E5-2660 v4 (28 total cores, 56 threads)&lt;/li&gt;
&lt;li&gt;256 GB RAM&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;192x OSDs (BluesStore): 24 2TB HDD and 2x 800GB NVMe SSD for WAL/DB per node&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Pool: &lt;code&gt;site{1,2}.rgw.buckets.data&lt;/code&gt; EC 4+2, &lt;code&gt;pg_num=4096&lt;/code&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;the-%22free-lunch%22-is-real%3A-significant-space-savings-at-zero-performance-cost&quot;&gt;The &amp;quot;Free Lunch&amp;quot; is Real: Significant Space Savings at Zero Performance Cost &lt;a class=&quot;link-anchor&quot; href=&quot;#the-%22free-lunch%22-is-real%3A-significant-space-savings-at-zero-performance-cost&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The primary and most surprising finding from our tests is that enabling RocksDB
compression had no negative impact on performance. The specific algorithm used
was LZ4, a lightweight solution known for its high speed. Our analysis suggests
that modern CPUs are so efficient at processing algorithms like LZ4 that the
overhead is negligible, particularly when compression operations occur on the
high-speed NVMe devices where the RocksDB database resides.&lt;/p&gt;
&lt;p&gt;Across a variety of hybrid workloads (45% reads, 35% writes, 15% stats,
and 5% deletes), we observed no detrimental effect on throughput or CPU resource
consumption compared to running the same workloads without compression. This
effectively eliminates the traditional trade-off.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Graph 1. CPU Consumption for Small Objects&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;images/7a80e345-6972-4b18-9c05-860c5a64ab05.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Graph 2. CPU Consumption for Large Objects&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;images/d26761a5-1bd4-4115-8ac8-5c5976046026.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Graph 3. Throughput for Small Object Writes&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;images/a588ee07-7f0c-4903-a8ab-3b11f0f543d8.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Graph 4. Throughput for Small Object Reads&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;images/3c36de86-7470-4204-a50e-f644e166567d.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;The results confirm that using RocksDB compression has no detrimental
effect on either throughput or resource consumption in Ceph, while
providing significant savings in DB space (compression ratio),
especially for smaller objects. This allows a smaller WAL+DB offload
partition for each OSD, or conversely helps avoid spillover of
RocksDB level data onto the BlueStore &lt;em&gt;slow&lt;/em&gt; device.&lt;/p&gt;
&lt;h2 id=&quot;small-objects%2C-massive-gains%3A-a-game-changer-for-object-storage-workloads.&quot;&gt;Small Objects, Massive Gains: A Game-Changer for Object Storage Workloads. &lt;a class=&quot;link-anchor&quot; href=&quot;#small-objects%2C-massive-gains%3A-a-game-changer-for-object-storage-workloads.&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;While compression proved beneficial across the board, its impact was most
dramatic on small-object workloads. Our tests, which used object sizes
ranging from 1 KiB to 256 KiB, showed a remarkable reduction in the
storage required for RocksDB metadata. In a BlueStore configuration,
Ceph&#39;s internal metadata is managed by a RocksDB database running on
top of the BlueFS file system on a fast storage device, in
our case, an NVMe SSD.&lt;/p&gt;
&lt;p&gt;The single most impactful data point we recorded was this: with compression
enabled, the &lt;code&gt;bluefs db_used_bytes&lt;/code&gt; workload for small objects required 2.68
times less storage during the cluster fill. This is a massive efficiency gain.
For any organization whose workload involves storing millions or even billions
of tiny objects, the metadata overhead can become a significant storage burden.
This feature directly and powerfully addresses that specific pain point by
compressing the metadata database on the fast offload device, not object data
on HDDs.&lt;/p&gt;
&lt;p&gt;This is particularly critical for Object Storage (RGW) workloads. When using
the Ceph Object Gateway (RGW), all rich metadata associated with an object,
such as its name, size, ACLs, and custom user tags, is stored in RocksDB instances
spread across the OSDs that comprise the index pool.
Furthermore, the bucket index, which lists all objects within a bucket, is
maintained as omap entries in this same database.&lt;/p&gt;
&lt;p&gt;For clusters with millions or billions of small objects, this metadata and
index data can swell to consume terabytes of space, often becoming the primary
capacity bottleneck on the expensive, high-speed NVMe drives. Compressing
RocksDB directly compresses this RGW metadata, providing massive and immediate
relief on that fast tier.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;images/1953bdb1-9871-44db-90b6-d1ce81fde788.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;h2 id=&quot;it&#39;s-not-just-for-small-objects%3A-large-objects-also-see-a-clear-benefit&quot;&gt;It&#39;s Not Just for Small Objects: Large Objects Also See a Clear Benefit &lt;a class=&quot;link-anchor&quot; href=&quot;#it&#39;s-not-just-for-small-objects%3A-large-objects-also-see-a-clear-benefit&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The positive effects were not limited to small objects. Our tests on large-object
workloads, ranging from 1 MiB to 256 MiB, also showed clear benefits. While the
source report highlights the most dramatic space savings for small objects, it
explicitly notes that the positive effect across both sets of object sizes is
evident, making compression a clear win for large-object workloads as well.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;images/94f71417-270f-4009-8e87-37c762dcf998.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;Furthermore, our test plan included stressful OSD failure scenarios to measure
behavior under duress. The overall conclusion of &amp;quot;no detrimental effect&amp;quot; on
performance or resource consumption held even during these fault and recovery
operations. This implies that RocksDB compression is not just efficient but
also a stable and robust feature under pressure.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Graph 5. Throughput for Small Object Reads During Failure&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;images/23fe5bea-9c23-440a-9e84-b9bddd035c9e.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Graph 6. Throughput for Small Object Writes During Failure&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;images/34f1ad7f-cb59-4e47-8be8-e1e0659e9d5b.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;h2 id=&quot;conclusion%3A-a-feature-that-should-be-enabled-by-default&quot;&gt;Conclusion: A Feature That Should Be Enabled By Default &lt;a class=&quot;link-anchor&quot; href=&quot;#conclusion%3A-a-feature-that-should-be-enabled-by-default&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Based on this comprehensive testing, RocksDB compression in a Ceph environment
is a low-risk, high-reward feature. It breaks the old rule that says efficiency
must come at the expense of performance. The evidence points to a clear win:
substantial storage savings on the metadata layer, with no measurable
trade-off in throughput or CPU usage.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;images/1de07550-5ebe-49cd-9e74-c58aaed0dea4.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;This led to a simple conclusion: given the potential for substantial space
savings with no performance downside, the decision was to enable RocksDB LZ4
compression by default in the Squid release.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;# ceph version
ceph version 19.2.1-222.el9cp (f2cd71cc2f7b46709c2351134ac89ea3e9f609b6) squid (stable)

# ceph config get osd bluestore_rocksdb_options
compression=kLZ4Compression,max_write_buffer_number=64,min_write_buffer_number_to_merge=6,compaction_style=kCompactionStyleLevel,write_buffer_size=16777216,max_background_jobs=4,level0_file_num_compaction_trigger=8,max_bytes_for_level_base=1073741824,max_bytes_for_level_multiplier=8,compaction_readahead_size=2MB,max_total_wal_size=1073741824,writable_file_max_buffer_size=0
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The authors would like to thank IBM for supporting the community with our time to create these posts.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>Ceph RGW Rate Limiting</title>
    <link href="https://ceph.io/en/news/blog/2025/rgw-rate-limiting/" />
    <updated>2025-12-16T00:00:00Z</updated>
    <id>https://ceph.io/en/news/blog/2025/rgw-rate-limiting/</id>
    <author>
      <name>Daniel Alexander Parkes, Anthony D&#39;Atri</name>
    </author>
      <category term="en-blog-post" />
      <category term="en-article" />
      <category term="blog-post" />
      <category term="ceph" />
      <category term="rgw" />
      <category term="s3" />
    <content type="html" xml:base="https://ceph.io/en/news/blog/2025/rgw-rate-limiting/">&lt;h2 id=&quot;introduction&quot;&gt;Introduction &lt;a class=&quot;link-anchor&quot; href=&quot;#introduction&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The Tentacle release introduces significant enhancements to Object Gateway (RGW)
rate limiting, addressing a critical gap that has long challenged administrators
managing multi-tenant object storage environments. With the addition of rate
limiting for &lt;code&gt;LIST&lt;/code&gt; and &lt;code&gt;DELETE&lt;/code&gt; operations, along with improved STS integration,
administrators now have more granular control over resource consumption across
their storage infrastructure.&lt;/p&gt;
&lt;h2 id=&quot;understanding-rate-limiting-in-the-ceph-object-gatway-(rgw)&quot;&gt;Understanding Rate Limiting in the Ceph Object Gatway (RGW) &lt;a class=&quot;link-anchor&quot; href=&quot;#understanding-rate-limiting-in-the-ceph-object-gatway-(rgw)&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Rate limiting in the Ceph Object Gateway has been a powerful tool for controlling
resource consumption and preventing individual users or applications from
monopolizing cluster resources. Before this enhancement, RGW supported rate
limiting for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Read operations (&lt;code&gt;max-read-ops&lt;/code&gt;): Controlling GET request frequency&lt;/li&gt;
&lt;li&gt;Write operations (&lt;code&gt;max-write-ops&lt;/code&gt;): Limiting PUT request rates&lt;/li&gt;
&lt;li&gt;Read bandwidth (&lt;code&gt;max-read-bytes&lt;/code&gt;): Throttling data egress&lt;/li&gt;
&lt;li&gt;Write bandwidth (&lt;code&gt;max-write-bytes&lt;/code&gt;): Controlling data ingress&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These limits operate within configurable time windows (controlled by
the &lt;code&gt;rgw_ratelimit_interval&lt;/code&gt; option), traditionally defaulting to 60
seconds. The system uses a token bucket algorithm to track resource
consumption, and when limits are exceeded, RGW returns &lt;code&gt;HTTP 503&lt;/code&gt;
responses to throttle clients.&lt;/p&gt;
&lt;h3 id=&quot;the-scope-of-rate-limiting&quot;&gt;The Scope of Rate Limiting &lt;a class=&quot;link-anchor&quot; href=&quot;#the-scope-of-rate-limiting&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Rate limits can be applied at multiple scopes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;User scope: Limits apply to a specific user across all buckets&lt;/li&gt;
&lt;li&gt;Bucket scope: Limits apply to operations in a particular bucket&lt;/li&gt;
&lt;li&gt;Global scope: Limits apply cluster-wide across all users and buckets&lt;/li&gt;
&lt;li&gt;Anonymous scope: Limits for unauthenticated requests&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;important-architectural-considerations&quot;&gt;Important Architectural Considerations &lt;a class=&quot;link-anchor&quot; href=&quot;#important-architectural-considerations&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;The Ceph Object Gateway&#39;s rate-limiting feature is not a complete QoS system. Key points:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Per-RGW enforcement: Limits are enforced per RGW instance, not cluster-wide.
With 2 RGWs and a desired 10 ops/minute limit, configure each RGW for 5 ops/minute.
If the client request load isn&#39;t evenly distributed across the endpoints, the required
limits may be lower than expected.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Limit intersection: Both user-level AND bucket-level limits must be satisfied.
Requests are rejected if either limit is exceeded.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;No traffic shaping: Throttled requests are immediately rejected (503) rather than queued.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;No mid-request throttling: Bandwidth is counted after a request completes, not
during. Users who exceed limits go into &amp;quot;debt&amp;quot; (max: 2x the limit) and are
blocked from new requests until the next interval(s) repay the debt.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;the-problem%3A-missing-control-for-list-and-delete-operations&quot;&gt;The Problem: Missing Control for List and Delete Operations &lt;a class=&quot;link-anchor&quot; href=&quot;#the-problem%3A-missing-control-for-list-and-delete-operations&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;While read and write operation limits provided good coverage for data transfer
operations, two critical operation types remained uncontrolled:&lt;/p&gt;
&lt;h3 id=&quot;list-operations&quot;&gt;List Operations &lt;a class=&quot;link-anchor&quot; href=&quot;#list-operations&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Bucket listing operations, particularly against buckets with millions of objects,
can place a significant load on the cluster. These operations:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Scan bucket indexes extensively&lt;/li&gt;
&lt;li&gt;Consume RADOS read IOPS on index pools&lt;/li&gt;
&lt;li&gt;Can impact overall cluster performance when executed at high frequency&lt;/li&gt;
&lt;li&gt;Are costly when using prefixes and delimiters that require filtering&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Previous limitation: &lt;code&gt;LIST&lt;/code&gt; operations (which use &lt;code&gt;GET&lt;/code&gt;/&lt;code&gt;HEAD&lt;/code&gt; HTTP methods) were counted
as read operations under the &lt;code&gt;max-read-ops&lt;/code&gt; limit, making it impossible to control
listing separately from regular &lt;code&gt;GET&lt;/code&gt; operations. This meant administrators couldn&#39;t
prevent list-heavy workloads from consuming the entire read operation budget while
still allowing standard data retrieval.&lt;/p&gt;
&lt;p&gt;Consider a workload performing checkpoint validation by repeatedly listing with prefixes like:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;$ aws s3api list-objects-v2 --bucket data --prefix checkpoint-flag --max-items 1&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;Even though this returns minimal data, each request triggers index scanning
operations that consume cluster resources.&lt;/p&gt;
&lt;p&gt;As an example, Apache Iceberg tables in data lakehouse environments have been
particularly challenging to maintain metadata for Iceberg&#39;s &lt;code&gt;deleteOrphanFiles&lt;/code&gt;
procedure, which cleans up unreferenced data files, requiring complete table
listings that can overwhelm object storage systems.&lt;/p&gt;
&lt;p&gt;![](images/f06a38ed-1eb5-4457-a13e-45a0eba48684.png align=&amp;quot;center&amp;quot;)&lt;/p&gt;
&lt;h3 id=&quot;delete-operations&quot;&gt;Delete Operations &lt;a class=&quot;link-anchor&quot; href=&quot;#delete-operations&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Single-object and multi-object delete operations were also uncontrolled, creating challenges for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Preventing abuse during bulk deletion scenarios&lt;/li&gt;
&lt;li&gt;Managing garbage collection workload&lt;/li&gt;
&lt;li&gt;Controlling the rate at which storage capacity is reclaimed&lt;/li&gt;
&lt;li&gt;Protecting against accidental or malicious mass deletion events&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Previous limitation: DELETE operations were classified as write
operations (non-GET/HEAD HTTP methods) and counted against &lt;code&gt;max-write-ops&lt;/code&gt;,
making it impossible to limit deletion rates from PUT operations separately.
Workloads that combined uploads and deletions had to balance their write-ops
budget across both operation types.&lt;/p&gt;
&lt;p&gt;Without dedicated controls for these operations, administrators had limited
options for managing workloads that mixed listing, reading, writing, and
deleting operations in different proportions.&lt;/p&gt;
&lt;h2 id=&quot;the-solution%3A-enhanced-rate-limiting-in-tentacle&quot;&gt;The Solution: Enhanced Rate Limiting in Tentacle &lt;a class=&quot;link-anchor&quot; href=&quot;#the-solution%3A-enhanced-rate-limiting-in-tentacle&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;![](images/a0ea6524-e006-496b-b2bf-855834372d56.jpeg align=&amp;quot;center&amp;quot;)&lt;/p&gt;
&lt;p&gt;Tentacle introduces two new rate-limiting parameters that address these gaps.&lt;/p&gt;
&lt;h3 id=&quot;new-configuration-options&quot;&gt;New Configuration Options &lt;a class=&quot;link-anchor&quot; href=&quot;#new-configuration-options&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Max-list-ops: Specifies the maximum number of bucket listing requests per
accumulation interval. A value of 0 (default) disables this limit, maintaining
backward compatibility.&lt;/li&gt;
&lt;li&gt;Max-delete-ops: Specifies the maximum number of delete operations per accumulation
interval. A value of 0 (default) disables this limit.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;critical%3A-backward-compatibility-behavior&quot;&gt;Critical: Backward Compatibility Behavior &lt;a class=&quot;link-anchor&quot; href=&quot;#critical%3A-backward-compatibility-behavior&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Important: The new limits work &lt;em&gt;in conjunction with&lt;/em&gt; existing read/write operation limits:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;LIST&lt;/code&gt; operations: Count against both &lt;code&gt;max-read-ops&lt;/code&gt; AND &lt;code&gt;max-list-ops&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;DELETE&lt;/code&gt; operations: Count against both &lt;code&gt;max-write-ops&lt;/code&gt; AND &lt;code&gt;max-delete-ops&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Both limits must be satisfied for a request to proceed. Administrators upgrading
from earlier versions will see no behavior change unless they explicitly configure
the new parameters.&lt;/p&gt;
&lt;p&gt;![](images/36e1a637-6316-40de-90d6-df76a8fcb97f.png align=&amp;quot;center&amp;quot;)&lt;/p&gt;
&lt;h3 id=&quot;configurable-time-windows&quot;&gt;Configurable Time Windows &lt;a class=&quot;link-anchor&quot; href=&quot;#configurable-time-windows&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;rgw_ratelimit_interval&lt;/code&gt; configuration option allows administrators to adjust
the interval for rate limit accumulation. This is particularly important for
workloads that exhibit bursty behavior:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;$ ceph config set client.rgw.rgw.1 rgw_ratelimit_interval 10
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The default 60-second interval may not be optimal for all workloads. Bursty
workloads, such as Apache Iceberg&#39;s metadata maintenance operations (snapshot
expiration, orphan file cleanup), can exhaust their LIST operation budget in
the first few seconds of a time window. Since Iceberg&#39;s &lt;code&gt;deleteOrphanFiles&lt;/code&gt;
procedure performs complete table listings across potentially thousands of
partitions in rapid succession, the accumulated operations can quickly exceed
the rate limit, resulting in extended throttling periods during which subsequent
maintenance tasks are blocked. Shorter intervals (1-10 seconds) can provide more
consistent behavior by allowing the operation budget to replenish more frequently,
preventing long stalls in critical table maintenance workflows.&lt;/p&gt;
&lt;h3 id=&quot;sts-integration&quot;&gt;STS Integration &lt;a class=&quot;link-anchor&quot; href=&quot;#sts-integration&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;A new enhancement to the STS/IAM feature ensures that rate limits now apply
correctly when users authenticate with temporary credentials obtained via the
Security Token Service (STS):&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;User rate limits configured on an account continue to be enforced when
that user assumes an IAM role and operates with temporary credentials.&lt;/li&gt;
&lt;li&gt;Bucket rate limits are enforced adequately for operations performed using
STS credentials, regardless of how the user authenticated.&lt;/li&gt;
&lt;li&gt;Global rate limits now work seamlessly with federated authentication flows,
such as AssumeRoleWithWebIdentity.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This closes a previous gap where rate limiting enforcement may not have worked
correctly with STS sessions, ensuring consistent rate limit policies across all
authentication methods.&lt;/p&gt;
&lt;h2 id=&quot;rate-limiting-configuration-examples&quot;&gt;Rate Limiting Configuration Examples &lt;a class=&quot;link-anchor&quot; href=&quot;#rate-limiting-configuration-examples&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;h3 id=&quot;example-1%3A-configuring-list-operation-rate-limits&quot;&gt;Example 1: Configuring LIST Operation Rate Limits &lt;a class=&quot;link-anchor&quot; href=&quot;#example-1%3A-configuring-list-operation-rate-limits&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Set up a user with list operation limits to control the frequency of bucket listings.&lt;/p&gt;
&lt;p&gt;Create a test user:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;$ radosgw-admin user create --uid=testuser --display-name=&amp;quot;Test User&amp;quot;
{
    &amp;quot;user_id&amp;quot;: &amp;quot;testuser&amp;quot;,
    &amp;quot;display_name&amp;quot;: &amp;quot;Test User&amp;quot;,
    &amp;quot;email&amp;quot;: &amp;quot;&amp;quot;,
    &amp;quot;suspended&amp;quot;: 0,
    &amp;quot;max_buckets&amp;quot;: 1000,
    &amp;quot;subusers&amp;quot;: [],
    &amp;quot;keys&amp;quot;: [
        {
            &amp;quot;user&amp;quot;: &amp;quot;testuser&amp;quot;,
            &amp;quot;access_key&amp;quot;: &amp;quot;TESTUSER_ACCESS_KEY&amp;quot;,
            &amp;quot;secret_key&amp;quot;: &amp;quot;TESTUSER_SECRET_KEY&amp;quot;
        }
    ],
    &amp;quot;caps&amp;quot;: [],
    &amp;quot;op_mask&amp;quot;: &amp;quot;read, write, delete&amp;quot;,
    &amp;quot;type&amp;quot;: &amp;quot;rgw&amp;quot;
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Set rate limits for list operations. We have two RGW services deployed in our
cluster, so if we want to limit the operations to 10, we need to divide the
ops limit by the number of RGWs running in the cluster:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;$ radosgw-admin ratelimit set --ratelimit-scope=user --uid=testuser &#92;
    --max-list-ops=5 &#92;
    --max-read-ops=100
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Enable rate limiting for this user:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;$ radosgw-admin ratelimit enable --ratelimit-scope=user --uid=testuser
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Verify the configuration:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;$ radosgw-admin ratelimit get --ratelimit-scope=user --uid=testuser
{
    &amp;quot;user_ratelimit&amp;quot;: {
        &amp;quot;max_read_ops&amp;quot;: 100,
        &amp;quot;max_write_ops&amp;quot;: 0,
        &amp;quot;max_list_ops&amp;quot;: 5,
        &amp;quot;max_delete_ops&amp;quot;: 0,
        &amp;quot;max_read_bytes&amp;quot;: 0,
        &amp;quot;max_write_bytes&amp;quot;: 0,
        &amp;quot;enabled&amp;quot;: true
    }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;example-2%3A-configuring-delete-operation-rate-limits&quot;&gt;Example 2: Configuring DELETE Operation Rate Limits &lt;a class=&quot;link-anchor&quot; href=&quot;#example-2%3A-configuring-delete-operation-rate-limits&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Set up delete operation limits to control the rate of deletions.
Set rate limits for delete operations on the same user:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;$ radosgw-admin ratelimit set --ratelimit-scope=user --uid=testuser &#92;
    --max-delete-ops=10 &#92;
    --max-write-ops=100
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Verify the updated configuration:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;$ radosgw-admin ratelimit get --ratelimit-scope=user --uid=testuser
{
    &amp;quot;user_ratelimit&amp;quot;: {
        &amp;quot;max_read_ops&amp;quot;: 100,
        &amp;quot;max_write_ops&amp;quot;: 100,
        &amp;quot;max_list_ops&amp;quot;: 5,
        &amp;quot;max_delete_ops&amp;quot;: 10,
        &amp;quot;max_read_bytes&amp;quot;: 0,
        &amp;quot;max_write_bytes&amp;quot;: 0,
        &amp;quot;enabled&amp;quot;: true
    }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;observing-rate-limiting-in-action&quot;&gt;Observing Rate Limiting in Action &lt;a class=&quot;link-anchor&quot; href=&quot;#observing-rate-limiting-in-action&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Let&#39;s see what happens when a user exceeds their configured limits.&lt;/p&gt;
&lt;h3 id=&quot;test-scenario%3A-exceeding-list-operation-limits&quot;&gt;Test Scenario: Exceeding List Operation Limits &lt;a class=&quot;link-anchor&quot; href=&quot;#test-scenario%3A-exceeding-list-operation-limits&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;With the configuration from Example 2 (5 list ops per RGW, 10 list
ops total per minute), configure AWS CLI with test credentials:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;$ aws configure set aws_access_key_id TESTUSER_ACCESS_KEY
$ aws configure set aws_secret_access_key TESTUSER_SECRET_KEY
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Create a test bucket:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;$ aws --endpoint-url http://rgw.example.com s3 mb s3://test-bucket
make_bucket: test-bucket
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Populate the bucket with test objects:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;$ for i in {1..100}; do
    echo &amp;quot;Test object $i&amp;quot; | aws --endpoint-url http://rgw.example.com s3 cp - s3://test-bucket/object-$i
done
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Rapidly execute list operations to exceed the limit. I will use a script that
uses &lt;code&gt;curl&lt;/code&gt; to list the contents of the bucket &lt;code&gt;test-bucket&lt;/code&gt; repeatedly:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;bash script.sh
Testing Rate Limit with list-objects-v2...
------------------------------------------------
Attempt 1: ✅ SUCCESS (200)
Attempt 2: ✅ SUCCESS (200)
Attempt 3: ✅ SUCCESS (200)
...
Attempt 10: ✅ SUCCESS (200)
Attempt 11: 🛑 BLOCKED (503) - Limit Reached
Attempt 12: 🛑 BLOCKED (503) - Limit Reached
Attempt 13: 🛑 BLOCKED (503) - Limit Reached
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;testing-delete-rate-limits&quot;&gt;Testing Delete Rate Limits &lt;a class=&quot;link-anchor&quot; href=&quot;#testing-delete-rate-limits&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Attempt to delete objects beyond the configured limit (20 deletes per minute) with the AWS CLI client:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;$ for i in {1..25}; do
    echo &amp;quot;Delete attempt $i&amp;quot;
    aws --endpoint-url http://rgw.example.com s3 rm s3://test-bucket/object-$i 2&amp;gt;&amp;amp;1 | grep -E &amp;quot;delete:|error&amp;quot;
done
Delete attempt 1
delete: s3://test-bucket/object-1
Delete attempt 2
delete: s3://test-bucket/object-2
Delete attempt 3
delete: s3://test-bucket/object-3
...
Delete attempt 19
delete: s3://test-bucket/object-19
Delete attempt 20
delete: s3://test-bucket/object-20
Delete attempt 21
delete failed: s3://limits-bucket/object-21 argument of type &#39;NoneType&#39; is not iterable
Delete attempt 22
delete failed: s3://limits-bucket/object-22 argument of type &#39;NoneType&#39; is not iterable
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;known-limitations-and-future-enhancements&quot;&gt;Known Limitations and Future Enhancements &lt;a class=&quot;link-anchor&quot; href=&quot;#known-limitations-and-future-enhancements&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;h3 id=&quot;current-limitations&quot;&gt;Current Limitations &lt;a class=&quot;link-anchor&quot; href=&quot;#current-limitations&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;Backward Compatibility Constraint&lt;/em&gt;*: LIST operations still count against
max-read-ops, and DELETE operations count against max-write-ops. The
new  &lt;code&gt;max-list-ops&lt;/code&gt; and &lt;code&gt;max-delete-ops&lt;/code&gt; limits provide additional
constraints but do not replace the legacy limits. Both limits must be
satisfied for a request to proceed. This design choice maintains backward
compatibility but means you cannot completely isolate LIST/DELETE operations
from general read/write operation budgets.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;Multi-Object Delete&lt;/em&gt;: The S3 DeleteObjects API (bulk delete) is not
currently rate-limited but is tracked for future
enhancement: &lt;a href=&quot;https://bugzilla.redhat.com/show_bug.cgi?id=2393080&quot;&gt;RFE&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;IAM Account Limitation&lt;/em&gt;: Rate limits on IAM accounts (as opposed to
users) do not currently work. This is tracked as an RFE for a future
release. &lt;a href=&quot;https://bugzilla.redhat.com/show_bug.cgi?id=2394369&quot;&gt;RFE&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;Multipart Upload Accounting&lt;/em&gt;: During multipart uploads with limited
write ops, the &lt;code&gt;CreateMultipartUpload&lt;/code&gt;, &lt;code&gt;UploadPart&lt;/code&gt;,
and &lt;code&gt;CompleteMultipartUpload operations&lt;/code&gt; each count against the write-ops
limit. For large files split into many parts, this can quickly consume the
operation budget. &lt;a href=&quot;https://bugzilla.redhat.com/show_bug.cgi?id=2396664&quot;&gt;RFE&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;Improved Logging Output&lt;/em&gt;. Currently, when hitting a rate limit, we see only
the following opaque errors in the RGW log, which don’t specify which rate
limit we have reached. &lt;a href=&quot;https://bugzilla.redhat.com/show_bug.cgi?id=2396664&quot;&gt;RFE&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;2025-11-20T16:39:40.030+0000 7f9e6423a640  2 req 15365199512736087891 0.001000024s s3:delete_obj check rate limiting
2025-11-20T16:39:40.030+0000 7f9e6423a640 20 req 15365199512736087891 0.001000024s op-&amp;gt;ERRORHANDLER: err_no=-2218 new_err_no=-2218
2025-11-20T16:39:40.030+0000 7f9e6423a640  2 req 15365199512736087891 0.001000024s s3:delete_obj http status=503
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion &lt;a class=&quot;link-anchor&quot; href=&quot;#conclusion&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The addition of &lt;code&gt;LIST&lt;/code&gt; and &lt;code&gt;DELETE&lt;/code&gt; operation rate limiting in Tentacle represents
a significant maturity improvement for the Object Gateway. Combined with the new
STS integration and configurable time intervals, administrators now have
comprehensive tools for managing multi-tenant object storage workloads.&lt;/p&gt;
&lt;p&gt;These enhancements are particularly valuable for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;Enterprises&lt;/em&gt; implementing department-level chargebacks and resource governance&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;Cloud-native applications&lt;/em&gt; using federated identity with OIDC&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;Data analytics platforms&lt;/em&gt; with mixed read-heavy and metadata-intensive operations&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;While some limitations remain (particularly around multi-object delete and IAM
accounts), the current implementation provides production-ready capabilities
that have been extensively tested with workloads ranging from small-object writes
to multi-million object listings.&lt;/p&gt;
&lt;h2 id=&quot;get-involved&quot;&gt;Get Involved &lt;a class=&quot;link-anchor&quot; href=&quot;#get-involved&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;We encourage you to test these new capabilities in your environment and share
your experiences with &lt;a href=&quot;https://docs.ceph.com/en/latest/start/get-involved&quot;&gt;the community&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The authors would like to thank IBM for supporting the community with our time to create these posts.&lt;/p&gt;
&lt;p&gt;Special thanks to the Ceph community and the IBM Storage Ceph QE team for their
extensive testing and validation of these features, covering functional, scale,
and regression scenarios with millions of objects and hundreds of gigabytes of
test data.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>Ending Support for some Erasure Code Plugins</title>
    <link href="https://ceph.io/en/news/blog/2025/ending-support-for-ec-plugins/" />
    <updated>2025-12-16T00:00:00Z</updated>
    <id>https://ceph.io/en/news/blog/2025/ending-support-for-ec-plugins/</id>
    <author>
      <name>Jamie Pryde (IBM)</name>
    </author>
      <category term="en-blog-post" />
      <category term="en-article" />
      <category term="blog-post" />
      <category term="ceph" />
      <category term="erasure-encoding" />
    <content type="html" xml:base="https://ceph.io/en/news/blog/2025/ending-support-for-ec-plugins/">&lt;p&gt;A plan to end support for some erasure code plugins and techniques
in the Ceph V release.&lt;/p&gt;
&lt;h2 id=&quot;the-erasure-code-plugin-interface&quot;&gt;The Erasure Code Plugin Interface &lt;a class=&quot;link-anchor&quot; href=&quot;#the-erasure-code-plugin-interface&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Ceph uses a plugin interface for erasure coded pools. These plugins are
external code libraries that are used to do the encoding and decoding of data.
Ceph pass chunks of data to the plugin. The plugin uses an encoding algorithm
to produce additional chunks called parity (or coding) chunks.&lt;/p&gt;
&lt;p&gt;When an erasure coded pool is created, an erasure code profile must be selected.
Among other things, the profile includes the plugin and the technique that will be
used for the pool. The technique defines the algorithm that the plugin will use for
encoding and decoding, and some plugins support multiple different techniques.
Because the parity chunks generated are different for each combination of plugin
and technique, there is no way to change the plugin and technique after the pool has been
created.&lt;/p&gt;
&lt;p&gt;Ceph currently supports five erasure code plugins, some of which support multiple
techniques:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Jerasure&lt;/li&gt;
&lt;li&gt;
&lt;ul&gt;
&lt;li&gt;reed_sol_van&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;ul&gt;
&lt;li&gt;reed_sol_r6_op&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;ul&gt;
&lt;li&gt;cauchy_orig&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;ul&gt;
&lt;li&gt;cauchy_good&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;ul&gt;
&lt;li&gt;liberation&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;ul&gt;
&lt;li&gt;blaum_roth&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;ul&gt;
&lt;li&gt;liber8tion&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;ISA-L (Intel Intelligent Storage Acceleration Library)&lt;/li&gt;
&lt;li&gt;
&lt;ul&gt;
&lt;li&gt;reed_sol_van&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;ul&gt;
&lt;li&gt;cauchy&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;SHEC (Shingled Erasure Code)&lt;/li&gt;
&lt;li&gt;CLAY (Coupled Layer)&lt;/li&gt;
&lt;li&gt;LRC (Locally Repairable Erasure Code)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Why are there so many options?&lt;/p&gt;
&lt;p&gt;In the distant past, before CPUs supported SIMD instructions
(Single Instruction, Multiple Data) and could encode and decode lots of data in parallel,
Jerasure&#39;s XOR-optimized techniques (cauchy, liberation, liber8tion and blaum_roth) offered a
performance improvement when encoding and decoding data. Now, with SSE, AVX (and other) instructions,
the need for techniques that rely on XOR operations has been greatly reduced, and reed_sol_van is very close
or in some cases better, than the XOR-optimized techniques. See the comparison chart later
in this post for the data!&lt;/p&gt;
&lt;p&gt;SHEC and CLAY both focus on trying to improve the recovery efficiency (by optimizing network and disk usage
when decoding data) when an OSD or server fails and data must be rebuilt. Both of these plugins build
on top of Jerasure, with additional logic that aims to speed up recovery.&lt;/p&gt;
&lt;p&gt;LRC also builds on top of Jerasure and intends to improve recovery efficiency by using locally available
data (e.g. data in the same data centre or same rack) to minimise transfers between racks or sites.&lt;/p&gt;
&lt;h2 id=&quot;ending-support-for-plugins&quot;&gt;Ending Support For Plugins &lt;a class=&quot;link-anchor&quot; href=&quot;#ending-support-for-plugins&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;In the Ceph Tentacle release we introduced a new version of erasure coding
that has become known as Fast EC. Fast EC offers significant performance
improvements and some capacity savings when using erasure coding, particularly for block
and file workloads. There are even more improvements to Fast EC coming in future releases. See Lee Sanders&#39; blog post
&lt;a href=&quot;https://ceph.io/en/news/blog/2025/tentacle-fastec-performance-updates/&quot;&gt;https://ceph.io/en/news/blog/2025/tentacle-fastec-performance-updates/&lt;/a&gt; for more details about Fast EC.&lt;/p&gt;
&lt;p&gt;Fast EC changes the interface between Ceph and the erasure code plugins. In Tentacle, only ISA-L
(using reed_sol_van or cauchy) and Jerasure (using reed_sol_van) support Fast EC. The old EC code has been kept
as a separate code path in Ceph, and the other plugins (and other Jerasure techniques) continue to use old EC.&lt;/p&gt;
&lt;p&gt;Our proposal is that we should end support for the least used (and least useful) plugins
and techniques in the V release. Ceph clusters using these plugins and techniques will not be
able to upgrade to the V release unless data is first migrated to a pool that uses
a supported plugin and technique.&lt;/p&gt;
&lt;p&gt;Why not continue to support all of these plugins using the old EC code path?&lt;/p&gt;
&lt;p&gt;The Fast EC work exposed the amount of development effort required to continue to support
such a big list of plugins and techniques. Even though only the most important plugins and techniques
support Fast EC, code changes were still required to ensure that the other plugins continue working
correctly. We now have two separate erasure code paths to maintain. Along with extra development work, supporting
a big list of plugins also means lots more testing needs to be done to ensure nothing gets broken.&lt;/p&gt;
&lt;p&gt;We don&#39;t think this effort is justified given the small number of users using some plugins, and
the lack of benefits that these plugins and techniques provide according to performance benchmarks. Developer
focus would be better spent on improving other parts of Ceph.&lt;/p&gt;
&lt;p&gt;The proposed list of plugins and techniques that we will support in the V release are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Jerasure&lt;/li&gt;
&lt;li&gt;
&lt;ul&gt;
&lt;li&gt;reed_sol_van&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;ISA-L&lt;/li&gt;
&lt;li&gt;
&lt;ul&gt;
&lt;li&gt;reed_sol_van&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;ul&gt;
&lt;li&gt;cauchy&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;LRC (Although LRC doesn&#39;t currently support Fast EC and we wouldn&#39;t recommend using it yet, we think
we will be able to use LRC in future to improve support for erasure coded pools in stretched clusters.)&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;telemetry-data&quot;&gt;Telemetry Data &lt;a class=&quot;link-anchor&quot; href=&quot;#telemetry-data&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;So let&#39;s look at some data. How many people are actually using each plugin? Not every Ceph cluster has opted in
to upload usage data to Telemetry, but enough have to give us a good idea about the plugins and techniques that
people are using. Here is a recent snapshot of the clusters using erasure coded pools:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;images/ec_plugin_telemetry.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;h2 id=&quot;performance-data&quot;&gt;Performance Data &lt;a class=&quot;link-anchor&quot; href=&quot;#performance-data&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;My talk at Cephalocon 2024 (&lt;a href=&quot;https://www.youtube.com/watch?v=aM8sJgDD-x4&quot;&gt;https://www.youtube.com/watch?v=aM8sJgDD-x4&lt;/a&gt;) discussed why we&#39;ve made ISA-L the
default plugin for new EC pools. The talk included performance data captured using Ceph&#39;s EC benchmarking
tool, and I&#39;ve included that here. These charts demonstrate how advancements in SIMD instructions have
brought the performance of the reed_sol_van technique to a point where reed_sol_van is almost as good as
or better than other techniques. Note that the ISA-L vandermonde and cauchy lines are overlapping in the encode
graph:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;images/encode_perf.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;images/decode_perf.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;As mentioned earlier, the goal of both SHEC and of CLAY is to improve recovery efficiency when an OSD is down.
A recent blog post written by Jake Squelch uses the Ceph Benchmarking Tool (CBT) to compare performance of the
Jerasure and CLAY plugins. His results show that there is a trade-off when using CLAY.
Although CLAY can reduce network bandwidth usage during recovery by around 50%, there is a performance penalty
for client I/O during normal operation, and when an OSD is down and the cluster needs to recover data. Data is being
read in a very inefficient way, particularly when using the default stripe_unit value.
See &lt;a href=&quot;https://ceph.io/en/news/blog/2025/cbt-performance-benchmarking-part3/&quot;&gt;https://ceph.io/en/news/blog/2025/cbt-performance-benchmarking-part3/&lt;/a&gt; for more detail.&lt;/p&gt;
&lt;h2 id=&quot;pool-migration&quot;&gt;Pool Migration &lt;a class=&quot;link-anchor&quot; href=&quot;#pool-migration&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;What if you&#39;re the owner of the single cluster using Jerasure&#39;s blaum_roth technique in the telemetry data?
As we end support for the above list of plugins and techniques, we will need a way to move
data from those pools into new pools that use supported plugins. In the Umbrella release
we plan to add such a pool migration feature. This new feature will provide a way to non-disruptively move data
from one pool to another. The migration will run as a background task, similar to backfill and recovery,
with no downtime where data is inaccessible. This will allow you to migrate all the objects from a pool that uses
an unsupported plugin to a new pool that uses a supported plugin and technique, and then upgrade the cluster to the
V release.
See &lt;a href=&quot;https://github.com/ceph/ceph/pull/65703&quot;&gt;https://github.com/ceph/ceph/pull/65703&lt;/a&gt; for the pool migration design document.&lt;/p&gt;
&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion &lt;a class=&quot;link-anchor&quot; href=&quot;#conclusion&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;As we&#39;ve developed Fast EC, it&#39;s become clear that continuing to support such a big list of plugins and techniques
is too much effort for the value that some of the plugins and techniques provide.&lt;/p&gt;
&lt;p&gt;In the Umbrella release we will deprecate the plugins and techniques not included in the supported list
mentioned above. In the V release we will end support for those plugins and techniques. You will not be
able to upgrade to the V release if your cluster has any pools that use those plugins and techniques.
You will be able to use the new pool migration feature in Umbrella to migrate data from a pool to a new pool
that uses one of the supported plugins and techniques.&lt;/p&gt;
&lt;p&gt;Reducing our list of supported plugins and techniques will allow us to focus our development efforts
and continue to improve Fast EC, without the risk of breaking lesser-used plugins.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>Ceph Object Storage Deep Dive Series Part 3: Version and Object Lock</title>
    <link href="https://ceph.io/en/news/blog/2025/rgw-deep-dive-3/" />
    <updated>2025-12-11T00:00:00Z</updated>
    <id>https://ceph.io/en/news/blog/2025/rgw-deep-dive-3/</id>
    <author>
      <name>Daniel Alexander Parkes, Anthony D&#39;Atri</name>
    </author>
      <category term="en-blog-post" />
      <category term="en-article" />
      <category term="blog-post" />
      <category term="ceph" />
      <category term="rgw" />
      <category term="s3" />
    <content type="html" xml:base="https://ceph.io/en/news/blog/2025/rgw-deep-dive-3/">&lt;h2 id=&quot;introduction&quot;&gt;Introduction &lt;a class=&quot;link-anchor&quot; href=&quot;#introduction&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;In the &lt;a href=&quot;https:https://ceph.io/en/news/blog/2025/rgw-deep-dive-1&quot;&gt;first&lt;/a&gt;
and &lt;a href=&quot;https:https://ceph.io/en/news/blog/2025/rgw-deep-dive-2&quot;&gt;second&lt;/a&gt; parts of
this deep dive series, we dissected the core foundations of Ceph RGW: stateless
frontends, specialized RADOS pools, bucket index mechanics, and the head/tail
data layout. We explored how the Ceph Object Gateway(RGW) achieves massive
scalability through dynamic bucket sharding and how background processes, including
Garbage Collection and Lifecycle Management, automate data governance.&lt;/p&gt;
&lt;p&gt;We now turn to two critical features for enterprise data
protection: &lt;em&gt;S3 Object Versioning&lt;/em&gt; and &lt;em&gt;S3 Object Lock&lt;/em&gt;.
These features transform the Ceph Object Gateway (RGW) from a simple object store
into a robust data preservation platform capable of meeting regulatory compliance
requirements, protecting against accidental deletions, and supporting immutable
storage patterns.&lt;/p&gt;
&lt;p&gt;In this third deep dive, we will first explore the concepts behind versioning
and object lock from the S3 API perspective. Then, we&#39;ll peel back the layers
to reveal how RGW implements these features internally, focusing on a crucial
architectural component: the &lt;em&gt;Object Logical Head (OLH)&lt;/em&gt;. Understanding this
mechanism is key to understanding how RGW efficiently maintains version history
while preserving the performance characteristics we expect.&lt;/p&gt;
&lt;h2 id=&quot;s3-object-versioning%3A-concepts-and-rationale&quot;&gt;S3 Object Versioning: Concepts and Rationale &lt;a class=&quot;link-anchor&quot; href=&quot;#s3-object-versioning%3A-concepts-and-rationale&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Object versioning is a mechanism that allows users to preserve, retrieve, and
restore every version of every object stored in a bucket. When versioning is
enabled, each object modification (PUT) or deletion creates a new, immutable
record rather than overwriting or removing existing data.&lt;/p&gt;
&lt;h3 id=&quot;why-versioning-matters&quot;&gt;Why Versioning Matters &lt;a class=&quot;link-anchor&quot; href=&quot;#why-versioning-matters&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Without versioning, object storage follows a &amp;quot;last write wins&amp;quot; model. Uploading
an object with the same key as an existing object silently replaces it. A DELETE
operation permanently removes the object. While simple, this model offers no
protection against:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;Accidental overwrites&lt;/em&gt;: A user uploads a corrupted file over a critical dataset&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;Accidental deletions&lt;/em&gt;: A script with a bug issues DELETE commands against production data&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;Malicious actions&lt;/em&gt;: A compromised credential is used to destroy data&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;Audit requirements&lt;/em&gt;: Regulations requiring historical record preservation&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Versioning addresses all of these concerns by maintaining a complete history of every object.
When combined with an RGW &lt;a href=&quot;https://ceph.io/en/news/blog/2025/rgw-multisite-replication_part7&quot;&gt;Archive Zone&lt;/a&gt;,
versioned objects enable all of the above while enabling production buckets to be lean and mean.&lt;/p&gt;
&lt;h3 id=&quot;versioning-states&quot;&gt;Versioning States &lt;a class=&quot;link-anchor&quot; href=&quot;#versioning-states&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Each bucket has one of three versioning states:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;State&lt;/th&gt;
&lt;th&gt;Behavior&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;em&gt;Unversioned&lt;/em&gt; (Default)&lt;/td&gt;
&lt;td&gt;Objects have a &lt;code&gt;null&lt;/code&gt; version ID. Overwrites replace data; deletes remove data permanently.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;em&gt;Enabled&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;Every PUT creates a new version with a unique version ID.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;em&gt;Suspended&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;New writes get &lt;code&gt;null&lt;/code&gt; version ID, but existing versions are preserved.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Once versioning is enabled on a bucket, it can never be fully disabled, only suspended.
This is a deliberate design choice to prevent accidental or malicious destruction of version history.&lt;/p&gt;
&lt;h3 id=&quot;version-ids-and-the-current-version&quot;&gt;Version IDs and the Current Version &lt;a class=&quot;link-anchor&quot; href=&quot;#version-ids-and-the-current-version&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;When versioning is enabled, every write to an object generates a unique,
system-assigned &lt;em&gt;Version ID&lt;/em&gt;. This ID is an opaque string that uniquely
identifies the object&#39;s version. When a client issues a GET request without
specifying a version ID, RGW returns the &lt;em&gt;current version&lt;/em&gt;: the most
recently written version of that object.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;# Create bucket
$ aws s3api create-bucket --bucket my-bucket
# Enable versioning on the bucket
$ aws s3api put-bucket-versioning --bucket my-bucket --versioning-configuration Status=Enabled
# Upload creates a new version
$ aws s3api put-object --bucket my-bucket --key report.pdf --body report.pdf
{
    &amp;quot;ETag&amp;quot;: &amp;quot;&#92;&amp;quot;959f45520adcbe51b3d7b24e1379d3c0&#92;&amp;quot;&amp;quot;,
    &amp;quot;ChecksumCRC64NVME&amp;quot;: &amp;quot;viq2x5cBzls=&amp;quot;,
    &amp;quot;ChecksumType&amp;quot;: &amp;quot;FULL_OBJECT&amp;quot;,
    &amp;quot;VersionId&amp;quot;: &amp;quot;5ch0kwnw2Nv1l5JctIrUFDY1zd55.va&amp;quot;
}

# List all versions, currently there is only one
$ aws s3api list-object-versions --bucket my-bucket --prefix report.pdf | jq .Versions
[
  {
    &amp;quot;ETag&amp;quot;: &amp;quot;&#92;&amp;quot;959f45520adcbe51b3d7b24e1379d3c0&#92;&amp;quot;&amp;quot;,
    &amp;quot;Size&amp;quot;: 1012,
    &amp;quot;StorageClass&amp;quot;: &amp;quot;STANDARD&amp;quot;,
    &amp;quot;Key&amp;quot;: &amp;quot;report.pdf&amp;quot;,
    &amp;quot;VersionId&amp;quot;: &amp;quot;5ch0kwnw2Nv1l5JctIrUFDY1zd55.va&amp;quot;,
    &amp;quot;IsLatest&amp;quot;: true,
    &amp;quot;LastModified&amp;quot;: &amp;quot;2025-12-05T11:37:52.802000+00:00&amp;quot;,
    &amp;quot;Owner&amp;quot;: {
      &amp;quot;DisplayName&amp;quot;: &amp;quot;user&amp;quot;,
      &amp;quot;ID&amp;quot;: &amp;quot;RGW42603947660038067&amp;quot;
    }
  }
]
# We do another PUT to the same Object/key
$ aws s3api put-object --bucket my-bucket --key report.pdf --body report.pdf
# We now have 2 versions of the same Object/Key
$ aws --profile kyle s3api list-object-versions --bucket my-bucket --prefix report.pdf | jq .Versions
[
  {
    &amp;quot;ETag&amp;quot;: &amp;quot;&#92;&amp;quot;959f45520adcbe51b3d7b24e1379d3c0&#92;&amp;quot;&amp;quot;,
    &amp;quot;Size&amp;quot;: 1012,
    &amp;quot;Key&amp;quot;: &amp;quot;report.pdf&amp;quot;,
    &amp;quot;VersionId&amp;quot;: &amp;quot;QhSnbf7bYMGHMshc0S-fyF3.SPMjIju&amp;quot;,
    &amp;quot;IsLatest&amp;quot;: true,
    &amp;quot;LastModified&amp;quot;: &amp;quot;2025-12-05T11:39:56.974000+00:00&amp;quot;,
  },
  {
    &amp;quot;ETag&amp;quot;: &amp;quot;&#92;&amp;quot;959f45520adcbe51b3d7b24e1379d3c0&#92;&amp;quot;&amp;quot;,
    &amp;quot;Size&amp;quot;: 1012,
    &amp;quot;Key&amp;quot;: &amp;quot;report.pdf&amp;quot;,
    &amp;quot;VersionId&amp;quot;: &amp;quot;5ch0kwnw2Nv1l5JctIrUFDY1zd55.va&amp;quot;,
    &amp;quot;IsLatest&amp;quot;: false,
    &amp;quot;LastModified&amp;quot;: &amp;quot;2025-12-05T11:37:52.802000+00:00&amp;quot;,
  }
]
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;delete-markers%3A-soft-deletes&quot;&gt;Delete Markers: Soft Deletes &lt;a class=&quot;link-anchor&quot; href=&quot;#delete-markers%3A-soft-deletes&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;When you delete an object in a versioned bucket (without specifying a version ID),
RGW does not remove any data. Instead, it creates a special zero-byte object called
a &lt;em&gt;Delete Marker&lt;/em&gt;. This marker becomes the current version, causing subsequent GET
requests to return a &lt;code&gt;404 Not Found&lt;/code&gt; error; even though all previous versions remain intact.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;# Delete creates a marker, not actual deletion
$ aws s3api delete-object --bucket my-bucket --key report.pdf
{
    &amp;quot;DeleteMarker&amp;quot;: true,
    &amp;quot;VersionId&amp;quot;: &amp;quot;77d9Np158AOrYrDod98ev7EhONah2G.&amp;quot;
}

# GET now returns 404, Because DeleteMarker&#39;s IsLatest is set to true
$ aws s3api get-object --bucket my-bucket --key report.pdf output.pdf
An error occurred (NoSuchKey) when calling the GetObject operation: Unknown

# But all versions still exist.
$ aws s3api list-object-versions --bucket my-bucket --prefix report.pdf
{
    &amp;quot;DeleteMarkers&amp;quot;: [
        {
            &amp;quot;Key&amp;quot;: &amp;quot;report.pdf&amp;quot;,
            &amp;quot;VersionId&amp;quot;: &amp;quot;77d9Np158AOrYrDod98ev7EhONah2G.&amp;quot;,
            &amp;quot;IsLatest&amp;quot;: true
        }
    ],
    &amp;quot;Versions&amp;quot;: [
        {
            &amp;quot;Key&amp;quot;: &amp;quot;report.pdf&amp;quot;,
            &amp;quot;VersionId&amp;quot;: &amp;quot;5ch0kwnw2Nv1l5JctIrUFDY1zd55.va&amp;quot;,
            &amp;quot;IsLatest&amp;quot;: false,
            &amp;quot;Size&amp;quot;: 1012
        },
        ...
    ]
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;recovering-deleted-objects&quot;&gt;Recovering Deleted Objects &lt;a class=&quot;link-anchor&quot; href=&quot;#recovering-deleted-objects&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Recovery is straightforward: either delete the Delete Marker or copy a specific version
back to the current position:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;# Method 1: Remove the Delete Marker
$ aws s3api delete-object --bucket my-bucket --key report.pdf &#92;
    --version-id &amp;quot;77d9Np158AOrYrDod98ev7EhONah2G.&amp;quot;

# Method 2: Copy a specific version to restore it as current
$ aws s3api copy-object &#92;
    --copy-source &amp;quot;my-bucket/report.pdf?versionId=5ch0kwnw2Nv1l5JctIrUFDY1zd55.va&amp;quot; &#92;
    --bucket my-bucket --key report.pdf
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;permanent-deletion&quot;&gt;Permanent Deletion &lt;a class=&quot;link-anchor&quot; href=&quot;#permanent-deletion&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;To permanently remove data from a versioned bucket, you must explicitly delete each
version by specifying its version ID:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;# Permanent deletion of the object requires the version ID of each version to get deleted
$ aws s3api delete-object --bucket my-bucket --key report.pdf &#92;
    --version-id &amp;quot;5ch0kwnw2Nv1l5JctIrUFDY1zd55.va&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;common-misconception%3A-%22delete-markers%22&quot;&gt;Common Misconception: &amp;quot;Delete Markers&amp;quot; &lt;a class=&quot;link-anchor&quot; href=&quot;#common-misconception%3A-%22delete-markers%22&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Question: &amp;quot;If I delete all versions of an object, will the delete markers be
automatically removed by the garbage collection process?&amp;quot;&lt;/p&gt;
&lt;p&gt;No! Delete markers are permanent metadata that preserve deletion history.
They persist indefinitely unless explicitly removed:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Lifecycle policy:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;{
  &amp;quot;Rules&amp;quot;: [
    {
      &amp;quot;ID&amp;quot;: &amp;quot;remove-expired-delete-markers&amp;quot;,
      &amp;quot;Status&amp;quot;: &amp;quot;Enabled&amp;quot;,
      &amp;quot;Filter&amp;quot;: {},
      &amp;quot;Expiration&amp;quot;: {
        &amp;quot;ExpiredObjectDeleteMarker&amp;quot;: true
      }
    }
  ]
}
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Manual deletion: &lt;code&gt;$ aws s3api delete-object --version-id &amp;lt;delete-marker-id&amp;gt;&lt;/code&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Why this matters: With high-churn workloads (frequent PUT/DELETE cycles), delete
markers accumulate silently, causing:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Bucket index bloat (millions of entries with no data)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Severe ListObjects performance degradation&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The fix: Configure lifecycle policies for versioned buckets to periodically
clean up expired delete markers.&lt;/p&gt;
&lt;h3 id=&quot;critical-consideration%3A-every-version-is-a-full-copy&quot;&gt;Critical Consideration: Every Version is a Full Copy &lt;a class=&quot;link-anchor&quot; href=&quot;#critical-consideration%3A-every-version-is-a-full-copy&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;A crucial detail that catches many users off guard: &lt;em&gt;each version is a
complete, independent copy of the object&lt;/em&gt;. Unlike filesystem snapshots or
incremental backups, S3 versioning does not store deltas or differences
between versions. When you upload a 1 GB file and then modify a single
byte, you now have two 1 GB objects stored in your cluster. Tiering however can
be employed to shift older revisions to more cost-effective storage.&lt;/p&gt;
&lt;p&gt;This design has significant implications for specific workloads:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Workload Pattern&lt;/th&gt;
&lt;th&gt;Impact with Versioning&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Large files with frequent minor updates&lt;/td&gt;
&lt;td&gt;Storage multiplies rapidly&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Log files with append operations&lt;/td&gt;
&lt;td&gt;Each append creates a complete copy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Database dumps are overwritten daily&lt;/td&gt;
&lt;td&gt;N days = N complete copies&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Configuration files are updated often&lt;/td&gt;
&lt;td&gt;Manageable (small files)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;em&gt;Example: The Log Append Anti-Pattern&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Consider an application that appends log entries to an S3 object throughout the day:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;# Hour 1: Create 10 MB log file
$ aws s3 cp app.log s3://versioned-bucket/logs/app.log  # 10 MB stored

# Hour 2: Append 1 MB, re-upload 
$ aws s3 cp app.log s3://versioned-bucket/logs/app.log  # Now 11 MB + 10 MB = 21 MB total

# Hour 3: Append 1 MB, re-upload
$ aws s3 cp app.log s3://versioned-bucket/logs/app.log  # Now 12 MB + 11 MB + 10 MB = 33 MB total

# After 24 hourly appends...
# Actual log data: ~34 MB
# Storage consumed: ~528 MB (sum of all versions)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For workloads that involve frequent modifications to large objects, consider these alternatives:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;Use unique keys&lt;/em&gt;: Write &lt;code&gt;app-2025-01-15-10.log&lt;/code&gt;, &lt;code&gt;app-2025-01-15-11.log&lt;/code&gt; instead of overwriting&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;Disable versioning selectively&lt;/em&gt;: Use separate buckets for append-heavy vs. versioning-critical data&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;Aggressive Lifecycle policies&lt;/em&gt;: Use &lt;code&gt;NoncurrentVersionExpiration&lt;/code&gt; with short retention periods&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;em&gt;Best Practice&lt;/em&gt;: Before enabling versioning on a bucket, analyze your workload
patterns. Versioning is ideal for objects that change infrequently but need
protection (documents, images, backups). It can be costly for objects that
change constantly (logs, metrics, temporary files).&lt;/p&gt;
&lt;h3 id=&quot;operational-consideration%3A-bucket-index-sharding-and-many-versions&quot;&gt;Operational Consideration: Bucket Index Sharding and Many Versions &lt;a class=&quot;link-anchor&quot; href=&quot;#operational-consideration%3A-bucket-index-sharding-and-many-versions&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Another consideration for versioned buckets concerns how RGW manages the bucket index.
As discussed in &lt;a href=&quot;https:https://ceph.io/en/news/blog/2025/rgw-deep-dive-2&quot;&gt;Part 2&lt;/a&gt;
of this series, RGW distributes bucket index entries across multiple shards to maintain
performance. However, versioning introduces a constraint: entries for all versions of a single object
must reside on the same bucket index shard.&lt;/p&gt;
&lt;p&gt;This design has several implications:&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Uneven Shard Distribution&lt;/em&gt;: Even with hashed sharding, a single object with
thousands of versions can create &amp;quot;hot spots&amp;quot; where one shard holds significantly
more entries than others. This undermines the even distribution that sharding
is designed to provide.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Large omap Warnings&lt;/em&gt;: Each version of an object requires multiple index
entries: approximately 2 + 2N entries for an object with N
versions. Since all these entries must reside on the same shard, a single
heavily-versioned object can push a shard past the RADOS “Large Object Warning“ threshold:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Versions per Object&lt;/th&gt;
&lt;th&gt;Approximate Index Entries&lt;/th&gt;
&lt;th&gt;Thresholds Level&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1,000&lt;/td&gt;
&lt;td&gt;~2,002&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;50,000&lt;/td&gt;
&lt;td&gt;~100,002&lt;/td&gt;
&lt;td&gt;Moderate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100,000&lt;/td&gt;
&lt;td&gt;~200,002&lt;/td&gt;
&lt;td&gt;&lt;em&gt;Threshold exceeded&lt;/em&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;When a shard exceeds 200,000 entries (the default &lt;code&gt;osd_deep_scrub_large_omap_object_key_threshold&lt;/code&gt;),
Ceph raiases a &lt;code&gt;LARGE_OMAP_OBJECTS&lt;/code&gt; health warning.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Future Improvements&lt;/em&gt;: The RGW development team is actively working on
enhancements to ordered bucket indexes that will allow version entries
for a single object to span multiple index shards. This architectural
change will  effectively eliminate the current practical limit on the number of
versions per object (currently constrained by omap size limits to roughly 100,000
in the worst case). This work is part of the broader ordered bucket index initiative
discussed in &lt;a href=&quot;https://ceph.io/en/news/blog/2025/rgw-deep-dive-2&quot;&gt;Part 2&lt;/a&gt;
of our blog series.&lt;/p&gt;
&lt;h2 id=&quot;s3-object-lock%3A-immutable-storage&quot;&gt;S3 Object Lock: Immutable Storage &lt;a class=&quot;link-anchor&quot; href=&quot;#s3-object-lock%3A-immutable-storage&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;While versioning protects against accidental changes, it doesn&#39;t prevent a
privileged user from deliberately deleting all versions. &lt;em&gt;S3 Object Lock&lt;/em&gt;
provides an additional layer of protection by implementing &lt;em&gt;Write-Once-Read-Many
(WORM)&lt;/em&gt; semantics. Once an object is locked, it cannot be deleted or overwritten
through the S3 endpoint, not even by an RGW admin account, until the lock expires.&lt;/p&gt;
&lt;h3 id=&quot;object-lock-prerequisites&quot;&gt;Object Lock Prerequisites &lt;a class=&quot;link-anchor&quot; href=&quot;#object-lock-prerequisites&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Object Lock has a critical prerequisite: &lt;em&gt;versioning must be enabled&lt;/em&gt;. This
tight coupling exists because Object Lock protects specific &lt;em&gt;object versions&lt;/em&gt;
rather than just object keys.&lt;/p&gt;
&lt;h3 id=&quot;historical-limitation-(pre-tentacle)&quot;&gt;Historical Limitation (Pre-Tentacle) &lt;a class=&quot;link-anchor&quot; href=&quot;#historical-limitation-(pre-tentacle)&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Before Ceph Tentacle, Object Lock could &lt;em&gt;only&lt;/em&gt;
be enabled at bucket creation time. This was a significant operational constraint:
if you created a bucket without Object Lock and later needed WORM protection, your
only option was to create a new bucket and migrate all data.&lt;/p&gt;
&lt;h3 id=&quot;new-in-ceph-tentacle&quot;&gt;New in Ceph Tentacle &lt;a class=&quot;link-anchor&quot; href=&quot;#new-in-ceph-tentacle&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Starting with Ceph Tentacle, you can now enable
Object Lock on existing versioned buckets (ceph/ceph#62063). This removes a major
operational pain point, allowing you to add compliance protection to production buckets without data migration.&lt;/p&gt;
&lt;h3 id=&quot;retention-modes%3A-governance-vs.-compliance&quot;&gt;Retention Modes: Governance vs. Compliance &lt;a class=&quot;link-anchor&quot; href=&quot;#retention-modes%3A-governance-vs.-compliance&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Object Lock supports two retention modes, each with different enforcement characteristics:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Mode&lt;/th&gt;
&lt;th&gt;Behavior&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;em&gt;Governance&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;Regular users cannot delete protected objects. However, users with the &lt;code&gt;s3:BypassGovernanceRetention&lt;/code&gt; permission can override the lock. Useful for internal policies that may require exceptions.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;em&gt;Compliance&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;Absolutely immutable. No user, including an RGW administrator, can delete the object or shorten the retention period through the S3 endpoint. Even the bucket owner cannot override. Required for regulatory compliance (SEC 17a-4, FINRA, etc.).&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3 id=&quot;retention-periods&quot;&gt;Retention Periods &lt;a class=&quot;link-anchor&quot; href=&quot;#retention-periods&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;A retention period specifies &lt;em&gt;how long&lt;/em&gt; the lock remains in effect. Once set to
Compliance mode, this period cannot be shortened; it can only be extended.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;# Create a bucket with Object Lock Enabled
$ aws s3api create-bucket --bucket worm-bucket --object-lock-enabled-for-bucket

# Set retention on upload
$ aws s3api put-object --bucket worm-bucket --key financial-record.pdf &#92;
    --body financial-record.pdf &#92;
    --object-lock-mode COMPLIANCE &#92;
    --object-lock-retain-until-date &amp;quot;2032-12-31T23:59:59Z&amp;quot;

# Or set default retention for all objects in the bucket
$ aws s3api put-object-lock-configuration --bucket worm-bucket &#92;
    --object-lock-configuration &#39;{
        &amp;quot;ObjectLockEnabled&amp;quot;: &amp;quot;Enabled&amp;quot;,
        &amp;quot;Rule&amp;quot;: {
            &amp;quot;DefaultRetention&amp;quot;: {
                &amp;quot;Mode&amp;quot;: &amp;quot;COMPLIANCE&amp;quot;,
                &amp;quot;Years&amp;quot;: 7
            }
        }
    }&#39;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;legal-hold%3A-indefinite-protection&quot;&gt;Legal Hold: Indefinite Protection &lt;a class=&quot;link-anchor&quot; href=&quot;#legal-hold%3A-indefinite-protection&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;In addition to time-based retention, Object Lock supports &lt;em&gt;Legal Hold&lt;/em&gt;, a flag
that prevents deletion regardless of retention settings. Legal Hold acts as a
binary switch (On/Off) and does &lt;em&gt;not&lt;/em&gt; require a retention period; it remains in
effect until explicitly removed. This is designed, for example, for litigation
scenarios where data must be preserved indefinitely until legal proceedings conclude.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;# Apply Legal Hold, Example version-id
$ aws s3api put-object-legal-hold --bucket worm-bucket &#92;
    --key evidence.pdf --version-id &amp;quot;abc123&amp;quot; &#92;
    --legal-hold Status=ON

# Remove Legal Hold (requires s3:PutObjectLegalHold permission)
$ aws s3api put-object-legal-hold --bucket worm-bucket &#92;
    --key evidence.pdf --version-id &amp;quot;abc123&amp;quot; &#92;
    --legal-hold Status=OFF
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;em&gt;Important&lt;/em&gt;: An object can have both a retention period AND a Legal Hold.
The object remains protected until BOTH conditions are cleared.&lt;/p&gt;
&lt;h3 id=&quot;regulatory-compliance%3A-third-party-validation&quot;&gt;Regulatory Compliance: Third-Party Validation &lt;a class=&quot;link-anchor&quot; href=&quot;#regulatory-compliance%3A-third-party-validation&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;For organizations in regulated industries, Ceph&#39;s Object Lock
implementation has been independently assessed by Cohasset Associates, a
consulting firm specializing in records management and information
governance. &lt;a href=&quot;https://www.ibm.com/downloads/cas/PJZN8VE3&quot;&gt;Their October 2023 compliance assessment&lt;/a&gt;
confirms that Ceph with Object Lock meets the electronic recordkeeping
requirements of:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;SEC Rules 17a-4(f) and 18a-6(e)&lt;/em&gt;: Non-rewriteable, non-erasable record format (WORM) requirements for broker-dealers and security-based swap entities&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;FINRA Rule 4511(c)&lt;/em&gt;: Which defers to SEC Rule 17a-4 for format and media requirements&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;CFTC Rule 1.31(c)-(d)&lt;/em&gt;: Principles-based requirements for commodity futures trading firms&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;understanding-object-lock-protection-boundaries&quot;&gt;Understanding Object Lock Protection Boundaries &lt;a class=&quot;link-anchor&quot; href=&quot;#understanding-object-lock-protection-boundaries&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;It&#39;s essential to understand what Object Lock protects against and what it does
not. &lt;em&gt;Object Lock enforcement occurs at the S3 API layer&lt;/em&gt;. When a DELETE request
arrives at the Object Gateway (RGW) endpoint, the gateway checks the lock status
and denies the operation if the object is protected. This means:&lt;/p&gt;
&lt;p&gt;&lt;em&gt;What Object Lock Protects Against:&lt;/em&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Accidental deletion via S3 clients (aws cli, SDKs, applications)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Malicious deletion by compromised S3 credentials&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Deletion by any user, including the bucket owner and RGW admin account (in Compliance mode)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Programmatic bulk deletions from rogue scripts or ransomware targeting S3 APIs&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;em&gt;What Object Lock Does NOT Protect Against:&lt;/em&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Direct RADOS-level operations (&lt;code&gt;rados rm&lt;/code&gt;, &lt;code&gt;radosgw-admin bucket rm --purge-objects&lt;/code&gt; )&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Physical destruction of storage media&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Cluster-level administrative actions by users with Ceph admin credentials&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;# This is blocked by Object Lock
$ aws s3api delete-object --bucket compliance-bucket --key locked-file.pdf --version-id abc123
An error occurred (AccessDenied) when calling the DeleteObject operation: forbidden by object lock

# But someone with RADOS admin access could still do this (DON&#39;T DO THIS!)
$ rados -p default.rgw.buckets.data rm &amp;lt;bucket_marker&amp;gt;_locked-file.pdf
# This bypasses Object Lock entirely - the data is gone
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This is not a limitation unique to Ceph; it&#39;s inherent to any software-enforced
protection. Object Lock protects your data at the application layer (S3 API),
but someone with root access to the underlying storage infrastructure operates
at a different trust boundary entirely.&lt;/p&gt;
&lt;p&gt;Object Lock provides strong protection against S3-layer threats and satisfies
regulatory requirements (SEC 17a-4, etc.) when combined with appropriate access
controls at the infrastructure layer. The RADOS bypass scenario requires
privileged cluster access that should be tightly controlled and audited through
separate mechanisms.&lt;/p&gt;
&lt;h2 id=&quot;rgw-internals%3A-the-object-logical-head-(olh)&quot;&gt;RGW Internals: The Object Logical Head (OLH) &lt;a class=&quot;link-anchor&quot; href=&quot;#rgw-internals%3A-the-object-logical-head-(olh)&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Now that we understand the API semantics, let&#39;s examine how RGW implements
versioning under the hood. This is where the &lt;em&gt;Object Logical Head (OLH)&lt;/em&gt; becomes essential.&lt;/p&gt;
&lt;h3 id=&quot;the-problem%3A-resolving-ambiguity&quot;&gt;The Problem: Resolving Ambiguity &lt;a class=&quot;link-anchor&quot; href=&quot;#the-problem%3A-resolving-ambiguity&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Consider a simple GET request: &lt;code&gt;GET /bucket/photo.jpg&lt;/code&gt;. In an unversioned bucket,
this is unambiguous: there&#39;s exactly one object with that key. But with versioning
enabled, &amp;quot;photo.jpg&amp;quot; could have dozens of versions. Which one should RGW return?&lt;/p&gt;
&lt;p&gt;The naive solution is to scan all versions and select the one with the most recent
timestamp. But this approach has profound performance implications:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Every GET would require a range scan of the bucket index&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The cost would grow linearly with the number of versions&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Concurrent writes could create race conditions&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;RGW solves this with a layer of indirection: the &lt;em&gt;Object Logical Head&lt;/em&gt;.&lt;/p&gt;
&lt;h3 id=&quot;what-is-the-olh%3F&quot;&gt;What is the OLH? &lt;a class=&quot;link-anchor&quot; href=&quot;#what-is-the-olh%3F&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;The OLH is a mechanism that tracks which version instance is the &amp;quot;current&amp;quot;
version of an object. When you access &lt;code&gt;photo.jpg&lt;/code&gt; without a version ID, RGW
uses the OLH to determine which version instance to return.&lt;/p&gt;
&lt;p&gt;The Ceph source code defines distinct entry types in the bucket index (&lt;code&gt;cls_rgw_types.h&lt;/code&gt;):&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-cpp&quot;&gt;enum class BIIndexType : uint8_t {
  Invalid        = 0,
  Plain          = 1,   // Non-versioned object entries
  Instance       = 2,   // Individual version instances
  OLH            = 3,   // Object Logical Head
};
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;When versioning is enabled:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Each object version is stored as an &lt;em&gt;Instance&lt;/em&gt; entry with a unique version ID&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The &lt;em&gt;OLH&lt;/em&gt; entry tracks which instance is current&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Non-versioned objects use &lt;em&gt;Plain&lt;/em&gt; entries&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;olh-epochs%3A-ordering-versions&quot;&gt;OLH Epochs: Ordering Versions &lt;a class=&quot;link-anchor&quot; href=&quot;#olh-epochs%3A-ordering-versions&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;RGW uses an &lt;code&gt;olh_epoch&lt;/code&gt; counter to establish version ordering. As described in
the Ceph GitHub repo:&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Note&lt;/em&gt;: &amp;quot;The existing algorithm uses an OLH epoch, incremented with each new version of
a name, that is used to sort its versions from newest to oldest.&amp;quot;* — &lt;a href=&quot;https://github.com/ceph/ceph/pull/31325&quot;&gt;ceph/ceph PR #31325&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;When a new version is written:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;The &lt;code&gt;olh_epoch&lt;/code&gt; is incremented&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;A new Instance entry is created in the bucket index&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The OLH is updated to reflect the new current version&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This epoch-based approach ensures consistent ordering even in concurrent write
scenarios and is critical for multi-site replication where versions may arrive
out of order.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;images/olh.jpg&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;h3 id=&quot;the-olh-log&quot;&gt;The OLH Log &lt;a class=&quot;link-anchor&quot; href=&quot;#the-olh-log&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;The OLH mechanism includes an &lt;code&gt;olh_log&lt;/code&gt; that records modifications to the
version history. Rather than updating the OLH pointer directly, changes
are logged and then applied. This log-based approach:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Enables safe concurrent modifications&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Supports multi-site synchronization (each zone maintains its own &lt;code&gt;olh_log&lt;/code&gt;)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Allows recovery from partial failures&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The &lt;code&gt;olh_log&lt;/code&gt; is processed by functions like &lt;code&gt;apply_olh_log()&lt;/code&gt; in the RGW
codebase, which evaluates pending changes and updates the current version
pointer accordingly.&lt;/p&gt;
&lt;h3 id=&quot;delete-markers-and-the-olh&quot;&gt;Delete Markers and the OLH &lt;a class=&quot;link-anchor&quot; href=&quot;#delete-markers-and-the-olh&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;When deleting an object in a versioned bucket, RGW creates a Delete Marker
using a dedicated operation (&lt;code&gt;CLS_RGW_OP_LINK_OLH_DM&lt;/code&gt;). This operation:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Creates a special zero-byte Instance entry marked as a delete marker&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Updates the OLH to point to this delete marker as the current version&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Subsequent GET requests (without a version ID) will resolve to the delete
marker and return &lt;code&gt;404&lt;/code&gt;, while direct version access still works for all
previous versions.&lt;/p&gt;
&lt;h3 id=&quot;examining-the-bucket-index&quot;&gt;Examining the Bucket Index &lt;a class=&quot;link-anchor&quot; href=&quot;#examining-the-bucket-index&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;You can examine the bucket index entries using &lt;code&gt;radosgw-admin&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;$ radosgw-admin bi list --bucket my-bucket
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;em&gt;The OLH Entry&lt;/em&gt; tracks the current version:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
    &amp;quot;type&amp;quot;: &amp;quot;olh&amp;quot;,
    &amp;quot;idx&amp;quot;: &amp;quot;�1001_report.pdf&amp;quot;,
    &amp;quot;entry&amp;quot;: {
        &amp;quot;key&amp;quot;: {
            &amp;quot;name&amp;quot;: &amp;quot;report.pdf&amp;quot;,
            &amp;quot;instance&amp;quot;: &amp;quot;sTsGobhZm2cGravZvOmc9IbpXgIEM8R&amp;quot;
        },
        &amp;quot;delete_marker&amp;quot;: false,
        &amp;quot;epoch&amp;quot;: 6,
        &amp;quot;pending_log&amp;quot;: [],
        &amp;quot;exists&amp;quot;: true
    }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The OLH tells us: the current version of &lt;code&gt;report.pdf&lt;/code&gt; is instance &lt;code&gt;sTsGobhZm2cGravZvOmc9IbpXgIEM8R&lt;/code&gt;.
The current epoch is &lt;code&gt;6&lt;/code&gt;, and it&#39;s not a delete marker.&lt;/p&gt;
&lt;p&gt;Instance Type Entries exist for each version of the object:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
    &amp;quot;type&amp;quot;: &amp;quot;instance&amp;quot;,
    &amp;quot;idx&amp;quot;: &amp;quot;�1000_report.pdf&#92;u0000isTsGobhZm2cGravZvOmc9IbpXgIEM8R&amp;quot;,
    &amp;quot;entry&amp;quot;: {
        &amp;quot;name&amp;quot;: &amp;quot;report.pdf&amp;quot;,
        &amp;quot;instance&amp;quot;: &amp;quot;sTsGobhZm2cGravZvOmc9IbpXgIEM8R&amp;quot;,
        &amp;quot;exists&amp;quot;: true,
        &amp;quot;meta&amp;quot;: {
            &amp;quot;size&amp;quot;: 1012,
            &amp;quot;mtime&amp;quot;: &amp;quot;2025-12-05T11:47:54.163133Z&amp;quot;,
            &amp;quot;etag&amp;quot;: &amp;quot;959f45520adcbe51b3d7b24e1379d3c0&amp;quot;
        },
        &amp;quot;versioned_epoch&amp;quot;: 6
    }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Notice how &lt;code&gt;versioned_epoch&lt;/code&gt; establishes ordering. Our three versions have
epochs &lt;code&gt;2&lt;/code&gt;, &lt;code&gt;3&lt;/code&gt;, and &lt;code&gt;6&lt;/code&gt;; the OLH points to epoch &lt;code&gt;6&lt;/code&gt;, confirming it&#39;s the current version.&lt;/p&gt;
&lt;p&gt;When we delete &lt;code&gt;report.pdf&lt;/code&gt; without specifying a version ID, a Delete Marker is created:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;$ aws s3api delete-object --bucket my-bucket --key report.pdf
{
    &amp;quot;DeleteMarker&amp;quot;: true,
    &amp;quot;VersionId&amp;quot;: &amp;quot;NtxFanesdl99IjNYXyJ-QGSGNETrlko&amp;quot;
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now the OLH has changed:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
    &amp;quot;type&amp;quot;: &amp;quot;olh&amp;quot;,
    &amp;quot;idx&amp;quot;: &amp;quot;�1001_report.pdf&amp;quot;,
    &amp;quot;entry&amp;quot;: {
        &amp;quot;key&amp;quot;: {
            &amp;quot;name&amp;quot;: &amp;quot;report.pdf&amp;quot;,
            &amp;quot;instance&amp;quot;: &amp;quot;NtxFanesdl99IjNYXyJ-QGSGNETrlko&amp;quot;
        },
        &amp;quot;delete_marker&amp;quot;: true,
        &amp;quot;epoch&amp;quot;: 7,
        &amp;quot;pending_log&amp;quot;: [],
        &amp;quot;exists&amp;quot;: true
    }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The OLH now points to a new instance with &lt;code&gt;&amp;quot;delete_marker&amp;quot;: true&lt;/code&gt; and epoch &lt;code&gt;7&lt;/code&gt;.
The delete marker&#39;s instance entry confirms it&#39;s a zero-byte marker:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
    &amp;quot;type&amp;quot;: &amp;quot;instance&amp;quot;,
    &amp;quot;idx&amp;quot;: &amp;quot;�1000_report.pdf&#92;u0000iNtxFanesdl99IjNYXyJ-QGSGNETrlko&amp;quot;,
    &amp;quot;entry&amp;quot;: {
        &amp;quot;name&amp;quot;: &amp;quot;report.pdf&amp;quot;,
        &amp;quot;instance&amp;quot;: &amp;quot;NtxFanesdl99IjNYXyJ-QGSGNETrlko&amp;quot;,
        &amp;quot;exists&amp;quot;: false,
        &amp;quot;meta&amp;quot;: {
            &amp;quot;size&amp;quot;: 0,
            &amp;quot;mtime&amp;quot;: &amp;quot;2025-12-05T13:35:37.561880Z&amp;quot;
        },
        &amp;quot;tag&amp;quot;: &amp;quot;delete-marker&amp;quot;,
        &amp;quot;versioned_epoch&amp;quot;: 7
    }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;radosgw-admin object stat&lt;/code&gt; command can be usefull providing a
higher-level view down to a specific object-version:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;$ radosgw-admin object stat --bucket my-bucket --object report.pdf
$ radosgw-admin object stat --bucket my-bucket --object report.pdf --object-version sTsGobhZm2cGravZvOmc9IbpXgIEM8R
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This shows the object&#39;s metadata, manifest, and version information from RGW&#39;s perspective.&lt;/p&gt;
&lt;h3 id=&quot;key-takeaways&quot;&gt;Key Takeaways &lt;a class=&quot;link-anchor&quot; href=&quot;#key-takeaways&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;The OLH mechanism provides several essential properties:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Efficient lookups&lt;/em&gt;: GET requests without a version ID can quickly resolve to the current version without scanning all versions&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Consistent ordering&lt;/em&gt;: The epoch-based system ensures deterministic version ordering&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Multi-site compatibility&lt;/em&gt;: The &lt;code&gt;olh_log&lt;/code&gt; design supports replication scenarios where versions may be created concurrently in different zones&lt;/li&gt;
&lt;li&gt;Safe concurrent access*: The log-and-apply model handles race conditions between concurrent writers&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;lifecycle-management-with-versioning&quot;&gt;Lifecycle Management with Versioning &lt;a class=&quot;link-anchor&quot; href=&quot;#lifecycle-management-with-versioning&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;As we discussed in &lt;a href=&quot;https://ceph.io/en/news/blog/2025/rgw-deep-dive-2&quot;&gt;Part 2&lt;/a&gt;,
Lifecycle management automates data governance through policy-based rules. With
versioning enabled, lifecycle policies gain additional capabilities for managing
version history.&lt;/p&gt;
&lt;h3 id=&quot;expiration-actions-for-versioned-buckets&quot;&gt;Expiration Actions for Versioned Buckets &lt;a class=&quot;link-anchor&quot; href=&quot;#expiration-actions-for-versioned-buckets&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Action&lt;/th&gt;
&lt;th&gt;Effect&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;Expiration&lt;/code&gt; (Days/Date)&lt;/td&gt;
&lt;td&gt;Adds a Delete Marker to current versions (does not delete data)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;NoncurrentVersionExpiration&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Permanently deletes noncurrent versions after specified days&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ExpiredObjectDeleteMarker&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Removes Delete Markers when they&#39;re the only remaining version&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;NewerNoncurrentVersions&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Limits how many noncurrent versions to retain&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3 id=&quot;example%3A-version-retention-policy&quot;&gt;Example: Version Retention Policy &lt;a class=&quot;link-anchor&quot; href=&quot;#example%3A-version-retention-policy&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;This policy keeps the current version indefinitely, retains the last three
noncurrent versions, and permanently deletes older noncurrent versions after 90 days:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;
{
  &amp;quot;Rules&amp;quot;: [
    {
      &amp;quot;ID&amp;quot;: &amp;quot;Version Retention Policy&amp;quot;,
      &amp;quot;Status&amp;quot;: &amp;quot;Enabled&amp;quot;,
      &amp;quot;Filter&amp;quot;: {
        &amp;quot;Prefix&amp;quot;: &amp;quot;&amp;quot;
      },
      &amp;quot;NoncurrentVersionExpiration&amp;quot;: {
        &amp;quot;NoncurrentDays&amp;quot;: 90,
        &amp;quot;NewerNoncurrentVersions&amp;quot;: 3
      },
      &amp;quot;Expiration&amp;quot;: {
        &amp;quot;ExpiredObjectDeleteMarker&amp;quot;: true
      }
    }
  ]
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Apply the policy using the AWS CLI:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;$ aws --endpoint=http://rgw:80 s3api put-bucket-lifecycle-configuration --bucket versioned-bucket --lifecycle-configuration file://lifecycle-policy.json
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;understanding-noncurrentversionexpiration-parameters&quot;&gt;Understanding NoncurrentVersionExpiration Parameters &lt;a class=&quot;link-anchor&quot; href=&quot;#understanding-noncurrentversionexpiration-parameters&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;NoncurrentVersionExpiration&lt;/code&gt; rule takes two parameters that work together
to control version retention:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;&amp;quot;NoncurrentVersionExpiration&amp;quot;: {
  &amp;quot;NoncurrentDays&amp;quot;: 90,
  &amp;quot;NewerNoncurrentVersions&amp;quot;: 3
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;em&gt;How it works:&lt;/em&gt; Both conditions must be &lt;code&gt;true&lt;/code&gt; for a version to be deleted:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Version must be noncurrent for at least &lt;code&gt;NoncurrentDays&lt;/code&gt; (90 days), &lt;em&gt;AND&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;There must be more than &lt;code&gt;NewerNoncurrentVersions&lt;/code&gt; (3) newer noncurrent versions&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Let&#39;s say you have an object &lt;code&gt;report.pdf&lt;/code&gt; with this version history:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;Current version (latest):
└─ v10 - 2025-12-11 (current version, not affected by this rule)

Noncurrent versions (older versions):
├─ v9  - 2025-12-10 (1 day noncurrent)   ← Newer noncurrent #1
├─ v8  - 2025-12-08 (3 days noncurrent)  ← Newer noncurrent #2
├─ v7  - 2025-12-05 (6 days noncurrent)  ← Newer noncurrent #3
├─ v6  - 2025-09-01 (102 days noncurrent) ✅ DELETE (&amp;gt;90 days AND &amp;gt;3 newer versions)
├─ v5  - 2025-08-15 (118 days noncurrent) ✅ DELETE
├─ v4  - 2025-08-01 (132 days noncurrent) ✅ DELETE
├─ v3  - 2025-07-15 (149 days noncurrent) ✅ DELETE
├─ v2  - 2025-07-01 (163 days noncurrent) ✅ DELETE
└─ v1  - 2025-06-15 (179 days noncurrent) ✅ DELETE
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;lifecycle-and-object-lock-interaction&quot;&gt;Lifecycle and Object Lock Interaction &lt;a class=&quot;link-anchor&quot; href=&quot;#lifecycle-and-object-lock-interaction&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;When both Lifecycle policies and Object Lock are active, Object Lock takes precedence.
If a Lifecycle rule attempts to delete a locked object version, the deletion is blocked:&lt;/p&gt;
&lt;p&gt;![](images/ocol.png align=&amp;quot;center&amp;quot;)&lt;/p&gt;
&lt;p&gt;This ensures that compliance requirements always take precedence over automated cleanup policies.&lt;/p&gt;
&lt;h3 id=&quot;cloud-transition-and-object-lock&quot;&gt;Cloud Transition and Object Lock &lt;a class=&quot;link-anchor&quot; href=&quot;#cloud-transition-and-object-lock&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;RGW&#39;s policy-based cloud transition feature allows you to tier data to external
S3-compatible endpoints (public cloud, tape gateways, etc.) using Lifecycle
policies. When Object Lock is active, &lt;em&gt;locked objects are automatically skipped
during cloud transitions&lt;/em&gt; to preserve the WORM contract.&lt;/p&gt;
&lt;p&gt;This behavior is intentional: cloud transition is a &lt;em&gt;destructive&lt;/em&gt; operation
from Ceph&#39;s perspective: after transition, the local copy is typically removed
and replaced with a stub. Allowing this for locked objects would violate the
immutability guarantee.&lt;/p&gt;
&lt;p&gt;From the RGW lifecycle code (&lt;code&gt;rgw_&lt;/code&gt;&lt;a href=&quot;http://lc.cc&quot;&gt;&lt;code&gt;lc.cc&lt;/code&gt;&lt;/a&gt;):&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-cpp&quot;&gt;if (!oc.o.is_current() &amp;amp;&amp;amp;
    !pass_object_lock_check(oc.driver, oc.obj.get(), oc.dpp)) {
  /* Skip objects which has object lock enabled. */
  ldpp_dout(oc.dpp, 10) &amp;lt;&amp;lt; &amp;quot;Object(key:&amp;quot; &amp;lt;&amp;lt; oc.o.key 
                        &amp;lt;&amp;lt; &amp;quot;) is locked. Skipping transition to cloud-s3 tier&amp;quot;
                        &amp;lt;&amp;lt; dendl;
  return 0;
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This ensures that compliance data remains on your Ceph cluster until the
retention period expires, regardless of any cloud tiering policies that
might otherwise apply.&lt;/p&gt;
&lt;h2 id=&quot;operational-considerations&quot;&gt;Operational Considerations &lt;a class=&quot;link-anchor&quot; href=&quot;#operational-considerations&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;h3 id=&quot;storage-capacity-planning&quot;&gt;Storage Capacity Planning &lt;a class=&quot;link-anchor&quot; href=&quot;#storage-capacity-planning&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;With versioning enabled, storage consumption can grow rapidly:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Every modification creates a new complete object version&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Delete operations don&#39;t free space (they add Delete Markers)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Space is only reclaimed when versions are permanently deleted&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;em&gt;Monitoring recommendation&lt;/em&gt;: Track both logical (S3-reported) and physical (RADOS-reported) usage:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;# S3-level bucket statistics
$ radosgw-admin bucket stats --bucket versioned-bucket | jq &#39;.usage&#39;

# RADOS pool usage
$ ceph df detail
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You may find it useful to add this or other &lt;a href=&quot;https://github.com/pcuzner/rgw-exporter&quot;&gt;rgw-exporter&lt;/a&gt;
to your Prometheus staack.&lt;/p&gt;
&lt;h3 id=&quot;index-shard-sizing-for-versioned-buckets&quot;&gt;Index Shard Sizing for Versioned Buckets &lt;a class=&quot;link-anchor&quot; href=&quot;#index-shard-sizing-for-versioned-buckets&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Each object version creates additional entries in the bucket index. A bucket with one million objects
and an average of ten versions per object has 10 million index entries. Plan shard counts accordingly:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;# Check current shard count
$ radosgw-admin bucket stats --bucket versioned-bucket | jq &#39;.num_shards&#39;

# Consider pre-sharding for expected growth
$ radosgw-admin bucket reshard --bucket versioned-bucket --num-shards XXX
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Note that as of December 2025 there is work underway to enhance dynamic resharding to account for
versioned objects. Clusters running earlier releases should factor versioning into their manual
shard count or dynamic resharding threshold.&lt;/p&gt;
&lt;h3 id=&quot;mfa-delete-for-additional-security&quot;&gt;MFA Delete for Additional Security &lt;a class=&quot;link-anchor&quot; href=&quot;#mfa-delete-for-additional-security&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;RGW supports MFA Delete, which requires multi-factor authentication to permanently
delete object versions or change a bucket&#39;s versioning state.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;# Generate current TOTP code
$ oathtool -d6 --totp b4902c641a1363541b32abc2a26817
293651

# Enable MFA Delete (note: serial + space + code)
$ aws --endpoint=http://rgw:80 s3api put-bucket-versioning &#92;
    --bucket secure-bucket &#92;
    --versioning-configuration MFADelete=Enabled,Status=Enabled &#92;
    --mfa &amp;quot;my-mfa-device 293651&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Once enabled, any attempt to permanently delete a version without providing a
valid MFA code will fail with &lt;code&gt;AccessDenied&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;For complete MFA setup instructions, including creating TOTP tokens
with &lt;code&gt;radosgw-admin mfa create&lt;/code&gt;, see the &lt;a href=&quot;https://docs.ceph.com/en/latest/radosgw/mfa&quot;&gt;Ceph Object Gateway Multi-Factor Authentication documentation&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&quot;conclusion%3A-the-complete-data-protection-stack&quot;&gt;Conclusion: The Complete Data Protection Stack &lt;a class=&quot;link-anchor&quot; href=&quot;#conclusion%3A-the-complete-data-protection-stack&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;![](images/lock2.png align=&amp;quot;center&amp;quot;)&lt;/p&gt;
&lt;p&gt;Across this third deep dive, we&#39;ve explored how Ceph RGW implements two
cornerstone features for enterprise data protection. &lt;em&gt;Versioning&lt;/em&gt;
provides a complete history of every object, enabling recovery from
accidental modifications and deletions. &lt;em&gt;Object Lock&lt;/em&gt; adds WORM
semantics for regulatory compliance and ransomware protection.&lt;/p&gt;
&lt;p&gt;At the heart of these features is the &lt;em&gt;Object Logical Head (OLH)&lt;/em&gt;,
an elegant architectural solution that maintains version history efficiently through a layer of indirection.&lt;/p&gt;
&lt;p&gt;Combined with the Lifecycle Management capabilities from &lt;a href=&quot;https://ceph.io/en/news/blog/2025/rgw-deep-dive-2&quot;&gt;Part 2&lt;/a&gt;,
you now have a complete picture of RGW&#39;s data governance stack:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Versioning + OLH&lt;/em&gt;: Preserves history and enables point-in-time recovery&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Object Lock&lt;/em&gt;: Enforces immutability for compliance&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Lifecycle Management&lt;/em&gt;: Automates version cleanup within policy constraints&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Garbage Collection&lt;/em&gt;: Reclaims space from permanently deleted versions&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In upcoming articles, we&#39;ll continue our exploration of topics that include multi-site replication, and STS/IAM integration. Stay tuned!&lt;/p&gt;
&lt;p&gt;The authors would like to thank IBM for supporting the community with our time to create these posts.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>KV Caching with vLLM, LMCache, and Ceph</title>
    <link href="https://ceph.io/en/news/blog/2025/vllm-kv-caching/" />
    <updated>2025-12-10T00:00:00Z</updated>
    <id>https://ceph.io/en/news/blog/2025/vllm-kv-caching/</id>
    <author>
      <name>Kyle Bader, Tushar Gohad</name>
    </author>
      <category term="en-blog-post" />
      <category term="en-article" />
      <category term="blog-post" />
      <category term="ceph" />
      <category term="rgw" />
      <category term="s3" />
    <content type="html" xml:base="https://ceph.io/en/news/blog/2025/vllm-kv-caching/">&lt;p&gt;Inference accounts for &lt;a href=&quot;https://www.sciencedirect.com/science/article/pii/S2210537923000124&quot;&gt;90% of machine learning
costs&lt;/a&gt; for deployed AI
systems, and it is no surprise that inference optimization is a burgeoning topic
in the research community. &lt;a href=&quot;https://info.idc.com/futurescape-generative-ai-2025-predictions.html&quot;&gt;IDC
estimates&lt;/a&gt; that global enterprises will invest
$307 billion USD on AI solutions in 2025, and that number is expected to grow
aggressively year-over-year.&lt;/p&gt;
&lt;h2 id=&quot;understanding-the-workload&quot;&gt;Understanding the workload &lt;a class=&quot;link-anchor&quot; href=&quot;#understanding-the-workload&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Unlike training, inference for autoregressive language models only involves the
forward pass, which itself is broken up into two distinct phases: prefill and
decode. Each phase has a unique workload profile – prefill tends to be
computation bound, consuming every ounce of floating-point arithmetic capability
the system can garner, followed by decode, which is principally limited by
memory bandwidth.&lt;/p&gt;
&lt;p&gt;The computational complexity of both prefill and decode phases grows
quadratically with each additional token. Prefill is easily parallelized across
GPUs - all prompt tokens are known up front when a request arrives at the model
API. The decode phase brings in the transformer multi-headed attention mechanism
and must compute the attention states across all previous tokens - including any
prompt(s) and generated responses. This complicates the deployment of inference
services where context lengths are growing rapidly to accommodate larger code
bases, longer documents, and retrieval augmented generation. KV caching is where
the computed key and value weights that correspond with token sequences in a
prompt are saved for later, and then retrieved when they are used in a
subsequent prompt to avoid the cost of computation (GPU hours) and to reduce
the time between when the prompt was submitted as a request and the first
response token (time-to-first-token, or TTFT).&lt;/p&gt;
&lt;h2 id=&quot;cache-blocks-in-vllm-and-lmcache&quot;&gt;Cache blocks in vLLM and LMCache &lt;a class=&quot;link-anchor&quot; href=&quot;#cache-blocks-in-vllm-and-lmcache&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;vLLM takes a hierarchical approach to KV caching. First it checks for the
existence of cache blocks in GPU memory, if there is a cache miss it will
progress to CPU memory, and if there is again a cache miss it will try to
retrieve cache blocks over any configured KV connectors. LMCache works with vLLM
over this KV connector interface - vLLM sends or requests cache blocks and
LMCache works to diligently store or stream cache blocks it locates. vLLM also
introduced the technique of &lt;a href=&quot;https://arxiv.org/pdf/2309.06180&quot;&gt;Paged Attention&lt;/a&gt;, which breaks up prompts into fixed
sized token sequences referred to as a block, 16 tokens by default. LMCache uses
a larger 256 token block by default, presumably to reduce the overhead of
managing reference to many blocks and to better amortize the per-block transfer
overhead. Storage folks, being unfamiliar with a token as a unit of measurement
for space and IO, might naturally wonder what this translates to in terms of
block sizes expressed in bytes. The bytes-per-token is model dependent, because
it’s a product of the model’s hidden size, number of key-value heads, number of
hidden layers, head dimension, and data type size. For a model like Qwen3-32B
this works out to be approximately 62.5 MiB. There is a convenient &lt;a href=&quot;https://docs.lmcache.ai/getting_started/kv_cache_calculator.html&quot;&gt;KV Cache
calculator&lt;/a&gt; available on the documentation page for LMCache if you want to see
how much KV space would be required for any given model or number of tokens.&lt;/p&gt;
&lt;h2 id=&quot;content-addressable-kv-storage&quot;&gt;Content addressable KV storage &lt;a class=&quot;link-anchor&quot; href=&quot;#content-addressable-kv-storage&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;vLLM and LMCache both calculate a hash of the token sequence that represents a
block and use that as a cache block identifier. This means that vLLM will pass
over the kv-connector interface the hashes of cache blocks that it is interested
in, and LMCache will return a bitmask indicating which cache blocks it can
provide. Under the covers the LMCache S3 connector will make GetObjectAttributes
calls with each block identifier (hash of the token sequence) and for each block
that exists it will flip the corresponding bit in the mask. The elegance of this
approach is that there is no cache block map that needs to be persisted, and no
coordination necessary when there are multiple instances of vLLM+LMCache running
across different hosts. In fact, there is no requirement that the &lt;a href=&quot;https://docs.lmcache.ai/kv_cache_management/index.html&quot;&gt;LMCache
controller&lt;/a&gt; be configured at all. This design also
permits flexible eviction: a storage system could implement time-based
expiration via Lifecycle configurations, and any deleted block simply registers
as a miss. In the end you get fully elastic content addressable storage for KV
cache blocks with flexible eviction. Anyone familiar with Ceph will truly
appreciate the notion of computing the location of data over performing a
lookup.&lt;/p&gt;
&lt;h2 id=&quot;retrieving-cache-blocks&quot;&gt;Retrieving cache blocks &lt;a class=&quot;link-anchor&quot; href=&quot;#retrieving-cache-blocks&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;We began exploring LMCache by testing its native S3 connector with Ceph, as it
provides an accessible entry point for most existing environments. The other
appeal of the native S3 connector in LMCache is that it leverages an AWS common
runtime library (CRT), which means that the connections in the client’s
connection pool will be multiplexed across endpoints that are returned in the
DNS response for the object store’s FQDN. The downside is that the bindings in
the AWS common runtime library for Python only support &lt;code&gt;recv_filepath&lt;/code&gt; and
&lt;code&gt;send_filepath&lt;/code&gt;, which limits the ability of LMCache to stream the response body
of a GetObject call directly to page-locked memory buffers allocated by the
LocalCPUBackend. To work around this limitation the connector pre-allocates and
mmaps files on a tmpfs mounted at /dev/shm (one per concurrent request), in this
way the CRT client can pass the file descriptors of memory mapped files and then
memcpy from their corresponding buffers to page-locked LocalCPUBackend buffers
that are used for DMA transfers to the GPU. This is a clever way of working
around most of the limitations of aws-crt-python, but to get true zero-copy it
will require changes to the bindings.&lt;/p&gt;
&lt;p&gt;After some preliminary testing with the native S3 connector &lt;a href=&quot;https://github.com/LMCache/LMCache/pull/1939&quot;&gt;LMCache
PR#1939&lt;/a&gt;
caught our eye because leveraged the NVIDIA Inference Xfer Library (NIXL). This
PR introduces the ability to directly read S3 data into page-locked NIXL
buffers, bypassing files on /dev/shm and the associated memory copy. It also
introduced a presence cache to eliminate redundant GetObjectInfo requests that
are used to determine if a cache block exists for a given sequence. We had
experimented with the NIXL obj plugin already and ran some rudimentary nixlbench
tests. What we found was that the NIXL obj plugin alone wanted a pre-allocated
pool of object keys, and that it required either the LMCache coordinator or
Dynamo KVBM to maintain device ID, offset, and length information for each cache
block. Unlike other NIXL plugins, the obj plugin could only write a single cache
block to each device ID (1:1 mapping with object key), because object APIs like
S3 do not support writes to arbitrary offsets. This is all addressed by PR1939,
because instead of using a pool of object keys and tracking cache block
metadata, it preserves the content addressable approach of LMCache’s native S3
connector. The only remaining downside with NIXL is that it used S3Client
instead of S3CrtClient, the latter of which supports multipathing across S3
endpoints.&lt;/p&gt;
&lt;h2 id=&quot;hyperscale-ai-deployments&quot;&gt;Hyperscale AI deployments &lt;a class=&quot;link-anchor&quot; href=&quot;#hyperscale-ai-deployments&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Drawing from over a decade of experience selecting hardware for Ceph storage
systems we had an idea of what sort of system we would want to build to
maximize throughput, while also drawing inspiration from choices made by major
AI practitioners like Meta and OpenAI. Enter Meta’s contribution to the Open
Compute project – the &lt;a href=&quot;https://www.opencompute.org/documents/yosemite-v3-5-platform-design-specification-v1-2-pdf&quot;&gt;Yosemite
V3.5&lt;/a&gt; Sierra Point server platform. The YV3.5
cubby occupies 3 OU and can be populated with 6x Sierra Point blades. Unlike
conventional enterprise blade systems the YV3.5 platform does not have an
integrated ethernet switch, instead each Sierra Point blade has OCP 3.0 slot
for direct to host network connectivity. We wanted a system that was a spiritual
successor to YV3.5 and Sierra Point, that reaped the advantages of cutting-edge
processor designs and lithography. While surveying the server landscape across a
whole host of OEMs there was one system that caught our attention, the
Supermicro X14 2U 4-node GrandTwin Rear IO.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;images/smci-x14-grandtwin.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.supermicro.com/en/products/system/datasheet/sys-212gt-hnr&quot;&gt;Supermicro X14 2U 4-node GrandTwin Rear
IO&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Each node:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;1x Intel Xeon 6 6740E 96C/96T, 205W&lt;/li&gt;
&lt;li&gt;16x16GB DDR5-6400&lt;/li&gt;
&lt;li&gt;1x Broadcom 57608 2x200GbE&lt;/li&gt;
&lt;li&gt;6x 2.5” Kioxia CM6-R, 7.68TB Gen4 NVMe SSD&lt;/li&gt;
&lt;li&gt;RAID1 2x 480TB NVMe (boot)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This system is utilized to provide high-bandwidth all-flash object storage for
the AI solution using IBM Storage Ceph 8.1.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;images/smci-gaudi3.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.supermicro.com/en/products/system/datasheet/sys-822ga-ngr3&quot;&gt;Supermicro Gaudi 3 AI Server
SYS-822GA-NGR3&lt;/a&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;2x Intel Xeon 6 6960P 72C/144T&lt;/li&gt;
&lt;li&gt;24x 64GB DDR5-6400&lt;/li&gt;
&lt;li&gt;8x Gaudi 3 HL-325L accelerators&lt;/li&gt;
&lt;li&gt;Up to 8x 2.5&amp;quot; Gen5 NVMe SSD&lt;/li&gt;
&lt;li&gt;Scale-up networking: 21x 200GbE Gaudi NICs&lt;/li&gt;
&lt;li&gt;2x Broadcom 57608 1x400GbE&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This system is utilized to run inference workloads with the combination of vLLM
and LMCache, leveraging Gaudi 3 accelerators from Intel.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;images/smci-gpu-aplus.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.supermicro.com/en/products/system/datasheet/as-8125gs-tnmr2&quot;&gt;Supermicro GPU A+ Server AS
-8125GS-TNMR2&lt;/a&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;1x AMD EPYC 9654 96C/192T&lt;/li&gt;
&lt;li&gt;24x 96GB DDR5-4800&lt;/li&gt;
&lt;li&gt;8x AMD MI300X accelerators&lt;/li&gt;
&lt;li&gt;Up to 8x 2.5&amp;quot; Gen5 NVMe SSD&lt;/li&gt;
&lt;li&gt;Scale-up networking: 4x400GbE&lt;/li&gt;
&lt;li&gt;Storage and GPU scale-out networking: 4x NVIDIA MT28908 ConnectX-6 200GbE&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This system is utilized to run inference workloads with the combination of vLLM
and LMCache, leveraging MI300X accelerators from AMD.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;images/smci-sw.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.supermicro.com/en/products/accessories/Networking/SSE-T7132SR.php&quot;&gt;SSE-T7132S - 400Gb Ethernet
Switch&lt;/a&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;32x QSFP-DD 400GbE, or 64x QSFP56 / 128x QSFP28 with breakout cables&lt;/li&gt;
&lt;li&gt;25.6Tb/s switching capacity&lt;/li&gt;
&lt;li&gt;SONiC OS&lt;/li&gt;
&lt;li&gt;RoCEv2/RDMA support with PFC&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For simplicity we used a single fixed-port 400Gb switch for both GPU-to-GPU and
the storage fabric.&lt;/p&gt;
&lt;h2 id=&quot;host-configuration&quot;&gt;Host configuration &lt;a class=&quot;link-anchor&quot; href=&quot;#host-configuration&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Performance profile set in BIOS&lt;/li&gt;
&lt;li&gt;Set the tuned profile to network-latency&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;tuned-adm profile network-latency
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;All hosts were configured with mode 802.3AD with xmit_hash_policy=Layer3+4&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;ceph-configuration&quot;&gt;Ceph configuration &lt;a class=&quot;link-anchor&quot; href=&quot;#ceph-configuration&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;h3 id=&quot;osd-service&quot;&gt;OSD service &lt;a class=&quot;link-anchor&quot; href=&quot;#osd-service&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;---
service_type: osd
service_id: nvme
placement:
  hosts:
    - ceph-osd01
    - ceph-osd02
    - ceph-osd03
data_devices:
  paths:
    - /dev/disk/by-path/pci-0000:63:00.5-pci-10001:81:00.0-nvme-1
    - /dev/disk/by-path/pci-0000:63:00.5-pci-10001:82:00.0-nvme-1
    - /dev/disk/by-path/pci-0000:89:00.5-pci-10002:01:00.0-nvme-1
    - /dev/disk/by-path/pci-0000:89:00.5-pci-10002:02:00.0-nvme-1
    - /dev/disk/by-path/pci-0000:89:00.5-pci-10002:03:00.0-nvme-1
    - /dev/disk/by-path/pci-0000:89:00.5-pci-10002:04:00.0-nvme-1
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;pool-configuration&quot;&gt;Pool configuration &lt;a class=&quot;link-anchor&quot; href=&quot;#pool-configuration&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;We decided to pre-create metadata and data pools for RGW before initializing the
RGW service.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;ceph osd pool set noautoscale
ceph osd pool create default.rgw.buckets.data 2048 2048 replicated
ceph osd pool create default.rgw.buckets.index 64 64 replicated
ceph osd pool create default.rgw.buckets.non-ec 64 64 replicated
ceph osd pool set default.rgw.buckets.data size 2
ceph osd pool set default.rgw.buckets.data min_size 1
ceph osd pool application enable default.rgw.buckets.data
ceph osd pool application enable default.rgw.buckets.index
ceph osd pool application enable default.rgw.buckets.non-ec
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;rgw-service&quot;&gt;RGW service &lt;a class=&quot;link-anchor&quot; href=&quot;#rgw-service&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;This RGW service configuration will create 4x RGW instances on each of the 4
hosts, with a concentrator bound to the host IP address at port 80.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;---
service_type: rgw
service_id: standard
service_name: rgw.standard
placement:
  count_per_host: 4
  label: rgw
networks:
  - 10.67.67.0/24
spec:
  rgw_exit_timeout_secs: 120
  rgw_frontend_port: 8080
  concentrator: haproxy
  concentrator_frontend_port: 80
  concentrator_monitor_port: 1967
  concentrator_monitor_user: admin
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;traffic-management&quot;&gt;Traffic management &lt;a class=&quot;link-anchor&quot; href=&quot;#traffic-management&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Like many applications, LMCache expects a single S3 endpoint. For us to maximize
bandwidth to storage cluster we decided to leverage &lt;a href=&quot;https://ceph.io/en/news/blog/2025/consul-lb1/&quot;&gt;Hashicorp Consul and CoreDNS&lt;/a&gt;
to return multiple DNS records in response to queries for our chosen object
FQDN. As stated earlier, this works perfectly with AWS CRT libraries like those
utilized by LMCache’s native S3 connector.&lt;/p&gt;
&lt;h4 id=&quot;consul&quot;&gt;Consul &lt;a class=&quot;link-anchor&quot; href=&quot;#consul&quot;&gt;¶&lt;/a&gt;&lt;/h4&gt;
&lt;p&gt;/etc/consul.d/consul.hcl&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;datacenter = &amp;quot;smci&amp;quot;
data_dir = &amp;quot;/opt/consul&amp;quot;
bind_addr = &amp;quot;172.19.65.41&amp;quot;
client_addr = &amp;quot;0.0.0.0&amp;quot;
retry_join = [
  &amp;quot;172.19.65.41&amp;quot;,
  &amp;quot;172.19.65.42&amp;quot;,
  &amp;quot;172.19.65.43&amp;quot;,
  &amp;quot;172.19.65.44&amp;quot;
]
server = true
bootstrap_expect = 3

services = [
  {
    name = &amp;quot;s3&amp;quot;
    port = 8080
    check = {
      id       = &amp;quot;tcp-check&amp;quot;
      name     = &amp;quot;S3 TCP&amp;quot;
      tcp      = &amp;quot;localhost:8080&amp;quot;
      interval = &amp;quot;10s&amp;quot;
      timeout  = &amp;quot;2s&amp;quot;
    }
  },
  {
    name = &amp;quot;s3&amp;quot;
    port = 8081
    check = {
      id       = &amp;quot;tcp-check&amp;quot;
      name     = &amp;quot;S3 TCP&amp;quot;
      tcp      = &amp;quot;localhost:8081&amp;quot;
      interval = &amp;quot;10s&amp;quot;
      timeout  = &amp;quot;2s&amp;quot;
    }
  },
  {
    name = &amp;quot;s3&amp;quot;
    port = 8082
    check = {
      id       = &amp;quot;tcp-check&amp;quot;
      name     = &amp;quot;S3 TCP&amp;quot;
      tcp      = &amp;quot;localhost:8082&amp;quot;
      interval = &amp;quot;10s&amp;quot;
      timeout  = &amp;quot;2s&amp;quot;
    }
  },
  {
    name = &amp;quot;s3&amp;quot;
    port = 8083
    check = {
      id       = &amp;quot;tcp-check&amp;quot;
      name     = &amp;quot;S3 TCP&amp;quot;
      tcp      = &amp;quot;localhost:8083&amp;quot;
      interval = &amp;quot;10s&amp;quot;
      timeout  = &amp;quot;2s&amp;quot;
    }
  }
]
&lt;/code&gt;&lt;/pre&gt;
&lt;h4 id=&quot;coredns&quot;&gt;CoreDNS &lt;a class=&quot;link-anchor&quot; href=&quot;#coredns&quot;&gt;¶&lt;/a&gt;&lt;/h4&gt;
&lt;p&gt;/etc/coredns/Corefile&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;.:53 {
    log
    errors
    forward . 8.8.8.8
}

cephlab.com {
    file /etc/coredns/cephlab.com
    prometheus
    errors
    log
    debug
}

consul {
  forward . 172.19.65.41:8600 172.19.65.42:8600 172.19.65.43:8600 172.19.65.44:8600
  log
  errors
}

s3.cephlab.com {
    rewrite stop {
        name exact s3.cephlab.com s3.service.consul.
        answer name s3.service.consul. s3.cephlab.com.
    }
    rewrite stop {
        name regex (.*)&#92;.s3&#92;.cephlab&#92;.com s3.service.consul.
        answer auto
    }
    forward . 172.19.65.41:8600 172.19.65.42:8600 172.19.65.43:8600 172.19.65.44:8600
    log
    errors
    debug
}

example.hosts s3.ecmp.cephlab.com {
    hosts {
        10.67.67.67 s3.ecmp.cephlab.com
        10.67.67.67 nixl.s3.ecmp.cephlab.com
        fallthrough
    }
    whoami
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h4 id=&quot;testing-dns-balancing&quot;&gt;Testing DNS balancing &lt;a class=&quot;link-anchor&quot; href=&quot;#testing-dns-balancing&quot;&gt;¶&lt;/a&gt;&lt;/h4&gt;
&lt;p&gt;To validate that the Hashicorp Consul and CoreDNS based approach is functioning
properly, we can test DNS resolution of our object FQDN of our object endpoint.
Note that we’re seeing 4 records returned, which is exactly what we want.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;[cephuser@ceph-osd01 ~]$ dig s3.cephlab.com

; &amp;lt;&amp;lt;&amp;gt;&amp;gt; DiG 9.16.23-RH &amp;lt;&amp;lt;&amp;gt;&amp;gt; s3.cephlab.com
;; global options: +cmd
;; Got answer:
;; -&amp;gt;&amp;gt;HEADER&amp;lt;&amp;lt;- opcode: QUERY, status: NOERROR, id: 12051
;; flags: qr aa rd; QUERY: 1, ANSWER: 4, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;s3.cephlab.com.                        IN      A

;; ANSWER SECTION:
s3.cephlab.com.         0       IN      A       172.19.65.41
s3.cephlab.com.         0       IN      A       172.19.65.42
s3.cephlab.com.         0       IN      A       172.19.65.43
s3.cephlab.com.         0       IN      A       172.19.65.44

;; Query time: 1 msec
;; SERVER: 172.19.65.41#53(172.19.65.41)
;; WHEN: Tue Nov 04 12:33:03 PST 2025
;; MSG SIZE  rcvd: 163
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;baseline-performance&quot;&gt;Baseline performance &lt;a class=&quot;link-anchor&quot; href=&quot;#baseline-performance&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;To establish the baseline performance of the storage cluster before we introduce
vLLM and LMCache we assessed the performance using
&lt;a href=&quot;https://github.com/breuner/elbencho&quot;&gt;elbencho&lt;/a&gt; to generate load
from the Gaudi3 GPU host and direct it towards the Ceph S3 endpoints. We used a
62MB block size to match the expected size of KV cache blocks being persisted by
LMCache. This shows that we’re able to multiplex connections across the
concentrator endpoints on each host and drive a considerable amount of S3
traffic from even a single host, topping out at nearly 60 GB/s.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;images/elbencho.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;h2 id=&quot;vllm&quot;&gt;vLLM &lt;a class=&quot;link-anchor&quot; href=&quot;#vllm&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;At the time of our testing the vllm production stack did not support our
end-to-end workflows, so we created customized vLLM container images that
incorporated a LMCache development release, including one that incorporated the
latest &lt;a href=&quot;https://github.com/vllm-project/vllm-gaudi&quot;&gt;vllm-gaudi&lt;/a&gt; development for our testing.&lt;/p&gt;
&lt;p&gt;AMD Container&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;vLLM:&lt;/li&gt;
&lt;li&gt;LMCache:&lt;/li&gt;
&lt;li&gt;NIXL:&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Gaudi Container&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;vLLM:&lt;/li&gt;
&lt;li&gt;LMCache:&lt;/li&gt;
&lt;li&gt;NIXL:&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Below you will find the configuration files and command line arguments we used
to run vLLM and LMCache together.&lt;/p&gt;
&lt;p&gt;.aws/credentials&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;[lmcache]
region = default
endpoint_url = http://s3.cephlab.com:80
aws_access_key_id = xxx
aws_secret_access_key = yyy
response_checksum_validation = when_required
preferred_transfer_client = crt
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;lmcache-ceph.yaml&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;chunk_size: 256
local_cpu: False
max_local_cpu_size: 100
remote_url: &amp;quot;s3://lmcache.s3.cephlab.com&amp;quot;
save_unfull_chunk: False
enable_async_loading: True
remote_serde: &amp;quot;naive&amp;quot;
blocking_timeout_secs: 100
extra_config:
  s3_max_io_concurrency: 1024
  s3_max_inflight_reqs: 1024
  s3_prefer_http2: False
  s3_region: &amp;quot;default&amp;quot;
  s3_enable_s3express: False
  save_chunk_meta: False
  s3_file_prefix: &amp;quot;test&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;lmcache-nixl-ceph.yaml&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;chunk_size: 512
local_cpu: false
max_local_cpu_size: 50
remote_serde: &amp;quot;naive&amp;quot;
nixl_buffer_size: 1073741824
nixl_buffer_device: cpu
extra_config:
  enable_nixl_storage: true
  nixl_backend: OBJ
  nixl_pool_size: 512
  nixl_backend_params:
    endpoint_override: http://s3.cephlab.com
    access_key: CR98FOT054QZJ60NR7E3
    secret_key: 15CTFkiAdwPkkiSh4gOlQ5zF14KZ0uCnZloYVo3w
    scheme: http
    region: default
    req_checksum: required
    bucket: lmcache
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;lmcache-dram.yaml&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;chunk_size: 256
local_cpu: True
max_local_cpu_size: 50
save_unfull_chunk: False
enable_async_loading: True
remote_serde: &amp;quot;naive&amp;quot;
blocking_timeout_secs: 100
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Starting vLLM&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;LMCACHE_CONFIG_FILE=&amp;quot;/root/lmcache-nixl-s3.yaml&amp;quot;
LMCACHE_USE_EXPERIMENTAL=True
PYTHONHASHSEED=67
AWS_PROFILE=&#39;lmcache&#39;
vllm serve Qwen/Qwen3-32B  &#92;
       --gpu-memory-utilization 0.55 &#92;
       --rope-scaling &#39;{&amp;quot;rope_type&amp;quot;:&amp;quot;yarn&amp;quot;,&amp;quot;factor&amp;quot;:4.0,&amp;quot;original_max_position_embeddings&amp;quot;:32768}&#39; &#92;
       --max-model-len 131072 &#92;
       --kv-transfer-config &#39;{&amp;quot;kv_connector&amp;quot;:&amp;quot;LMCacheConnectorV1&amp;quot;,&amp;quot;kv_role&amp;quot;:&amp;quot;kv_both&amp;quot;,&amp;quot;kv_parallel_size&amp;quot;:&amp;quot;16&amp;quot;}&#39; &#92;
       --tensor-parallel-size 2
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For the Gaudi3 accelerator testing we set the following additional environmental
variables:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;PT_HPU_GPU_MIGRATION=1
VLLM_USE_V1=1
VLLM_SKIP_WARMUP=True
VLLM_EXPONENTIAL_BUCKETING=False
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;benchmark&quot;&gt;Benchmark &lt;a class=&quot;link-anchor&quot; href=&quot;#benchmark&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;We wanted to characterize the reduction in time-to-first-token for a 100% cache
hit rate from remote storage with Ceph across various context lengths, and chart
it relative to computational prefill. For this we selected the LMCache
&lt;a href=&quot;https://github.com/LMCache/LMCache/blob/dev/benchmarks/long_doc_qa/long_doc_qa.py&quot;&gt;long_doc_qa.py&lt;/a&gt;. We developed the following methodology for TTFT data collection:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Start vLLM&lt;/li&gt;
&lt;li&gt;Run long_doc_qa.py and record TTFT for the warm-up round (computational
prefill result)&lt;/li&gt;
&lt;li&gt;Restart vLLM&lt;/li&gt;
&lt;li&gt;Run long_doc_qa.py and record TTFT for the warm-up round (KV cache hit from
remote storage result)&lt;/li&gt;
&lt;li&gt;Stop vLLM&lt;/li&gt;
&lt;li&gt;Remove cache blocks from remote storage&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;By restarting vLLM in step 3 we ensure that the results are not skewed by KV
caching in GPU HBM or CPU memory, and by stopping vLLM and removing cache blocks
from remote storage we ensure that each subsequent context length is not
benefitting from remote storage KV caching from the previous context length.
With this methodology all KV caches are cold at the beginning of each test,
except for remote storage KV caching which we want to measure the benefit of in
step 4.&lt;/p&gt;
&lt;p&gt;long_doc_qa.py example command line&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;python3 ~/LMCache/benchmarks/long_doc_qa/long_doc_qa.py &#92;
      --model Qwen/Qwen3-32B &#92;
      --port 8000 &#92;
      --num-documents 1 &#92;
      --document-length ${len} &#92;
      --output-len 100 &#92;
      --repeat-count 1 &#92;
      --repeat-mode interleave &#92;
      --max-inflight-requests 1 &#92;
      --output results/ttft_${L}.out
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;results&quot;&gt;Results &lt;a class=&quot;link-anchor&quot; href=&quot;#results&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;h3 id=&quot;intel-gaudi-3-results&quot;&gt;Intel Gaudi 3 Results &lt;a class=&quot;link-anchor&quot; href=&quot;#intel-gaudi-3-results&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;&lt;img src=&quot;images/gaudi3-tp2-sweep-qwen.png&quot; alt=&quot;&quot;&gt;
&lt;img src=&quot;images/gaudi3-tp2-sweep-llama.png&quot; alt=&quot;&quot;&gt;
&lt;img src=&quot;images/gaudi3-tp-charts.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;h3 id=&quot;amd-mi300x-results&quot;&gt;AMD MI300X Results &lt;a class=&quot;link-anchor&quot; href=&quot;#amd-mi300x-results&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;&lt;img src=&quot;images/amd-tp1-sweep-qwen.png&quot; alt=&quot;&quot;&gt;
&lt;img src=&quot;images/amd-tp-charts.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;Considerable reduction in TTFT with both Intel Guadi3 and AMD MI300X
accelerators, with the largest measured speed-up of 23x reduction. This testing
also illustrates how KV caching can reduce TTFT more than using tensor parallelism
to spread prefill across multiple GPUs in a system and that combing these
techniques can deliver the lowest TTFT. It’s also worth pointing out that in
addition to reducing TTFT, prefix caching derives additional value by conserving
GPU cycles for decode – potentially reducing time-per-output-token (TPOT).&lt;/p&gt;
&lt;h2 id=&quot;what&#39;s-next%3F&quot;&gt;What&#39;s next? &lt;a class=&quot;link-anchor&quot; href=&quot;#what&#39;s-next%3F&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;We shared our results with the llm-d team at Red Hat and have sarted to work
with them to commodify KV caching by establishing KV caching with Ceph as a
&lt;a href=&quot;https://www.redhat.com/en/topics/ai/what-is-llm-d#what-are-well-lit-paths&quot;&gt;well-lit
path&lt;/a&gt;. We believe that our approach is perhaps the most accessible
because it uses standard object protocols like S3, standard TCP/IP networking,
works with a variety of accelerators from different vendors, and because Ceph
object is ubiquitously deployed in OpenShift clusters through OpenShift Data
Foundation and IBM Fusion. Our next phase of testing will utilize llm-d, with
the GPU hosts serving as worker nodes, and exploring more sophisticated
scenarios like PD disaggregation and cache blending.&lt;/p&gt;
&lt;p&gt;Finally, we&#39;d like to thank Supermicro for providing the environment for these
testing efforts. If you have any questions about Data or AI workloads for Ceph,
please &lt;a href=&quot;mailto:kbader@ibm.com&quot;&gt;reach out&lt;/a&gt;.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>Benchmarking Performance with CBT: Running and Analysing a Performance Test. Part Three</title>
    <link href="https://ceph.io/en/news/blog/2025/cbt-performance-benchmarking-part3/" />
    <updated>2025-12-08T00:00:00Z</updated>
    <id>https://ceph.io/en/news/blog/2025/cbt-performance-benchmarking-part3/</id>
    <author>
      <name>Jake Squelch (IBM)</name>
    </author>
      <category term="en-blog-post" />
      <category term="en-article" />
      <category term="blog-post" />
      <category term="ceph" />
      <category term="benchmarks" />
      <category term="performance" />
    <content type="html" xml:base="https://ceph.io/en/news/blog/2025/cbt-performance-benchmarking-part3/">&lt;p&gt;CBT Performance Benchmarking - Part 3. How do we run and analyse a performance test?&lt;/p&gt;
&lt;h2 id=&quot;outline-of-the-blog-series&quot;&gt;&lt;a id=&quot;outline&quot;&gt;&lt;/a&gt;Outline of the Blog Series &lt;a class=&quot;link-anchor&quot; href=&quot;#outline-of-the-blog-series&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://ceph.io/en/news/blog/2025/cbt-performance-benchmarking-part1/&quot;&gt;&lt;strong&gt;Part 1&lt;/strong&gt;&lt;/a&gt; - How to start a Ceph cluster for a performance benchmark with CBT&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://ceph.io/en/news/blog/2025/cbt-performance-benchmarking-part2/&quot;&gt;&lt;strong&gt;Part 2&lt;/strong&gt;&lt;/a&gt; - Defining YAML contents&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Part 3&lt;/strong&gt; - How to start a CBT performance benchmark&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;p&gt;Contents:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;#intro&quot;&gt;Introduction: Running a performance benchmark&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;#read&quot;&gt;How to read response time curves&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;#values&quot;&gt;What values to read from a response curve?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;#summary&quot;&gt;Summary&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;#conclusion&quot;&gt;Conclusion&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&quot;introduction&quot;&gt;&lt;a id=&quot;intro&quot;&gt;&lt;/a&gt;Introduction &lt;a class=&quot;link-anchor&quot; href=&quot;#introduction&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Now that we have created our erasure coded (EC) cluster (from &lt;a href=&quot;https://ceph.io/en/news/blog/2025/cbt-performance-benchmarking-part1/&quot;&gt;&lt;strong&gt;Part 1&lt;/strong&gt;&lt;/a&gt;) and defined our YAML file and workloads (from &lt;a href=&quot;https://ceph.io/en/news/blog/2025/cbt-performance-benchmarking-part2/&quot;&gt;&lt;strong&gt;Part 2&lt;/strong&gt;&lt;/a&gt;), we can now start a CBT performance benchmark test.&lt;/p&gt;
&lt;p&gt;This part will cover:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Running a performance test&lt;/li&gt;
&lt;li&gt;Generating a performance report&lt;/li&gt;
&lt;li&gt;How to read response time curves&lt;/li&gt;
&lt;li&gt;Comparing performance benchmarks&lt;/li&gt;
&lt;li&gt;Running a performance test with an OSD down&lt;/li&gt;
&lt;/ol&gt;
&lt;hr&gt;
&lt;details&gt;
&lt;summary&gt;Step 1: Run the performance test&lt;/summary&gt;
&lt;p&gt;First, clone the &lt;a href=&quot;https://github.com/ceph/cbt&quot;&gt;CBT GitHub repository&lt;/a&gt; into a directory of your choice on the machine you are using and &lt;code&gt;cd&lt;/code&gt; into it.&lt;/p&gt;
&lt;p&gt;This is an example of the command to run a CBT performance test:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;  python /cbt/cbt.py -a /tmp/cbt -c /example/ceph.conf /example/&amp;lt;yaml_file&amp;gt; 2&amp;gt;&amp;amp;1 | tee /tmp/cbt.out
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You will specify the location of your cbt file (&lt;code&gt;cbt.py&lt;/code&gt;). Provide an archive folder where your results will be generated (&lt;code&gt;/tmp/cbt&lt;/code&gt;). Provide a config folder (&lt;code&gt;/example/ceph.conf&lt;/code&gt;) to allow CBT to connect with the cluster. Finally, we specify our (&lt;code&gt;yaml_file&lt;/code&gt;) which will outline what tests/workloads will be running.&lt;/p&gt;
&lt;/details&gt;
&lt;hr&gt;
&lt;details&gt;
&lt;summary&gt;Step 2: Generate a performance report&lt;/summary&gt;
&lt;p&gt;Once you have ran the performance test by following &lt;strong&gt;Pat 1&lt;/strong&gt; your result files will be outputed at the location you specified them to go in &lt;strong&gt;Step 1&lt;/strong&gt; after the archive argument (&lt;code&gt;-a&lt;/code&gt;). For me, the previous command referenced &lt;code&gt;/tmp/cbt&lt;/code&gt;, so my results are there.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You can now copy these result files to a new directory if you wish. I would like them to be within &lt;code&gt;/perftests/my_test&lt;/code&gt; in this case, I do so because I like to keep a directory of all my CBT test results, and I delete &lt;code&gt;/tmp/cbt&lt;/code&gt; before each performance test, so that is not a suitable place to keep them stored. So I would do this for example:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;cp -r /tmp/cbt/* /perftests/my_test
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;Next, it is a case of generating the performance report, which can be done by the following command for myself in this example:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;PYTHONPATH=/cbt/ /cbt/tools/generate_performance_report.py --archive /perftests/my_test --output_directory /perftests/my_test_results --create_pdf
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Above you reference the location of &lt;code&gt;cbt.py&lt;/code&gt; again at the start, you then reference the script that will generate the performance report (&lt;code&gt;generate_performance_report.py&lt;/code&gt;). I state the directory, &lt;code&gt;/perftests/my_test&lt;/code&gt; in this case, that has the results from the performance run, and you should also state a desired &lt;code&gt;output_directory&lt;/code&gt;, this is where the files for the performance report will be.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Side note:&lt;/strong&gt; you do not need to have already created the specified &lt;code&gt;output_directory&lt;/code&gt; you see in the command above, this will be automatically created for you if need be. After these steps, you should now have the result files inside your new &lt;code&gt;output_directory&lt;/code&gt;, in my case, &lt;code&gt;my_test_results&lt;/code&gt; folder. You have now successfully generated your &lt;strong&gt;performance report&lt;/strong&gt;! I normally upload these result files to GitHub to create a main repository to store and view the reports.&lt;/p&gt;
&lt;p&gt;The next section will go over the performance report generated, and how to understand your own one.&lt;/p&gt;
&lt;/details&gt;
&lt;hr&gt;
&lt;h2 id=&quot;how-to-read-response-time-curves&quot;&gt;&lt;a id=&quot;read&quot;&gt;&lt;/a&gt;How to read response time curves &lt;a class=&quot;link-anchor&quot; href=&quot;#how-to-read-response-time-curves&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Now you have generated your performance report for your test you may be looking at the pdf or md file and be slightly confused by the graphs shown. This section will cover how we read the response time curves and reach conclusions based on the data points.&lt;/p&gt;
&lt;p&gt;So lets go back to our example CBT test run and the question we started with: &lt;strong&gt;&amp;quot;Does using the CLAY erasure code plugin give better performance than using the default Jerasure plugin?&amp;quot;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;I generated a performance report for a &lt;strong&gt;Jerasure&lt;/strong&gt; plugin EC pool, the results can be found &lt;a href=&quot;https://github.com/Jakesquelch/cbt_results/blob/main/Blog/24th_Sep_Jerasure_4%2B2_results/performance_report_250924_094912.pdf&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I then did the same for the &lt;strong&gt;CLAY&lt;/strong&gt; plugin, &lt;a href=&quot;https://github.com/Jakesquelch/cbt_results/blob/main/Blog/13th_Oct_Clay_4%2B2%2B5_results/performance_report_251013_094658.pdf&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Within the generated reports above you will see hockey stick curves plotted to show the performance of each configuration.&lt;/p&gt;
&lt;h3 id=&quot;so-how-do-we-read-the-curves-generated%3F&quot;&gt;So how do we read the curves generated? &lt;a class=&quot;link-anchor&quot; href=&quot;#so-how-do-we-read-the-curves-generated%3F&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Here is an example of a curve generated within a performance report:
&lt;img src=&quot;images/example_curve.png&quot; alt=&quot;alt text&quot; title=&quot;How to read graphs&quot;&gt;
Below is an example of the &lt;code&gt;total_iodepth&lt;/code&gt; value. As stated above we can find out each specified &lt;code&gt;total iodepth&lt;/code&gt; point for this test by checking the yaml file we previously used in this test, and it is also stated within the performance report under the “Configuration yaml” section. For the above example it is:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;total_iodepth: [ 2, 4, 8, 12, 16, 24, 32, 64, 96, 128, 192, 288, 384 ] 
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The vertical red lines (error bars) shows the amount of standard deviation/variance in the performance for that specific point in the curve. If the standard deviations are small it shows that performance is stable with that workload. As the response curve starts to curve upwards performance bceomes more variable and the standard deviation increases.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;For an FIO workload, CBT will start 1 instance of FIO per volume.&lt;/li&gt;
&lt;li&gt;It&#39;s also to note that the graph produced by reports do not include the results during the &amp;quot;ramp&amp;quot; period.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The post processing tools will sum the IOPs to generate a total IOPs for the response curve and calculate an average latency over all the volumes. The IOPS vs latency is then plotted on the response curve for that point of the curve for that specific iodepth.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&quot;what-values-to-read-from-a-response-curve%3F&quot;&gt;&lt;a id=&quot;values&quot;&gt;&lt;/a&gt;What values to read from a response curve? &lt;a class=&quot;link-anchor&quot; href=&quot;#what-values-to-read-from-a-response-curve%3F&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;If you know how much I/O your application is generating then you can use the response curve to work out what latency you should expect&lt;/li&gt;
&lt;li&gt;If you want to see the maximum amount of I/O that the storage controller can process look for the right most point on the curve and find the value on the X axis.&lt;/li&gt;
&lt;li&gt;If you have a latency requirement such as all I/O must complete in under 2ms then you can find out the maximum I/Os the storage controller can do by finding the point on the curve at this latency.&lt;/li&gt;
&lt;li&gt;Most of the time you don&#39;t know exactly how much I/O an application is going to generate, and want to ensure that if there are any peaks or bursts in the amount of I/O that this doesn&#39;t cause a big change in latency. Where the response curve is flat there will be little change in latency if the amount of I/O varies, where the response curve is bending upwards a fairly small variation in amount of I/O can have a big impact on latency. Choosing a point on the response curve just before it starts increasing too rapidly gives a good indication of the maximum amount of I/O you can do with stable performance.&lt;/li&gt;
&lt;li&gt;Most users do not want to operate above around 70% of maximum throughput, as this provides some headroom for expansion and allows for sudden bursts in a workload so that high latency can be tolerated.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;As mentioned in &lt;a href=&quot;https://ceph.io/en/news/blog/2025/cbt-performance-benchmarking-part1/&quot;&gt;&lt;strong&gt;Part 1&lt;/strong&gt;&lt;/a&gt; of the blog, the perfect response curve would be a flat horizontal line showing constant latency as the quantity of I/O increases until we reach the saturation point where the system can handle no more I/O. This is because it highlights that performance is consistent with less variance.&lt;/p&gt;
&lt;hr&gt;
&lt;details&gt;
&lt;summary&gt;Step 3: Generating a comparison report&lt;/summary&gt;
&lt;p&gt;With CBT, as well as performance reports, we can also generate &lt;strong&gt;comparison reports&lt;/strong&gt; quickly. Now that I have ran the tests for &lt;strong&gt;CLAY&lt;/strong&gt; and &lt;strong&gt;Jerasure&lt;/strong&gt;, we can generate a performance report for them. I will use the following command to do so:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;PYTHONPATH=/cbt/ /cbt/tools/generate_comparison_performance_report.py --baseline /perftests/jerasure_test/ --archives /perftests/clay_test/ --output_directory /perftests/clay_vs_jerasure_comparison --create_pdf
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In the above command we will have to specify what our baseline is, we will use the &lt;strong&gt;Jerasure&lt;/strong&gt; test folder as the &lt;strong&gt;baseline curve&lt;/strong&gt; as shown above. Our &lt;strong&gt;archive curve&lt;/strong&gt; will be our &lt;strong&gt;CLAY&lt;/strong&gt; performance report test folder. It is important here that in the above command you are inputting the &lt;strong&gt;test&lt;/strong&gt; folders for Jerasure and CLAY &lt;strong&gt;NOT&lt;/strong&gt; the &lt;strong&gt;results&lt;/strong&gt; folder that was generated from the previous steps. The above command will generate a comparison report in our specified &lt;code&gt;output_directory&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;You have now successfully generated your &lt;strong&gt;comparison report&lt;/strong&gt;! Mine can be found &lt;a href=&quot;https://github.com/Jakesquelch/cbt_results/blob/main/Blog/Jerasure_Vs_Clay_comparison/comparitive_performance_report_251015_142011.pdf&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;h3 id=&quot;basic-analysis-of-the-comparison-report%3A&quot;&gt;Basic analysis of the comparison report: &lt;a class=&quot;link-anchor&quot; href=&quot;#basic-analysis-of-the-comparison-report%3A&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Let&#39;s first give a bit of a background on our two erasure coding profiles: &lt;strong&gt;Jerasure&lt;/strong&gt; is a generic reed-solomon erasure coding library, it is matrix-based, not CPU-optimised. It is fairly balanced between read and write. &lt;strong&gt;CLAY&lt;/strong&gt; is designed for faster recovery at the cost of more complicated write paths. So we are expecting to see &lt;strong&gt;better&lt;/strong&gt; performance from CLAY potentially when it comes to &lt;strong&gt;smaller&lt;/strong&gt; IO sizes, but as the writes get &lt;strong&gt;larger&lt;/strong&gt; we may see a decline in performance from CLAY leading to better Jerasure results. Furthermore, in terms of reads we expect fairly similar results across the board as they are implemented very similar, the main difference is when it comes to the writes.&lt;/p&gt;
&lt;p&gt;So lets now take a look at our comparison report, first comparing smaller workloads so let&#39;s start with a &lt;strong&gt;4K Random Reads&lt;/strong&gt;, this is the corresponding graph:
&lt;img src=&quot;images/4k_rand_read.png&quot; alt=&quot;alt text&quot; title=&quot;4k random read curve&quot;&gt;
As shown by the diagram, the orange curve is our CLAY EC pool, and the blue curve is our Jerasure EC pool. We can see for 4k random reads there is very little change in performance, as we expected. Both the curves have almost identical latencies and IOps.&lt;/p&gt;
&lt;p&gt;We can also take a look at the &lt;strong&gt;4K Random Writes&lt;/strong&gt;:
&lt;img src=&quot;images/4k_rand_write.png&quot; alt=&quot;alt text&quot; title=&quot;4k random write curve&quot;&gt;
The performance is similar until we get to the saturation point around &lt;strong&gt;14,000&lt;/strong&gt; IOps, where we can see latency sky rocket for both Jerasure and CLAY. The IOps for &lt;strong&gt;Jerasure&lt;/strong&gt; are marginally better than CLAY at this point but nothing substantial.&lt;/p&gt;
&lt;p&gt;So overall, we can see at small workloads there is very similar performance between &lt;strong&gt;Jerasure&lt;/strong&gt; and &lt;strong&gt;CLAY&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Lets now move onto larger workloads, starting with &lt;strong&gt;1024K Sequential Read&lt;/strong&gt;:
&lt;img src=&quot;images/1024k_seq_read.png&quot; alt=&quot;alt text&quot; title=&quot;1024k Sequential Read curve&quot;&gt;
Once again the two curves barely differ and they follow very similar paths, and that was expected. This is because for a normal read, ceph only needs to fetch data chunks (not parity chunks). Both Jerasure and CLAY are practically just returning the stored object, there is no real difference unless a failure occurs.&lt;/p&gt;
&lt;p&gt;Now lets look at the &lt;strong&gt;1024K Sequential Write&lt;/strong&gt;:
&lt;img src=&quot;images/1024k_seq_write.png&quot; alt=&quot;alt text&quot; title=&quot;1024k Sequential Write curve&quot;&gt;
Now when we take a look at the writes we see that &lt;strong&gt;CLAY&lt;/strong&gt; has 20-60% higher latency, with throughput dropping compared to &lt;strong&gt;Jerasure&lt;/strong&gt;. This is likely due to extra CPU and network demands in CLAY. Larger writes mean bigger encoding matrices/layers, and CLAY has more complexity per write than Jerasure, likely leading to the higher latency shown.&lt;/p&gt;
&lt;p&gt;Our sequential write benchmarks shows that Jerasure delivers more consistent write performance across all the block sizes, while CLAY is more volatile, performing better at some smaller sizes but much worse at large sequential writes. This shows CLAY’s design priorities: it is optimised for reduced recovery bandwidth rather than raw write performance.&lt;/p&gt;
&lt;p&gt;This means that if your I/O workload is mainly large sequential reads, for example a &lt;strong&gt;data lake&lt;/strong&gt; for AI training, then switching to CLAY isn&#39;t going to affect performance. However, if your I/O workload is mainly heavy sequential writes, for example &lt;strong&gt;storage archives or backups&lt;/strong&gt;, then switching to CLAY will have a substantial negative performance impact, as shown by the diagrams.&lt;/p&gt;
&lt;/details&gt;
&lt;hr&gt;
&lt;details&gt;
&lt;summary&gt;Step 4: Running a test with OSD down&lt;/summary&gt;
&lt;p&gt;So, before we had a CLAY and Jerasure EC pool compared with one another. The results solidified our hypothesis that Jerasure would likely perform better because of the more complex computations used to recover data. Now we will do an additional run and deliberately kill an OSD prior to running the CBT test, to simulate real world failures that could occur, to see how the performance between the two differs when it comes to OSD recovery.&lt;/p&gt;
&lt;p&gt;The following comparison report shows CLAY and Jerasure curves where both of the plugins have 1 OSD down. The report can be found &lt;a href=&quot;https://github.com/Jakesquelch/cbt_results/blob/main/Blog/Jerasure_Vs_Clay_down_comparison/comparitive_performance_report_251015_154505.pdf&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;We will now take a look at &lt;strong&gt;1024K Sequential Read&lt;/strong&gt; from the above comparison report:
&lt;img src=&quot;images/down_1024_seq_read.png&quot; alt=&quot;alt text&quot; title=&quot;1024k sequential read&quot;&gt;
Now we expect CLAY to have better performance here due to it&#39;s supposedly more efficient data recovery. However this is not the case as shown by the diagram above.&lt;/p&gt;
&lt;/details&gt;
&lt;hr&gt;
&lt;h2 id=&quot;summary&quot;&gt;&lt;a id=&quot;summary&quot;&gt;&lt;/a&gt;Summary &lt;a class=&quot;link-anchor&quot; href=&quot;#summary&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Within this part we have used CBT to successfully compare Jerasure and CLAY for a variety of different workloads. We have generated results that are repeatable and show that for both good path I/O and I/O when there is an OSD down (hence data needs to be recronstucted using erasure coding), there is no benefit to using CLAY. In fact, there are extra overheads which mean that performance may be worse when using CLAY.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&quot;conclusion&quot;&gt;&lt;a id=&quot;conclusion&quot;&gt;&lt;/a&gt;Conclusion &lt;a class=&quot;link-anchor&quot; href=&quot;#conclusion&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;In conlusion this blog has demonstrated the seamless experience of how you can generate a CBT performance benchmark run from start to finish, generating performance reports along the way and enabling analysis/comparison of performance. We used &lt;strong&gt;CLAY&lt;/strong&gt; and &lt;strong&gt;Jerasure&lt;/strong&gt; as an example of how to easily do a performance benchmark but sometimes the results can be unexpected and lead to more questions arising than answers being received. This can lead to further experiments to deep-dive into why certain results occured, and this is what I&#39;ll be doing in &lt;strong&gt;Part 4&lt;/strong&gt; of the blog that will be coming in the near future. &lt;strong&gt;Part 4&lt;/strong&gt; will provide more detailed analysis and IO breakdown for CLAY and Jerasure to provide more clarity on why CLAY performance was worse!&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;#outline&quot;&gt;Links to previous parts of the blog series&lt;/a&gt;&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>Benchmarking Performance with CBT: Defining YAML Contents. Part Two</title>
    <link href="https://ceph.io/en/news/blog/2025/cbt-performance-benchmarking-part2/" />
    <updated>2025-12-04T00:00:00Z</updated>
    <id>https://ceph.io/en/news/blog/2025/cbt-performance-benchmarking-part2/</id>
    <author>
      <name>Jake Squelch (IBM)</name>
    </author>
      <category term="en-blog-post" />
      <category term="en-article" />
      <category term="blog-post" />
      <category term="ceph" />
      <category term="benchmarks" />
      <category term="performance" />
    <content type="html" xml:base="https://ceph.io/en/news/blog/2025/cbt-performance-benchmarking-part2/">&lt;p&gt;CBT Performance Benchmarking - Part 2. What is a YAML file and how do we use them within CBT?&lt;/p&gt;
&lt;h2 id=&quot;outline-of-the-blog-series&quot;&gt;Outline of the Blog Series &lt;a class=&quot;link-anchor&quot; href=&quot;#outline-of-the-blog-series&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://ceph.io/en/news/blog/2025/cbt-performance-benchmarking-part1/&quot;&gt;&lt;strong&gt;Part 1&lt;/strong&gt;&lt;/a&gt; - How to start a Ceph cluster for a performance benchmark with CBT&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Part 2&lt;/strong&gt; - Defining YAML contents&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://ceph.io/en/news/blog/2025/cbt-performance-benchmarking-part3/&quot;&gt;&lt;strong&gt;Part 3&lt;/strong&gt;&lt;/a&gt; - How to start a CBT performance benchmark&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;p&gt;Contents:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;#intro&quot;&gt;Introduction: What goes into the YAML file?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;#key&quot;&gt;Key sections of the YAML file&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;#express&quot;&gt;Expressing queue depth&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;#diff&quot;&gt;Why do we have lots of different IO values in the yaml?&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&quot;introduction%3A-what-goes-into-the-yaml-file%3F&quot;&gt;&lt;a id=&quot;intro&quot;&gt;&lt;/a&gt;Introduction: What goes into the YAML file? &lt;a class=&quot;link-anchor&quot; href=&quot;#introduction%3A-what-goes-into-the-yaml-file%3F&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Once you have finished &lt;a href=&quot;https://ceph.io/en/news/blog/2025/cbt-performance-benchmarking-part1/&quot;&gt;&lt;strong&gt;Part 1&lt;/strong&gt;&lt;/a&gt; you should have an erasure coded Ceph cluster setup now, and you&#39;re nearly ready to run a CBT test on it! However, before we can do that, we need to understand what &lt;strong&gt;YAML contents&lt;/strong&gt; we want.&lt;/p&gt;
&lt;p&gt;The YAML file defines what tests we will run on the cluster.&lt;/p&gt;
&lt;p&gt;We could briefly describe the YAML file as having 3 main sections to it:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;code&gt;cluster&lt;/code&gt; section: Where the YAML describes how CBT communicated with the cluster. Eg user ID, clients, OSDs, ceph binary paths etc.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;monitoring_profiles&lt;/code&gt; section: Where the YAML describes the monitoring tools used (collectl in our case) to collect statistics.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;benchmarks&lt;/code&gt; section: Where the benchmarking technique is specified (librbdfio) in our case, and also where the workloads are placed.&lt;/li&gt;
&lt;/ol&gt;
&lt;hr&gt;
&lt;h2 id=&quot;key-sections-of-the-yaml-file%3A&quot;&gt;&lt;a id=&quot;key&quot;&gt;&lt;/a&gt;Key sections of the YAML file: &lt;a class=&quot;link-anchor&quot; href=&quot;#key-sections-of-the-yaml-file%3A&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;details&gt;
&lt;summary&gt;Cluster&lt;/summary&gt; 
&lt;p&gt;Here you will be describing your ceph cluster configuration.&lt;/p&gt;
&lt;p&gt;Now the reason &lt;code&gt;user&lt;/code&gt;, &lt;code&gt;head&lt;/code&gt;, &lt;code&gt;clients&lt;/code&gt;, &lt;code&gt;osds&lt;/code&gt;, &lt;code&gt;mons&lt;/code&gt; etc fields are required is because CBT uses a parallel distributed shell (&lt;strong&gt;pdsh&lt;/strong&gt;) with SSH to login to the various entities of the cluster that have been defined in the cluster section. This enables &amp;quot;ceph&amp;quot; commands and also the ability to start up the benchmark tool (such as &lt;strong&gt;FIO&lt;/strong&gt;) on the client endpoints (which are defined in the &amp;quot;&lt;strong&gt;clients&lt;/strong&gt;&amp;quot; field).&lt;/p&gt;
&lt;p&gt;A typical use case of Ceph is that there is a &lt;strong&gt;separately attached&lt;/strong&gt; host server dedicated for reading and writing data to the storage. Therefore it is possible to run CBT on a completely separate server from the cluster itself, and the performance data can be collected on the attached server. So the separately attached server is orchestrating the starting and stopping of the benchmark tools on the Ceph cluster.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Important side note:&lt;/strong&gt; A requirement of CBT is that passwordless SSH has to be &lt;code&gt;enabled&lt;/code&gt; from the server running CBT to the Ceph nodes defined in the &lt;code&gt;head&lt;/code&gt;, &lt;code&gt;clients&lt;/code&gt; and &lt;code&gt;osds&lt;/code&gt; fields.&lt;/p&gt;
&lt;p&gt;Example:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;cluster:
  user: &#39;exampleUser&#39; # the SSH user ID that is going to be used for accessing the ceph cluster
  head: &amp;quot;exampleHostAddress&amp;quot; # node where general ceph commands are run
  clients: [&amp;quot;exampleHostAddress&amp;quot;] # nodes that will run benchmarks or other client tools
  osds: [&amp;quot;exampleHostAddress&amp;quot;] # nodes where OSDs will live
  mons: # nodes where mons will live
    exampleHostAddress:
      a: &amp;quot;exampleIPAddress&amp;quot;
  mgrs:
    exampleHostAddress:
      a: ~
  osds_per_node: 8
  conf_file: &#39;/etc/ceph/ceph.conf&#39;
  clusterid: &amp;quot;ceph&amp;quot;
  tmp_dir: &amp;quot;/tmp/cbt&amp;quot;
  ceph-osd_cmd: &amp;quot;/usr/bin/ceph-osd&amp;quot;
  ceph-mon_cmd: &amp;quot;/usr/bin/ceph-mon&amp;quot;
  ceph-run_cmd: &amp;quot;/usr/bin/ceph-run&amp;quot;
  rados_cmd: &amp;quot;/usr/bin/rados&amp;quot;
  ceph_cmd: &amp;quot;/usr/bin/ceph&amp;quot;
  rbd_cmd: &amp;quot;/usr/bin/rbd&amp;quot;
  ceph-mgr_cmd: &amp;quot;/usr/bin/ceph-mgr&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;/details&gt;
&lt;hr&gt;
&lt;details&gt;
&lt;summary&gt;Monitoring Profiles&lt;/summary&gt; 
&lt;p&gt;In our example, we will be using &lt;strong&gt;collectl&lt;/strong&gt;, to collect statistics.&lt;/p&gt;
&lt;p&gt;In more detail, the benchmark IO exercisor (&lt;strong&gt;FIO&lt;/strong&gt;) starts up. When the &lt;code&gt;ramp&lt;/code&gt; period expires, the monitoring tool (&lt;strong&gt;collectl&lt;/strong&gt;) is started to begin statistics collection, so that no data is collected during the warmup/ramp period. Once the &lt;code&gt;time&lt;/code&gt; period of the IO exerciser has expired, CBT stops the monitor tool.&lt;/p&gt;
&lt;p&gt;Example:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;monitoring_profiles:
  collectl:
     args: &#39;-c 18 -sCD -i 10 -P -oz -F0 --rawtoo --sep &amp;quot;;&amp;quot; -f {collectl_dir}&#39;
&lt;/code&gt;&lt;/pre&gt;
&lt;/details&gt;
&lt;hr&gt;
&lt;details&gt;
&lt;summary&gt;Benchmark module&lt;/summary&gt; 
&lt;p&gt;In our example, we will be using &lt;strong&gt;librbdfio&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Example:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;benchmarks:
  librbdfio:
    rbdname: &amp;quot;test-image&amp;quot;
    poolname: &amp;quot;rbd_replicated&amp;quot;
    cmd_path: &#39;/usr/local/bin/fio&#39;
    &amp;lt;insert details here&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now within the &lt;strong&gt;librbdfio&lt;/strong&gt; section you will have to specify some details, including your &lt;strong&gt;volume name&lt;/strong&gt; and &lt;strong&gt;pool name&lt;/strong&gt; you created in &lt;a href=&quot;https://ceph.io/en/news/blog/2025/cbt-performance-benchmarking-part1/&quot;&gt;&lt;strong&gt;Part 1&lt;/strong&gt;&lt;/a&gt; in Step 5. CBT will append &lt;code&gt;&#39;hostname -f&#39;&lt;/code&gt; followed by a volume ID &lt;code&gt;&#39;-X&#39;&lt;/code&gt; onto the end of your &lt;code&gt;rbdname&lt;/code&gt; stated above, where &lt;code&gt;X&lt;/code&gt; is a volume starting from 0 to X as specified in your &lt;code&gt;volumes_per_client&lt;/code&gt; field (see &lt;code&gt;Number of volumes&lt;/code&gt; section).&lt;/p&gt;
&lt;p&gt;For example:
&lt;code&gt;rbdname=&amp;quot;test-image&amp;quot;&lt;/code&gt; will use:
&lt;code&gt;--rbdname=test-image-mycephhost1.com-1&lt;/code&gt;, if:
&lt;code&gt;hostname -f&lt;/code&gt; returned: &lt;a href=&quot;http://mycephhost1.com&quot;&gt;mycephhost1.com&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;It&#39;s important to have the &lt;code&gt;rbdname&lt;/code&gt; reflect your &lt;strong&gt;volume name&lt;/strong&gt; and the &lt;code&gt;poolname&lt;/code&gt; to reflect your &lt;strong&gt;pool name&lt;/strong&gt; that you used to create the volume. So the example YAML above, follows on from what we did in &lt;a href=&quot;https://ceph.io/en/news/blog/2025/cbt-performance-benchmarking-part1/&quot;&gt;&lt;strong&gt;Part 1&lt;/strong&gt;&lt;/a&gt;, here:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;rbd create –pool rbd_replicated –data-pool rbd_erasure –size 10G test-image
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Also, the &lt;code&gt;cmd_path&lt;/code&gt; attribute shown above is important, this has to be the path where FIO is located on the client driving the IO.&lt;/p&gt;
&lt;/details&gt;
&lt;hr&gt;
&lt;h3 id=&quot;other-important-sections-of-the-yaml-file%3A&quot;&gt;Other important sections of the YAML file: &lt;a class=&quot;link-anchor&quot; href=&quot;#other-important-sections-of-the-yaml-file%3A&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;details&gt;
&lt;summary&gt;Length of the benchmark&lt;/summary&gt; 
&lt;p&gt;We configure a &lt;strong&gt;ramp&lt;/strong&gt; and a &lt;strong&gt;time&lt;/strong&gt; for each test:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Ramp&lt;/strong&gt; → warmup period where no data is collected.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Time&lt;/strong&gt; → duration for which each test will run and collect results.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The &lt;code&gt;ramp&lt;/code&gt; time ensures that the I/O test gets into a steady state before the I/O measurement starts, it is quite common that &lt;strong&gt;write&lt;/strong&gt; caches give unrealistically high performance at the start of the test while the cache fills up and that &lt;strong&gt;read&lt;/strong&gt; caches give slightly lower performance at the start of the test while they are filled. Caches may be implemented in the drives or in the software.&lt;/p&gt;
&lt;p&gt;A very short &lt;code&gt;duration&lt;/code&gt; test will get performance measurements quicker but might not reflect the performance you will see in real use. Reasons for this include background processes that periodically perform work to clean up and issues such as fragmentation that typically become worse the longer the test is run for.
If doing a performance run multiple times gives different results then it is possible that the test duration is too short.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;It&#39;s important to note that the specified amount of time and ramp within librbdfio will apply to all workloads elsewhere specified in the YAML.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;However&lt;/strong&gt;, these can be overridden by specifying a time or ramp within a specific workload. You will see an example of this within the precondition section, where time is overridden to 600 (10 minutes).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Example:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;  librbdfio:
    time: 90 #in seconds
    ramp: 30 #in seconds
&lt;/code&gt;&lt;/pre&gt;
&lt;/details&gt;
&lt;hr&gt;
&lt;details&gt;
&lt;summary&gt;Volume size&lt;/summary&gt;
&lt;p&gt;Storage systems may give different performance depending how full they are, where there are fixed sized caches the cache hit ratio will be higher when testing a smaller quantity of storage, dealing with fragmentation and garbage colleciton takes more time when there is less free capacity.
Ideally configure the performance test to use over 50% of the physical storage to get measurements representative of real world use. We went over how to calculate the RBD volume size in &lt;a href=&quot;https://ceph.io/en/news/blog/2025/cbt-performance-benchmarking-part1/&quot;&gt;&lt;strong&gt;Part 1&lt;/strong&gt;&lt;/a&gt;, so it&#39;s important that your calculation there, matches with the &lt;code&gt;vol_size&lt;/code&gt; attribute within your yaml file.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Ideally, this should match the volume size created in &lt;a href=&quot;https://ceph.io/en/news/blog/2025/cbt-performance-benchmarking-part1/&quot;&gt;&lt;strong&gt;Part 1&lt;/strong&gt;&lt;/a&gt; when setting up the EC profile.&lt;/li&gt;
&lt;li&gt;If this value is lower than the RBD image size, then only that amount of data specified will be written.&lt;/li&gt;
&lt;li&gt;If the value is grater, then only the amount of data equivalent to the RBD image size will be written.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Example:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;  librbdfio:
    vol_size: 52500 #in megabytes
&lt;/code&gt;&lt;/pre&gt;
&lt;/details&gt;
&lt;hr&gt;
&lt;details&gt;
&lt;summary&gt;Number of volumes&lt;/summary&gt;
&lt;p&gt;This is the same number of volumes you defined in &lt;strong&gt;Part 1&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Example:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;  librbdfio:
    volumes_per_client: [8]
&lt;/code&gt;&lt;/pre&gt;
&lt;/details&gt;
&lt;hr&gt;
&lt;details&gt;
&lt;summary&gt;Prefill &amp; Precondition &lt;/summary&gt; 
&lt;p&gt;These are discussed more in depth in &lt;strong&gt;part 1&lt;/strong&gt; so please refer to that section if you need a recap.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Prefill&lt;/strong&gt; → filling all volumes with sequential writes.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Precondition&lt;/strong&gt; → adding random writes to simulate real-world workloads.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Example:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;  librbdfio:
    prefill:
      blocksize: &#39;64k&#39;
      numjobs: 1

    workloads:
      precondition:
        jobname: &#39;precond1rw&#39;
        mode: &#39;randwrite&#39;
        time: 600
        op_size: 65536
        numjobs: [ 1 ]
        total_iodepth: [ 16 ]
        monitor: False
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;So the above is issuing random 64K writes at a total_iodepth of 16 (across all volumes), so with an 8 volume configuration, each volume will be using a queue depth of 2 per volume.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Note: The time here is overriding the time specified in the librbdfio (global) section of the YAML. Not specifying a time will use the default value spceified in the outer (librbdfio) section.&lt;/li&gt;
&lt;/ul&gt;
&lt;/details&gt;  
&lt;hr&gt;
&lt;details&gt;
&lt;summary&gt;Workloads&lt;/summary&gt;
&lt;p&gt;Example:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;librbdfio:
  workloads:
    Seq32kwrite:
      jobname: &#39;seqwrite&#39;
      mode: &#39;write&#39;
      op_size: 32768
      numjobs: [ 1 ]
      total_iodepth: [ 2, 4, 8, 16, 32, 64, 128, 256, 512, 768 ]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The above is an example of a 32k sequential write, we configure different levels of total_iodepth. So the way this test would work is that it would start with a total_iodepth of 2 with a ramp of 30 seconds and 90 seconds of IO with stats collected, then the same would occur for total_iodepth 4, and so on for the increasing total_iodepth values. Each of these total_iodepth points are one of the points that are represented on the curve diagram.&lt;/p&gt;
&lt;/details&gt;
&lt;hr&gt;
&lt;p&gt;An example of workloads from a YAML file:
&lt;img src=&quot;images/yaml-contents.png&quot; alt=&quot;alt text&quot; title=&quot;Example of YAML workload&quot;&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&quot;expressing-queue-depth&quot;&gt;&lt;a id=&quot;express&quot;&gt;&lt;/a&gt;Expressing queue depth &lt;a class=&quot;link-anchor&quot; href=&quot;#expressing-queue-depth&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Firstly, what is &lt;strong&gt;queue depth&lt;/strong&gt;?&lt;/p&gt;
&lt;p&gt;Queue depth can be defined as the number of concurrent commands that are outstanding.&lt;/p&gt;
&lt;p&gt;There are two ways of expressing the queue depth per volume in CBT:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Using the &lt;code&gt;iodepth&lt;/code&gt; attribute&lt;/li&gt;
&lt;li&gt;Using the &lt;code&gt;total_iodepth&lt;/code&gt; attribute&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;code&gt;iodepth&lt;/code&gt; &lt;code&gt;n&lt;/code&gt; will use the same queue depth of &lt;code&gt;n&lt;/code&gt; for each &lt;strong&gt;volume&lt;/strong&gt;. For example, if the number of configured &lt;strong&gt;volumes&lt;/strong&gt; is 8 then a setting of &lt;code&gt;iodepth&lt;/code&gt; 2 will generate a &lt;code&gt;total_iodepth&lt;/code&gt; of 16 with each &lt;strong&gt;volume&lt;/strong&gt; having a queue of 2 I/Os. As the queue depth is increased the total amount of queued I/O will increase in &lt;strong&gt;multiples&lt;/strong&gt; of the number of &lt;strong&gt;volumes&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;total_iodepth&lt;/code&gt; &lt;code&gt;n&lt;/code&gt; will try and spread &lt;code&gt;n&lt;/code&gt; I/O requests across the set of volumes. For example, if &lt;code&gt;total_iodepth&lt;/code&gt; is 16 and the number of configured &lt;strong&gt;volumes&lt;/strong&gt; is 8, then the queue depth per &lt;strong&gt;volume&lt;/strong&gt; will be 2 (16/8). &lt;code&gt;Total_iodepth&lt;/code&gt; does not need to be exactly divisible by the number of volumes, in these cases CBT some volumes will have a queue depth 1 higher than other volumes.&lt;/p&gt;
&lt;h3 id=&quot;the-main-drawback-of-iodepth-over-total_iodepth%3A&quot;&gt;The main drawback of iodepth over total_iodepth: &lt;a class=&quot;link-anchor&quot; href=&quot;#the-main-drawback-of-iodepth-over-total_iodepth%3A&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Example: If you have a large number of volumes eg. 32. If you specified:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;  iodepth: [1, 2, 4, 8]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;All 32 volumes will be exercised, and therefore this is equivalent to writing a YAML that does:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;total_iodepth: [32, 64, 128, 256]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As you can see, your control over the queue depth scales according to the number of volumes you have configured in the YAML.&lt;/p&gt;
&lt;p&gt;Now with &lt;code&gt;total_iodepth&lt;/code&gt;, you can go finer grain than this, like so:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;total_iodepth: [1, 2, 4, 8, 16, 32]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;CBT will only use a subset of the volumes if the &lt;code&gt;total_iodepth&lt;/code&gt; configured is less than the &lt;code&gt;total_iodepth&lt;/code&gt; in the YAML and where the number of volumes configured does not divide into &lt;code&gt;total_iodepth&lt;/code&gt; evenly. This means some volumes will have a different &lt;code&gt;queue depth&lt;/code&gt; than others, but CBT will try to start FIO with an iodepth that is as even as possible over the volumes.&lt;/p&gt;
&lt;p&gt;A good way to look at the relationship between these terms if you&#39;re struggling, is:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;total_iodepth = volumes x queue depth&lt;/code&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&quot;why-do-we-have-lots-of-different-io-values-in-the-yaml%3F&quot;&gt;&lt;a id=&quot;diff&quot;&gt;&lt;/a&gt;Why do we have lots of different IO values in the yaml? &lt;a class=&quot;link-anchor&quot; href=&quot;#why-do-we-have-lots-of-different-io-values-in-the-yaml%3F&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;We have lots of different levels of IOs for our writes and reads within the yaml because we want to get test results for all the different scenarios that happen in the real world. Also to test the different bottlenecks that could be holding back the ceph cluster.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;In terms of bottlenecks:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Short IOs&lt;/strong&gt; will usually have a CPU bottleneck (this is why the x axis is IOPs for small IOs)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Larger IOs&lt;/strong&gt; are more likely to suffer from network and device storage bottlenecks (this is why the x axis turns to Bandwidth for the larger IO sizes)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;In terms of real world scenarios:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A database, or more generally &lt;strong&gt;OLTP&lt;/strong&gt; (Online Transaction Processing) running on block or file storage generally issues small &lt;strong&gt;random read&lt;/strong&gt; and &lt;strong&gt;write&lt;/strong&gt; I/Os. Often there is a higher percentage of read I/Os to write I/Os so this might be represented by a 70% read, 30% overwrite 4K I/O workload.&lt;/li&gt;
&lt;li&gt;An application creating a backup is likely to make larger &lt;strong&gt;read&lt;/strong&gt; and &lt;strong&gt;write&lt;/strong&gt; I/Os and these are likely to be fairly sequential. If the backup is being written to other storage then the I/O workload will be 100% sequential reads, if the backup is being read from elsewhere and written to the storage the I/O workload will be 100% sequential writes.&lt;/li&gt;
&lt;li&gt;A traditional S3 object store contains large objects that are &lt;strong&gt;read&lt;/strong&gt; and &lt;strong&gt;written sequentially&lt;/strong&gt;. S3 objects are not overwritten so the I/O workload would be a mixture of large sequential reads and writes. While the S3 object may be GB in size, RGW will typically split the S3 object into 4MB chunks.&lt;/li&gt;
&lt;li&gt;S3 object stores can be used to store small objects as well, and some applications store indexes and tables within objects and make &lt;strong&gt;short random&lt;/strong&gt; accesses to data within the object. These applications may generate I/O workloads where the reads are more similar to OLTP workloads.&lt;/li&gt;
&lt;li&gt;A storage cluster is likely to be used by more than one application, each with its own I/O workload. The I/O workload to the cluster can consequently become quite complicated.
Measuring the performance for I/O workloads with just one type of I/O is a good way of characterising the performance. This data can then be used to predict the performance of more complex I/O workloads with a mixture of I/O types in different ratios by calculating a harmonic mean.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;p&gt;Here is an example of a full YAML file, containing the components mentioned above:&lt;/p&gt;
&lt;details&gt;
&lt;summary&gt;Example YAML file&lt;/summary&gt; 
&lt;p&gt;Here is an example of a YAML file, you can have a lot more workloads than this of course, I just have a few for simplicity purposes.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;cluster:

  user: #specify user here 
  head: #specify head here
  clients: #specify clients here
  osds: #specify OSDs here
  mons:
    #specify mons here
  mgrs:
    #specify mgrs here
  osds_per_node: 8
  fs: &#39;xfs&#39;
  mkfs_opts: &#39;-f -i size=2048&#39;
  mount_opts: &#39;-o inode64,noatime,logbsize=256k&#39;
  conf_file: &#39;/cbt/ceph.conf.4x1x1.fs&#39;
  iterations: 1
  use_existing: True
  clusterid: &amp;quot;ceph&amp;quot;
  tmp_dir: &amp;quot;/tmp/cbt&amp;quot;
  ceph-osd_cmd: &amp;quot;/usr/bin/ceph-osd&amp;quot;
  ceph-mon_cmd: &amp;quot;/usr/bin/ceph-mon&amp;quot;
  ceph-run_cmd: &amp;quot;/usr/bin/ceph-run&amp;quot;
  rados_cmd: &amp;quot;/usr/bin/rados&amp;quot;
  ceph_cmd: &amp;quot;/usr/bin/ceph&amp;quot;
  rbd_cmd: &amp;quot;/usr/bin/rbd&amp;quot;
  ceph-mgr_cmd: &amp;quot;/usr/bin/ceph-mgr&amp;quot;
  pdsh_ssh_args: &amp;quot;-a -x -l%u %h&amp;quot;

monitoring_profiles:
  collectl:
     args: &#39;-c 18 -sCD -i 10 -P -oz -F0 --rawtoo --sep &amp;quot;;&amp;quot; -f {collectl_dir}&#39;

benchmarks:
  librbdfio:
    time: 90
    ramp: 30
    time_based: True
    norandommap: True
    vol_size: 52500
    use_existing_volumes: True
    procs_per_volume: [1]
    volumes_per_client: [16]
    osd_ra: [4096]
    cmd_path: &#39;/usr/local/bin/fio&#39;
    create_report: True
    wait_pgautoscaler_timeout: 20
    log_iops: True
    log_bw:  True
    log_lat: True
    fio_out_format: &#39;json&#39;
    log_avg_msec: 100
    rbdname: &amp;quot;test-image&amp;quot;
    poolname: &amp;quot;rbd_replicated&amp;quot;
    prefill:
      blocksize: &#39;64k&#39;
      numjobs: 1

    workloads:
      precondition:
        jobname: &#39;precond1rw&#39;
        mode: &#39;randwrite&#39;
        time: 600
        op_size: 65536
        numjobs: [ 1 ]
        total_iodepth: [ 16 ]
        monitor: False 

      seq32kwrite:
        jobname: &#39;seqwrite&#39;
        mode: &#39;write&#39;
        op_size: 32768
        numjobs: [ 1 ]
        total_iodepth: [ 2, 4, 8, 16, 32, 64, 128, 256, 512, 768 ]
      4krandomread:
        jobname: &#39;randread&#39;
        mode: &#39;randread&#39;
        op_size: 4096
        numjobs: [ 1 ]
        total_iodepth: [ 4, 8, 12, 16, 32, 48, 64, 128, 256, 384, 588, 768 ]
&lt;/code&gt;&lt;/pre&gt;
&lt;/details&gt;
&lt;hr&gt;
&lt;h2 id=&quot;summary&quot;&gt;Summary &lt;a class=&quot;link-anchor&quot; href=&quot;#summary&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;In part 2 you have learnt about YAML files, workloads, and how they are incorporated within CBT performance benchmarking. We will now move onto &lt;a href=&quot;https://ceph.io/en/news/blog/2025/cbt-performance-benchmarking-part3/&quot;&gt;&lt;strong&gt;Part 3&lt;/strong&gt;&lt;/a&gt; of the blog, which will discuss factors to consider and how to start your first CBT performance benchmark!&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>Benchmarking Performance with CBT: A guide to setup a Ceph cluster. Part One</title>
    <link href="https://ceph.io/en/news/blog/2025/cbt-performance-benchmarking-part1/" />
    <updated>2025-12-03T00:00:00Z</updated>
    <id>https://ceph.io/en/news/blog/2025/cbt-performance-benchmarking-part1/</id>
    <author>
      <name>Jake Squelch (IBM)</name>
    </author>
      <category term="en-blog-post" />
      <category term="en-article" />
      <category term="blog-post" />
      <category term="ceph" />
      <category term="benchmarks" />
      <category term="performance" />
    <content type="html" xml:base="https://ceph.io/en/news/blog/2025/cbt-performance-benchmarking-part1/">&lt;p&gt;CBT Performance Benchmarking - Part 1. What is CBT and how can we use it?&lt;/p&gt;
&lt;h2 id=&quot;outline-of-the-blog-series&quot;&gt;Outline of the Blog Series &lt;a class=&quot;link-anchor&quot; href=&quot;#outline-of-the-blog-series&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Part 1&lt;/strong&gt; - How to start a Ceph cluster for a performance benchmark with CBT&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://ceph.io/en/news/blog/2025/cbt-performance-benchmarking-part2/&quot;&gt;&lt;strong&gt;Part 2&lt;/strong&gt;&lt;/a&gt; - Defining YAML contents&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://ceph.io/en/news/blog/2025/cbt-performance-benchmarking-part3/&quot;&gt;&lt;strong&gt;Part 3&lt;/strong&gt;&lt;/a&gt; - How to start a CBT performance benchmark&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;p&gt;Contents:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;#intro&quot;&gt;Introduction: What is CBT (Ceph Benchmarking Tool)?  &lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;#consider&quot;&gt;What do you have to consider when you are benchmarking storage systems?  &lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;#achieve&quot;&gt;What are you looking to achieve from the performance benchmark?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;#start&quot;&gt;Starting up a ceph cluster for a performance benchmark&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&quot;introduction%3A-what-is-cbt-(ceph-benchmarking-tool)%3F&quot;&gt;&lt;a id=&quot;intro&quot;&gt;&lt;/a&gt;Introduction: What is CBT (Ceph Benchmarking Tool)? &lt;a class=&quot;link-anchor&quot; href=&quot;#introduction%3A-what-is-cbt-(ceph-benchmarking-tool)%3F&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;&lt;a href=&quot;https://github.com/ceph/cbt&quot;&gt;CBT&lt;/a&gt; can be used to standardise the performance evaluation process by:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Simplifying the cluster creation process and having CBT do it&lt;/li&gt;
&lt;li&gt;Running a deterministic suite of tests with response curves (throughput vs latency) with a wide variety of workloads&lt;/li&gt;
&lt;li&gt;Tooling to automatically post process data from a performance benchmark and generate performance reports and comparison reports, ability to compare two or more (up to 6) response curve runs and identify differences in performance within the response curves&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Here is an example of what a CBT comparison report would look like: (this will all be explained in more detail later, in &lt;strong&gt;part 3&lt;/strong&gt;)&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;images/cbt_example_results.png&quot; alt=&quot;alt text&quot; title=&quot;Example CBT comparison report&quot;&gt;&lt;/p&gt;
&lt;p&gt;Now I understand that the above example curves could be a totally new concept for a lot of people so will go over the fundamentals of them:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The perfect response curve would be a flat horizontal line showing constant latency as the quantity of I/O increases until we reach the saturation point. This is where we reach a bottleneck in the system, such as in CPU, network, drive utilisation or some other resource limitation which could also be in the software. At this point we would expect the curve to become a vertical line showing that attempting to do more I/O than the system can handle, just results in I/Os being queued and hence the latency increasing.&lt;/li&gt;
&lt;li&gt;In practice response curves are never perfect, a good response curve will have a fairly horizontal line with the latency increasing gradually as the I/O load increases, curving upwards towards a vertical line where we reach the saturation point.&lt;/li&gt;
&lt;li&gt;Our comparison curves will be explained in more detail in &lt;strong&gt;part 3&lt;/strong&gt; of the blog, so a basic understanding is more than fine for now.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The objective of this blog is to demonstrate how CBT (Ceph Benchmarking Tool) can be used to run tests for Ceph in a deterministic manner. It&#39;s also to show how to set up a Ceph cluster for use with CBT to make your life simpler by automating a lot of the manual effort that is required to set up a performance test.&lt;/p&gt;
&lt;p&gt;For a real life example, this blog will try and answer the quesiton &amp;quot;Does using the CLAY erasure code plugin give better performance than using the default JErasure plugin?&amp;quot; showing how CBT can be used to conduct a set of experiments and produce reports to answer this question.&lt;/p&gt;
&lt;p&gt;I hope you find this tutorial simple to understand and you will get to learn the benefits of using CBT to make your performance benchmarking a whole lot easier.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&quot;what-do-you-have-to-consider-when-you-are-benchmarking-storage-systems%3F&quot;&gt;&lt;a id=&quot;consider&quot;&gt;&lt;/a&gt;What do you have to consider when you are benchmarking storage systems? &lt;a class=&quot;link-anchor&quot; href=&quot;#what-do-you-have-to-consider-when-you-are-benchmarking-storage-systems%3F&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;There are several aspects to consider when evaluating performance, the main aspect to consider is what is the goal of measuring performance, this may be:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Regression testing a fix to see if performance has degraded or improved&lt;/li&gt;
&lt;li&gt;Regressing testing a build to see if other contributors have degraded performance&lt;/li&gt;
&lt;li&gt;Comparing a feature&lt;/li&gt;
&lt;li&gt;Comparing the effect of scale-up (adding more OSDs to a node) or scale-out (adding more nodes)&lt;/li&gt;
&lt;li&gt;Comparing the performance of one pool type over another&lt;/li&gt;
&lt;li&gt;The effect of additional network bandwidth&lt;/li&gt;
&lt;li&gt;The effect of upgrading CPU in a Ceph Node&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Therefore you need to consider:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;The results generated must be compared against a like-for-like system with the test repeated in the same way as the original results.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;This includes the &lt;strong&gt;same&lt;/strong&gt; cpu, number of OSDs, drive type, number of RBD volumes, Ceph nodes, ethernet port/type.&lt;/li&gt;
&lt;li&gt;Even client attach is important.&lt;/li&gt;
&lt;li&gt;Two seamlessly like-for-like systems could produce varying performance results because one drive could have a different generation of Flash memory within it.&lt;/li&gt;
&lt;li&gt;So Ideally, to get like for like comparisons, tests need to be run on the same system.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The system must be prefilled (if applicable, perhaps not so important for Object/RGW evaluation) and preconditioned in the same way.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Pre-filling involves filling the volume or pool with sequential writes prior to any performance benchmarking. Filling to 100% of the physical capacity is not needed, most production systems will have sufficient capacity available to allow for expansion. For benchmarking therefore, filling to around 50% of the physical capacity is sufficient for real world storage.&lt;/li&gt;
&lt;li&gt;Pre-conditioning is adding random overwrites after prefilling the system to simulate a real world application, to add some garbage collection/fragmentation since most production systems will have been running for many months/years and therefore will have generated many overwrites and updates to data written.&lt;/li&gt;
&lt;li&gt;A storage system that is almost empty will perform very differently from one that has a lot of data on it due to metadata access, garbage collection, fragmentation etc.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Same workload amount, e.g. 1M, 4k, 8k, 64k etc. And this has to be with the same sequential/random method.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;There is always going to be some element of variance in the results, even if everything is done like for like.&lt;br&gt;
This could be down to something as minimal as workload ordering, this can have an effect on performance of later workloads. For example, if you sequentially write then read, that will have significantly better performance than if you were to randomly write then sequentially read.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;So if the test results in a pass, fail, you need to allow for variance, typically 10% is probably acceptable if you are just looking at the average performance during the duration of the time.&lt;/li&gt;
&lt;li&gt;The shorter the run time, the greater the degree of variance.&lt;/li&gt;
&lt;li&gt;Also to help minimise variance it’s important to pick an appropriate run time for each test, 5 minutes is usually a good amount.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Turning off the balancer, scrub, deep scrub, autoscaler will help generate more repeatable results as the performance benchmark will just be measuring client performance and not measuring any of the background processes in Ceph that can affect performance such as backfill, pg splitting/merging, and scrubbing. Leaving these features enabled will generate real world results, but likely will generate more variance and a few % difference in performance.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&quot;what-are-you-looking-to-achieve-from-the-performance-benchmark%3F&quot;&gt;&lt;a id=&quot;achieve&quot;&gt;&lt;/a&gt;What are you looking to achieve from the performance benchmark? &lt;a class=&quot;link-anchor&quot; href=&quot;#what-are-you-looking-to-achieve-from-the-performance-benchmark%3F&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;If the same performance test is repeated on the same system you want to be able to measure the same results (or with as little variance between runs as possible). This predictability is important if you are going to try and compare different configurations to see which is better.&lt;/p&gt;
&lt;p&gt;Ideally you also want to be able to come back and run the same test 6 months later, on the same system, and get the same results. This is harder because things can change over time. Ideally, if someone configures an equivalent system to the one the performance test was run on you would like to get the same results.
If done correctly, the amount of manual effort needed to regression test performance will be significantly reduced.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&quot;starting-up-a-ceph-cluster-for-a-performance-benchmark&quot;&gt;&lt;a id=&quot;start&quot;&gt;&lt;/a&gt;Starting up a ceph cluster for a performance benchmark &lt;a class=&quot;link-anchor&quot; href=&quot;#starting-up-a-ceph-cluster-for-a-performance-benchmark&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;For these blogs we will be focusing on using &lt;code&gt;Cephadm&lt;/code&gt; to start our ceph clusters, though &lt;code&gt;vstart&lt;/code&gt; or by hand are also feasible options. It&#39;s also important to note that I am using &lt;code&gt;RBD volumes&lt;/code&gt; as the storage type with &lt;code&gt;FIO&lt;/code&gt; as the IO exerciser interface. The same rules for capacity filling etc apply equally to other storage types, except the maths for calculating the pool size will differ.&lt;/p&gt;
&lt;p&gt;This section will describe the basic steps to get a ceph cluster up and running, ready to start a performance benchmark.&lt;/p&gt;
&lt;hr&gt;
&lt;details&gt;
&lt;summary&gt;Step 1: Setup&lt;/summary&gt;
&lt;p&gt;You will want to ssh into our machine that we will be using.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;My system has the following setup:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;6 Sata Drive SSD’s 210GB&lt;/li&gt;
&lt;li&gt;ceph version &lt;code&gt;20.3.0-2198-gb0ae68b0 (b0ae68b0ccceed5a913d81c5a8cb0b4e9c5a5f6b)&lt;/code&gt; tentacle (dev)&lt;/li&gt;
&lt;li&gt;OS: Red Hat Enterprise Linux 9.6 (Plow)&lt;br&gt;
Note: This is a single node system and you are running the IO client on the same system as Ceph. However, there is nothing stopping you from running CBT on a multi-node server. The YAML format allows it to SSH into Ceph nodes.&lt;/li&gt;
&lt;/ul&gt;
&lt;/details&gt;
&lt;hr&gt;
&lt;details&gt;
&lt;summary&gt;Step 2: Clean up&lt;/summary&gt;
&lt;p&gt;When you create a cluster using cephadm and run a CBT test, log files will be created in specified locations.&lt;/p&gt;
&lt;p&gt;So if you have done a test before and know there will be old log files at a specific location, begin by deleting them, if you have never done a CBT run before, you can move onto &lt;code&gt;Step 3: Building a container&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Now I will remove a previous cluster that I had running, so that I am starting from a clean slate.&lt;/p&gt;
&lt;p&gt;There are 2 areas you will have to delete to complete this step:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Wherever the &lt;code&gt;tmp_dir&lt;/code&gt; line within your yaml file points to:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;tmp_dir: &amp;quot;/tmp/cbt&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This directory contains the temporarily log files from the IO exercisor, eg the FIO json files.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The -a argument when you run a performance benchmark:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;-a /tmp/cbt
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This argument directory contains the performance results of the performance benchmark.
As you can see, both my YAML and argument point to the same directory, so before my CBT run I will always make sure to:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;rm -rf /tmp/cbt/*
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We delete these files as if you don&#39;t, CBT assumes there is already a run ongoing and CBT will attempt to protect the previous data and skip tests throughout the YAML.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/details&gt;
&lt;hr&gt;
&lt;details&gt;
&lt;summary&gt;Step 3: Building a container&lt;/summary&gt;
&lt;p&gt;Next you will have to get a build container that you are going to use to construct our ceph cluster. You can obtain this container id from &lt;a href=&quot;https://shaman.ceph.com/builds/ceph&quot;&gt;Builds ceph&lt;/a&gt;. Click on your desired build and then copy the &lt;strong&gt;sha1&lt;/strong&gt;, this is also known as the &lt;strong&gt;container id&lt;/strong&gt;. The build I’m using can be seen within the &lt;code&gt;Step 1: Setup&lt;/code&gt; section previously.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You will now pull down the desired build container using podman&lt;/li&gt;
&lt;/ul&gt;
&lt;details&gt;
&lt;summary&gt;Click to see details for upstream&lt;/summary&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;podman pull quay.ceph.io/ceph-ci/ceph:&amp;lt;sha1&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Make sure to paste your specific &lt;strong&gt;sha1&lt;/strong&gt; into the above command!&lt;/p&gt;
&lt;/details&gt;
&lt;p&gt;Note: The above is using the development build containers. You can also pull released build containers (for Squid/Reef etc), from &lt;a href=&quot;https://quay.io/repository/ceph/ceph?tab=tags&amp;amp;tag=latest&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;details&gt;
&lt;summary&gt;Click to see details for downstream&lt;/summary&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;podman pull quay.ceph.io/ceph/ceph:v19.2.3
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The above is an example for the latest squid build&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;podman pull quay.ceph.io/ceph/ceph:v20.1
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The above is an example for the tentacle RC candidate&lt;/p&gt;
&lt;/details&gt;
&lt;/details&gt;
&lt;hr&gt;
&lt;details&gt;
&lt;summary&gt;Step 4: Creating a cluster&lt;/summary&gt;
&lt;p&gt;Firstly, run command &lt;code&gt;lsblk&lt;/code&gt; to see if there are any ceph partitions on the block devices you are going to use for ceph. If so, you will need to run the &lt;code&gt;removevgs&lt;/code&gt; script below, to remove the volume groups:&lt;/p&gt;
&lt;details&gt;
&lt;summary&gt;Click here to see removevgs script&lt;/summary&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;for i in /dev/ceph*
do
lvremove -y $i
done
&lt;/code&gt;&lt;/pre&gt;
&lt;/details&gt;
&lt;p&gt;Next, use cephadm with your container &lt;code&gt;id&lt;/code&gt; you previously pulled down, to create your ceph cluster, like so:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;cephadm --image quay.ceph.io/ceph-ci/ceph:&amp;lt;sha1&amp;gt; bootstrap --single-host-defaults --log-to-file --mon-ip &amp;lt;ip_of_node&amp;gt; --allow-mismatched-release
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Of course replace &lt;code&gt;sha1&lt;/code&gt; and &lt;code&gt;ip_of_node&lt;/code&gt; with your corresponding values. You are specifying the container image, using &lt;code&gt;bootstrap&lt;/code&gt; to initialise a new Ceph cluster. &lt;code&gt;--single-host-defaults&lt;/code&gt; is optimising the bootstrap for a single node, note that if you are creating a multi-node Ceph cluster, this option is not needed. &lt;code&gt;--log-to-file&lt;/code&gt; makes Ceph daemons log to files on disk. &lt;code&gt;--mon-ip&lt;/code&gt; tells what IP address to bind the first monitor to. &lt;code&gt;--allow-mismatched-release&lt;/code&gt; lets you bootstrap with an image that does not match the cephadm version of the host.&lt;/p&gt;
&lt;p&gt;It is also common in performance benchmarking to reset the system into a known state prior to starting any benchmarks because factors such as fragmentation of stored data can affect results. Therefore it is advisable to delete and recreate the cluster between every run.&lt;/p&gt;
&lt;/details&gt;
&lt;hr&gt;
&lt;details&gt;
&lt;summary&gt;Step 5: Configure cluster&lt;/summary&gt;
Now you have a basic cluster setup, you can view your cluster to make sure it is up and running:
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;ceph orch device ls&lt;/code&gt; to check all the OSDs you need are available&lt;/li&gt;
&lt;li&gt;If not available, you have to use &lt;code&gt;ceph orch zap device &amp;lt;osd&amp;gt;&lt;/code&gt; to make them available. A script like this will solve the OSD unavailability problem:&lt;/li&gt;
&lt;/ul&gt;
&lt;details&gt;
&lt;summary&gt;Click to see zap OSD script&lt;/summary&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;#! /bin/bash
file=/tmp/$$.out
out=/tmp/$$b.out
cephadm shell ceph orch device ls 2&amp;gt;&amp;amp;1 | grep ssd &amp;gt;$file

cat $file | while read -a line_array; do

host=${line_array[0]}
device=${line_array[1]}

echo ceph orch device zap ${host} ${device} --force &amp;gt;&amp;gt;$out
done

echo exit &amp;gt;&amp;gt;$out

cephadm shell &amp;lt;$out

rm -f $file
rm -f $out
&lt;/code&gt;&lt;/pre&gt;
&lt;/details&gt;
&lt;ul&gt;
&lt;li&gt;Next, you will create our Erasure Coding (EC) setup. This script can be customised however you’d like your EC setup to be, I will provide a simple example version of mine here:&lt;/li&gt;
&lt;/ul&gt;
&lt;details&gt;
&lt;summary&gt;Click to see details&lt;/summary&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;ceph osd erasure-code-profile set reedsol plugin=isa k=4 m=2 technique=reed_sol_van stripe_unit=4K crush-failure-domain=osd
ceph osd pool create rbd_erasure 64 64 erasure reedsol
ceph osd pool create rbd_replicated 64 64 replicated
ceph osd pool set rbd_erasure allow_ec_overwrites true
ceph osd pool set rbd_erasure allow_ec_optimizations true
rbd pool init rbd_erasure
rbd pool init rbd_replicated
rbd create –pool rbd_replicated –data-pool rbd_erasure –size 10G test-image
&lt;/code&gt;&lt;/pre&gt;
&lt;/details&gt;
&lt;p&gt;It&#39;s very important to take a note of the &lt;strong&gt;volume name&lt;/strong&gt; and &lt;strong&gt;pool name&lt;/strong&gt; you create, in my example above this is &lt;code&gt;test-image&lt;/code&gt; and &lt;code&gt;rbd_replicated&lt;/code&gt; respectively. As we are creating an erasure coded profile set up, we use the replicated pool name. (In &lt;a href=&quot;https://ceph.io/en/news/blog/2025/cbt-performance-benchmarking-part2/&quot;&gt;&lt;strong&gt;part 2&lt;/strong&gt;&lt;/a&gt; within the &lt;code&gt;Benchmark Module&lt;/code&gt; section you will need to refer to these names)&lt;/p&gt;
&lt;p&gt;So the above is an example of a similar script to what I run. It defines a 4 + 2 EC profile named &lt;strong&gt;reedsol&lt;/strong&gt;. An EC profile is essentially a template that defines how Ceph should encode and store data using EC. You create two pools (&lt;strong&gt;rbd_erasure&lt;/strong&gt; &amp;amp; &lt;strong&gt;rbd_replicated&lt;/strong&gt;), enable EC overwrites and EC optimisations, then initialise pools and create an RBD image backed by the EC pool.&lt;/p&gt;
&lt;p&gt;Within creating the EC setup you will be:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Defining the amount of data OSDs (k) and parity OSDs (m)&lt;/li&gt;
&lt;li&gt;Defining the size of your drives&lt;/li&gt;
&lt;li&gt;Defining the percentage of prefill&lt;/li&gt;
&lt;li&gt;Defining the number of volumes&lt;/li&gt;
&lt;li&gt;Defining the volume size&lt;/li&gt;
&lt;li&gt;Defining the EC profile, specifying the plugin, technique, stripe width etc&lt;/li&gt;
&lt;li&gt;Creating your EC pool&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;My EC (Erasure Coding) setup is as follows:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;4 + 2 setup (k=4, m=2)&lt;/li&gt;
&lt;li&gt;210gb drive size&lt;/li&gt;
&lt;li&gt;50% prefill&lt;/li&gt;
&lt;li&gt;8 volumes&lt;/li&gt;
&lt;li&gt;52.5gb volume size&lt;/li&gt;
&lt;li&gt;Single EC pool&lt;/li&gt;
&lt;li&gt;Chunk size = 4K&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Now you have set up and configured an erasure coded ceph cluster!&lt;/p&gt;
&lt;p&gt;I will go a bit more in depth here regarding prefilling, as mentioned above you are aiming to prefill 50% of the physical capacity. You need to choose a &lt;strong&gt;working set&lt;/strong&gt;, (the amount of logical capacity to utilise over during the course of the benchmark) is very important, this is so that all the IO doesn&#39;t just go straight into cache in systems with large amounts of memory. Therefore, it is important to have the total working set to be significantly larger than the RAM in the system.&lt;/p&gt;
&lt;p&gt;In this example you are using RBD volumes with erasure coding. This is the calculation you would do to find out how much you need to write to fill the physical capacity to 50% (this is represented by 0.5), this is known as the RBD Volume size.
&lt;code&gt;(Physical drive size * K * 0.5 / No. of volumes&lt;/code&gt;
Therefore for our example above, you would get:
&lt;code&gt;(210000 * 4 * 0.5) / 8&lt;/code&gt; Therefore the &lt;strong&gt;RBD Volume size&lt;/strong&gt; = 52500 (52.5GB)&lt;/p&gt;
&lt;p&gt;You can then calculate the &lt;strong&gt;total working set&lt;/strong&gt;, by doing:
&lt;code&gt;RBD Volume size * No. of volumes&lt;/code&gt;
Which would result in, for our example:
&lt;code&gt;52500 x 8&lt;/code&gt; Therefore the &lt;strong&gt;working set&lt;/strong&gt; = 420000 (420GB)&lt;/p&gt;
&lt;p&gt;You can see here for our example that the working set is 420GB and the RAM is 210GB therefore this is satisfactory.&lt;/p&gt;
&lt;p&gt;If you are not using RBD volumes with EC and you are using Replica pools instead, the maths would look like this, to get the &lt;strong&gt;RBD Volume size&lt;/strong&gt;:
&lt;code&gt;(Physical drive size * Number of OSDS / Number of copies * 0.5) / Number of Volumes&lt;/code&gt;&lt;/p&gt;
&lt;/details&gt;
&lt;hr&gt;
&lt;p&gt;Now move onto &lt;a href=&quot;https://ceph.io/en/news/blog/2025/cbt-performance-benchmarking-part2/&quot;&gt;&lt;strong&gt;Part 2&lt;/strong&gt;&lt;/a&gt; of the blog if you so wish, where you can take a look at defining a YAML file that will outline the workloads (tests) that you will be running on your ceph cluster!&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>Fast Erasure Coding for Tentacle Performance Updates</title>
    <link href="https://ceph.io/en/news/blog/2025/tentacle-fastec-performance-updates/" />
    <updated>2025-11-20T00:00:00Z</updated>
    <id>https://ceph.io/en/news/blog/2025/tentacle-fastec-performance-updates/</id>
    <author>
      <name>Lee Sanders (IBM)</name>
    </author>
      <category term="en-blog-post" />
      <category term="en-article" />
      <category term="blog-post" />
      <category term="performance" />
      <category term="erasure-encoding" />
      <category term="tentacle" />
    <content type="html" xml:base="https://ceph.io/en/news/blog/2025/tentacle-fastec-performance-updates/">&lt;p&gt;A deep-dive into the benefits of the FastEC improvements in Tentacle.
This blog discusses in detail how we have improved Erasure Coding to be a viable alternative to replica and reduce TCO of your Ceph clusters.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Contents:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;#introduction&quot;&gt;Introduction&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;#erasure-coding-basics&quot;&gt;Erasure Coding Basics&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;#choosing-an-erasure-code-profile&quot;&gt;Choosing an Erasure Code Profile&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;#read-optimizations-partial-reads&quot;&gt;Read Optimizations (Partial Reads)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;#space-efficiency-improvements---small-objects-padding&quot;&gt;Space efficiency improvements - small objects padding&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;#increasing-stripe_unit-size-to-16k&quot;&gt;Increasing stripe_unit size to 16k&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;#write-optimizations&quot;&gt;Write optimizations&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;#partial-writes&quot;&gt;Partial Writes&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;#parity-delta-writes&quot;&gt;Parity Delta Writes&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;#performance-results&quot;&gt;Performance Results&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;#how-to-read-a-response-curve&quot;&gt;How to read a response curve&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;#write-results---small-writes&quot;&gt;Write Results - Small Writes&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;#write-results---large-writes&quot;&gt;Write Results - Large Writes&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;#read-results---small-reads&quot;&gt;Read Results - Small Reads&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;#read-results---large-reads&quot;&gt;Read Results - Large Reads&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;#write-append-results&quot;&gt;Write Append Results&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;#mixed-readwrite-workloads&quot;&gt;Mixed Read/Write Workloads&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;#summary&quot;&gt;Summary&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;introduction&quot;&gt;&lt;a id=&quot;introduction&quot;&gt;&lt;/a&gt;Introduction &lt;a class=&quot;link-anchor&quot; href=&quot;#introduction&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Users of Ceph within the community have been getting very excited about the Fast EC feature within the Tentacle release of Ceph. This blog discusses the performance benefits of enabling Fast EC in Tentacle compared to Squid.&lt;/p&gt;
&lt;p&gt;The optimizations are primarily intended to benefit Block and File workloads; there may be benefits for S3 object workloads with small objects or random-access reads.&lt;/p&gt;
&lt;p&gt;Enabling Fast EC in Tentacle is on a per-pool basis with:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;ceph osd pool &amp;lt;mypool&amp;gt; set allow_ec_optimizations on&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;It is important to note that once &lt;code&gt;allow_ec_optimizations&lt;/code&gt; is enabled, it cannot be disabled.&lt;/p&gt;
&lt;p&gt;The Fast Erasure coding improvements are summarised as follows:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Read optimizations - partial reads&lt;/li&gt;
&lt;li&gt;Space efficiency improvements - small objects padding&lt;/li&gt;
&lt;li&gt;Write optimizations – partial writes, parity delta writes&lt;/li&gt;
&lt;li&gt;Recommending users increase the &lt;code&gt;stripe_unit&lt;/code&gt; size to 16k for pools with &lt;code&gt;allow_ec_optimizations&lt;/code&gt; enabled.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;erasure-coding-basics&quot;&gt;&lt;a id=&quot;erasure-coding-basics&quot;&gt;&lt;/a&gt;Erasure Coding Basics &lt;a class=&quot;link-anchor&quot; href=&quot;#erasure-coding-basics&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Before we jump into discussing the optimizations, let us briefly talk about the basics of Erasure Coding and RAID.&lt;/p&gt;
&lt;p&gt;Ceph erasure coding works by splitting an object into &lt;strong&gt;K&lt;/strong&gt; data chunks and &lt;strong&gt;M&lt;/strong&gt; parity coding chunks, which are then stored across different Object Storage Daemons (OSDs). If one or more OSDs fail, the missing data can be reconstructed by using the remaining data and parity coding chunks. This method is more storage-efficient than traditional replication because it doesn&#39;t store full copies of data.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;images/blog-erasure-basics.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;How it works:&lt;/strong&gt;&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Data splitting: An object is divided into &lt;strong&gt;K&lt;/strong&gt; data chunks.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Parity generation: An erasure code algorithm, such as &lt;a href=&quot;https://en.wikipedia.org/wiki/Reed%E2%80%93Solomon_error_correction&quot;&gt;Reed-Solomon&lt;/a&gt;, computes &lt;strong&gt;M&lt;/strong&gt; parity coding chunks based on the data chunks. The number of parity chunks &lt;strong&gt;M&lt;/strong&gt; determines how many OSDs can fail without data loss. The user can configure the erasure code algorithm with different plug-in’s available. The choice of plug-in is outside the scope of this blog.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Chunk distribution: The &lt;strong&gt;K&lt;/strong&gt; data chunks and &lt;strong&gt;M&lt;/strong&gt; parity chunks are distributed and stored on separate OSDs according to a CRUSH rule.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The user decides what the size of a chunk is, this is called a &lt;code&gt;stripe_unit&lt;/code&gt; and this can be specified when the &lt;code&gt;erasure-code-profile&lt;/code&gt; is created. There is a section later that discusses the choice of &lt;code&gt;stripe_unit&lt;/code&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The &lt;code&gt;stripe_unit&lt;/code&gt; size is the amount of data that is written to a data chunk before the next part of an object is written to the next chunk on the next OSD. The stripe is the collection of strips in the stripe which also make up for the coding parities that protect the data in the event of an OSD loss.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Within the community, &lt;code&gt;stripe_unit&lt;/code&gt; is commonly referred to as a &lt;strong&gt;chunk&lt;/strong&gt;. For the purpose of this blog, &lt;code&gt;stripe_unit&lt;/code&gt; is synonymous to chunk size.&lt;/p&gt;
&lt;h3 id=&quot;choosing-an-erasure-code-profile&quot;&gt;&lt;a id=&quot;choosing-an-erasure-code-profile&quot;&gt;&lt;/a&gt;Choosing an Erasure Code Profile &lt;a class=&quot;link-anchor&quot; href=&quot;#choosing-an-erasure-code-profile&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Users have been mainly using replica-3 pools for block and file workloads. A replica-3 pool stores 3 copies of the data on different OSDs so can survive two OSD failures without loss of data. The most common double failure is a drive failure plus a medium error on another drive. Replica-3 pools have a 300% storage overhead - for every 3GB of raw capacity you can store 1GB of application data.&lt;/p&gt;
&lt;p&gt;With erasure coding pools you create an erasure code profile choosing values for &lt;strong&gt;K+M&lt;/strong&gt;. The minimum number of OSDs required for an erasure code pool is &lt;strong&gt;K+M&lt;/strong&gt;, and just like replica-3 pools it is recommended that these OSDs are in different servers for fault tolerance. The choice of &lt;strong&gt;M&lt;/strong&gt; defines how much redundancy you have, &lt;strong&gt;M=2&lt;/strong&gt; means you can survive two OSD failures - the same as a replica-3 pool. The storage overhead for an erasure coded pool is &lt;strong&gt;(K+M / K)&lt;/strong&gt;, so a 4+2 pool has a 150% storage overhead.&lt;/p&gt;
&lt;p&gt;This blog focuses on Erasure code performance with &lt;strong&gt;M=2&lt;/strong&gt; as this gives the same level of protection as a replica-3 pool.&lt;/p&gt;
&lt;h2 id=&quot;read-optimizations-(partial-reads)&quot;&gt;&lt;a id=&quot;read-optimizations-partial-reads&quot;&gt;&lt;/a&gt;Read Optimizations (Partial Reads) &lt;a class=&quot;link-anchor&quot; href=&quot;#read-optimizations-(partial-reads)&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;In Squid, reads to an individual strip in a stripe, read the whole stripe, extract the required data that is needed by the client request from the stripe data and then discard the rest of the data. For small reads, the greater the &lt;strong&gt;K&lt;/strong&gt; value (data strips) in the erasure code profile, the greater the amount of wasted IOs to the OSDs.&lt;/p&gt;
&lt;p&gt;In Tentacle, Partial Reads is an improvement to only read the minimal data to honour the client request. There are two benefits of this improvement, firstly performance reads are unaffected by the increase of &lt;strong&gt;K&lt;/strong&gt; and your drive media will get better utilization through less wasted IOs, secondly with a larger &lt;code&gt;stripe_unit&lt;/code&gt; , client reads will only need to read part of a strip and there will be less wasted bandwidth from the other OSDs.&lt;/p&gt;
&lt;p&gt;This means that in Tentacle, with fast EC, you can now choose to use a higher value of &lt;strong&gt;K&lt;/strong&gt; so that you get better capacity utilization without the performance penalties that we see in Squid.&lt;/p&gt;
&lt;h2 id=&quot;space-efficiency-improvements---small-objects-padding&quot;&gt;&lt;a id=&quot;space-efficiency-improvements&quot;&gt;&lt;/a&gt;Space efficiency improvements - small objects padding &lt;a class=&quot;link-anchor&quot; href=&quot;#space-efficiency-improvements---small-objects-padding&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;In Squid, small objects are padded to a whole stripe, which resulted in wasted space as well as a write performance loss due to writing to multiple OSDs needlessly. Fast EC does not pad small objects to a whole stripe, instead it writes the object to just the strips that it needs to, resulting in a performance improvement as well as a capacity saving.&lt;/p&gt;
&lt;h2 id=&quot;increasing-stripe_unit-size-to-16k&quot;&gt;&lt;a id=&quot;increasing-stripe-unit-size-to-16k&quot;&gt;&lt;/a&gt;Increasing stripe_unit size to 16k &lt;a class=&quot;link-anchor&quot; href=&quot;#increasing-stripe_unit-size-to-16k&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Having a small &lt;code&gt;stripe_unit&lt;/code&gt; increases the probability that client I/Os get split up into multiple requests for different OSDs. For large I/Os (e.g. 1MB reads) there is a performance advantage in splitting the I/O into smaller requests to separate OSDs that can be processed in parallel. For smaller I/Os splitting the I/O just increases the work for the drives, CPU and network and reduces performance.&lt;/p&gt;
&lt;p&gt;Increasing the &lt;code&gt;stripe_unit&lt;/code&gt; reduces the overheads for processing small I/Os whilst still splitting and getting a performance advantage for large I/Os.&lt;/p&gt;
&lt;p&gt;In squid and earlier, there are two reasons why the &lt;code&gt;stripe_unit&lt;/code&gt; was small:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Lack of partial read support essentially was a blocker to allowing the increase of the &lt;code&gt;stripe_unit&lt;/code&gt; size, as greater values of &lt;strong&gt;K&lt;/strong&gt; with a larger &lt;code&gt;stripe_unit&lt;/code&gt; meant reads of 4-16k would have resulted in even greater IO wastage to the OSDs.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;EC used to pad all objects to be a multiple of the stripe size. A bigger &lt;code&gt;stripe_unit&lt;/code&gt; means more padding which wasted storage capacity.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;There is still a compromise between performance and capacity usage. Increasing the &lt;code&gt;stripe_unit&lt;/code&gt; above 16K, perhaps as high as 256K would improve performance more but for small files or objects will still waste storage capacity. The choice of 16K for the &lt;code&gt;stripe_unit&lt;/code&gt; is a good compromise – it gives very similar capacity utilization to the old EC but better performance.&lt;/p&gt;
&lt;p&gt;The default &lt;code&gt;stripe_unit&lt;/code&gt; is still 4K in Tentacle, but we recommend that you specify a 16K &lt;code&gt;stripe_unit&lt;/code&gt; when you create a new fast EC pool for a bigger performance gain.&lt;/p&gt;
&lt;p&gt;For existing pools, it is not possible to change the &lt;code&gt;stripe_unit&lt;/code&gt;, fast EC can still be enabled for these pools but there will be a slightly less performance improvement.&lt;/p&gt;
&lt;h2 id=&quot;write-optimizations&quot;&gt;&lt;a id=&quot;write-optimizations&quot;&gt;&lt;/a&gt;Write optimizations &lt;a class=&quot;link-anchor&quot; href=&quot;#write-optimizations&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;h3 id=&quot;partial-writes&quot;&gt;&lt;a id=&quot;partial-writes&quot;&gt;&lt;/a&gt;Partial Writes &lt;a class=&quot;link-anchor&quot; href=&quot;#partial-writes&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;In squid, all sub-stripe writes are handled by reading the whole stripe, merging in the new data from the client, encode the new parities and write the stripe back, data with the coding parities.&lt;/p&gt;
&lt;p&gt;This meant that EC was more optimised for large block and large object workloads, but it is not optimal for small object or small write workloads such as CephFS or transactional workloads, since greater values of &lt;strong&gt;K&lt;/strong&gt; with small writes meant that IO operations are amplified.&lt;/p&gt;
&lt;p&gt;Partial Writes only reads the data strips that are not being written, encode the new parities and only write back the modified data and parity strips.&lt;/p&gt;
&lt;p&gt;This optimisation means for small writes and large values of &lt;strong&gt;K&lt;/strong&gt;, Fast EC saves on drive operations for reading and writing unchanged data within the stripe.&lt;/p&gt;
&lt;h3 id=&quot;parity-delta-writes&quot;&gt;&lt;a id=&quot;parity-delta-writes&quot;&gt;&lt;/a&gt;Parity Delta Writes &lt;a class=&quot;link-anchor&quot; href=&quot;#parity-delta-writes&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Parity delta writes (PDW) builds on the partial write improvement within Fast EC.&lt;/p&gt;
&lt;p&gt;A common technique used by block storage controllers, implementing RAID-5 and RAID-6 is to implement parity delta writes. When a small part of the stripe is being overwritten it is possible to perform the update by reading the old data, XORing this with the new data to create a delta and then read each coding parity, apply the delta and write the new parity. The advantage of this technique is that it can involve a lot less I/O, especially for &lt;strong&gt;K+M&lt;/strong&gt; encodings with larger values of &lt;strong&gt;K&lt;/strong&gt;. The technique is not specific to &lt;strong&gt;M=1&lt;/strong&gt; and &lt;strong&gt;M=2&lt;/strong&gt;, it can be applied with any number of coding parities. For &lt;strong&gt;M=2&lt;/strong&gt;, this technique involves doing 3 reads and 3 writes per strip within the client request and then updates to the parity are coalesced via a cache to minimize the number of parity updates within the stripe. (For &lt;strong&gt;M=1&lt;/strong&gt;, 2 reads and 2 writes are needed for each write).&lt;/p&gt;
&lt;p&gt;In some scenarios depending on the value of &lt;strong&gt;K&lt;/strong&gt; and the size of the write operation, it may be more beneficial to not use PDW.&lt;/p&gt;
&lt;p&gt;The implementation of PDW within Fast EC dynamically adjusts the write technique for each IO for optimal write performance.&lt;/p&gt;
&lt;p&gt;Here is an example table of profile vs the write size just for illustration:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Erasure Code&lt;/th&gt;
&lt;th&gt;stripe_unit&lt;/th&gt;
&lt;th&gt;Write size&lt;/th&gt;
&lt;th&gt;PDW Write&lt;/th&gt;
&lt;th&gt;PDW Off (Partial Write)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;2+2&lt;/td&gt;
&lt;td&gt;16k&lt;/td&gt;
&lt;td&gt;4 to 16k&lt;/td&gt;
&lt;td&gt;3 reads+3 writes&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1 read+3 writes&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4+2&lt;/td&gt;
&lt;td&gt;16k&lt;/td&gt;
&lt;td&gt;4 to 16k&lt;/td&gt;
&lt;td&gt;&lt;code&gt;3 reads+3 writes&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;3 reads+3 writes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6+2&lt;/td&gt;
&lt;td&gt;16k&lt;/td&gt;
&lt;td&gt;4 to 16k&lt;/td&gt;
&lt;td&gt;&lt;code&gt;3 reads+3 writes&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;5 reads+3 writes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6+2&lt;/td&gt;
&lt;td&gt;16k&lt;/td&gt;
&lt;td&gt;32k&lt;/td&gt;
&lt;td&gt;&lt;code&gt;4 reads+4 writes&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;4 reads+4 writes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8+2&lt;/td&gt;
&lt;td&gt;16k&lt;/td&gt;
&lt;td&gt;4 to 16k&lt;/td&gt;
&lt;td&gt;&lt;code&gt;3 reads+3 writes&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;7 reads+3 writes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8+2&lt;/td&gt;
&lt;td&gt;16k&lt;/td&gt;
&lt;td&gt;32k&lt;/td&gt;
&lt;td&gt;&lt;code&gt;4 reads+4 writes&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;6 reads+4 writes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;strong&gt;Figure 1: Table to explain write overhead using PDW and Partial Write techniques&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The highlighted &lt;code&gt;text&lt;/code&gt; indicates the more efficient method for the scenario.&lt;/p&gt;
&lt;p&gt;In scenarios where the total numbers of I/O operations is the same between PDW on and off (ie. Using Partial Write methodology), FastEC will favour using PDW because reading and writing the same OSD is more efficient than reading and writing to different OSDs because bluestore caches metadata.&lt;/p&gt;
&lt;h2 id=&quot;performance-results&quot;&gt;&lt;a id=&quot;performance-results&quot;&gt;&lt;/a&gt;Performance Results &lt;a class=&quot;link-anchor&quot; href=&quot;#performance-results&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;For the purpose of this blog, we ran the performance tests with a single node. Running with a single node means that there are no network bottlenecks and we can focus on CPU and drive bottlenecks. The absolute performance measurements won’t be great, but we can still compare relative performance as the optimizations will be demonstrating we have extracted more performance in workloads that are limited by CPU or drives.&lt;/p&gt;
&lt;p&gt;The configuration of the system is as follows:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Single Node – 8 OSDs - NVME Flash&lt;/li&gt;
&lt;li&gt;2 x Intel(R) Xeon(R) Platinum 8276M CPU @ 2.20GHz – 28 cores per socket&lt;/li&gt;
&lt;li&gt;LibRBD FIO client – 16 volumes – 1 client per RBD volume&lt;/li&gt;
&lt;li&gt;ISAL plugin&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;how-to-read-a-response-curve&quot;&gt;&lt;a id=&quot;how-to-read-a-response-curve&quot;&gt;&lt;/a&gt;How to read a response curve &lt;a class=&quot;link-anchor&quot; href=&quot;#how-to-read-a-response-curve&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;For the next sections, I need to explain a response curve (also known as a hockey stick curve).&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;images/tentacle-read-responsecurve.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Figure 2: How to read a response curve.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;A response curve plots I/Os per second (IOPs) against latency. Starting from the bottom left of the chart the expectation is that as the queue depth to the storage system increases, the IOPs increases also. At a certain point in the curve, otherwise known as the knee, this is the saturation point where throughput no longer increases and adding extra work onto the storage systems queue will just increase latency. This is what generates the “hockey stick” shape to the curve.&lt;/p&gt;
&lt;p&gt;For each point of the curve, the I/O workload is run at a specified queue depth for several minutes and then an average IOPS and latency is calculated. Typically 3 to 5 minutes (with a warm up period).&lt;/p&gt;
&lt;p&gt;The saturation point is system specific and there may be many reasons why this limit is hit, depending on the workload, for example such as the CPU, a CPU core, drive, network interface, or some software resource limit in the software are just a few possible reasons and there maybe others.&lt;/p&gt;
&lt;p&gt;Typically, response curves are used during client system sizing estimates to understand the limits of the system being sold and to evaluate how much headroom remains on the system. Typically, clients don’t go beyond around 70% of the maximum throughput to allow for sufficient head room for expansion.
A flat line at the beginning of the curve through to the knee is an indication that latency is consistent with low variance in the throughput and latency.&lt;/p&gt;
&lt;p&gt;The topic of how a response curve is created or factors that can affect the response curve is subject to performance best practices. This is outside the scope of this blog and will be discussed in a series of blogs on CBT (Ceph Benchmarking Tool) which will be available soon on &lt;a href=&quot;https://ceph.io&quot;&gt;ceph.io&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;When comparing response curves, it is inevitable that there is some variance typically around 5 to 10%&lt;/p&gt;
&lt;p&gt;For now, let us get onto discussing the performance Fast EC improvements in Tentacle
for the purpose of explaining the legend of the charts in the next section and all other charts in this blog, for example:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;squid-ec-6+2-4K&lt;/strong&gt; means we are running the squid build, using erasure coding with a 6+2 profile and a 4K stripe unit. Therefore, these graphs are comparing a Squid build with a 4K &lt;code&gt;stripe_unit&lt;/code&gt; 6+2 erasure code to a Tentacle build with FastEC enabled with a 16K &lt;code&gt;stripe_unit&lt;/code&gt; and a 6+2 profile. There are other charts that use a different erasure code profile.&lt;/p&gt;
&lt;h3 id=&quot;write-results---small-writes&quot;&gt;&lt;a id=&quot;write-results---small-writes&quot;&gt;&lt;/a&gt;Write Results - Small Writes &lt;a class=&quot;link-anchor&quot; href=&quot;#write-results---small-writes&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;&lt;img src=&quot;images/tentacle-fastec-smallwrites-part1.png&quot; alt=&quot;&quot;&gt;
&lt;img src=&quot;images/tentacle-fastec-smallwrites-part2.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Figure 3: Small Writes - Squid 4k 6+2 EC vs Tentacle 16k Fast EC 6+2&lt;/strong&gt;
Small writes are common for Ceph FS, RBD and small object workloads.&lt;/p&gt;
&lt;p&gt;In Figure 3, we start by comparing a Squid 4k &lt;code&gt;stripe_unit&lt;/code&gt; 6+2 erasure code to Tentacle with FastEC enabled 16k &lt;code&gt;stripe_unit&lt;/code&gt; 6+2 configuration. This is a small system single node with 8 OSDs system. By all means, 20K IOPS isn’t a particularly great throughput of a storage system, however it isn’t the absolute numbers we are interested in here. We are interested in the relative performance of the two pieces of software, this is highlighting that we can at least double the throughput of the drives with Fast EC at the same latency achieved, or in some cases improve the latency. If you want more performance, you can add more drives and nodes to the configuration.&lt;/p&gt;
&lt;p&gt;The improvement in performance in the 6+2 configuration between Squid 4K and Tentacle 16k is largely due to the Parity Delta Writes feature of FastEC, as explained in the Figure 1 comparing the number of read/write operations depending on the value of &lt;strong&gt;K&lt;/strong&gt; and the size of the write IO request.&lt;/p&gt;
&lt;p&gt;Your choice of &lt;strong&gt;K&lt;/strong&gt; can affect the performance you get from the system. Here are a set of charts that perform the same 4/8/16k random writes test comparing 2+2,4+2 and 6+2 EC configurations:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;images/tentacle-fastec-smallwrites-part3.png&quot; alt=&quot;&quot;&gt;
&lt;img src=&quot;images/tentacle-fastec-smallwrites-part4.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Figure 4: Small Writes - Squid – EC 2+2, 4+2 and 6+2 – 4k compared to tentacle – FastEC in 2+2,4+2, 6+2 profiles&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Previously in Squid, write performance reduced as &lt;strong&gt;K&lt;/strong&gt; increased, the reason is the whole stripe is always being read and written, this means that for wider erasure codes (eg 4+2 and 6+2) the overheads get higher and performance reduces. Increasing &lt;strong&gt;K&lt;/strong&gt; above 6 would lead to further drops in performance.&lt;/p&gt;
&lt;p&gt;For Tentacle with Fast EC, the parity delta write optimization means that wider erasure codes performance improves as &lt;strong&gt;K&lt;/strong&gt; increases. Performance is not expected to improve beyond the 6+2.&lt;/p&gt;
&lt;p&gt;We’ll discuss later in this blog how we are recommending choosing greater values of &lt;strong&gt;K&lt;/strong&gt; as this improves storage efficiency with less capacity overhead.&lt;/p&gt;
&lt;h3 id=&quot;write-results---large-writes&quot;&gt;&lt;a id=&quot;write-results---large-writes&quot;&gt;&lt;/a&gt;Write Results - Large Writes &lt;a class=&quot;link-anchor&quot; href=&quot;#write-results---large-writes&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;&lt;img src=&quot;images/tentacle-fastec-largewrites.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Figure 5: Large Writes - Comparing Squid 4k, Tentacle 16k to 3-way replica&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;For large writes and large S3 objects there is a small increase in throughput and lower latency compared to Squid. You can expect to see the same performance with FastEC enabled for larger 1Mbyte objects, performance is near 3-way replica.&lt;/p&gt;
&lt;h3 id=&quot;read-results---small-reads&quot;&gt;&lt;a id=&quot;read-results---small-reads&quot;&gt;&lt;/a&gt;Read Results - Small Reads &lt;a class=&quot;link-anchor&quot; href=&quot;#read-results---small-reads&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;&lt;img src=&quot;images/tentacle-fastec-smallreads-part1.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Figure 6: Small reads comparing 4k stripe_unit Squid 6+2 EC to 16k Tentacle 6+2 EC&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Small reads yield a 2-3x improvement due to the Partial Read feature added in Fast EC. This is good for RBD, Ceph FS and Small object workloads.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;images/tentacle-fastec-smallreads-part2.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Figure 7: Small Reads - Comparing 2+2,4+2, 6+2 Squid to Tentacle to 3-way replica&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Comparing the different erasure code profiles between Squid and Tentacle.&lt;/p&gt;
&lt;p&gt;These results highlight the following observations:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;For Squid, as &lt;strong&gt;K&lt;/strong&gt; increases from 2+2 -&amp;gt; 4+2 -&amp;gt; 6+2. Maximum throughput degrades, for reasons as explained earlier in this blog, Squid does not have partial reads. As &lt;strong&gt;K&lt;/strong&gt; increases, more data is thrown away for small read operations therefore increasing OSD and CPU utilization.&lt;/li&gt;
&lt;li&gt;For Tentacle, as &lt;strong&gt;K&lt;/strong&gt; increases, maximum throughput scales to the point where we can achieve nearly the same read performance as 3-way replica.&lt;/li&gt;
&lt;li&gt;The latency gap between Tentacle and 3-way reads to the non-primary OSD are being redirected to primary OSD.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Direct Reads, a feature coming in a future release of Ceph will remove the hop to the primary OSD which will improve latency to be equivalent to 3-way replica performance.&lt;/p&gt;
&lt;p&gt;Currently targeted at Umbrella timeframe, there will be a blog at a future date on this feature.&lt;/p&gt;
&lt;h3 id=&quot;read-results---large-reads&quot;&gt;&lt;a id=&quot;read-results---large-reads&quot;&gt;&lt;/a&gt;Read Results - Large Reads &lt;a class=&quot;link-anchor&quot; href=&quot;#read-results---large-reads&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;&lt;img src=&quot;images/tentacle-fastec-largereads-part1.png&quot; alt=&quot;&quot;&gt;
&lt;img src=&quot;images/tentacle-fastec-largereads-part2.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Figure 8: Large Reads - Comparing 6+2 Squid 4K to Tentacle 6+2 16k to 3-way replica&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;For large reads, backup, large S3 object and streaming workloads, offer slightly lower latency and around a 1.2x increase in throughput using Fast EC over Squid.&lt;/p&gt;
&lt;p&gt;Direct Reads is expected to significantly improve EC throughput further to be much closer to 3-way replica whilst also reducing latency due to dividing up of the large requests into chunks and issuing the IOs in parallel to all the OSDs in the stripe.&lt;/p&gt;
&lt;h3 id=&quot;write-append-results&quot;&gt;&lt;a id=&quot;write-append-results&quot;&gt;&lt;/a&gt;Write Append Results &lt;a class=&quot;link-anchor&quot; href=&quot;#write-append-results&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Write appends are where new data is being appended to the end of an existing object. This is typically common in sequential write, backup, AI or RGW PUT workloads.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;images/tentacle-fastec-writeappends-part1.png&quot; alt=&quot;&quot;&gt;
&lt;img src=&quot;images/tentacle-fastec-writeappends-part2.png&quot; alt=&quot;&quot;&gt;
&lt;img src=&quot;images/tentacle-fastec-writeappends-part3.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Figure 9: Write Appends – Squid 4k to Tentacle 16k &lt;code&gt;stripe_unit&lt;/code&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;These results highlight the following benefits:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A significant latency reduction for all writes upto 512k and modest improvement at 1Mbyte.&lt;/li&gt;
&lt;li&gt;For small block writes upto 16k, there is a significant increase in IOPs throughput available.&lt;/li&gt;
&lt;li&gt;For writes 16k to 64k there is a modest increase in throughput available also.&lt;/li&gt;
&lt;li&gt;No degradation in performance for 512k and 1Mbyte writes whilst improving latency significantly.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;It is interesting to note that increasing &lt;strong&gt;K&lt;/strong&gt;, (eg. going from 2+2 to 4+2/6+2) increases the latency. The reason for this is that in a 2+2 configuration, 50% of your I/O is writing to the primary OSD of the PG, where as in a 4+2 configuration, 25% of your I/O is writing to the primary OSD of the PG. Writing to the non-primary OSD results in needing to forward the request to the primary OSD resulting in an extra messenger hop operation.&lt;/p&gt;
&lt;h3 id=&quot;mixed-read%2Fwrite-workloads&quot;&gt;&lt;a id=&quot;mixed-readwrite-workloads&quot;&gt;&lt;/a&gt;Mixed Read/Write Workloads &lt;a class=&quot;link-anchor&quot; href=&quot;#mixed-read%2Fwrite-workloads&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;&lt;img src=&quot;images/tentacle-fastec-mixed.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Figure 10: Mixed 16k 70/30 - Squid 4k to Tentacle 16k to 3-way replica&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Transactional and File workloads contain a mixture of reads and writes, typically small block with 70% reads and 30% writes of around 16k in size. This chart contains a typical 70/30 16k mix workload.&lt;/p&gt;
&lt;p&gt;Compared with Squid, there is at least a doubling in throughput with FastEC. Three-way replica is still faster, however compared to a 6+2 16k &lt;code&gt;stripe_unit&lt;/code&gt; erasure pool with Fast EC it is around 50% of the performance, however you need to consider that a 6+2 erasure code has only 33% overheads, compared to needing 3x physical capacity for a Three-way configuration. Three-way replica is a significantly more expensive option compared to using EC. Therefore, EC in 6+2 form has a much better cost vs performance ratio over 3-way replica.&lt;/p&gt;
&lt;p&gt;On the same storage system, write dominated workloads with EC (due to the 3 reads/3 writes) are never going to perform as well as Replica purely because of the laws of physics and the algorithms for EC need to do more IOs than Replica. However, you can offset this cost with less physical capacity and restructure your storage accordingly.&lt;/p&gt;
&lt;p&gt;It is important to note, traditional storage controllers often offer a choice between RAID-1 (mirroring) and RAID-6 (erasure coding K+2) and they also have a similar cost performance trade off.&lt;/p&gt;
&lt;p&gt;Using a wider erasure code such as 6+2 requires 9 nodes and therefore you may need to add more nodes to your Ceph cluster. However, the cost of a storage solution, is typically dominated by the cost of the drives you install to store the data, especially if you are using Flash. With Erasure Code you get half the performance at less than half the cost, giving you the opportunity to scale out to build the same level of performance as replica.&lt;/p&gt;
&lt;h2 id=&quot;summary&quot;&gt;Summary &lt;a class=&quot;link-anchor&quot; href=&quot;#summary&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The objective of the EC performance enhancements is to make performance good enough to make it viable to use EC for block and file storage, especially when you consider the cost performance ratio benefits of using EC over 3-way replica.&lt;/p&gt;
&lt;p&gt;For the most part users should not be considering performance when choosing the value of &lt;strong&gt;K&lt;/strong&gt;. Users should use higher values of &lt;strong&gt;K&lt;/strong&gt; (such as 6+2) for better storage efficiency whilst maintaining the same redundancy as replica.&lt;/p&gt;
&lt;p&gt;Using Fast EC in a 6+2 configuration, you could use this saving to increase the number of nodes, redistribute your drives across the nodes and achieve the same performance as Three-way replica and still save money.&lt;/p&gt;
&lt;p&gt;The Fast EC feature in Tentacle reduces the total cost of ownership of your Ceph cluster by allowing you to use Erasure Coding as an alternative and more space efficient method with a significantly better cost vs performance ratio of storing your data compared to Replica pools.&lt;/p&gt;
&lt;p&gt;I hope this blog has helped you appreciate the performance benefits of Fast EC. The team are working on many more improvements:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Direct Reads – This feature will significantly improve reads to offer the same performance as Replica pools.&lt;/li&gt;
&lt;li&gt;Object packing – This feature brings substantial benefits to users wanting to increase the &lt;code&gt;stripe_unit&lt;/code&gt; increase beyond 16k without degrading space utilization which will bring other performance improvements for reads and writes beyond 16k. This will be a useful improvement for larger (4MB) objects.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Direct Reads is targeted for the Umbrella release. Object packing will be in a future release. More performance data on these features will be available nearer the time.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>v20.2.0 Tentacle released</title>
    <link href="https://ceph.io/en/news/blog/2025/v20-2-0-tentacle-released/" />
    <updated>2025-11-18T00:00:00Z</updated>
    <id>https://ceph.io/en/news/blog/2025/v20-2-0-tentacle-released/</id>
    <author>
      <name>Laura Flores</name>
    </author>
      <category term="en-blog-post" />
      <category term="en-article" />
      <category term="blog-post" />
      <category term="release" />
      <category term="tentacle" />
    <content type="html" xml:base="https://ceph.io/en/news/blog/2025/v20-2-0-tentacle-released/">&lt;p&gt;Tentacle is the 20th stable release of Ceph.&lt;/p&gt;
&lt;p&gt;This is the first stable release of Ceph Tentacle.&lt;/p&gt;
&lt;p&gt;Contents:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;#changes&quot;&gt;Major Changes from Squid&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;#upgrade&quot;&gt;Upgrading from Reef or Squid&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;#upgrade-from-older-release&quot;&gt;Upgrading from pre-Reef releases (like Quincy)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;#contributors&quot;&gt;Thank You to Our Contributors&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;major-changes-from-squid&quot;&gt;&lt;a id=&quot;changes&quot;&gt;&lt;/a&gt;Major Changes from Squid &lt;a class=&quot;link-anchor&quot; href=&quot;#major-changes-from-squid&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;h3 id=&quot;highlights&quot;&gt;Highlights &lt;a class=&quot;link-anchor&quot; href=&quot;#highlights&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;&lt;em&gt;See the sections below for more details on these items.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;CephFS&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Directories may now be configured with case-insensitive or normalized
directory entry names.&lt;/li&gt;
&lt;li&gt;Modifying the FS setting variable &lt;code&gt;max_mds&lt;/code&gt; when a cluster is unhealthy
now requires users to pass the confirmation flag (&lt;code&gt;--yes-i-really-mean-it&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;code&gt;EOPNOTSUPP&lt;/code&gt; (Operation not supported) is now returned by the CephFS FUSE
client for &lt;code&gt;fallocate&lt;/code&gt; for the default case (i.e. &lt;code&gt;mode == 0&lt;/code&gt;).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Crimson&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;SeaStore Tech Preview: SeaStore object store is now deployable
alongside Crimson-OSD, mainly for early testing and experimentation.
Community feedback is encouraged to help with future improvements.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Dashboard&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Support has been added for NVMe/TCP gateway groups and multiple
namespaces, multi-cluster management, OAuth 2.0 integration, and enhanced
RGW/SMB features including multi-site automation, tiering, policies,
lifecycles, notifications, and granular replication.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Integrated SMB support&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Ceph clusters now offer an SMB Manager module that works like the existing
NFS subsystem. The new SMB support allows the Ceph cluster to automatically
create Samba-backed SMB file shares connected to CephFS. The &lt;code&gt;smb&lt;/code&gt; module
can configure both basic Active Directory domain or standalone user
authentication. The Ceph cluster can host one or more virtual SMB clusters
which can be truly clustered using Samba&#39;s CTDB technology. The &lt;code&gt;smb&lt;/code&gt;
module requires a cephadm-enabled Ceph cluster and deploys container images
provided by the &lt;code&gt;samba-container&lt;/code&gt; project. The Ceph dashboard can be used
to configure SMB clusters and shares. A new &lt;code&gt;cephfs-proxy&lt;/code&gt; daemon is
automatically deployed to improve scalability and memory usage when connecting
Samba to CephFS.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;MGR&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Users now have the ability to force-disable always-on modules.&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;restful&lt;/code&gt; and &lt;code&gt;zabbix&lt;/code&gt; modules (deprecated since 2020) have been
officially removed.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;RADOS&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;FastEC: Long-anticipated performance and space amplification
optimizations are added for erasure-coded pools.&lt;/li&gt;
&lt;li&gt;BlueStore: Improved compression and a new, faster WAL (write-ahead-log).&lt;/li&gt;
&lt;li&gt;Data Availability Score: Users can now track a data availability score
for each pool in their cluster.&lt;/li&gt;
&lt;li&gt;OMAP: All components have been switched to the faster OMAP iteration
interface, which improves RGW bucket listing and scrub operations.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;RBD&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;New live migration features: RBD images can now be instantly imported
from another Ceph cluster (native format) or from a wide variety of
external sources/formats.&lt;/li&gt;
&lt;li&gt;There is now support for RBD namespace remapping while mirroring between
Ceph clusters.&lt;/li&gt;
&lt;li&gt;Several commands related to group and group snap info were added or
improved, and &lt;code&gt;rbd device map&lt;/code&gt; command now defaults to &lt;code&gt;msgr2&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;RGW&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Added support for S3 &lt;code&gt;GetObjectAttributes&lt;/code&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;For compatibility with AWS S3, &lt;code&gt;LastModified&lt;/code&gt; timestamps are now truncated
to the second. Note that during upgrade, users may observe these timestamps
moving backwards as a result.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Bucket resharding now does most of its processing before it starts to block
write operations. This should significantly reduce the client-visible impact
of resharding on large buckets.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The User Account feature introduced in Squid provides first-class support for
IAM APIs and policy. Our preliminary STS support was based on tenants, and
exposed some IAM APIs to admins only. This tenant-level IAM functionality is now
deprecated in favor of accounts. While we&#39;ll continue to support the tenant feature
itself for namespace isolation, the following features will be removed no sooner
than the V release:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Tenant-level IAM APIs including CreateRole, PutRolePolicy and PutUserPolicy,&lt;/li&gt;
&lt;li&gt;Use of tenant names instead of accounts in IAM policy documents,&lt;/li&gt;
&lt;li&gt;Interpretation of IAM policy without cross-account policy evaluation,&lt;/li&gt;
&lt;li&gt;S3 API support for cross-tenant names such as &lt;code&gt;Bucket=&#39;tenant:bucketname&#39;&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;STS Lite and &lt;code&gt;sts:GetSessionToken&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;cephadm&quot;&gt;Cephadm &lt;a class=&quot;link-anchor&quot; href=&quot;#cephadm&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;A new cephadm-managed &lt;code&gt;mgmt-gateway&lt;/code&gt; service provides a single, TLS-terminated
entry point for Ceph management endpoints such as the Dashboard and the monitoring
stack. The gateway is implemented as an nginx-based reverse proxy that fronts Prometheus,
Grafana, and Alertmanager, so users no longer need to connect to those daemons directly or
know which hosts they run on. When combined with the new &lt;code&gt;oauth2-proxy&lt;/code&gt; service, which
integrates with external identity providers using the OpenID Connect (OIDC) / OAuth 2.0
protocols, the gateway can enforce centralized authentication and single sign-on (SSO) for
both the Ceph Dashboard and the rest of the monitoring stack.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;High availability for the Ceph Dashboard and the Prometheus-based monitoring stack is now
provided via the cephadm-managed &lt;code&gt;mgmt-gateway&lt;/code&gt;. nginx high-availability mechanisms allow
the mgmt-gateway to detect healthy instances of the Dashboard, Prometheus, Grafana, and Alertmanager,
route traffic accordingly, and handle manager failover transparently. When deployed with a virtual
IP and multiple &lt;code&gt;mgmt-gateway&lt;/code&gt; instances, this architecture keeps management access available
even during daemon or host failures.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;A new &lt;code&gt;certmgr&lt;/code&gt; cephadm subsystem centralizes certificate lifecycle management for cephadm-managed
services. certmgr acts as a cluster-internal root CA for cephadm-signed certificates, it can also
consume user-provided certificates, and tracks how each certificate was provisioned. It standardizes
HTTPS configuration for services such as RGW and the mgmt-gateway, automates renewal and rotation of
cephadm-signed certificates, and raises health warnings when certificates are invalid, expiring or misconfigured.
With certmgr, cephadm-signed certificates are available across all cephadm-managed services, providing
secure defaults out of the box.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;cephfs&quot;&gt;CephFS &lt;a class=&quot;link-anchor&quot; href=&quot;#cephfs&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Directories may now be configured with case-insensitive or
normalized directory entry names. This is an inheritable configuration,
making it apply to an entire directory tree.&lt;/p&gt;
&lt;p&gt;For more information, see &lt;a href=&quot;https://docs.ceph.com/en/tentacle/cephfs/charmap/&quot;&gt;https://docs.ceph.com/en/tentacle/cephfs/charmap/&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;It is now possible to pause the threads that asynchronously purge
deleted subvolumes by using the config option
&lt;code&gt;mgr/volumes/pause_purging&lt;/code&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;It is now possible to pause the threads that asynchronously clone
subvolume snapshots by using the config option
&lt;code&gt;mgr/volumes/pause_cloning&lt;/code&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Modifying the setting &lt;code&gt;max_mds&lt;/code&gt; when a cluster is
unhealthy now requires users to pass the confirmation flag
(&lt;code&gt;--yes-i-really-mean-it&lt;/code&gt;). This has been added as a precaution to inform
users that modifying &lt;code&gt;max_mds&lt;/code&gt; may not help with troubleshooting or recovery
efforts. Instead, it might further destabilize the cluster.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;EOPNOTSUPP&lt;/code&gt; (Operation not supported) is now returned by the CephFS
FUSE client for &lt;code&gt;fallocate&lt;/code&gt; in the default case (i.e., &lt;code&gt;mode == 0&lt;/code&gt;) since
CephFS does not support disk space reservation. The only flags supported are
&lt;code&gt;FALLOC_FL_KEEP_SIZE&lt;/code&gt; and &lt;code&gt;FALLOC_FL_PUNCH_HOLE&lt;/code&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The &lt;code&gt;ceph fs subvolume snapshot getpath&lt;/code&gt; command now allows users
to get the path of a snapshot of a subvolume. If the snapshot is not present,
&lt;code&gt;ENOENT&lt;/code&gt; is returned.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The &lt;code&gt;ceph fs volume create&lt;/code&gt; command now allows users to pass
metadata and data pool names to be used for creating the volume. If either
is not passed, or if either is a non-empty pool, the command will abort.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The format of the pool namespace name for CephFS volumes has been changed
from &lt;code&gt;fsvolumens__&amp;lt;subvol-name&amp;gt;&lt;/code&gt; to
&lt;code&gt;fsvolumens__&amp;lt;subvol-grp-name&amp;gt;_&amp;lt;subvol-name&amp;gt;&lt;/code&gt; to avoid namespace collisions
when two subvolumes located in different subvolume groups have the same name.
Even with namespace collisions, there were no security issues, since the MDS
auth cap is restricted to the subvolume path. Now, with this change, the
namespaces are completely isolated.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;If the subvolume name passed to the command &lt;code&gt;ceph fs subvolume info&lt;/code&gt;
is a clone, the output will now also contain a &amp;quot;source&amp;quot; field that tells the
user the name of the source snapshot along with the name of the volume,
subvolume group, and subvolume in which the source snapshot is located.
For clones created with Tentacle or an earlier release, the value of this
field will be &lt;code&gt;N/A&lt;/code&gt;. Regular subvolumes do not have a source subvolume and
therefore the output for them will not contain a &amp;quot;source&amp;quot; field regardless of
the release.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;crimson-%2F-seastore&quot;&gt;Crimson / SeaStore &lt;a class=&quot;link-anchor&quot; href=&quot;#crimson-%2F-seastore&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;The Crimson project continues to progress, with the Squid release marking the
first technical preview available for Crimson.
The Tentacle release introduces a host of improvements and new functionalities
that enhance the robustness, performance, and usability
of both Crimson-OSD and the SeaStore object store.
In this release, SeaStore can now be deployed alongside the Crimson-OSD!
Early testing and experimentation are highly encouraged and we’d greatly
appreciate any initial feedback rounds from the community to help guide future
improvements.
Check out the Crimson project updates blog post for Tentacle
where we highlight some of the work included in the latest release, moving us
closer to fully replacing the existing Classical OSD in the future:
&lt;a href=&quot;https://ceph.io/en/news/blog/2025/crimson-T-release/&quot;&gt;https://ceph.io/en/news/blog/2025/crimson-T-release/&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;If you&#39;re new to the Crimson project, please visit the project
page for more information and resources: &lt;a href=&quot;https://ceph.io/en/news/crimson&quot;&gt;https://ceph.io/en/news/crimson&lt;/a&gt;&lt;/p&gt;
&lt;h3 id=&quot;dashboard&quot;&gt;Dashboard &lt;a class=&quot;link-anchor&quot; href=&quot;#dashboard&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;There is now added support for NVMe/TCP gateway groups and multiple
namespaces, multi-cluster management, OAuth 2.0 integration, and enhanced
RGW/SMB features including multi-site automation, tiering, policies,
lifecycles, notifications, and granular replication.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;mgr&quot;&gt;MGR &lt;a class=&quot;link-anchor&quot; href=&quot;#mgr&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;The Ceph Manager&#39;s always-on modulues/plugins can now be force-disabled.
This can be necessary in cases where we wish to prevent the manager from being
flooded by module commands when Ceph services are down or degraded.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;mgr/restful&lt;/code&gt;, &lt;code&gt;mgr/zabbix&lt;/code&gt;: both modules, already deprecated since 2020, have been
finally removed. They have not been actively maintained in the last years,
and started suffering from vulnerabilities in their dependency chain (e.g.:
CVE-2023-46136). An alternative for the &lt;code&gt;restful&lt;/code&gt; module is the &lt;code&gt;dashboard&lt;/code&gt; module,
which provides a richer and better maintained RESTful API. Regarding the &lt;code&gt;zabbix&lt;/code&gt; module,
there are alternative monitoring solutions, like &lt;code&gt;prometheus&lt;/code&gt;, which is the most
widely adopted among the Ceph user community.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;rados&quot;&gt;RADOS &lt;a class=&quot;link-anchor&quot; href=&quot;#rados&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Long-anticipated performance and space amplification optimizations (FastEC)
are added for erasure-coded pools, including partial reads and partial writes.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;A new implementation of the Erasure Coding I/O code provides substantial
performance improvements and some capacity improvements. The new code is
designed to optimize performance when using Erasure Coding with block storage
(RBD) and file storage (CephFS) but will have benefits for object storage
(RGW), in particular when using smaller sized objects. A new flag
&lt;code&gt;allow_ec_optimizations&lt;/code&gt; must be set on each pool to switch to using the
new code. Existing pools can be upgraded once the OSD and Monitor daemons
have been updated. There is no need to update the clients.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The default plugin for erasure coded pools has been changed from Jerasure to
ISA-L. Clusters created on Tentacle or later releases will use ISA-L as the
default plugin when creating a new pool. Clusters that upgrade to the T release
will continue to use their existing default values. The default values can be
overridden by creating a new erasure code profile and selecting it when
creating a new pool. ISA-L is recommended for new pools because the Jerasure
library is no longer maintained.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;BlueStore now has better compression and a new, faster WAL (write-ahead-log).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;All components have been switched to the faster OMAP iteration interface, which
improves RGW bucket listing and scrub operations.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;It is now possible to bypass &lt;code&gt;ceph_assert()&lt;/code&gt; in extreme cases to help with
disaster recovery.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Testing improvements for dencoding verification were added.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;A new command, &lt;code&gt;ceph osd pool availability-status&lt;/code&gt;, has been added that
allows users to view the availability score for each pool in a cluster. A pool
is considered unavailable if any PG in the pool is not &lt;code&gt;active&lt;/code&gt; or if
there are unfound objects. Otherwise the pool is considered available. The
score is updated every one second by default. This interval can be changed
using the new config option &lt;code&gt;pool_availability_update_interval&lt;/code&gt;. The feature
is off by default. A new config option &lt;code&gt;enable_availability_tracking&lt;/code&gt; can be
used to turn on the feature if required. Another command is added to clear the
availability status for a specific pool:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ ceph osd pool clear-availability-status &amp;lt;pool-name&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This feature is in tech preview.&lt;/p&gt;
&lt;p&gt;Related links:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Feature ticket: &lt;a href=&quot;https://tracker.ceph.com/issues/67777&quot;&gt;https://tracker.ceph.com/issues/67777&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Documentation: &lt;a href=&quot;https://docs.ceph.com/en/tentacle/rados/operations/monitoring/&quot;&gt;https://docs.ceph.com/en/tentacle/rados/operations/monitoring/&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Leader monitor and stretch mode status are now included in the &lt;code&gt;ceph status&lt;/code&gt;
output.&lt;/p&gt;
&lt;p&gt;Related tracker: &lt;a href=&quot;https://tracker.ceph.com/issues/70406&quot;&gt;https://tracker.ceph.com/issues/70406&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The &lt;code&gt;ceph df&lt;/code&gt; command reports incorrect &lt;code&gt;MAX AVAIL&lt;/code&gt; for stretch mode pools
when CRUSH rules use multiple take steps for datacenters. &lt;code&gt;PGMap::get_rule_avail&lt;/code&gt;
incorrectly calculates available space from only one datacenter. As a workaround,
define CRUSH rules with &lt;code&gt;take default&lt;/code&gt; and &lt;code&gt;choose firstn 0 type datacenter&lt;/code&gt;.
See &lt;a href=&quot;https://tracker.ceph.com/issues/56650#note-6&quot;&gt;https://tracker.ceph.com/issues/56650#note-6&lt;/a&gt; for details.&lt;/p&gt;
&lt;p&gt;Upgrading a cluster configured with a CRUSH rule with multiple take steps can
lead to data shuffling, as the new CRUSH changes may necessitate data
redistribution. In contrast, a stretch rule with a single-take configuration
will not cause any data movement during the upgrade process.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Added convenience function &lt;code&gt;librados::AioCompletion::cancel()&lt;/code&gt; with the same
behavior as &lt;code&gt;librados::IoCtx::aio_cancel()&lt;/code&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The configuration parameter &lt;code&gt;osd_repair_during_recovery&lt;/code&gt; has been removed.
That configuration flag used to control whether an operator-initiated &amp;quot;repair
scrub&amp;quot; would be allowed to start on an OSD that is performing a recovery. In
this Ceph version, operator-initiated scrubs and repair scrubs are never blocked
by a repair being performed.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Fixed issue of recovery/backfill hang due to improper handling of items in the
dmclock&#39;s background clean-up thread.&lt;/p&gt;
&lt;p&gt;Related tracker: &lt;a href=&quot;https://tracker.ceph.com/issues/61594&quot;&gt;https://tracker.ceph.com/issues/61594&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The OSD&#39;s IOPS capacity used by the mClock scheduler is now also checked to
determine if it&#39;s below a configured threshold value defined by:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;osd_mclock_iops_capacity_low_threshold_hdd&lt;/code&gt; – set to 50 IOPS&lt;/li&gt;
&lt;li&gt;&lt;code&gt;osd_mclock_iops_capacity_low_threshold_ssd&lt;/code&gt; – set to 1000 IOPS&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The check is intended to handle cases where the measured IOPS is unrealistically
low. If such a case is detected, the IOPS capacity is either set to the last
valid value or the configured default to avoid affecting cluster performance
(slow or stalled ops).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Documentation has been updated with steps to override OSD IOPS capacity
configuration.&lt;/p&gt;
&lt;p&gt;Related links:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Tracker ticket: &lt;a href=&quot;https://tracker.ceph.com/issues/70774&quot;&gt;https://tracker.ceph.com/issues/70774&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Documentation: &lt;a href=&quot;https://docs.ceph.com/en/tentacle/rados/configuration/mclock-config-ref/&quot;&gt;https://docs.ceph.com/en/tentacle/rados/configuration/mclock-config-ref/&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;pybind/rados: Fixes &lt;code&gt;WriteOp.zero()&lt;/code&gt; in the original reversed order of arguments
&lt;code&gt;offset&lt;/code&gt; and &lt;code&gt;length&lt;/code&gt;. When pybind calls &lt;code&gt;WriteOp.zero()&lt;/code&gt;, the argument passed
does not match &lt;code&gt;rados_write_op_zero&lt;/code&gt;, and offset and length are swapped, which
results in an unexpected response.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;rbd&quot;&gt;RBD &lt;a class=&quot;link-anchor&quot; href=&quot;#rbd&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;RBD images can now be instantly imported from another Ceph cluster. The
migration source spec for &lt;code&gt;native&lt;/code&gt; format has grown &lt;code&gt;cluster_name&lt;/code&gt; and
&lt;code&gt;client_name&lt;/code&gt; optional fields for connecting to the source cluster after
parsing the respective &lt;code&gt;ceph.conf&lt;/code&gt;-like configuration file.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;With the help of the new NBD stream (&lt;code&gt;&amp;quot;type&amp;quot;: &amp;quot;nbd&amp;quot;&lt;/code&gt;), RBD images can now
be instantly imported from a wide variety of external sources/formats. The
exact set of supported formats and their features depends on the capabilities
of the NBD server.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;While mirroring between Ceph clusters, the local and remote RBD namespaces
don&#39;t need to be the same anymore (but the pool names still do). Using the
new &lt;code&gt;--remote-namespace&lt;/code&gt; option of &lt;code&gt;rbd mirror pool enable&lt;/code&gt; command, it&#39;s
now possible to pair a local namespace with an arbitrary remote namespace in
the respective pool, including mapping a default namespace to a non-default
namespace and vice versa, at the time mirroring is configured.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;All Python APIs that produce timestamps now return &amp;quot;aware&amp;quot; &lt;code&gt;datetime&lt;/code&gt;
objects instead of &amp;quot;naive&amp;quot; ones (i.e., those including time zone information
instead of those not including it). All timestamps remain in UTC, but
including &lt;code&gt;timezone.utc&lt;/code&gt; makes it explicit and avoids the potential of the
returned timestamp getting misinterpreted. In Python 3, many &lt;code&gt;datetime&lt;/code&gt;
methods treat &amp;quot;naive&amp;quot; &lt;code&gt;datetime&lt;/code&gt; objects as local times.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;rbd group info&lt;/code&gt; and &lt;code&gt;rbd group snap info&lt;/code&gt; commands are introduced to
show information about a group and a group snapshot respectively.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;rbd group snap ls&lt;/code&gt; output now includes the group snapshot IDs. The header
of the column showing the state of a group snapshot in the unformatted CLI
output is changed from &lt;code&gt;STATUS&lt;/code&gt; to &lt;code&gt;STATE&lt;/code&gt;. The state of a group snapshot
that was shown as &lt;code&gt;ok&lt;/code&gt; is now shown as &lt;code&gt;complete&lt;/code&gt;, which is more
descriptive.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;In &lt;code&gt;rbd mirror image status&lt;/code&gt; and &lt;code&gt;rbd mirror pool status --verbose&lt;/code&gt;
outputs, &lt;code&gt;mirror_uuids&lt;/code&gt; field has been renamed to &lt;code&gt;mirror_uuid&lt;/code&gt; to
highlight that the value is always a single UUID and never a list of any
kind.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Moving an image that is a member of a group to trash is no longer
allowed. The &lt;code&gt;rbd trash mv&lt;/code&gt; command now behaves the same way as &lt;code&gt;rbd rm&lt;/code&gt;
in this scenario.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;rbd device map&lt;/code&gt; command now defaults to &lt;code&gt;msgr2&lt;/code&gt; for all device types.
&lt;code&gt;-o ms_mode=legacy&lt;/code&gt; can be passed to continue using &lt;code&gt;msgr1&lt;/code&gt; with krbd.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The family of diff-iterate APIs has been extended to allow diffing from or
between non-user type snapshots which can only be referred to by their IDs.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Fetching the mirroring mode of an image is invalid if the image is
disabled for mirroring. The public APIs -- C++ &lt;code&gt;mirror_image_get_mode()&lt;/code&gt;,
C &lt;code&gt;rbd_mirror_image_get_mode()&lt;/code&gt;, and Python &lt;code&gt;Image.mirror_image_get_mode()&lt;/code&gt;
-- will return &lt;code&gt;EINVAL&lt;/code&gt; when mirroring is disabled.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Promoting an image is invalid if the image is not enabled for mirroring.
The public APIs -- C++ &lt;code&gt;mirror_image_promote()&lt;/code&gt;,
C &lt;code&gt;rbd_mirror_image_promote()&lt;/code&gt;, and Python &lt;code&gt;Image.mirror_image_promote()&lt;/code&gt;
-- will return EINVAL instead of ENOENT when mirroring is not enabled.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Requesting a resync on an image is invalid if the image is not enabled
for mirroring. The public APIs -- C++ &lt;code&gt;mirror_image_resync()&lt;/code&gt;,
C &lt;code&gt;rbd_mirror_image_resync()&lt;/code&gt;, and Python &lt;code&gt;Image.mirror_image_resync()&lt;/code&gt;
-- will return EINVAL instead of ENOENT when mirroring is not enabled.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;rgw&quot;&gt;RGW &lt;a class=&quot;link-anchor&quot; href=&quot;#rgw&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Multiple fixes: Lua scripts will no longer run uselessly against health checks,
properly quoted &lt;code&gt;ETag&lt;/code&gt; values returned by S3 &lt;code&gt;CopyPart&lt;/code&gt;, &lt;code&gt;PostObject&lt;/code&gt;, and
&lt;code&gt;CompleteMultipartUpload&lt;/code&gt; responses.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;IAM policy evaluation now supports conditions &lt;code&gt;ArnEquals&lt;/code&gt; and &lt;code&gt;ArnLike&lt;/code&gt;,
along with their &lt;code&gt;Not&lt;/code&gt; and &lt;code&gt;IfExists&lt;/code&gt; variants.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Added BEAST frontend option &lt;code&gt;so_reuseport&lt;/code&gt; which facilitates running multiple
RGW instances on the same host by sharing a single TCP port.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Replication policies now validate permissions using
&lt;code&gt;s3:ReplicateObject&lt;/code&gt;, &lt;code&gt;s3:ReplicateDelete&lt;/code&gt;, and &lt;code&gt;s3:ReplicateTags&lt;/code&gt; for
destination buckets. For source buckets, both
&lt;code&gt;s3:GetObjectVersionForReplication&lt;/code&gt; and &lt;code&gt;s3:GetObject(Version)&lt;/code&gt; are
supported. Actions like &lt;code&gt;s3:GetObjectAcl&lt;/code&gt;, &lt;code&gt;s3:GetObjectLegalHold&lt;/code&gt;, and
&lt;code&gt;s3:GetObjectRetention&lt;/code&gt; are also considered when fetching the source object.
Replication of tags is controlled by the
&lt;code&gt;s3:GetObject(Version)Tagging&lt;/code&gt; permission.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Adding missing quotes to the &lt;code&gt;ETag&lt;/code&gt; values returned by S3 &lt;code&gt;CopyPart&lt;/code&gt;,
&lt;code&gt;PostObject&lt;/code&gt;, and &lt;code&gt;CompleteMultipartUpload&lt;/code&gt; responses.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;PutObjectLockConfiguration&lt;/code&gt; can now be used to enable S3 Object Lock on an
existing versioning-enabled bucket that was not created with Object Lock enabled.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The &lt;code&gt;x-amz-confirm-remove-self-bucket-access&lt;/code&gt; header is now supported by
&lt;code&gt;PutBucketPolicy&lt;/code&gt;. Additionally, the root user will always have access to
modify the bucket policy, even if the current policy explicitly denies access.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Added support for the &lt;code&gt;RestrictPublicBuckets&lt;/code&gt; property of the S3
&lt;code&gt;PublicAccessBlock&lt;/code&gt; configuration.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The HeadBucket API now reports the &lt;code&gt;X-RGW-Bytes-Used&lt;/code&gt; and &lt;code&gt;X-RGW-Object-Count&lt;/code&gt;
headers only when the &lt;code&gt;read-stats&lt;/code&gt; querystring is explicitly included in the
API request.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;telemetry&quot;&gt;Telemetry &lt;a class=&quot;link-anchor&quot; href=&quot;#telemetry&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;The &lt;code&gt;basic&lt;/code&gt; channel in telemetry now captures the &lt;code&gt;ec_optimizations&lt;/code&gt;
flag, which will allow us to gauge feature adoption for the new
FastEC improvements.
To opt into telemetry, run &lt;code&gt;ceph telemetry on&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;upgrading-from-reef-or-squid&quot;&gt;&lt;a id=&quot;upgrade&quot;&gt;&lt;/a&gt;Upgrading from Reef or Squid &lt;a class=&quot;link-anchor&quot; href=&quot;#upgrading-from-reef-or-squid&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Before starting, ensure that your cluster is stable and healthy with no
&lt;code&gt;down&lt;/code&gt;, &lt;code&gt;recovering&lt;/code&gt;, &lt;code&gt;incomplete&lt;/code&gt;, &lt;code&gt;undersized&lt;/code&gt; or &lt;code&gt;backfilling&lt;/code&gt; PGs.
You can temporarily disable the PG autoscaler for all pools during the upgrade
by running &lt;code&gt;ceph osd pool set noautoscale&lt;/code&gt; before beginning, and if the
autoscaler is desired after completion, running &lt;code&gt;ceph osd pool unset noautoscale&lt;/code&gt;
after upgrade success is confirmed.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;You can monitor the progress of your upgrade at each stage with the
&lt;code&gt;ceph versions&lt;/code&gt; command, which will tell you what Ceph version(s) are running
for each type of daemon.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3 id=&quot;upgrading-cephadm-clusters&quot;&gt;Upgrading Cephadm Clusters &lt;a class=&quot;link-anchor&quot; href=&quot;#upgrading-cephadm-clusters&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;If your cluster is deployed with cephadm (first introduced in Octopus), then the upgrade process is entirely automated. To initiate the upgrade,&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ ceph orch upgrade start --image quay.io/ceph/ceph:v20.2.0
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The same process is used to upgrade to future minor releases.&lt;/p&gt;
&lt;p&gt;Upgrade progress can be monitored with&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ ceph orch upgrade status
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Upgrade progress can also be monitored with &lt;code&gt;ceph -s&lt;/code&gt; (which provides a simple progress bar) or more verbosely with&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ ceph -W cephadm
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The upgrade can be paused or resumed with&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ ceph orch upgrade pause  # to pause
$ ceph orch upgrade resume # to resume
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;or canceled with&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ ceph orch upgrade stop
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Note that canceling the upgrade simply stops the process. There is no ability to downgrade back to Reef or Squid.&lt;/p&gt;
&lt;h3 id=&quot;upgrading-non-cephadm-clusters&quot;&gt;Upgrading Non-cephadm Clusters &lt;a class=&quot;link-anchor&quot; href=&quot;#upgrading-non-cephadm-clusters&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt;&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;If your cluster is running Reef (18.2.x) or later, you might choose
to first convert it to use cephadm so that the upgrade to Tentacle is automated (see above).
For more information, see &lt;a href=&quot;https://docs.ceph.com/en/tentacle/cephadm/adoption/&quot;&gt;https://docs.ceph.com/en/tentacle/cephadm/adoption/&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;If your cluster is running Reef (18.2.x) or later, systemd unit file
names have changed to include the cluster fsid. To find the correct
systemd unit file name for your cluster, run the following command:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ systemctl -l | grep &amp;lt;daemon type&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Example:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ systemctl -l | grep mon | grep active

ceph-6ce0347c-314a-11ee-9b52-000af7995d6c@mon.f28-h21-000-r630.service loaded active running Ceph mon.f28-h21-000-r630 for 6ce0347c-314a-11ee-9b52-000af7995d6c
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/blockquote&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Set the &lt;code&gt;noout&lt;/code&gt; flag for the duration of the upgrade. (Optional, but recommended.)&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ ceph osd set noout
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Upgrade Monitors by installing the new packages and restarting the Monitor daemons. For example, on each Monitor host:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ systemctl restart ceph-mon.target
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Once all Monitors are up, verify that the Monitor upgrade is complete by looking for the &lt;code&gt;tentacle&lt;/code&gt; string in the mon map. The command:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ ceph mon dump | grep min_mon_release
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;should report:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;min_mon_release 20 (tentacle)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If it does not, that implies that one or more Monitors haven&#39;t been upgraded and restarted and/or the quorum does not include all Monitors.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Upgrade &lt;code&gt;ceph-mgr&lt;/code&gt; daemons by installing the new packages and restarting all Manager daemons. For example, on each Manager host:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ systemctl restart ceph-mgr.target
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Verify the &lt;code&gt;ceph-mgr&lt;/code&gt; daemons are running by checking &lt;code&gt;ceph -s&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ ceph -s

...
  services:
   mon: 3 daemons, quorum foo,bar,baz
   mgr: foo(active), standbys: bar, baz
...
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Upgrade all OSDs by installing the new packages and restarting the &lt;code&gt;ceph-osd&lt;/code&gt; daemons on all OSD hosts:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ systemctl restart ceph-osd.target
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Upgrade all CephFS MDS daemons. For each CephFS file system:&lt;/p&gt;
&lt;p&gt;5.1. Disable standby_replay:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;     $ ceph fs set &amp;lt;fs_name&amp;gt; allow_standby_replay false     &lt;/code&gt;&lt;/p&gt;
&lt;p&gt;5.2. Reduce the number of ranks to 1. (Make note of the original number of MDS daemons first if you plan to restore it later.)&lt;/p&gt;
&lt;p&gt;&lt;code&gt;     $ ceph status # ceph fs set &amp;lt;fs_name&amp;gt; max_mds 1     &lt;/code&gt;&lt;/p&gt;
&lt;p&gt;5.3. Wait for the cluster to deactivate any non-zero ranks by periodically checking the status:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;     $ ceph status     &lt;/code&gt;&lt;/p&gt;
&lt;p&gt;5.4. Take all standby MDS daemons offline on the appropriate hosts with:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;     $ systemctl stop ceph-mds@&amp;lt;daemon_name&amp;gt;     &lt;/code&gt;&lt;/p&gt;
&lt;p&gt;5.5. Confirm that only one MDS is online and is rank 0 for your FS:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;     $ ceph status     &lt;/code&gt;&lt;/p&gt;
&lt;p&gt;5.6. Upgrade the last remaining MDS daemon by installing the new packages and restarting the daemon:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;     $ systemctl restart ceph-mds.target     &lt;/code&gt;&lt;/p&gt;
&lt;p&gt;5.7. Restart all standby MDS daemons that were taken offline:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;     $ systemctl start ceph-mds.target     &lt;/code&gt;&lt;/p&gt;
&lt;p&gt;5.8. Restore the original value of &lt;code&gt;max_mds&lt;/code&gt; for the volume:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;     $ ceph fs set &amp;lt;fs_name&amp;gt; max_mds &amp;lt;original_max_mds&amp;gt;     &lt;/code&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Upgrade all &lt;code&gt;radosgw&lt;/code&gt; daemons by upgrading packages and restarting daemons on all hosts:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ systemctl restart ceph-radosgw.target
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Complete the upgrade by disallowing pre-Tentacle OSDs and enabling all new Tentacle-only functionality:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ ceph osd require-osd-release tentacle
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;If you set &lt;code&gt;noout&lt;/code&gt; at the beginning, be sure to clear it with:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ ceph osd unset noout
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Consider transitioning your cluster to use the cephadm deployment and orchestration framework to simplify
cluster management and future upgrades. For more information on converting an existing cluster to cephadm,
see &lt;a href=&quot;https://docs.ceph.com/en/tentacle/cephadm/adoption/&quot;&gt;https://docs.ceph.com/en/tentacle/cephadm/adoption/&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&quot;post-upgrade&quot;&gt;Post-upgrade &lt;a class=&quot;link-anchor&quot; href=&quot;#post-upgrade&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Verify the cluster is healthy with &lt;code&gt;ceph health&lt;/code&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Consider enabling telemetry to send anonymized usage statistics
and crash information to Ceph upstream developers. To see what would
be reported without actually sending any information to anyone:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ ceph telemetry preview-all
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If you are comfortable with the data that is reported, you can opt-in to automatically report high-level cluster metadata with:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ ceph telemetry on
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The public dashboard that aggregates Ceph telemetry can be found at &lt;a href=&quot;https://telemetry-public.ceph.com/&quot;&gt;https://telemetry-public.ceph.com/&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&quot;upgrading-from-pre-reef-releases-(like-quincy)&quot;&gt;&lt;a id=&quot;upgrade-from-older-release&quot;&gt;&lt;/a&gt;Upgrading from Pre-Reef Releases (like Quincy) &lt;a class=&quot;link-anchor&quot; href=&quot;#upgrading-from-pre-reef-releases-(like-quincy)&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;You &lt;strong&gt;must&lt;/strong&gt; first upgrade to Reef (18.2.z) or Squid (19.2.z) before upgrading to Tentacle.&lt;/p&gt;
&lt;h2 id=&quot;thank-you-to-our-contributors&quot;&gt;&lt;a id=&quot;contributors&quot;&gt;&lt;/a&gt;Thank You to Our Contributors &lt;a class=&quot;link-anchor&quot; href=&quot;#thank-you-to-our-contributors&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;We express our gratitude to all members of the Ceph community who contributed by proposing pull requests, testing this release,
providing feedback, and offering valuable suggestions.&lt;/p&gt;
&lt;p&gt;If you are interested in helping test the next release, Umbrella, please join us at the
&lt;a href=&quot;https://ceph-storage.slack.com/archives/C04Q3D7HV1T&quot;&gt;#ceph-at-scale&lt;/a&gt; Slack channel.&lt;/p&gt;
&lt;p&gt;The Tentacle release would not be possible without the contributions of the
community:&lt;/p&gt;
&lt;p&gt;Aashish Sharma ▪
Abhishek Desai ▪
Abhishek Kane ▪
Abhishek Lekshmanan ▪
Achint Kaur ▪
Achintk1491 ▪
Adam C. Emerson ▪
Adam King ▪
Adam Kupczyk ▪
Adam Lyon-Jones ▪
Adarsh Ashokan ▪
Afreen Misbah ▪
Aishwarya Mathuria ▪
Alex Ainscow ▪
Alex Kershaw ▪
Alex Wojno ▪
Alexander Indenbaum ▪
Alexey Odinokov ▪
Alexon Oliveira ▪
Ali Maredia ▪
Ali Masarwa ▪
Aliaksei Makarau ▪
Anatoly Scheglov ▪
Andrei Ivashchenko ▪
Ankit Kumar ▪
Ankush Behl ▪
Anmol Babu ▪
Anoop C S ▪
Anthony D Atri ▪
Anuradha Gadge ▪
Anushruti Sharma ▪
arm7star ▪
Artem Vasilev ▪
Avan Thakkar ▪
Aviv Caro ▪
Benedikt Heine ▪
Bernard Landon ▪
Bill Scales ▪
Brad Hubbard ▪
Brian P ▪
bugwz ▪
cailianchun ▪
Casey Bodley ▪
Chanyoung Park ▪
Chen Yuanrun ▪
Chengen Du ▪
Christian Rohmann ▪
Christopher Hoffman ▪
chungfengz ▪
Chunmei Liu ▪
Connor Fawcett ▪
Cory Snyder ▪
Cybertinus ▪
daijufang ▪
Dan Mick ▪
Dan van der Ster ▪
Daniel Gryniewicz ▪
Danny Al-Gaaf ▪
DanWritesCode ▪
David Galloway ▪
Deepika Upadhyay ▪
Dhairya Parmar ▪
Divyansh Kamboj ▪
Dnyaneshwari ▪
Dominique Leuenberger ▪
Dongdong Tao ▪
Doug Whitfield ▪
Drunkard Zhang ▪
Effi Ofer ▪
Emin ▪
Emin Mert Sunacoglu ▪
Enrico Bocchi ▪
Enrico De Fent ▪
er0k ▪
Erik Sjölund ▪
Ernesto Puerta ▪
Ethan Wu ▪
Feng, Hualong ▪
Florent Carli ▪
Gabriel BenHanokh ▪
Gal Salomon ▪
Garry Drankovich ▪
Gil Bregman ▪
Gilad Sid ▪
gitkenan ▪
Gregory O&#39;Neill ▪
Guillaume Abrioux ▪
gukaifeng ▪
Hannes Baum ▪
haoyixing ▪
hejindong ▪
Hezko ▪
Hoai-Thu Vuong ▪
Hualong Feng ▪
Hyun Jin Kim ▪
igomon ▪
Igor Fedotov ▪
Igor Golikov ▪
Ilya Dryomov ▪
imtzw ▪
Indira Sawant ▪
Ivo Almeida ▪
J. Eric Ivancich ▪
Jakob Haufe ▪
James Oakley ▪
Jamie Pryde ▪
Jane Zhu ▪
Janne Heß ▪
Jannis Speer ▪
Jared Yu ▪
Jaya Prakash ▪
Jayaprakash-ibm ▪
Jesse F. Williamson ▪
Jesse Williamson ▪
Jianwei Zhang ▪
Jianxin Li ▪
jiawd ▪
Jiffin Tony Thottan ▪
Joao Eduardo Luis ▪
Joel Davidow ▪
John Agombar ▪
John Mulligan ▪
Jon Bailey ▪
Jos Collin ▪
Jose J Palacios-Perez ▪
Joshua Baergen ▪
Joshua Blanch ▪
Juan Ferrer Toribio ▪
Juan Miguel Olmo Martínez ▪
julpark ▪
junxiang Mu ▪
Kalpesh Pandya ▪
Kamoltat Sirivadhna ▪
kchheda3 ▪
Kefu Chai ▪
Ken Dreyer ▪
Kevin Niederwanger ▪
Kevin Zhao ▪
Kotresh Hiremath Ravishankar ▪
Kritik Sachdeva ▪
Kushal Deb ▪
Kushal Jyoti Deb ▪
Kyrylo Shatskyy ▪
Laimis Juzeliūnas ▪
Laura Flores ▪
Lee Sanders ▪
Leo Mylonas ▪
Leonid Chernin ▪
Leonid Usov ▪
lightmelodies ▪
Linjing Li ▪
liubingrun ▪
lizhipeng ▪
Lorenz Bausch ▪
Luc Ritchie ▪
Lucian Petrut ▪
Luo Rixin ▪
Ma Jianpeng ▪
Marc Singer ▪
Marcel Lauhoff ▪
Mark Kogan ▪
Mark Nelson ▪
Martin Nowak ▪
Matan Breizman ▪
Matt Benjamin ▪
Matt Vandermeulen ▪
Matteo Paramatti ▪
Matthew Vernon ▪
Max Carrara ▪
Max Kellermann ▪
Md Mahamudur Rahaman Sajib ▪
Michael J. Kidd ▪
Michal Nasiadka ▪
Mike Perez ▪
Miki Patel ▪
Milind Changire ▪
Mindy Preston ▪
Mingyuan Liang ▪
Mohit Agrawal ▪
molpako ▪
mosayyebzadeh ▪
Mouratidis Theofilos ▪
Mykola Golub ▪
Myoungwon Oh ▪
N Balachandran ▪
Naman Munet ▪
Naveen Naidu ▪
nbalacha ▪
Neeraj Pratap Singh ▪
Neha Ojha ▪
Niklas Hambüchen ▪
Nithya Balachandran ▪
Nitzan Mordechai ▪
Nizamudeen A ▪
Oguzhan Ozmen ▪
Omid Yoosefi ▪
Omri Zeneva ▪
Or Ozeri ▪
Orit Wasserman ▪
Oshrey Avraham ▪
Patrick Donnelly ▪
Paul Cuzner ▪
Paul Stemmet ▪
Paulo E. Castro ▪
Pedro Gonzalez Gomez ▪
Pere Diaz Bou ▪
Peter Sabaini ▪
Pierre Riteau ▪
Piotr Parczewski ▪
Piyush Agarwal ▪
Ponnuvel Palaniyappan ▪
Prachi Goel ▪
Prashant D ▪
prik73 ▪
Pritha Srivastava ▪
Puja Shahu ▪
pujashahu ▪
qn2060 ▪
Radoslaw Zarzynski ▪
Raja Sharma ▪
Ramana Raja ▪
Redouane Kachach ▪
rhkelson ▪
Richard Poole ▪
Rishabh Dave ▪
Robin Geuze ▪
Ronen Friedman ▪
Rongqi Sun ▪
Rostyslav Khudov ▪
Roy Sahar ▪
Ryotaro Banno ▪
Sachin Prabhu ▪
Sachin Punadikar ▪
Sam Goyal ▪
Samarah Uriarte ▪
Samuel Just ▪
Satoru Takeuchi ▪
Seena Fallah ▪
Shachar Sharon ▪
Shasha Lu ▪
Shawn Edwards ▪
Shen Jiatong ▪
Shilpa Jagannath ▪
shimin ▪
Shinya Hayashi ▪
Shraddha Agrawal ▪
Shreya Sapale ▪
Shreyansh Sancheti ▪
Shrish0098 ▪
Shua Lv ▪
Shweta Bhosale ▪
Shweta Sodani ▪
Shwetha K Acharya ▪
Sidharth Anupkrishnan ▪
Silent ▪
Simon Jürgensmeyer ▪
Soumya Koduri ▪
Sridhar Seshasayee ▪
Srinivasa Bharath Kanta ▪
Stellios Williams ▪
Steven Chien ▪
Sun Lan ▪
Sungjoon Koh ▪
Sungmin Lee ▪
Sunil Angadi ▪
Sunnat Samadov ▪
Surya Kumari Jangala ▪
Suyash Dongre ▪
T K Chandra Hasan ▪
Taha Jahangir ▪
Tan Changzhi ▪
Teng Jie ▪
Teoman Onay ▪
Thomas Lamprecht ▪
Tobias Fischer ▪
Tobias Urdin ▪
Tod Chen ▪
Tomer Haskalovitch ▪
TomNewChao ▪
Toshikuni Fukaya ▪
Trang Tran ▪
TruongSinh Tran-Nguyen ▪
Tyler Brekke ▪
Tyler Stachecki ▪
Umesh Muthuvara ▪
Vallari Agrawal ▪
Venky Shankar ▪
Victoria Mackie ▪
Ville Ojamo ▪
Vinay Bhaskar Varada ▪
Wang Chao ▪
wanglinke ▪
Xavi Hernandez ▪
Xiubo Li ▪
Xuehan Xu ▪
XueYu Bai ▪
Yaarit Hatuka ▪
Yan, Zheng ▪
Yantao Xue ▪
Yao guotao ▪
Yehuda Sadeh ▪
Yingxin Cheng ▪
Yite Gu ▪
Yonatan Zaken ▪
Yuri Weinstein ▪
Yuval Lifshitz ▪
Zac Dover ▪
Zack Cerza ▪
Zaken ▪
Zhang Song ▪
zhangjianwei2 ▪
Zhansong Gao ▪
Zhipeng Li ▪
胡玮文&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>Finding My Place in the Ceph Community: Reflections Ahead of Cephalocon 2025</title>
    <link href="https://ceph.io/en/news/blog/2025/PoweredbyPeopleBlog/" />
    <updated>2025-10-22T00:00:00Z</updated>
    <id>https://ceph.io/en/news/blog/2025/PoweredbyPeopleBlog/</id>
    <author>
      <name>Anthony Middleton</name>
    </author>
      <category term="en-blog-post" />
      <category term="en-article" />
      <category term="blog-post" />
      <category term="ceph" />
      <category term="cephalocon" />
      <category term="community" />
    <content type="html" xml:base="https://ceph.io/en/news/blog/2025/PoweredbyPeopleBlog/">&lt;h2 id=&quot;finding-my-place-in-the-ceph-community%3A-reflections-ahead-of-cephalocon-2025&quot;&gt;Finding My Place in the Ceph Community: Reflections Ahead of Cephalocon 2025 &lt;a class=&quot;link-anchor&quot; href=&quot;#finding-my-place-in-the-ceph-community%3A-reflections-ahead-of-cephalocon-2025&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Six months ago, I stepped into an exciting chapter in my career by joining the Ceph Foundation as Community Manager. I saw it as an opportunity to grow a community, enhance my marketing skills, and harness my passion for organizing events to make a positive impact. What I didn&#39;t expect was how truly rewarding this journey would be. So far, I&#39;ve managed many campaigns, but the most significant was preparing for &lt;a href=&quot;https://events.linuxfoundation.org/cephalocon/&quot;&gt;Cephalocon&lt;/a&gt;, taking place in &lt;strong&gt;Vancouver, BC, on October 28–29&lt;/strong&gt;. They say what doesn&#39;t break you makes you stronger, and after six months with Ceph, I honestly feel like the Incredible Hulk: stronger, more resilient, and inspired by the power of open collaboration.&lt;/p&gt;
&lt;p&gt;Cephalocon is the annual gathering of the global Ceph community, where contributors, users, and developers share ideas, exchange knowledge, and celebrate open-source storage progress. It&#39;s a space for innovation and collaboration, often sparking the next breakthrough for the project. This year&#39;s event in Vancouver aims to increase user engagement and showcase Ceph&#39;s versatility with real-world use cases.&lt;/p&gt;
&lt;p&gt;As the Ceph Community Manager, I was invited to attend Cephalocon this year and deliver a presentation. This will be my first Cephalocon and my first visit to Vancouver, BC. I&#39;ve spent my time with Ceph connecting with community members around the world, and all of those interactions have been through screens. I look forward to meeting many community members with whom I have partnered, including the collection of developers, operators, and advocates who make Ceph what it is.&lt;/p&gt;
&lt;p&gt;Along with organizing the details of Cephalocon with the Ceph Events Team, I&#39;ve had the privilege of supporting incredible contributors who share their stories through blog posts, tech talks, and open discussions. I’ve collaborated with participants in programs including the Google Summer of Code and the Ceph Developer Summit, which have shown me just how passionate, innovative, and collaborative this community really is. I’ve also been fortunate to work alongside people like Gaurav Sitlani, whose insights helped shape the Ceph Ambassador Program into a growing network of talented advocates; Frédéric Nass, who has been an amazing collaborator on Ceph blogs and events; Anthony D’Atri, who has been instrumental in refining our communication tools; and Joseph Mundackal, who taught me the ropes of making GitHub pull requests for &lt;a href=&quot;http://ceph.io&quot;&gt;ceph.io&lt;/a&gt; updates.&lt;/p&gt;
&lt;p&gt;The Ceph community has quickly proven to be one of the most inspiring groups I’ve ever worked with. Every person I’ve met brings a story of solving challenges, scaling systems, and believing in open collaboration. My talk at Cephalocon 2025, &lt;em&gt;Powered by People: Growing the Ceph Community Through User Engagement&lt;/em&gt;, will explore where the Ceph community has been and where we&#39;re headed next. I&#39;ll share how user engagement, storytelling, and cross-community collaboration are shaping the next chapter of the Ceph Foundation&#39;s work, and how every contributor plays a role in building a stronger, more connected ecosystem.&lt;/p&gt;
&lt;p&gt;If you&#39;re attending Cephalocon this year, I&#39;d love for you to join my session. If you aren&#39;t, there&#39;s still time to register! You&#39;ll learn how to get more involved with the Ceph Foundation, how we’re building tools to recognize contributors across the ecosystem, and how you can make your voice heard. Ceph’s greatest strength has always been its people, and together, we’re building something extraordinary.&lt;/p&gt;
&lt;p&gt;Cephaloalon 2025 will take place in Vancouver, BC, on October 28-29.&lt;/p&gt;
&lt;p&gt;&lt;a class=&quot;button&quot; href=&quot;https://events.linuxfoundation.org/cephalocon/register/&quot;&gt;Register Today!&lt;/a&gt;&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>Ceph Object Storage Deep Dive Series. Part 1</title>
    <link href="https://ceph.io/en/news/blog/2025/rgw-deep-dive-1/" />
    <updated>2025-10-15T00:00:00Z</updated>
    <id>https://ceph.io/en/news/blog/2025/rgw-deep-dive-1/</id>
    <author>
      <name>Daniel Alexander Parkes, Anthony D&#39;Atri</name>
    </author>
      <category term="en-blog-post" />
      <category term="en-article" />
      <category term="blog-post" />
      <category term="ceph" />
      <category term="rgw" />
      <category term="s3" />
    <content type="html" xml:base="https://ceph.io/en/news/blog/2025/rgw-deep-dive-1/">&lt;h2 id=&quot;ceph-rgw-architecture%3A-a-deep-dive-into-its-core-foundations&quot;&gt;Ceph RGW Architecture: A Deep Dive into its Core Foundations &lt;a class=&quot;link-anchor&quot; href=&quot;#ceph-rgw-architecture%3A-a-deep-dive-into-its-core-foundations&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;h3 id=&quot;introduction%3A-the-stateless-powerhouse&quot;&gt;Introduction: The Stateless Powerhouse &lt;a class=&quot;link-anchor&quot; href=&quot;#introduction%3A-the-stateless-powerhouse&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;The Ceph Object Gateway (RGW) is far more than just a proxy; it&#39;s a high-level
abstraction layer that seamlessly provides Amazon S3 and OpenStack Swift
RESTful APIs on top of the underlying Reliable Autonomic Distributed
Object Store (RADOS). For storage architects, the RGW is crucial because
it translates standard HTTP requests for object operations into native RADOS
operations executed directly against the cluster. This allows applications
built for popular cloud object storage ecosystems to leverage a Ceph cluster as
their storage backend without modification.&lt;/p&gt;
&lt;p&gt;A fundamental principle governing RGW&#39;s design is its stateless nature. This
critical architectural decision is the bedrock of its massive horizontal
scalability and high availability. Since RGW daemons maintain no persistent
state related to client sessions, you can achieve near-linear performance
scaling simply by deploying more RGW instances behind a standard load balancer.
The failure of any single RGW daemon is a non-critical event because the load
balancer can redirect client traffic to the remaining healthy instances, making
the outage transparent to end-users. All vital state, including user metadata,
bucket definitions, ACLs, and object data, is durably stored within the RADOS cluster in designated pools.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;images/img1.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;In this first deep dive, we peel back the layers to examine the RGW frontend
components, the specialized RADOS pools that house its internal metadata, and
the critical mechanics of bucket indexing and sharding that enable high-performance
object operations.&lt;/p&gt;
&lt;h3 id=&quot;rgw-frontends&quot;&gt;RGW Frontends &lt;a class=&quot;link-anchor&quot; href=&quot;#rgw-frontends&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;An incoming client request to an RGW daemon traverses several internal layers,
beginning with the frontend web server that handles the initial HTTP connection.
RGW has historically supported two primary embedded frontends: Civetweb, the legacy
default, and Beast, the modern, high-performance default choice.&lt;/p&gt;
&lt;p&gt;Civetweb operates on a synchronous, thread-per-connection model. In contrast,
Beast is a modern frontend built upon Boost. The Asio C++ library facilitates
an asynchronous, event-driven I/O model. Instead of dedicating a thread to each
connection, Beast uses a small pool of worker threads to service thousands of
connections concurrently. This model is significantly more efficient in terms of
CPU and memory utilization, as threads are not blocked waiting for I/O, and the
memory overhead per-connection is drastically reduced. The architectural shift
from Civetweb to Beast was a direct response to the demands of modern
cloud-native applications, which often generate high-concurrency, high-IOPS workloads..&lt;/p&gt;
&lt;h4 id=&quot;frontend-configuration-in-action&quot;&gt;Frontend Configuration in Action &lt;a class=&quot;link-anchor&quot; href=&quot;#frontend-configuration-in-action&quot;&gt;¶&lt;/a&gt;&lt;/h4&gt;
&lt;p&gt;When deploying or modifying RGW services using cephadm, the frontend type and its
settings can be specified directly within the service specification file. Beast is
the default and recommended option for the RGW frontend:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;service_type: rgw
service_id: myrealm.myzone
spec:
  rgw_realm: myrealm
  rgw_zone: myzone
  ssl: true
  rgw_frontend_port: 1234
  rgw_frontend_type: beast
  rgw_frontend_ssl_certificate: ...
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This YAML snippet illustrates how cephadm deploys an RGW service, specifying the
realm and zone, enabling SSL termination, and explicitly setting the &lt;code&gt;rgw_frontend_type&lt;/code&gt;
to &lt;code&gt;beast&lt;/code&gt; on TCP port 1234.&lt;/p&gt;
&lt;h3 id=&quot;understanding-rgw-rados-pools&quot;&gt;Understanding RGW RADOS Pools &lt;a class=&quot;link-anchor&quot; href=&quot;#understanding-rgw-rados-pools&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;For RGW to operate as a truly stateless component, every piece of critical information,
user data, metadata, and logs must be stored persistently within the RADOS layer. This
persistence is achieved through a set of specialized, dedicated RADOS pools.&lt;/p&gt;
&lt;p&gt;RGW&#39;s multi-pool architecture is a deliberate design choice that allows operators
to physically separate different classes of data onto different hardware tiers,
enabling a highly optimized balance of performance and cost. For example,
latency-sensitive metadata and logs can be placed on fast replicated pools backed
by SSD media. At the same time, capacity-heavy object payloads can reside on
erasure-coded pools supported by slower, more cost-effective HDD or increasingly
QLC-class SSDs. NVMe SSDs are preferable to legacy SAS/SATA SSDs as they offer
future-proofing, better density, and better performance for the money.  An NVMe
server can actually cost less than a SATA server.&lt;/p&gt;
&lt;h4 id=&quot;key-rgw-pools-and-their-purposes&quot;&gt;Key RGW Pools and Their Purposes &lt;a class=&quot;link-anchor&quot; href=&quot;#key-rgw-pools-and-their-purposes&quot;&gt;¶&lt;/a&gt;&lt;/h4&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Pool Name Suffix&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Typical Data Protection&lt;/th&gt;
&lt;th&gt;Recommended Media&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;.rgw.root&lt;/td&gt;
&lt;td&gt;Stores global RGW configuration (realms, zonegroups, zones)&lt;/td&gt;
&lt;td&gt;Replicated&lt;/td&gt;
&lt;td&gt;SSD&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;.rgw.control&lt;/td&gt;
&lt;td&gt;Internal RGW daemon coordination&lt;/td&gt;
&lt;td&gt;Replicated&lt;/td&gt;
&lt;td&gt;SSD&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;.rgw.meta&lt;/td&gt;
&lt;td&gt;User and bucket metadata&lt;/td&gt;
&lt;td&gt;Replicated&lt;/td&gt;
&lt;td&gt;SSD&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;.rgw.log&lt;/td&gt;
&lt;td&gt;Operation and replication logs&lt;/td&gt;
&lt;td&gt;Replicated&lt;/td&gt;
&lt;td&gt;SSD&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;.rgw.buckets.index&lt;/td&gt;
&lt;td&gt;Bucket object listings (omaps). Critical for performance&lt;/td&gt;
&lt;td&gt;Replicated&lt;/td&gt;
&lt;td&gt;SSD&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;.rgw.buckets.data&lt;/td&gt;
&lt;td&gt;Main object data payload&lt;/td&gt;
&lt;td&gt;Erasure Coded&lt;/td&gt;
&lt;td&gt;TLC/QLC SSD, HDD&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;.rgw.buckets.non-ec&lt;/td&gt;
&lt;td&gt;Auxiliary pool for operations incompatible with EC&lt;/td&gt;
&lt;td&gt;Replicated&lt;/td&gt;
&lt;td&gt;SSD / HDD&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;When the RGW service first tries to operate on a RADOS pool that does not exist, it
will create that pool with the values of the config options &lt;code&gt;osd_pool_default_pg_num&lt;/code&gt;
and &lt;code&gt;osd_pool_default_pgp_num&lt;/code&gt;. These defaults are sufficient for some pools, but others
(especially those listed in placement_pools for the bucket index and data) will require
additional tuning. Note that when the PG autoscaler is enabled it will adjust the placement
group values for these pools automatically, with an increased &lt;code&gt;BIAS&lt;/code&gt; for &lt;code&gt;.index&lt;/code&gt; pools
so that they are allocated more than their aggregate data would otherwise inform.
For the autoscaler to work best with the constellation of RGW pools, we suggest raising the
following values from their defaults:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;# ceph config set global mon_target_pg_per_osd 300
# ceph config set global mon_max_pg_per_osd 600
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Pool names specific to an RGW zone follow the naming convention &lt;code&gt;zone-name.pool-name&lt;/code&gt;.
For example, a zone named &lt;code&gt;us-east&lt;/code&gt; will have the following pools:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;.rgw.root
us-east.rgw.control
us-east.rgw.meta
us-east.rgw.log
us-east.rgw.buckets.index
us-east.rgw.buckets.data
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The structure of these pools is vital for understanding RGW&#39;s operational mechanics.
Many logical pools are consolidated using RADOS namespaces within the main RADOS
pools (e.g., &lt;code&gt;default.rgw.log&lt;/code&gt;).&lt;/p&gt;
&lt;p&gt;We can list RADOS namespaces with a command of the following form. Here we can see
how the &lt;code&gt;rgw.meta pool&lt;/code&gt; contains three different RADOS namespaces:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;# rados ls -p default.rgw.meta --all | awk &#39;{ print $1 }&#39; | sort -u
root
users.keys
users.uid
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Pools with their namespaces are exposed when querying the RGW zone configuration:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;$ radosgw-admin zone get --rgw-zone default
{
    &amp;quot;id&amp;quot;: &amp;quot;d9c4f708-5598-4c44-9d36-849552a08c4d&amp;quot;,
    &amp;quot;name&amp;quot;: &amp;quot;default&amp;quot;,
    &amp;quot;domain_root&amp;quot;: &amp;quot;default.rgw.meta:root&amp;quot;,
    &amp;quot;control_pool&amp;quot;: &amp;quot;default.rgw.control&amp;quot;,
    &amp;quot;gc_pool&amp;quot;: &amp;quot;default.rgw.log:gc&amp;quot;,
    &amp;quot;lc_pool&amp;quot;: &amp;quot;default.rgw.log:lc&amp;quot;,
    &amp;quot;log_pool&amp;quot;: &amp;quot;default.rgw.log&amp;quot;,
    &amp;quot;intent_log_pool&amp;quot;: &amp;quot;default.rgw.log:intent&amp;quot;,
    &amp;quot;usage_log_pool&amp;quot;: &amp;quot;default.rgw.log:usage&amp;quot;,
    &amp;quot;roles_pool&amp;quot;: &amp;quot;default.rgw.meta:roles&amp;quot;,
    &amp;quot;reshard_pool&amp;quot;: &amp;quot;default.rgw.log:reshard&amp;quot;,
    &amp;quot;user_keys_pool&amp;quot;: &amp;quot;default.rgw.meta:users.keys&amp;quot;,
    &amp;quot;user_email_pool&amp;quot;: &amp;quot;default.rgw.meta:users.email&amp;quot;,
    &amp;quot;user_swift_pool&amp;quot;: &amp;quot;default.rgw.meta:users.swift&amp;quot;,
    &amp;quot;user_uid_pool&amp;quot;: &amp;quot;default.rgw.meta:users.uid&amp;quot;,
    &amp;quot;otp_pool&amp;quot;: &amp;quot;default.rgw.otp&amp;quot;,
   ...
    &amp;quot;placement_pools&amp;quot;: [
        {
            &amp;quot;key&amp;quot;: &amp;quot;default-placement&amp;quot;,
            &amp;quot;val&amp;quot;: {
                &amp;quot;index_pool&amp;quot;: &amp;quot;default.rgw.buckets.index&amp;quot;,
                &amp;quot;storage_classes&amp;quot;: {
                    &amp;quot;STANDARD&amp;quot;: {
                        &amp;quot;data_pool&amp;quot;: &amp;quot;default.rgw.buckets.data&amp;quot;
                    }
                },
                &amp;quot;data_extra_pool&amp;quot;: &amp;quot;default.rgw.buckets.non-ec&amp;quot;,
                &amp;quot;index_type&amp;quot;: 0
            }
        }
    ],
    &amp;quot;realm_id&amp;quot;: &amp;quot;&amp;quot;,
    &amp;quot;notif_pool&amp;quot;: &amp;quot;default.rgw.log:notif&amp;quot;
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This JSON output details the configuration for the default zone.
Notice how many different logical functions (GC, LC, usage logs)
are mapped to the RADOS pool &lt;code&gt;default.rgw.log` but are separated using RADOS Namespaces (e.g., &lt;/code&gt;default.rgw.log:gc``).&lt;/p&gt;
&lt;h3 id=&quot;a-detailed-overview-of-the-bucket-index-and-sharding&quot;&gt;A Detailed Overview of the Bucket Index and Sharding &lt;a class=&quot;link-anchor&quot; href=&quot;#a-detailed-overview-of-the-bucket-index-and-sharding&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;The ability to list the contents of a bucket is fundamental to object storage.
RGW implements this using a dedicated structure called the Bucket Index,
which is responsible for listing bucket content, maintaining a journal
for versioned operations, storing quota metadata, and serving as a log
for multi-zone synchronization.&lt;/p&gt;
&lt;h4 id=&quot;the-bucket-index-and-omaps&quot;&gt;The Bucket Index and OMAPs &lt;a class=&quot;link-anchor&quot; href=&quot;#the-bucket-index-and-omaps&quot;&gt;¶&lt;/a&gt;&lt;/h4&gt;
&lt;p&gt;The bucket index relies on a special feature of RADOS objects called the Object
Map (OMAP). An OMAP is a key-value store associated with a RADOS object, similar
in concept to Extended Attributes in a POSIX file. For each bucket, RGW creates
one or more dedicated index objects in the .rgw.buckets.index pool. The listing
information for the objects within that bucket is stored within the OMAP of these index objects.&lt;/p&gt;
&lt;p&gt;Crucially, the performance of the bucket index relies entirely on the underlying
key-value database: OMAPs are physically stored within the RocksDB database residing
on the OSD&#39;s DB partition. This mandates that index pools like ``default.rgw.buckets.index`
must currently use a replicated data protection scheme, as OMAP operations are not
compatible with erasure-coded pools. Investing in fast flash devices (SSDs, ideally NVMe)
for the OSD&#39;s DB partition is paramount for bucket listing performance.  RGW index
pools may select a CRUSH rule that places them on pure SSD OSDs, or on hybrid OSDs
with the DB offloaded to SSDs.  Since omaps are purely in the DB portion of a given OSD,
either strategy suffices.&lt;/p&gt;
&lt;h4 id=&quot;tuning-the-index-pool-for-performance&quot;&gt;Tuning the Index Pool for Performance &lt;a class=&quot;link-anchor&quot; href=&quot;#tuning-the-index-pool-for-performance&quot;&gt;¶&lt;/a&gt;&lt;/h4&gt;
&lt;p&gt;While fast storage for OSD DBs is critical, the distribution of the bucket index across
the cluster is equally essential. This is controlled by the Placement Group (PG) count
of the index pool. Poor PG tuning is a common cause of poor listing performance, especially
in large clusters.&lt;/p&gt;
&lt;h5 id=&quot;placement-group-(pg)-count-and-parallelism&quot;&gt;Placement Group (PG) Count and Parallelism &lt;a class=&quot;link-anchor&quot; href=&quot;#placement-group-(pg)-count-and-parallelism&quot;&gt;¶&lt;/a&gt;&lt;/h5&gt;
&lt;p&gt;Each PG is mapped to a set of OSDs, with one acting as the primary. When RGW performs a bucket
listing, it sends parallel read requests to the OMAPs of many different bucket index shard
objects. A higher PG count for the index pool distributes these shards across a greater
number of primary OSDs. This increases the parallelism of the listing operation, as more
physical devices can concurrently service the I/O requests. A low PG count can create a
bottleneck where many requests are funneled to just a few OSDs, which then become saturated.&lt;/p&gt;
&lt;p&gt;We suggest that each index pool have at least one PG for every OSD on which it is placed.
When using the PG autoscaler, index pools should automatically have a BIAS value of 4 so
that they receive a higher number of PGs. See above for recommendations on central configuration
settings to allow the autoscaler to provision enough PGs to index pools.&lt;/p&gt;
&lt;h4 id=&quot;visualizing-the-bucket-index-log&quot;&gt;Visualizing the Bucket Index Log &lt;a class=&quot;link-anchor&quot; href=&quot;#visualizing-the-bucket-index-log&quot;&gt;¶&lt;/a&gt;&lt;/h4&gt;
&lt;p&gt;First, we confirm the existence and Pool ID of the bucket index pool:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;$ ceph osd lspools | grep default.rgw.buckets.index
6 default.rgw.buckets.index
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here we see that RADOS pool with ID &lt;code&gt;6&lt;/code&gt; is the dedicated index pool for the &lt;code&gt;default&lt;/code&gt; zone.&lt;/p&gt;
&lt;p&gt;Now, let’s get a bucket name to use as an example: &lt;code&gt;bucket1&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;$ radosgw-admin bucket list | grep bucket1
    &amp;quot;bucket1&amp;quot;,
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Next, we can examine the index entries for a specific bucket from the &lt;code&gt;default&lt;/code&gt; zone, &lt;code&gt;bucket1&lt;/code&gt;, using: &lt;code&gt;radosgw-admin&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;$ radosgw-admin bi list --bucket bucket1
[
    {
        &amp;quot;type&amp;quot;: &amp;quot;plain&amp;quot;,
        &amp;quot;idx&amp;quot;: &amp;quot;hosts5&amp;quot;,
        &amp;quot;entry&amp;quot;: {
            &amp;quot;name&amp;quot;: &amp;quot;hosts5&amp;quot;,
            &amp;quot;instance&amp;quot;: &amp;quot;&amp;quot;,
            &amp;quot;ver&amp;quot;: {
                &amp;quot;pool&amp;quot;: 16,
                &amp;quot;epoch&amp;quot;: 3
            },
            &amp;quot;locator&amp;quot;: &amp;quot;&amp;quot;,
            &amp;quot;exists&amp;quot;: &amp;quot;true&amp;quot;,
            &amp;quot;meta&amp;quot;: {
                &amp;quot;category&amp;quot;: 1,
                &amp;quot;size&amp;quot;: 4066,
                &amp;quot;mtime&amp;quot;: &amp;quot;2022-12-14T16:27:02.562603Z&amp;quot;,
                &amp;quot;etag&amp;quot;: &amp;quot;71ad37de1d442f5ee2597a28fe07461e&amp;quot;,
                &amp;quot;storage_class&amp;quot;: &amp;quot;&amp;quot;,
                &amp;quot;owner&amp;quot;: &amp;quot;test&amp;quot;,
                &amp;quot;owner_display_name&amp;quot;: &amp;quot;test&amp;quot;,
                &amp;quot;content_type&amp;quot;: &amp;quot;&amp;quot;,
                &amp;quot;accounted_size&amp;quot;: 4066,
                &amp;quot;user_data&amp;quot;: &amp;quot;&amp;quot;,
                &amp;quot;appendable&amp;quot;: &amp;quot;false&amp;quot;
            },
            &amp;quot;tag&amp;quot;: &amp;quot;_iDrB7rnO7jqyyQ2po8bwqE0vL_Al6ZH&amp;quot;,
            &amp;quot;flags&amp;quot;: 0,
            &amp;quot;pending_map&amp;quot;: [],
            &amp;quot;versioned_epoch&amp;quot;: 0
        }
    }
]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;radosgw-admin bi list&lt;/code&gt; output displays the stored metadata for
an S3 object (&lt;code&gt;hosts5&lt;/code&gt;), including size, modification time (mtime), and ETag.&lt;/p&gt;
&lt;h4 id=&quot;the-scalability-enabler%3A-bucket-sharding&quot;&gt;The Scalability Enabler: Bucket Sharding &lt;a class=&quot;link-anchor&quot; href=&quot;#the-scalability-enabler%3A-bucket-sharding&quot;&gt;¶&lt;/a&gt;&lt;/h4&gt;
&lt;p&gt;A significant performance problem arises when a bucket index grows very large.
If a bucket&#39;s index is stored in a single RADOS object, only one
operation can be performed at a time. This serialization limits parallelism and
can become a severe bottleneck for high-throughput write workloads.&lt;/p&gt;
&lt;p&gt;To circumvent this limitation, RGW employs Bucket Index Sharding. This mechanism
divides the bucket index into multiple parts, with each shard stored on a separate
RADOS object within the index pool. When an object is written, the update is
directed to a specific shard determined by a hash of the object&#39;s name. This
allows multiple operations to occur concurrently across different Placement
Groups (PGs) and OSDs, improving overall scalability. The number of shards
should be a prime number, and is configurable with the &lt;code&gt;bucket_index_max_shards&lt;/code&gt;
config option, which defaults to &lt;code&gt;11&lt;/code&gt;). We can get relevant metadata information
about your bucket and objects using the
radosgw-admin bucket stats command, like the shard count, bucket usage, quota,
versioning, object lock, owner, etc.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;$ radosgw-admin bucket stats --bucket bucket1 | grep shards
    &amp;quot;num_shards&amp;quot;: 11,
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Bucket Index pool for the default zone:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;$ ceph osd lspools | grep default.rgw.buckets.index
6 default.rgw.buckets.index
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We can visually confirm the existence of these shards as discrete OMAP RADOS objects:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;$ rados -p default.rgw.buckets.index ls
.dir.7fb0a3df-9553-4a76-938d-d23711e67677.34162.1.9
.dir.7fb0a3df-9553-4a76-938d-d23711e67677.34162.1.0
.dir.7fb0a3df-9553-4a76-938d-d23711e67677.34162.1.10
.dir.7fb0a3df-9553-4a76-938d-d23711e67677.34162.1.1
.dir.7fb0a3df-9553-4a76-938d-d23711e67677.34162.1.7
.dir.7fb0a3df-9553-4a76-938d-d23711e67677.34162.1.8
.dir.7fb0a3df-9553-4a76-938d-d23711e67677.34162.1.2
.dir.7fb0a3df-9553-4a76-938d-d23711e67677.34162.1.6
.dir.7fb0a3df-9553-4a76-938d-d23711e67677.34162.1.5
.dir.7fb0a3df-9553-4a76-938d-d23711e67677.34162.1.4
.dir.7fb0a3df-9553-4a76-938d-d23711e67677.34162.1.3
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Each &lt;code&gt;.dir&lt;/code&gt; RADOS object listed here is a separate bucket index shard. In this
example, 11 shards are visible, matching the default number of shards per bucket.&lt;/p&gt;
&lt;p&gt;At bucket creation time, the initial number of shards is set
by the &lt;code&gt; bucket_index_max_shards&lt;/code&gt; option at the zonegroup level, and it is used
for all buckets. If a different number of shards is required for a specific bucket,
it is possible to change it.&lt;/p&gt;
&lt;p&gt;Note: we recommend a maximum of 102,400 S3 objects per bucket index shard.&lt;/p&gt;
&lt;p&gt;We can get the marker for a bucket using the stats command:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;$ radosgw-admin bucket stats --bucket bucket1 | grep marker
    &amp;quot;marker&amp;quot;: &amp;quot;7fb0a3df-9553-4a76-938d-d23711e67677.34162.1&amp;quot;,
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now that we know that the &lt;em&gt;marker&lt;/em&gt; for &lt;code&gt;bucket1&lt;/code&gt; is &lt;code&gt;7fb0a3df-9553-4a76-938d-d23711e67677.34162.1&lt;/code&gt;.
Let’s upload an object named &lt;code&gt;file1&lt;/code&gt; to &lt;code&gt;bucket1&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;$ aws --endpoint=http://ceph-node02:8080 s3 cp /etc/hosts s3://bucket1/file1 --region default
upload: ../etc/hosts to s3://bucket1/file1
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Let’s investigate the bucket index for this bucket at the RADOS level. By
listing the omapkeys on the bucket index object, we can see a key called &lt;code&gt;file1&lt;/code&gt;,
which matches the uploaded object name. Here we are doing a listomapkeys on one of
the 11 shard objects available, in this case, shard 2. As mentioned before, objects
will be spread among the different shards during creation.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;$ rados -p default.rgw.buckets.index listomapkeys .dir.7fb0a3df-9553-4a76-938d-d23711e67677.34162.1.2
file1
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;When we check the values, we can see that the key/value entry in the bucket index shard &lt;code&gt;2&lt;/code&gt;
omap object for &lt;code&gt;bucket1&lt;/code&gt; is 217 bytes in size. In the hex dump we see info including the object name.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;$ rados -p default.rgw.buckets.index listomapvals .dir.7fb0a3df-9553-4a76-938d-d23711e67677.34162.1.2
file1
value (217 bytes) :
00000000  08 03 d3 00 00 00 05 00  00 00 66 69 6c 65 31 01  |..........file1.|
00000010  00 00 00 00 00 00 00 01  07 03 5a 00 00 00 01 32  |..........Z....2|
00000020  05 00 00 00 00 00 00 4b  ab a1 63 95 74 ba 04 20  |.......K..c.t.. |

&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;When we add more S3 objects to our bucket, we see new key/value entries for each
added to the shards available for the bucket. In this example &lt;code&gt;file1&lt;/code&gt;, &lt;code&gt;file2&lt;/code&gt;,&lt;code&gt;file4&lt;/code&gt;, &lt;code&gt;file10&lt;/code&gt;
landed in shard &lt;code&gt;2&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;$ rados -p default.rgw.buckets.index listomapkeys .dir.7fb0a3df-9553-4a76-938d-d23711e67677.34162.1.2
file1
file2
file4
file10
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We can confirm the placement of a specific shard, shard &lt;code&gt;2&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;$ ceph osd map default.rgw.buckets.index .dir.7fb0a3df-9553-4a76-938d-d23711e67677.34162.1.2
osdmap e90 pool &#39;default.rgw.buckets.index&#39; (9) object &#39;.dir.7fb0a3df-9553-4a76-938d-d23711e67677.34162.1.2&#39; -&amp;gt; pg 9.6fa75bc9 (9.9) -&amp;gt; up ([1,2], p5) acting ([1,2], p5)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This output shows that the index shard is replicated across the cluster and lives on specific OSDs. Distributing the index across multiple PGs (and therefore OSDs) enables parallelism.&lt;/p&gt;
&lt;h4 id=&quot;the-zero-byte-mystery%3A-why-the-index-pool-appears-empty&quot;&gt;The Zero-Byte Mystery: Why the Index Pool Appears Empty &lt;a class=&quot;link-anchor&quot; href=&quot;#the-zero-byte-mystery%3A-why-the-index-pool-appears-empty&quot;&gt;¶&lt;/a&gt;&lt;/h4&gt;
&lt;p&gt;when you query the space usage of the bucket index pool, the result often surprises
engineers unfamiliar with Ceph&#39;s OMAP architecture:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;$ rados df -p default.rgw.buckets.index
POOL_NAME                  USED  OBJECTS  CLONES  COPIES  MISSING_ON_PRIMARY  UNFOUND  DEGRADED  RD_OPS       RD  WR_OPS      WR  USED COMPR  UNDER COMPR
default.rgw.buckets.index   0 B       11       0      33                   0        0         0     208  207 KiB      41  20 KiB         0 B          0 B
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Even inspecting a single shard object(shard 2), shows a size of zero:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;$ rados -p default.rgw.buckets.index stat .dir.7fb0a3df-9553-4a76-938d-d23711e67677.34162.1.2
default.rgw.buckets.index/.dir.7fb0a3df-9553-4a76-938d-d23711e67677.34162.1.2 mtime 2022-12-20T07:32:11.000000-0500, size 0
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Despite containing 11 RADOS objects (shards), the pool reports 0 bytes used.
This is because bucket index listing data is stored entirely as OMAP entries
within the RocksDB database of each OSD, not as payload data in the RADOS
object itself. This confirms why leveraging fast flash media (SSDs) for at least
the OSD DB partition is essential for maximizing bucket index performance.&lt;/p&gt;
&lt;h4 id=&quot;managing-index-growth-with-dynamic-bucket-resharding&quot;&gt;Managing Index Growth with Dynamic Bucket Resharding &lt;a class=&quot;link-anchor&quot; href=&quot;#managing-index-growth-with-dynamic-bucket-resharding&quot;&gt;¶&lt;/a&gt;&lt;/h4&gt;
&lt;p&gt;As a bucket scales to hundreds of thousands or millions of S3 objects, its
index can become a performance bottleneck. By default, a single shard can
become &amp;quot;hot&amp;quot; as it accumulates too many entries. The threshold for the number of
S3 objects per shard is configurable, defaulting to 100,000. Very large numbers of
S3 objects per bucket reintroduce the serialization problem that sharding was
designed to solve. To combat this, RGW features an advanced, automated mechanism
known as Dynamic Bucket Resharding (DBR).&lt;/p&gt;
&lt;p&gt;DBR is a background process that continuously monitors the number of entries in
each bucket index shard. When a shard grows beyond its configured threshold, DBR
automatically and online triggers a resharding operation. This process creates a
new set of index objects with a greater number of shards and then safely migrates
the existing index entries from the old, smaller layout to the new, larger one.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;images/img2.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;h5 id=&quot;the-evolution-of-online-resharding%3A-minimizing-impact&quot;&gt;The Evolution of Online Resharding: Minimizing Impact &lt;a class=&quot;link-anchor&quot; href=&quot;#the-evolution-of-online-resharding%3A-minimizing-impact&quot;&gt;¶&lt;/a&gt;&lt;/h5&gt;
&lt;p&gt;Historically, the resharding operation required temporarily pausing write I/O to the bucket.
While read operations remained unaffected, this write pause could be noticeable and
painful on very active workloads.&lt;/p&gt;
&lt;p&gt;However, a significant enhancement coming in a forthcoming Tentacle release has drastically minimized
this write freeze. The new implementation makes the resharding process nearly transparent,
allowing writes to proceed with minimal interruption. This improvement is a vital step forward,
making dynamic resharding a seamless, production-safe feature for even the most demanding environments.&lt;/p&gt;
&lt;h5 id=&quot;not-just-growing%2C-but-shrinking%3A-the-power-of-shard-merging&quot;&gt;Not Just Growing, but Shrinking: The Power of Shard Merging &lt;a class=&quot;link-anchor&quot; href=&quot;#not-just-growing%2C-but-shrinking%3A-the-power-of-shard-merging&quot;&gt;¶&lt;/a&gt;&lt;/h5&gt;
&lt;p&gt;Dynamic resharding is not limited to just scaling up. Consider a scenario in which
a bucket that once held millions of objects has a massive number of them deleted.
The bucket now contains many sparsely populated or even empty index shards. This is
inefficient, as listing operations must still check every shard, adding unnecessary overhead.&lt;/p&gt;
&lt;p&gt;To address this, the DBR mechanism was enhanced to support shard merging as well.
As detailed in the Ceph documentation and development
trackers (e.g., &lt;a href=&quot;https://bugzilla.redhat.com/show_bug.cgi?id=2135354&quot;&gt;BZ#2135354&lt;/a&gt;),
if the object count in a bucket drops significantly, DBR can trigger a &amp;quot;downsizing&amp;quot;
resharding operation. It will migrate the entries from many sparse shards into a new,
smaller, and more densely packed set of index objects.&lt;/p&gt;
&lt;p&gt;While DBR is a powerful automated feature, for scenarios where you know a bucket will
be enormous from its inception, a standard best practice remains to pre-shard the
bucket at creation time. By setting an appropriate initial number of shards, you can
avoid the first dynamic resharding event altogether, ensuring optimal performance from
the very first object written.&lt;/p&gt;
&lt;h5 id=&quot;the-future-is-ordered%3A-a-glimpse-into-in-order-sharding&quot;&gt;The Future is Ordered: A Glimpse into In-Order Sharding &lt;a class=&quot;link-anchor&quot; href=&quot;#the-future-is-ordered%3A-a-glimpse-into-in-order-sharding&quot;&gt;¶&lt;/a&gt;&lt;/h5&gt;
&lt;p&gt;Currently, RGW&#39;s hashed sharding is optimized for write distribution, but it
presents a challenge for listing objects in alphabetical order. To fulfill a
paginated list request, RGW must perform a &amp;quot;scatter-gather&amp;quot; operation, querying
every single shard and sorting the combined results. This can become a bottleneck
for buckets with a very large number of shards.&lt;/p&gt;
&lt;p&gt;To solve this, a significant new feature known as in-order sharding (or ordered
bucket listing) is in development. This upcoming evolution will change the
sharding logic to place objects into shards based on their lexicographical
name rather than a hash.&lt;/p&gt;
&lt;p&gt;The impact of this change will be transformative. Instead of querying all shards,
a request to list objects will be directed to the specific shard(s) that contain
the requested alphabetical range. This will make paginated listing operations
dramatically faster and more efficient, particularly for workloads that rely
heavily on browsing or iterating through object keys.&lt;/p&gt;
&lt;p&gt;By combining the automated scaling of Dynamic Bucket Resharding with the listing
efficiency of in-order sharding, Ceph RGW is on a clear path to providing virtually
limitless and performant scalability within a single bucket, catering to the most
demanding data lake and AI/ML use cases of the future.&lt;/p&gt;
&lt;h3 id=&quot;conclusion%3A-the-engine-of-scalability&quot;&gt;Conclusion: The Engine of Scalability &lt;a class=&quot;link-anchor&quot; href=&quot;#conclusion%3A-the-engine-of-scalability&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;So far, we have journeyed through the high-performance path of a client request,
from the initial connection at the Beast frontend, through the specialized RADOS
pools, and deep into the intricate mechanics of the bucket index. You now
understand how OMAPs form the backbone of object listings and how Dynamic
Bucket Resharding acts as the engine of scalability, allowing a single bucket
to grow to billions of objects while maintaining performance. We&#39;ve uncovered
the core mechanisms that handle object discovery and listing at massive scale.&lt;/p&gt;
&lt;p&gt;However, our deep dive has so far focused on the index, which are the pointers
to the data. But what about the data itself? And what about the crucial control
plane metadata that defines the users, accounts, and rules governing the entire system?&lt;/p&gt;
&lt;p&gt;In &lt;a href=&quot;https://ceph.io/en/news/blog/2025/rgw-deep-dive-2&quot;&gt;Part 2&lt;/a&gt; of our series,
we will answer these questions. We&#39;ll shift our focus
to explore the elegant head/tail model of RGW&#39;s data layout, examine the system&#39;s
core metadata, and uncover the robust background processes that manage data
throughout its entire lifecycle.&lt;/p&gt;
&lt;p&gt;The authors would like to thank IBM for supporting the community with our time to create these posts.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>Ceph Object Storage Deep Dive Series. Part 2</title>
    <link href="https://ceph.io/en/news/blog/2025/rgw-deep-dive-2/" />
    <updated>2025-10-14T00:00:00Z</updated>
    <id>https://ceph.io/en/news/blog/2025/rgw-deep-dive-2/</id>
    <author>
      <name>Daniel Alexander Parkes, Anthony D&#39;Atri</name>
    </author>
      <category term="en-blog-post" />
      <category term="en-article" />
      <category term="blog-post" />
      <category term="ceph" />
      <category term="rgw" />
      <category term="s3" />
    <content type="html" xml:base="https://ceph.io/en/news/blog/2025/rgw-deep-dive-2/">&lt;h2 id=&quot;a-deep-dive-into-ceph-rgw%3A-data-path%2C-sharding%2C-and-automated-management&quot;&gt;A Deep Dive into Ceph RGW: Data Path, Sharding, and Automated Management &lt;a class=&quot;link-anchor&quot; href=&quot;#a-deep-dive-into-ceph-rgw%3A-data-path%2C-sharding%2C-and-automated-management&quot;&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;h3 id=&quot;introduction&quot;&gt;Introduction &lt;a class=&quot;link-anchor&quot; href=&quot;#introduction&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;In the &lt;a href=&quot;https://ceph.io/en/news/blog/2025/rgw-deep-dive-1&quot;&gt;first part of this deep drive&lt;/a&gt;,
we dissected the high-performance request path within the Ceph RGW. We covered its
stateless frontends, foundational RADOS pools, and the critical bucket index,
revealing how dynamic sharding enables virtually limitless scalability for
object listings within a single bucket.&lt;/p&gt;
&lt;p&gt;We established how RGW efficiently locates and lists objects at scale. Now, we
shift our focus from the index to the objects themselves and the broader system
that manages them. In this second deep dive, we will explore the control plane by
examining the RGW metadata layout. We will then uncover how S3 objects are physically
stored using the head/tail data model and conclude with a look at the critical
background processes, Garbage Collection, and Lifecycle Management, that automate
data governance.&lt;/p&gt;
&lt;h3 id=&quot;rgw-metadata-layout%3A-the-control-plane&#39;s-blueprint&quot;&gt;RGW Metadata Layout: The Control Plane&#39;s Blueprint &lt;a class=&quot;link-anchor&quot; href=&quot;#rgw-metadata-layout%3A-the-control-plane&#39;s-blueprint&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Just as the data for a single S3 object is meticulously organized across RADOS,
the entire state of the RGW system, its users, buckets, and policies, is also
durably stored within dedicated RADOS pools. This design is fundamental to the
stateless nature of RGW daemons; all control plane information lives within the
cluster itself, not on the gateways. This metadata is primarily housed in
the &lt;code&gt;.rgw.meta pool&lt;/code&gt;, while operational logs for processes like garbage
collection and lifecycle management reside in the &lt;code&gt;.rgw.log&lt;/code&gt; pool.&lt;/p&gt;
&lt;p&gt;These metadata objects are stored in an internal binary format. For this reason,
it is critical to use the &lt;code&gt;radosgw-admin&lt;/code&gt; command-line tool for administration
and interaction. This utility reliably decodes the binary records into human-readable
JSON and ensures that any modifications are performed safely.&lt;/p&gt;
&lt;p&gt;Note: Never attempt to modify objects in the &lt;code&gt;.rgw.meta&lt;/code&gt; pool directly with the &lt;code&gt;rados&lt;/code&gt; tool.&lt;/p&gt;
&lt;h5 id=&quot;key-metadata-categories&quot;&gt;Key Metadata Categories &lt;a class=&quot;link-anchor&quot; href=&quot;#key-metadata-categories&quot;&gt;¶&lt;/a&gt;&lt;/h5&gt;
&lt;p&gt;The &lt;code&gt;.rgw.meta&lt;/code&gt; pool uses RADOS namespaces to separate different types of
information logically. When you query the metadata, you will encounter several
top-level categories:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;user&lt;/code&gt;: Stores S3 user records, including access keys, capabilities, usage quotas, and contact information including email.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;bucket&lt;/code&gt;: The high-level named bucket record. This contains essential information including the bucket owner, its placement policy (which zone it belongs to), and various flags.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;bucket.instance&lt;/code&gt;: Represents the concrete, physical instance of a bucket. This record tracks the bucket&#39;s unique ID, shard count for the index, versioning status, and creation timestamps. A single bucket name can have multiple instances over its lifetime, such as when it is deleted and recreated.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;roles&lt;/code&gt;: Contains STS (Security Token Service) and IAM role definitions used by the policy evaluation engine to grant temporary credentials.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;group&lt;/code&gt;: Defines user groups, which can be used for administrative operations or policy management.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;topic&lt;/code&gt;: Stores configuration for S3 bucket event notifications.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;otp&lt;/code&gt;: Holds one-time password credentials for multi-factor authentication.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;account&lt;/code&gt;: Used for Swift account metadata if the Swift API is enabled.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h5 id=&quot;inspecting-metadata-with-radosgw-admin&quot;&gt;Inspecting Metadata with radosgw-admin &lt;a class=&quot;link-anchor&quot; href=&quot;#inspecting-metadata-with-radosgw-admin&quot;&gt;¶&lt;/a&gt;&lt;/h5&gt;
&lt;p&gt;The ``radosgw-admin` tool provides a safe and structured way to explore this
control plane data. First, you can list all available metadata categories:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;$ radosgw-admin metadata list
[
    &amp;quot;user&amp;quot;,
    &amp;quot;bucket&amp;quot;,
    &amp;quot;bucket.instance&amp;quot;,
    &amp;quot;roles&amp;quot;,
    ...
]
$ radosgw-admin metadata list account
[
    &amp;quot;RGW42603947660038067&amp;quot;,
    &amp;quot;RGW46950437120753278&amp;quot;,
    &amp;quot;RGW40572530565246530&amp;quot;,
    &amp;quot;RGW66892093834478914&amp;quot;,
    &amp;quot;RGW63384910224424377&amp;quot;,
    &amp;quot;RGW94705908964376531&amp;quot;,
    &amp;quot;RGW25531238860968914&amp;quot;
]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Next, list the specific keys within a category, such as &lt;code&gt;bucket&lt;/code&gt; or &lt;code&gt;bucket.instance&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;# List all bucket names
$ radosgw-admin metadata list bucket | grep bucket1
   &amp;quot;bucket1&amp;quot;,

# List all concrete bucket instances
$ radosgw-admin metadata list bucket.instance | grep bucket1
&amp;quot;bucket1:7fb0a3df-9553-4a76-938d-d23711e67677.34162.1&amp;quot;,
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Finally, here is an example of retrieving and decoding a specific record using its key.
Piping the output to &lt;code&gt;jq&lt;/code&gt; formats the JSON output for readability:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;# Get bucket metadata by its name
$ radosgw-admin metadata get bucket:bucket1 | jq .

# Get a user record by their UID
$ radosgw-admin metadata get user:my-user-id | jq .
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Its important to mention that &lt;code&gt;radosgw-admin&lt;/code&gt; makes our life easy with
specific CLI parameters to interact with this metadata directly.  For
example: &lt;code&gt;radosgw-admin user&lt;/code&gt; , &lt;code&gt;radosgw-admin account&lt;/code&gt;, &lt;code&gt;radosgw-admin bucket&lt;/code&gt; ,etc&lt;/p&gt;
&lt;h5 id=&quot;linking-metadata-to-usage&quot;&gt;Linking Metadata to Usage &lt;a class=&quot;link-anchor&quot; href=&quot;#linking-metadata-to-usage&quot;&gt;¶&lt;/a&gt;&lt;/h5&gt;
&lt;p&gt;To bridge the gap between abstract metadata and real-world usage, &lt;code&gt;radosgw-admin&lt;/code&gt;
offers commands that aggregate this information:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;# Get detailed stats for a bucket, including its shard count, object count, and size
$ radosgw-admin bucket stats --bucket &amp;lt;BUCKET_NAME&amp;gt; | jq .

# Get the complete metadata for a single object as RGW sees it
$ radosgw-admin object stat --bucket &amp;lt;BUCKET_NAME&amp;gt; --object &amp;lt;OBJECT_KEY&amp;gt; | jq .
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This &lt;code&gt;object stat&lt;/code&gt; command is handy, as it shows you the manifest, placement
information, and all system attributes for a specific S3 object, providing a
complete view from the gateway&#39;s perspective.&lt;/p&gt;
&lt;h3 id=&quot;rgw-data-layout%3A-the-head%2Ftail-object-model&quot;&gt;RGW Data Layout: The Head/Tail Object Model &lt;a class=&quot;link-anchor&quot; href=&quot;#rgw-data-layout%3A-the-head%2Ftail-object-model&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;A single logical S3 object often consists of several physical RADOS objects. RGW
employs a flexible head/tail object model that enables optimizations for various
file sizes and complex operations including MultiPart Upload (MPU).&lt;/p&gt;
&lt;p&gt;The primary RADOS object associated with any S3 object is the head object. Its
RADOS object name is typically formed by concatenating the bucket&#39;s internal
marker with the object&#39;s key, separated by an
underscore, for example &lt;code&gt;&amp;lt;bucket_marker&amp;gt;_&amp;lt;object_key&amp;gt;&lt;/code&gt;. The head object serves
two primary purposes. First, it is the authoritative store for all object-level
metadata, including ACLs, HTTP content type, ETag, and any user-defined metadata.
This information is stored efficiently as RADOS extended attributes (xattrs) on
the head object. Second, for small objects (by default, those up to the
configurable &lt;code&gt;rgw_max_chunk_size&lt;/code&gt;), the entire data payload of the S3 object
is stored directly within the data portion of the head object. This is a crucial
performance optimization, as it allows both the data and its associated metadata
to be written to the cluster in a single, atomic RADOS operation, minimizing I/O
amplification and latency for small-file workloads.&lt;/p&gt;
&lt;p&gt;For objects that exceed this inline data size, the head object&#39;s data payload is
used to store a manifest. This manifest is a metadata structure that describes
how the rest of the object&#39;s data is physically laid out across the cluster.
It contains an ordered list of the other RADOS objects, known as tail objects,
that hold the remaining data chunks. Each entry in the manifest specifies the
name of a tail object, its size, and its logical offset within the complete S3 object.&lt;/p&gt;
&lt;p&gt;If the object size exceeds the &lt;code&gt;rgw_max_chunk_size&lt;/code&gt; (default: 4MB), the data
is striped across multiple RADOS objects: a head object (containing only
metadata/manifest) and one or more tail objects (holding the bulk data).&lt;/p&gt;
&lt;p&gt;We can retrieve the default striping size, which governs when data splitting occurs:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;$ ceph config get mon rgw_obj_stripe_size
4194304
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This output confirms the default RGW object stripe size is 4,194,304 bytes (4MB).&lt;/p&gt;
&lt;p&gt;The interaction between the client-defined part size and RGW&#39;s internal striping
size (&lt;code&gt;rgw_obj_stripe_size&lt;/code&gt;) can result in the creation of specifically named
tail objects. If a client uploads a part (e.g., 5 MiB) that is larger than the
RGW stripe size (e.g., 4 MiB), RGW will automatically stripe that part across
multiple RADOS objects. For instance, it might create a 4 MiB object named with
a &lt;code&gt;__multipart&lt;/code&gt; prefix if MPU is used, and a 1 MiB object named with
a &lt;code&gt;__shadow&lt;/code&gt; prefix to hold the remainder. These are simply tail objects whose
names follow a specific convention, and both will be referenced correctly in the final manifest.&lt;/p&gt;
&lt;p&gt;Here, we observe the head object for a large file:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;$ aws --endpoint=http://ceph-node02:8080 s3 cp awscliv2.zip s3://bucket1/bigfile
$ aws --endpoint=http://ceph-node02:8080 s3 ls s3://bucket1/bigfile
2022-12-20 15:10:16   20971520 bigfile
$ rados -p default.rgw.buckets.data ls | grep bigfile$
7fb0a3df-9553-4a76-938d-d23711e67677.34162.1_bigfile
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This is the head object for &lt;code&gt;bigfile&lt;/code&gt;. It contains the object&#39;s xattrs metadata,
including the &lt;code&gt;user.rgw.manifest&lt;/code&gt;, which lists the locations of all tail objects.&lt;/p&gt;
&lt;p&gt;The head object stores its metadata efficiently as extended attributes:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;$ rados -p default.rgw.buckets.data listxattr 7fb0a3df-9553-4a76-938d-d23711e67677.34162.1_bigfile
user.rgw.acl
user.rgw.content_type
user.rgw.etag
user.rgw.idtag
user.rgw.manifest
user.rgw.pg_ver
user.rgw.source_zone
user.rgw.tail_tag
user.rgw.x-amz-content-sha256
user.rgw.x-amz-date
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The listed extended attributes (xattr) confirm that the head object stores critical object
metadata, notably &lt;code&gt;user.rgw.manifest&lt;/code&gt;, which describes how the large object&#39;s data
payload is split into tail objects.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;radosgw-admin object stat&lt;/code&gt; command can show the object’s manifest
striping/parts via RGW metadata:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;$ radosgw-admin object stat --bucket BUCKET --object OBJECT | jq .
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Tail objects in our example:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;# rados -p default.rgw.buckets.data ls | grep shadow_bigfile
7fb0a3df-9553-4a76-938d-d23711e67677.34162.1__shadow_bigfile.2~E_PYNwiBq0la0EuZcCOY30KgmRrf1pV.1_1
7fb0a3df-9553-4a76-938d-d23711e67677.34162.1__shadow_bigfile.2~E_PYNwiBq0la0EuZcCOY30KgmRrf1pV.2_1
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The tail objects typically hold 4MB chunks of data:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-$&quot;&gt;default.rgw.buckets.data/7fb0a3df-9553-4a76-938d-d23711e67677.34162.1__shadow_bigfile.2_E_PYNwiBq0la0EuZcCOY30KgmRrf1pV.1_1 mtime 2022-12-20T15:10:16.000000-0500, size 4194304
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&quot;images/img1.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;h3 id=&quot;s3-multipart-upload%3A-an-atomic-commit-operation&quot;&gt;S3 Multipart Upload: An Atomic Commit Operation &lt;a class=&quot;link-anchor&quot; href=&quot;#s3-multipart-upload%3A-an-atomic-commit-operation&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;The S3 Multipart Upload (MPU) feature is designed for efficiently uploading
large objects by dividing them into smaller parts that can be uploaded
independently and in parallel. RGW implements this elegantly as a metadata-only
commit operation.&lt;/p&gt;
&lt;p&gt;The workflow involves three key steps:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;Multipart Upload Initiation&lt;/em&gt;: A request is sent to get a unique Upload ID.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;Parts Upload&lt;/em&gt;: Individual parts are uploaded using both the Upload ID and a unique Part ID. Each part is stored as a distinct, temporary RADOS object. If a part size exceeds the RGW stripe size (default 4MB), it is internally segmented.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;Multipart Upload Completion (Atomic Commit)&lt;/em&gt;: When all parts are uploaded, the client sends a completion request. RGW avoids costly data copying. Instead, it creates the final head object and populates its internal manifest with pointers to the temporary RADOS objects that constitute the parts. This results in near-instantaneous completion.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This design makes the completion of a large object upload nearly instantaneous
from the cluster&#39;s perspective. The head object itself contains no user data
in this case, which is why low-level tools will report its size as 0 bytes;
its payload is the manifest, not the object content.&lt;/p&gt;
&lt;h4 id=&quot;mpu-structure-in-rados&quot;&gt;MPU Structure in RADOS &lt;a class=&quot;link-anchor&quot; href=&quot;#mpu-structure-in-rados&quot;&gt;¶&lt;/a&gt;&lt;/h4&gt;
&lt;p&gt;When a file is uploaded in chunks (e.g., 5MB chunks) and the RGW stripe width
is 4 MiB, RGW handles the internal splitting: it takes the first 4 MiB to create a
&amp;quot;multipart&amp;quot; RADOS object and the remaining 1 MiB to create a &amp;quot;shadow&amp;quot; tail RADOS object.&lt;/p&gt;
&lt;p&gt;Let’s check it out with an example. We will set the client chunk size to 5 MiB, and upload a 20 MiB file:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;$ aws configure set default.s3.multipart_chunksize 5MB
$ aws --endpoint=http://ceph-node02:8080 s3 cp text.txt s3://bucket1/5chuncks
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We send 5 MiB chunks to RGW, and RGW has a stripe width of 4 MiB, which means
RGW will take the first 4 MiB and create a &amp;quot;multipart&amp;quot; RADOS object and then
a 1 MiB &amp;quot;shadow&amp;quot; RADOS tail object.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;$ rados -p default.rgw.buckets.data ls | grep 5chuncks
7fb0a3df-9553-4a76-938d-d23711e67677.34162.1__shadow_5chuncks.2_r3yyxqL2hYs5DW32L9UXR3uawF4VEKL.2_1
7fb0a3df-9553-4a76-938d-d23711e67677.34162.1__multipart_5chuncks.2_r3yyxqL2hYs5DW32L9UXR3uawF4VEKL.2
7fb0a3df-9553-4a76-938d-d23711e67677.34162.1__shadow_5chuncks.2_r3yyxqL2hYs5DW32L9UXR3uawF4VEKL.3_1
7fb0a3df-9553-4a76-938d-d23711e67677.34162.1__shadow_5chuncks.2_r3yyxqL2hYs5DW32L9UXR3uawF4VEKL.4_1
7fb0a3df-9553-4a76-938d-d23711e67677.34162.1__multipart_5chuncks.2_r3yyxqL2hYs5DW32L9UXR3uawF4VEKL.4
7fb0a3df-9553-4a76-938d-d23711e67677.34162.1__shadow_5chuncks.2_r3yyxqL2hYs5DW32L9UXR3uawF4VEKL.1_1
7fb0a3df-9553-4a76-938d-d23711e67677.34162.1__multipart_5chuncks.2_r3yyxqL2hYs5DW32L9UXR3uawF4VEKL.3
7fb0a3df-9553-4a76-938d-d23711e67677.34162.1_5chuncks
7fb0a3df-9553-4a76-938d-d23711e67677.34162.1__multipart_5chuncks.2_r3yyxqL2hYs5DW32L9UXR3uawF4VEKL.1
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The output shows creation of the various components, including the final head
object (...&lt;code&gt;_5chuncks&lt;/code&gt;), as well as multiple multipart and shadow objects
corresponding to the striped parts.&lt;/p&gt;
&lt;p&gt;The size verification of these objects demonstrates the RGW splitting logic: the
multipart head RADOS object is 4 MiB, and the tail (shadow) RADOS object is 1 MiB.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;# Check the size of the main 4MB chunk
$ rados -p default.rgw.buckets.data stat 7fb0a3df-9553-4a76-938d-d23711e67677.34162.1__multipart_5chuncks.2_r3yyxqL2hYs5DW32L9UXR3uawF4VEKL.2
default.rgw.buckets.data/7fb0a3df-9553-4a76-938d-d23711e67677.34162.1__multipart_5chuncks.2_r3yyxqL2hYs5DW32L9UXR3uawF4VEKL.2 mtime 2022-12-21T03:07:49.000000-0500, size 4194304

# Check the size of the remaining 1MB chunk
$ rados -p default.rgw.buckets.data stat 7fb0a3df-9553-4a76-938d-d23711e67677.34162.1__shadow_5chuncks.2_r3yyxqL2hYs5DW32L9UXR3uawF4VEKL.2_1
default.rgw.buckets.data/7fb0a3df-9553-4a76-938d-d23711e67677.34162.1__shadow_5chuncks.2_r3yyxqL2hYs5DW32L9UXR3uawF4VEKL.2_1 mtime 2022-12-21T03:07:49.000000-0500, size 1048576
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;These parts are not assembled or merged in RADOS: this is their final state.&lt;/p&gt;
&lt;p&gt;Finally, the completed S3 object&#39;s head RADOS object contains only the metadata
manifest, which is why it reports a size of zero bytes at the RADOS level:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;$ rados -p default.rgw.buckets.data stat 7fb0a3df-9553-4a76-938d-d23711e67677.34162.1_5chuncks
default.rgw.buckets.data/7fb0a3df-9553-4a76-938d-d23711e67677.34162.1_5chuncks mtime 2022-12-21T03:07:49.000000-0500, size 0
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;More information on Multipart Upload can be found at &lt;a href=&quot;https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html&quot;&gt;AWS Multipart Upload&lt;/a&gt;.&lt;/p&gt;
&lt;h3 id=&quot;the-asynchronous-garbage-collector-(gc)&quot;&gt;The Asynchronous Garbage Collector (GC) &lt;a class=&quot;link-anchor&quot; href=&quot;#the-asynchronous-garbage-collector-(gc)&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;When clients delete S3 objects or overwrite them, the underlying RADOS objects
are not immediately removed. The primary function of object deletion is to
update the bucket index (or place a delete marker, if versioning is active).
Once the S3 object is removed from the index, its underlying RADOS objects
are effectively &amp;quot;orphaned.&amp;quot;&lt;/p&gt;
&lt;p&gt;These orphaned RADOS objects are then inserted into the Garbage Collection (GC)
queue. The Garbage Collector is a critical background process in RGW responsible
for asynchronously reclaiming the storage space consumed by these deleted objects.
This design ensures that client &lt;code&gt;DELETE&lt;/code&gt; requests return quickly without waiting
for the slow process of physically purging data blocks.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;images/img2.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;For workloads with high object churn (many creations and deletions), the GC process
can lag behind, causing a build-up of reclaimable space. To combat this,
administrators can tune several key parameters to make GC more aggressive:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;rgw_gc_obj_min_wait&lt;/code&gt;: The minimum time a before deleted object becomes eligible
for collection. Reducing this (default is 2 hours) accelerates space reclamation.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;rgw_gc_max_concurrent_io&lt;/code&gt;: The number of parallel RADOS delete operations a GC
thread can issue. Increasing this from the default of &lt;code&gt;10&lt;/code&gt; allows GC to process
more objects simultaneously, at the cost of higher background I/O on the cluster.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;rgw_gc_processor_period&lt;/code&gt;: The interval between GC processing cycles. A lower
value means the GC thread runs more frequently.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;rgw_gc_max_trim_chunk&lt;/code&gt;: The number of log entries to process in a single batch.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We can use the below commands to list all objects scheduled for removal:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;$ radosgw-admin gc list
$ radosgw-admin gc list --include-all
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;By default, Ceph waits for 2 hours between GC cycles. To manually run the GC
deletion process, run:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;$ radosgw-admin gc process --include-all
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This command can be executed to force the Garbage Collector to process its backlog
manually, ensuring the quick reclamation of space without waiting for the next scheduled run.&lt;/p&gt;
&lt;p&gt;Note:  The &lt;code&gt;rgw_gc_max_objs&lt;/code&gt; option should NEVER be modified from its default value
in a running cluster. This value should only be modified (if at all) before deploying RGWs.&lt;/p&gt;
&lt;p&gt;Note also: &lt;code&gt;radosgw-admin&lt;/code&gt; can accept the &lt;code&gt;--bypass-gc&lt;/code&gt;switch to delete underlying
storage immediately, but we strongly recommend &lt;strong&gt;not&lt;/strong&gt; passing this option.&lt;/p&gt;
&lt;p&gt;Deployments with heavy S3 object churn may also find value in deploying a dedicated
cohort of RGW daemons that only process GC events, which are then disabled in the client-facing
cohort.&lt;/p&gt;
&lt;h3 id=&quot;lifecycle-(lc)-management&quot;&gt;Lifecycle (LC) Management &lt;a class=&quot;link-anchor&quot; href=&quot;#lifecycle-(lc)-management&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;The Lifecycle (LC) Management engine automates data management based on user-defined
policies applied to buckets. These policies consist of rules that trigger actions
based on an object&#39;s age or other criteria. Everyday actions include &lt;code&gt;Expiration&lt;/code&gt;,
which deletes an object, and &lt;code&gt;Transition&lt;/code&gt;, which moves an object to a different
storage class. Lifecycle Transition can be defined between arbitrary storage
classes (Tiers) within a cluster or to external S3 compatible endpoints, which
include AWS, IBM Cloud or S3 Tape endpoints:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;images/img3.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;You can refine S3 Lifecycle expiration in RGW with fine-grained filters:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Current vs Noncurrent object versions&lt;/li&gt;
&lt;li&gt;Expire delete markers (&lt;code&gt;ExpiredObjectDeleteMarker&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Automatically abort incomplete multipart uploads (&lt;code&gt;AbortIncompleteMultipartUpload&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Cap retained older versions via &lt;code&gt;NewerNoncurrentVersions&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Scope rules by object size using &lt;code&gt;ObjectSizeGreaterThan&lt;/code&gt; and &lt;code&gt;ObjectSizeLessThan&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These filters, along with the use of S3 Tags, can be mixed to control cleanup behavior
at scale with incredible granularity.&lt;/p&gt;
&lt;p&gt;The LC engine is implemented as a set of multi-threaded worker processes. These workers
periodically scan the bucket indexes across the cluster. For each object they encounter,
they evaluate its properties against the bucket&#39;s lifecycle policy. If a rule&#39;s
conditions are met, the corresponding action is executed. An &lt;code&gt;Expiration&lt;/code&gt;
action effectively triggers a standard delete, removing the object&#39;s index
entry and enqueuing its data for GC. A &lt;code&gt;Transition&lt;/code&gt; action involves copying the
object&#39;s data to the target storage pool (which could be on a different media tier
or even a remote cloud tier), and then updating the object&#39;s metadata to reflect its
new location. To scale across large clusters, the LC engine&#39;s parallelism is tunable:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;rgw_lc_max_worker&lt;/code&gt;: This controls the number of main worker threads, which
process multiple bucket index shards in parallel. This should be increased for
clusters with a vast number of buckets.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;rgw_lc_max_wp_worker&lt;/code&gt;: This defines the number of sub-threads within each
worker&#39;s pool, which process objects within a single shard in parallel.
This should be increased for clusters with a few buckets that each contain a very large
number of S3 objects.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img src=&quot;images/img4.jpg&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;Here is a &lt;code&gt;radosgw-admin&lt;/code&gt; command listing the configured LC jobs in the cluster:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;$ radosgw-admin lc list | jq .
[
  {
    &amp;quot;bucket&amp;quot;: &amp;quot;:ingest:fcabdf4a-86f2-452f-a13f-e0902685c655.47553.1&amp;quot;,
    &amp;quot;shard&amp;quot;: &amp;quot;lc.0&amp;quot;,
    &amp;quot;started&amp;quot;: &amp;quot;Sat, 11 Oct 2025 11:20:59 GMT&amp;quot;,
    &amp;quot;status&amp;quot;: &amp;quot;COMPLETE&amp;quot;
  },
  {
    &amp;quot;bucket&amp;quot;: &amp;quot;:tierbucket:fcabdf4a-86f2-452f-a13f-e0902685c655.323278.10&amp;quot;,
    &amp;quot;shard&amp;quot;: &amp;quot;lc.3&amp;quot;,
    &amp;quot;started&amp;quot;: &amp;quot;Sat, 11 Oct 2025 11:20:56 GMT&amp;quot;,
    &amp;quot;status&amp;quot;: &amp;quot;COMPLETE&amp;quot;
  },
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We can get the information for a specific bucket using a command of the following
form. This rule is using object tags with the k/v &lt;code&gt;processed&lt;/code&gt;:&lt;code&gt;true&lt;/code&gt; as a
filter to expire objects older than one day.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;$ # radosgw-admin lc get --bucket ingest
{
    &amp;quot;prefix_map&amp;quot;: {
        &amp;quot;&amp;quot;: {
            &amp;quot;status&amp;quot;: true,
            &amp;quot;dm_expiration&amp;quot;: false,
            &amp;quot;expiration&amp;quot;: 1,
            &amp;quot;noncur_expiration&amp;quot;: 0,
            &amp;quot;mp_expiration&amp;quot;: 0,
            &amp;quot;obj_tags&amp;quot;: {
                &amp;quot;tagset&amp;quot;: {
                    &amp;quot;processed&amp;quot;: &amp;quot;true&amp;quot;
                }
            },
            &amp;quot;transitions&amp;quot;: {},
            &amp;quot;noncur_transitions&amp;quot;: {}
        }
    },
    &amp;quot;rule_map&amp;quot;: [
        {
            &amp;quot;id&amp;quot;: &amp;quot;Delete objects that are older than 24 hours&amp;quot;,
            &amp;quot;rule&amp;quot;: {
                &amp;quot;id&amp;quot;: &amp;quot;Delete objects that are older than 24 hours&amp;quot;,
                &amp;quot;prefix&amp;quot;: &amp;quot;&amp;quot;,
                &amp;quot;status&amp;quot;: &amp;quot;Enabled&amp;quot;,
                &amp;quot;expiration&amp;quot;: {
                    &amp;quot;days&amp;quot;: &amp;quot;1&amp;quot;,
                    &amp;quot;date&amp;quot;: &amp;quot;&amp;quot;
                },
                &amp;quot;noncur_expiration&amp;quot;: {
                    &amp;quot;days&amp;quot;: &amp;quot;&amp;quot;,
                    &amp;quot;date&amp;quot;: &amp;quot;&amp;quot;
                },
                &amp;quot;mp_expiration&amp;quot;: {
                    &amp;quot;days&amp;quot;: &amp;quot;&amp;quot;,
                    &amp;quot;date&amp;quot;: &amp;quot;&amp;quot;
                },
                &amp;quot;filter&amp;quot;: {
                    &amp;quot;prefix&amp;quot;: &amp;quot;&amp;quot;,
                    &amp;quot;obj_tags&amp;quot;: {
                        &amp;quot;tagset&amp;quot;: {
                            &amp;quot;processed&amp;quot;: &amp;quot;true&amp;quot;
                        }
                    }
                },
                &amp;quot;transitions&amp;quot;: {},
                &amp;quot;noncur_transitions&amp;quot;: {},
                &amp;quot;dm_expiration&amp;quot;: false
            }
        }
    ]
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;conclusion%3A-the-engine-room-revealed&quot;&gt;Conclusion: The Engine Room Revealed &lt;a class=&quot;link-anchor&quot; href=&quot;#conclusion%3A-the-engine-room-revealed&quot;&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Across this two-part deep dive, we&#39;ve journeyed through the core architectural
pillars of Ceph RGW. From the high-performance frontends and the intricate
mechanics of bucket index sharding to the elegant head/tail data layout and
the automated background processes, you now have a comprehensive, end-to-end
understanding of how RGW achieves its remarkable scalability and flexibility.&lt;/p&gt;
&lt;p&gt;Understanding the engine&#39;s anatomy is just the first step. To truly master Ceph
RGW, we must learn how to tune, secure, and operate it in complex, real-world environments.&lt;/p&gt;
&lt;p&gt;This architectural exploration is the foundation for our ongoing series on Ceph RGW mastery.&lt;/p&gt;
&lt;p&gt;The authors would like to thank IBM for supporting the community with our time to create these posts.&lt;/p&gt;
</content>
  </entry>
</feed>
