What’s New in the Land of OSD?

sajust

It’s been a few months since the last named release, Argonaut, and we’ve been busy! Well, in retrospect, most of the time was spent on finding a cephalopod name that starts with “b”, but once we got that done, we still had a few weeks left to devote to technical improvements. In particular, the OSD has seen some new and interesting developments.

OSD Internals Overview

Let’s start with some background for those not familiar with ceph internals. Objects in a Ceph Object Store are placed into pools, each of which is comprised of some number of placement groups (PGs). An object “foo” in pool “bar” would be mapped onto a set of osds as follows:

The first mapping hashes foo to 0x3F4AE323 and maps “bar” to its pool id: 3. The next mapping maps this to PG 3.23 (pg 23 in pool 3) by taking 0x3F4AE323 mod 256 (the number of PGs in pool “bar”). This pgid is then mapped onto the osds [24, 3, 12] via CRUSH. osd 24 is the primary; 3 and 12 are the replicas. PGs serve several critical roles in the ceph-osd design. First, they are the unit of placement. If we calculated placement directly on a per-object basis, changes in the cluster might require us to recalculate the location of each and every object! This way, we only need to re-run CRUSH on a per-PG basis when the cluster changes. Second, writes on objects are sequenced on a per-PG basis. Each PG contains an ordered log of all operations on objects in that PG. Finally, recovery is done on a per-PG basis. By comparing their PG logs, two osds can agree on which objects need to be recovered to which OSD.

Scrub

With that out of the way, let’s move to some work on keeping your cluster’s data honest. It turns out that data redundancy isn’t particularly useful if you fail to notice a corrupted object until you finally go to read it, possibly months after the last copy has finally become unreadable. To deal with this, ceph has long included a “scrub” feature which, during periods of low IO, chooses PGs in sequence and compares their contents across replicas. Alas, our implementation suffered from two shortcomings. The first is that we compared the set of objects contained in each PG across replicas as well as object metadata, but not the object contents. In the upcoming Bobtail release, we hash the object contents as we scan and compare the hashes from across replicas to detect corrupt copies.

The second shortcoming is that, in the name of simplicity, we essentially scrubbed an entire PG at once. The tricky part of efficiently performing a scrub is that comparing the contents of the primary and the replica is only useful if the scans are performed at the same version! Scrubbing while writes are in flight might result in a scrub scanning the replica at version 200 and scanning the primary at 197 because the replica happens to be a bit ahead of the primary. A simple way to ensure that the versions match is simply to stop writes on the entire PG and wait for them to flush before scanning the primary and replica stores. In fact, the Argonaut approach is a bit more sophisticated — we scan the primary and replica collections without stopping writes, and then stop writes to rescan any objects which changed in the meantime. However, for a large PG, that last step could take a long time, so a better approach is needed. Enter ChunkyScrub! In Bobtail, we scrub a PG in chunks, only pausing writes on the set of objects we are currently scrubbing. This way, no object has writes blocked for long.

OSD Internals Refactor

The ceph-osd daemon internals received a bit of a rework as well. As mentioned above, PGs act as the unit of sequencing for object operations. This is reflected in the code: each PG the OSD is responsible for maps onto a PG object. The OSD object’s primary responsibility is to shuffle messages from clients and other OSDs over to the appropriate PG object. A happy consequence of this is that operations on different objects on the same OSD can be processed independently (and in parallel!) as long as the objects are in different PGs. There is, however, one annoying detail which tends to prevent us from fully exploiting this opportunity for parallelism: that pesky OSDMap.

The OSDMap is required for the “CRUSH MAGIC” arrow in the above diagram to work. CRUSH really takes two inputs: a pgid and a description of the cluster. These together determine the OSDs on which the PG will reside. This description is encoded in the OSDMap. Changes to the cluster, such as the death of of an osd, are encoded into a new OSDMap by the ceph-mon cluster and sent out to the OSDs. The maps are given sequential epoch numbers. Essentially every decision within an OSD depends on the contents of the OSDMap. Complicating the situation even further is that OSDMap updates don’t reach all OSDs at the same time. The ceph-mon cluster sends out maps as they are created to a few OSDs, and then the OSDs gossip the new map around to other OSDs as they discover OSDs with old maps. Every OSD-OSD (including regular heartbeats) or OSD-client message includes the sender’s current OSDMap epoch, allowing the receiver to respond with whatever maps the sender is missing. So, how do we handle an OSDMap update arriving while other threads are busy with client requests for various PGs?

Originally, the OSD halted the threads responsible for handling PG requests (including client IO) while updating the global map. This was a useful simplification since each PG might need to update local state due to the map change, and it would be complicated to coordinate that update with in-progress operations. However, it was also a somewhat expensive simplification since halting all IO during the map switchover tends to be costly. Bobtail includes a rework of how the OSD processes PG messages. First, the PG internal code has been reworked to rely as little as possible on global OSD state. Second, each PG has its own notion of the “current” OSDMap epoch distinct from that of other PGs and from the OSD as a whole. Each PG’s internal map state is updated to the current OSDMap epoch before processing a message. The OSD can therefore update its OSDMap related state without bothering the PG threads and then publish the new map epoch atomically for PG thread consumption once it’s ready. The end result of all of this is that the OSD should handle map changes much more efficiently. This might not seem like much, but map changes tend to happen quickly when the cluster is experiencing heavy load due to OSD failures — exactly when you don’t want extraneous overhead!

Filestore Performance

Another area that got a fresh coat of paint is the backend io system synchronization design. The ceph-osd daemon uses standard file systems such as xfs or btrfs as its backing store. However, as you might imagine, it’s much simpler to work in terms of transactions on an abstract data store than to work directly on top of a file system (particularly considering the differences between xfs, btrfs, and ext4). Thus, the ceph-osd daemon talks to the file system via the FIleStore, which presents a uniform transactional interface in terms of objects and flat collections on top of the user’s underlying filesystem.
The journal is crucial to providing these transactional guarantees. In xfs (btrfs is somewhat different), the FileStore writes out each transaction to the journal prior to applying it to the file system. Each write must pass through:

  • OSD op thread (responsible for handling client requests)
  • FileStore journal thread (responsible for appending writes to the journal)
  • FileStore work queue (responsible for applying writes to the backing file sytem)
  • Messenger (responsible for managing inter-node communication) for the client reply.

It’s crucial to maximize throughput and minimize latency in this pipeline if we want to avoid torpedoing performance. To approach this problem, we took a shiny new server with 192GB of memory and started running benchmarks against the FileStore module in isolation mounted on a ramdisk. We were able to shove small writes through at a rate of around 6k iops. This is pretty good if we plan on running on a ~150iop spinning disk. It is considerably less good if we plan on running on 20k+ iop ssds. So, we went to work. Instrumenting our Mutex object to add up time spent waiting on each lock yielded several promising “problem locks”, each of which, for reasons of simplicity, protected several unrelated structures. Restructuring the code for finer synchronization around these structures, along with disabling in-memory logging, bumped us up to around 22k iops. For the next release, we’ll continue on to attacking latency and throughput bottlenecks in the upper layers of the OSD daemon.

Recovery QOS

One of Ceph’s nicer properties is self-healing. The death of OSD 10 eventually triggers a new OSDMap to be generated with OSD 10 marked down and out, which in turn triggers any PGs which had lived on OSD 10 to rebalance to a new set of OSDs. Of course, there is no escaping the fact that recovering OSD 10’s PGs to new OSDs must involve copying the objects from OSD 10’s PG’s surviving replicas to new OSDs. With Argonaut’s default settings, this looks like a long series of 1MB transfers from surviving replicas to new replicas. So, how might these large transfers interact with, say, the flurry of latency sensitive 4k writes generated by the VMs running on RBD on your cluster?

Argonaut already has some facilities you may be familiar with for limiting the impact of recovery on client workloads. Most prominently, “osd recovery max active” limits the number of concurrent recovery operations any single OSD will start. Regrettably, this only limits the number started at any single OSD. It does not, for example, prevent 20 OSDs from simultaneously pushing “osd recovery max active” objects each to a single OSD you have just added to your cluster! That’s where Bobtail’s new “osd max backfills” configurable comes in. “osd max backfills” defines a limit on how many PGs are allowed to recover to or from a single OSD at any one time.

Argonaut also includes a simple mechanism for prioritizing Messages for processing at the OSD. Each message is tagged with a numerical priority. Messages are processed in order first by priority, and then by time of arrival. So all messages of priority 128 will be processed before any of priority 63. This is useful for some pieces of the OSD. For example, replies from replicas to the primary indicating that the replica has persisted a client op are given a high priority to reduce client op latency since they are quick to process. However, if we give client messages a higher priority than recovery messages using this mechanism, any significant amount of client io will tend to starve recovery. You really don’t want that since the longer a PG goes without re-replicating it’s data, the more likely a second or third OSD death takes it out completely!

Bobtail introduces a more flexible Message prioritization scheme. Messages can be sent such that a message with priority 40 will be allowed through at twice the rate of messages with priority 20, but won’t starve them. This has been leveraged to allow recovery messages to be sent with a lower priority than client io messages without causing starvation. As a nice bonus, consider trying to read an object which has not yet been recovered to the primary. The primary must complete the recovery operation on that object before serving the read. Now, we can give that recovery operation the priority of client io to allow it to bypass any lower priority recovery operations queued at other osds! Future work in this area will focus on reducing the burden of coordinating a PG’s recovery operations on the PG’s primary OSD. This burden still has an impact on client io coming in to that OSD.

So, those are some of the new developments in the OSD. And that’s just the OSD! RBD is getting layering and write-back cache! CephFS is getting substantial stability and performance enhancements! And if we stay on track, our next release will have even more exclamation points! That is, it will if we have time left after we come up with a cephalopod name that starts with “c”…