We would like to thank BBVA, Cisco and Intel for providing the cutting edge hardware used to run a Red Hat Ceph Storage 3.2 All-flash performance POC. The tests and results provided in this blog series is a joint effort of the partnership formed by BBVA, Intel , Cisco and Red Hat. All partners have worked together to produce a performance baseline for a Software Defined Storage Building Block based on Red Hat Ceph Storage with Cisco and Intel Hardware technology.
All-flash storage systems offer several benefits to businesses. This includes high performance in terms of both high throughput and lower latencies, low TCO with reduced power and cooling consumption over traditional hard drive based storage systems. The global digital transformation requires modernization of enterprise IT infrastructure to improve performance, responsiveness, and resiliency to almost all critical business applications. With the proliferation of complex application stacks, databases and a new breed of workloads, storage (data) have become the nervous system of today’s businesses. The only way to achieve storage nirvana is to use Software-defined, scale-out storage technology which is built for Exascale.
Red Hat Ceph Storage(RHCS) is an open source, massively scalable, software-defined storage that has all the resilience, durability and reliability needed for enterprise storage workloads. With the advancement in NAND technology flash media for storage are becoming more affordable. RHCS takes the center stage and being able to deliver millions of IOPS at low latency catering customers most demanding and storage intensive workload needs.
High performance and latency sensitive workloads often consume storage via the block device interface. Ceph delivers block storage to clients with the help of RBD, a librbd library which is a thin layer that sits on top of rados (Figure-1) taking advantage of all the features rados has to offer. RBD block devices are highly distributed in nature as it stripes a block device image over multiple objects in the Red Hat Ceph Storage cluster, where each object gets mapped to a placement group and distributed, and the placement groups are spread across separate Ceph OSDs throughout the cluster, this greatly enhances the parallelism when accessing data from and RBD volume.
RHCS block devices are thin-provisioned, resizable and store data striped over multiple OSDs in a Ceph cluster. RHCS block devices leverage RADOS capabilities such as self healing, replication, and consistency.
Ceph is an OpenSource project with a thriving community, over the last few releases there has been a significant effort on performance optimization for all-flash clusters, some of these enhancements are:
BlueStore is a new backend object store for the Ceph OSD daemons. The original object store, FileStore, requires a file system on top of raw block devices. Objects are then written to the file system. Unlike the original FileStore back end, BlueStore stores object directly on the block devices without any file system interface, which improves the performance of the cluster. To learn more about BlueStore follow Red Hat Ceph documentation.
Figure-2 shows how BlueStore interacts with a block device. Data is directly written to the raw block device and all metadata operations are managed by RocksDB. The device containing the OSD is divided between RocksDB metadata and the actual user data stored in the cluster. User data objects are stored as blobs directly on the raw block device, once the data has been written to the block device, RocksDB metadata gets updated with the required details about the new data blobs.
RocksDB is an embedded high-performance key-value store that excels with flash-storage, RocksDB can’t directly write to the raw disk device, it needs and underlying filesystem to store it’s persistent data, this is where BlueFS comes in, BlueFS is a Filesystem developed with the minimal feature set needed by RocksDB to store its sst files.
RocksDB uses WAL as a transaction log on persistent storage, unlike Filestore where all the writes went first to the journal, in bluestore we have two different datapaths for writes, one were data is written directly to the block device and the other were we use deferred writes, with deferred writes data gets written to the WAL device and later asynchronously flushed to disk.
Bluestore has a number of possible storage layout configurations:
For example in our test environment, we configured the main device on the Intel P4500 drive , this is where the user data will be stored. We then configured the WAL and RocksDB devices on a faster Intel Optane P4800 drive, having WAL and RocksDB on a high performing drive will give us more IOPS with lower latency.
A new feature of BlueStore is that it enables compression of data at the lowest level, if compression is enabled data blobs allocated on the raw device will be compressed. This means that any data written into RH Ceph Storage, no matter the client used(rbd,rados, etc), can benefit from this feature.
An additional benefit of BlueStore is that it stores data and meta-data in the cluster with checksums for increased integrity. Whenever data is read from persistent storage its checksum is verified
The test lab consists of 5 x RHCS all-flash (NVMe) servers and 7 x client nodes, the detailed hardware, and software configurations are shown in table 1 and 2 respectively.
5 x RHCS OSD Node Configuration | |
Chassis | Cisco UCS C220-M5SN Rack Server |
CPU | 2 x Intel® Xeon® Platinum 8180 28 core (56 HT cores) @ 2.50 GHz |
Memory | 196 GB |
NIC | Cisco UCS VIC 1387 2 port x 40Gb |
Storage | Ceph Data: 7x Intel® SSD DC P4500 4.0 TB
Ceph Metadata (RocksDB/WAL) : 1x Intel® Optane™ SSD DC P4800X 375 GB Ceph Pool Placement Groups : 4096 |
Software Configuration | RHEL 7.6, Linux Kernel 3.10, RHCS 3.2 (12.2.8-52) |
7 x Client Hardware Configuration | |
Chassis | Cisco UCS B200 M4 Blade servers |
CPU | 2x Intel® Xeon® CPU E5-2640 v4 @ 2.40GHz |
Memory | 528 GB |
NIC | Cisco UCS VIC 1387 2 port x 10Gb |
Software Configuration | RHOSP 10, RHEL 7.6, Linux Kernel 3.10, Pbench-FIO 3.3 |
Starting RHCS 3.2 Containerized Storage Daemons (CSD) deployment strategy is supported and the same has been used this benchmarking exercise. Unlike spinning media where one OSD per media device is recommended, each NVMe device used in this testing was configured to be used by two Ceph OSD, to get maximum performance from the NVMe device.
For a highly performant and fault tolerant storage cluster, the network architecture is as important as the nodes running the Monitors and OSD Daemons. So the deployed network architecture must have the capacity to handle the expected number of clients bandwidth. As shown in figure-3 two physical networks are used:
The lab environment was deployed with an ACI leaf and spine topology and both the public network and the cluster network are configured to use jumbo frames with the MTU size of 9000.
The network bandwidth from the OSP clients was limited by the Fiber Interconnect uplink, as such we had only had 80gbit/s available, so the maximum throughput from the clients to the ceph clusters was ~10GBytes/s.
Containerized deployment of Ceph daemons gives us the flexibility to co-locate multiple Ceph services on a single node. This eliminates the need for dedicated storage nodes and helps to reduce TCO. As such first 3 nodes were used to co-located Ceph MON, Ceph MGR and Ceph OSDs services, the remaining two nodes were dedicated for Ceph OSD usage. Table-3 details the per container resource limit configured for different Ceph daemons.
For each RHCS node we have 2 x 28 physical cores, with Hyper Threading this gives us a total of 112 vCPUs, we divided the number of vCPUs between the RHCS services as well as reserving a few for the OS. As such, we allocated 8 vCPUs to the OS and 3 vCPUs each for Ceph Monitor and Manager containers. For each Ceph OSD container we set the limit of 7 vCPUs. Since we had 7 NVMe devices per node, in order to fully utilize the NVMe device, each device was configured (partitioned) to host 2 Ceph OSDs. So the final calculation looks like 7 vCPUs/OSD * 7 NVMe devices * 2 OSDs per device = 98 vCPUs were allocated to Ceph OSDs per node. If we add 3 vCPUs for MON, 3 vCPUs for MGR and 8 vCPUs for the OS we reached 112 vCPUs i.e reaching full subscription of the physical CPUs.
OS | Per OSD Container | Per MON Container | Per MGR Container | |
vCPUs (HT Core) | 8 | 7 | 3 | 3 |
Memory (GB) | 12 | 12 | 6 | 6 |
Table 3 : Ceph Container resource allocation
FIO the industry standard tool for synthetic benchmarking was used to exercise Ceph block storage. It provides a librbd module to run tests against RBD volumes. We have used different number of volumes per client node depending on what we wanted to achieve in each test, we have mainly used these two combinations:
Initially we started using 12 x Ceph RBD volumes for each client, however for certain tests we were not able to stress the storage sub-system to its peak using 12 volumes, so in some tests we increased the number of volumes to 15 per client, which resulted in squeezing more performance. When using 15 volumes per client we used a total of 21TB of provisioned capacity on a 2x replicated pool (42TB of raw capacity).
The FIO librbd IOengine allows fio to test block storage performance of Ceph RBD volumes without KVM/QEMU configuration, through the userland librbd libraries. These libraries are the same ones used by the, QEMU backend, so it allows a approximation to KVM/QEMU performance , if we multiply 7 clients by 15 RBD volumes, we are creating a client workload similar to 105 OSP instances using one RBD volume each. The FIO template file used is available here.
Before running the tests the RBD volumes where preconditioned with a sequential write to the full size of the volume, eliminating the Ceph thin-provision mechanism’s impact to generate stable and reproducible results. Before each test, .OS cache was dropped from all OSD and client nodes Each test was run four times with 900 seconds of runtime and 180 seconds of ramp-up time. The average results of those four runs were reported in this study.
RHCS with BlueStore OSD backend out-of-the-box performs great for large block workloads, however for small blocks, the default OSD configurations were found to be conservative. As such we tuned RHCS cluster to optimize small block performance, for reference the ceph.conf file used is available here.
Below are some of the most relevant tunings applied to Ceph OSDs.
bluestore_rocksdb_options = compression=kNoCompression,max_write_buffer_number=32,min_write_buffer_number_to_merge=2,recycle_log_file_num=32,compaction_style=kCompactionStyleLevel,write_buffer_size=67108864,target_file_size_base=67108864,max_background_compactions=31,level0_file_num_compaction_trigger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=64,max_bytes_for_level_base=536870912,compaction_threads=32,max_bytes_for_level_multiplier=8,flusher_threads=8,compaction_readahead_size=2MB |
bluestore_cache_autotune = 0 |
bluestore_cache_size_ssd = 8G |
bluestore_cache_autotune = 0
bluestore_cache_kv_ratio = 0.2 bluestore_cache_meta_ratio = 0.8 |
osd_min_pg_log_entries = 10
osd_max_pg_log_entries = 10 osd_pg_log_dups_tracked = 10 osd_pg_log_trim_min = 10 |
As reflected in graph-1, tuning BlueStore resulted in higher IOPS and lower average and tail latencies compared to default BlueStore configuration. As such compared to the default configuration, BlueStore after tuning showed
Ceph BlueStore Tuned Configuration | |||
Workload | IOPS | Average Latency | Tail Latency |
Random Read | 43% Higher | 30% Lower | 36% Lower |
Random Read Write | 132% Higher | 73% Lower | 90% Lower |
Random Write | 134% Higher | 70% Lower | 91% Lower |
Continuing the benchmarking blog series, in the next section, you will learn performance insights of small, medium and large block size workloads running on RHCS 3.2 BlueStore all-flash cluster.
This post was originally written by Karan Singh and Daniel Parkes.