Ceph blog stories provide high-level spotlights on our customers all over the world
A wide variety of Ceph deployment tools have emerged over the years with the aim of making Ceph easier to install and manage. Most of these have leveraged existing tools like Ansible, Puppet, and Salt, bringing with them an existing ecosystem of users and an opportunity to align with an existing investment by an organization in a particular tool. As a result, however, the Ceph community’s investment has been fragmented across many different efforts, new users are faced with a difficult choice of tool when starting out, and attempts to simplify the experience and integration with Ceph itself has been difficult.
Like many others, I’ve personally stuck with the dated ceph-deploy tool, which has the benefit of being extremely simple to use and understand (at least for someone familiar with Ceph), and has the nice property of not requiring an initial investment in installing and learning another tool. These days ceph-deploy is no longer maintained, however, and doesn’t even work with some newer distros like RHEL/CentOS 8.
Most importantly, however, none of these tools has done a great job of solving the core problem: making Ceph very easy to install for a new user, and making a Ceph cluster easy to maintain over time through seamless integration with the Ceph CLI and GUI. A new orchestrator API was first introduced in Ceph Nautilus to provide a generic way for Ceph–the CLI and the dashboard–to interact with its deployment environment, whether it’s Rook or ceph-ansible or DeepSea, but only with Octopus has this reached a level of maturity where it is providing meaningful abstraction across multiple backends: Rook for Kubernetes environments, and Cephadm for everyone else.
The goal of Cephadm is to provide a fully-featured, robust, and well-maintained install and management layer that can be used for anyone that is not running Ceph in Kubernetes. The goals we set out with include:
The goal with all of this is to focus the attention of the Ceph developer and user community on just two platforms for deploying and managing Ceph–cephadm for “bare metal” deployments, and Rook for running Ceph in Kubernetes–and to provide a similar management experience for both of them.
The cephadm model is to have a simple “bootstrap” step that is started from a command line that brings up a minimal Ceph cluster (a single monitor and manager daemon) on the local host. The rest of the cluster is then deployed using “day 2” orchestrator commands to add additional hosts, consume storage devices, and deploy daemons for cluster services.
Getting a cluster bootstrapped is as simple as:
curl --silent --remote-name --location https://github.com/ceph/ceph/raw/octopus/src/cephadm/cephadm chmod +x cephadm mkdir -p /etc/ceph ./cephadm bootstrap --mon-ip <ip>
After 30 to 60 seconds, a minimal Ceph cluster will be up and running, and cephadm will have printed out command to access the Ceph CLI (via a containerized shell) and a URL to access the dashboard:
INFO:cephadm:Ceph Dashboard is now available at: URL: https://gnit:8443/ User: admin Password: 07j394z550 INFO:cephadm:You can access the Ceph CLI with: sudo ./cephadm shell --fsid 2d2fd136-6df1-11ea-ae74-002590e526e8 -c /etc/ceph/ceph.conf -k /etc/ceph/ceph.client.admin.keyring INFO:cephadm:Bootstrap complete.
Because Ceph is fully containerized (check
podman ps or
docker ps), no software has been installed on the host, and the usual
ceph command won’t work (not yet, at least). There are a few ways to interact with the new cluster.
One way is to use the
cephadm shell command. The cephadm that was used to bootstrap can also launch a containerized shell that has all of the Ceph software (including the CLI) installed. And because bootstrap puts a copy of the ceph config and admin keyring in
/etc/ceph by default, and the shell command looks there by default, you can launch a working shell and use the CLI with just
./cephadm shell ceph status
The cephadm command also makes it easy to install “traditional” Ceph packages on the host. To install the Ceph CLI commands and the cephadm command in the standard locations,
./cephadm add-repo --release octopus ./cephadm install cephadm ceph-common
This supports a few common Linux distros to start (CentOS/RHEL, Debian/Ubuntu, OpenSUSE/SLE) and can easily be extended to support new ones.
Any real Ceph cluster spans multiple hosts. Cephadm manages the cluster by using SSH to connect from the ceph-mgr daemon to hosts in the cluster to introspect the environment, monitor Ceph daemons, and deploy or remove daemons. Each Ceph cluster generates a unique SSH identity and key that is used to connect to hosts. The bootstrap process adds this key to the local host’s root user’s
authorized_keys file for you. Adding additional hosts, however, requires a few manual steps.
First, we need a copy of the public side of the cluster key. By default bootstrap puts a copy at
/etc/ceph/ceph.pub, or you can fetch a copy from the cluster with
ceph cephadm get-ssh-pub-key.
For each host, we first need to install the key on the remote system. This is most easily done with the
ssh-copy-id command included with any recent version of SSH:
ssh-copy-id -f -i /etc/ceph/ceph.pub root@new-host
This command may prompt for a root password if your current user does not already have password-less SSH access set up.
Next, we need to tell Ceph about the new host. We assume here that all hosts have a unique hostname that matches the configured hostname on the host itself. If your local environment does not also have DNS configured such that we connect to these hostnames, or if you would like to avoid a dependency on DNS, you can provide an IP address for each host as well:
ceph orch host add <new-host> [<new-host-ip>]
You can see all hosts in the cluster with
ceph orch host ls
Each service or collection of daemons in Cephadm has an associated placement spec, or description of where and how many daemons should be deployed. By default, a new Ceph cluster with cephadm knows that clusters should have 5 monitors, 2 managers, and a few other services (like a crash dump collector) deployed on every host. New monitors and managers are deployed automatically as soon as additional hosts are added to the cluster. You can see the new clusters services and deployed daemons with the
ceph orch ls and
ceph orch ps commands:
# ceph orch ls NAME RUNNING REFRESHED AGE PLACEMENT IMAGE NAME IMAGE ID alertmanager 1/1 71s ago 22m count:1 docker.io/prom/alertmanager:latest 0881eb8f169f crash 1/1 71s ago 23m * docker.io/ceph/ceph:v15 204a01f9b0b6 grafana 1/1 71s ago 22m count:1 docker.io/ceph/ceph-grafana:latest 87a51ecf0b1c mgr 1/2 71s ago 23m count:2 docker.io/ceph/ceph:v15 204a01f9b0b6 mon 1/5 71s ago 23m count:5 docker.io/ceph/ceph:v15 204a01f9b0b6 node-exporter 1/1 71s ago 22m * docker.io/prom/node-exporter:latest e5a616e4b9cf prometheus 1/1 71s ago 22m count:1 docker.io/prom/prometheus:latest e935122ab143 # ceph orch ps NAME HOST STATUS REFRESHED AGE VERSION IMAGE NAME IMAGE ID CONTAINER ID alertmanager.gnit gnit running (21m) 96s ago 22m 0.20.0 docker.io/prom/alertmanager:latest 0881eb8f169f 15ceff5ae935 crash.gnit gnit running (22m) 96s ago 23m 15.2.0 docker.io/ceph/ceph:v15 204a01f9b0b6 0687711365e4 grafana.gnit gnit running (21m) 96s ago 22m 6.6.2 docker.io/ceph/ceph-grafana:latest 87a51ecf0b1c fa1db4647c4c mgr.gnit.xmfvjy gnit running (24m) 96s ago 24m 15.2.0 docker.io/ceph/ceph:v15 204a01f9b0b6 6a29bc868357 mon.gnit gnit running (24m) 96s ago 24m 15.2.0 docker.io/ceph/ceph:v15 204a01f9b0b6 072f5926faa8 node-exporter.gnit gnit running (22m) 96s ago 22m 0.18.1 docker.io/prom/node-exporter:latest e5a616e4b9cf eb5f715005fc prometheus.gnit gnit running (22m) 96s ago 22m 2.16.0 docker.io/prom/prometheus:latest e935122ab143 6ee6de1b3cc1
In the example output above, you’ll notice that a number of non-Ceph daemons are deployed: Prometheus, Grafana, alertmanager, and node-exporter. These provide a basic but fully configured and functional monitoring stack that allow all of the metrics and graphs for the Ceph dashboard to work out of the box. If you already have an existing Prometheus deployment you’d like Ceph to use, you can tell cephadm to skip all of this by passing
--skip-monitoring-stack to the bootstrap command.
For most users, this default behavior is all that you need. For advanced users who want to control exactly which hosts monitors are deployed on, or which IPs are chosen, some additional steps are needed to customized the placement of these daemons. Automated placement for specific services (like monitors) can even be disabled entirely, although there should be relatively few reasons to do so.
Once a cluster is up running, a minimal but sufficient
ceph.conf file for hosts accessing the cluster can be fetched with:
# ceph config generate-minimal-conf
Adding OSDs to the Ceph cluster is usually one of the trickiest part of a deployment. There are a variety of ways that HDDs and SSDs can be combined to balance performance and cost, and telling Ceph which devices to consume can be tricky.
For many users, we hope, the following command will be sufficient:
ceph orch apply osd --all-available-devices
This will consume any device (HDD or SSD) on any host that is part of the Ceph cluster that passes all of the safety checks, which means there are no partitions, no LVM volumes, no file systems, etc. Each device will get a single OSD deployed, which is the simplest case that applies to many (if not most) users.
For the rest of us, we have several tools at our disposal. We can enumerate all devices on all hosts (and the status of the above safety checks) with:
ceph orch device ls
A single OSD can be created on a single device explicitly with:
ceph orch daemon add osd host-foo:/dev/foo
For more sophisticated automation, however, the orchestrator API introduces the concept of DriveGroups, which allow OSD deployments to be described in terms of device properties (SSD vs HDD, model name, size, host name patterns), and for “hybrid” OSDs that combine multiple devices (e.g., SSD for metadata and HDD for data) to be deployed in a semi-automated fashion.
Other Ceph daemons are stateless, in the sense that they don’t store any data locally and can be trivially redeployed on any host. These are easy for cephadm… and in the case of CephFS, their deployment is completely automated. For example, to create a CephFS file system called
ceph fs volume create foo
will create the necessary data and metadata pools and deploy MDS daemons all in one step. The number and placement of daemons can be examined and adjusted later via the
ceph orch ls and
ceph orch apply mds ... commands, or an optional placement argument can be passed to the volume create command.
For object storage with RGW, things aren’t quite as streamlined (yet), but the orchestrator and cephadm infrastructure is there to manage the underlying daemons. For a standalone object storage cluster,
radosgw-admin realm create --rgw-realm=myorg --default radosgw-admin zonegroup create --rgw-zonegroup=default --master --default radosgw-admin zone create --rgw-zonegroup=default --rgw-zone=us-east-1 --master --default ceph orch apply rgw myorg us-east-1
For an existing (multi-site or standalone) deployment, deploying the daemons can be as simple as
ceph orch apply rgw <realmname> <zonename>, provided the RGW configuration options are already stored in the cluster’s config database (
ceph config set client.rgw.$realmname.$zonename ...) and not in a
ceph orch upgrade start --ceph-version 15.2.1
The upgrade progress can be monitored from the
ceph status view, which will include a progress bar like:
Upgrade to docker.io/ceph/ceph:v15.2.1 (3m) [===.........................] (remaining: 21m)
It’s helpful (and, for me, comforting) to be able to take a closer look at what cephadm is doing behind the scenes to run services on remote hosts. The first thing you can take a look at is the running containers with
podman ps or
docker ps. You’ll notice that all containers have the cluster fsid UUID in the name so that multiple clusters may be present on the same host without bumping into each other. (This is mostly true, except for when daemons used fixed ports, like the Ceph monitor, or services like the prometheus node-exporter.)
The files are all separate as well. In
/var/log/ceph you’ll find things are separated out by cluster fsid. And in each of these daemon directories you’ll see a file called
unit.run which has the docker or podman command that starts the daemon–this is what the systemd unit executes.
Although you may recall that the bootstrap step wrote files to
/etc/ceph, it did that only for convenience so that in the common case of a single cluster on the host, simply installing the
ceph-common package will allow the
ceph CLI to work. Passing
--output-dir . (or similar) to bootstrap will write those files elsewhere.
In fact, the only other changes present the host OS are
/etc/systemd/systemfor each cluster (
ceph-$fsid.targetfor each cluster
ceph-$fsid@.serviceshared by all daemons)
ceph.targetunit to start/stop all Ceph services
/etc/logrotate.d/ceph-$fsid, in case logging to files is enabled. (By default, cephadm daemons log to stderr and logs are captured by the container runtime.)
Meanwhile, changes are being driven by the cephadm module running in the ceph-mgr daemon. Services are configured via the orchestrator interface, which is accessible either via the internal Python interfaces (e.g., for use by the dashboard) or via the CLI. To see all available commands, try
ceph orch -h.
ceph orch ls in particular will describe the currently configured services.
In the background, cephadm has a “reconciliation loop,” much like Kubernetes, that compares the current state to the desired state, as specified by the configured services. To monitor its activity,
ceph -W cephadm will tail the log as it makes changes, or
ceph log last cephadm will show recent messages. This background work can be paused at any time with
ceph orch pause and resumed with
ceph orch resume.
With the initial Octopus release, cephadm has solid support for the core Ceph services: RADOS, CephFS, RBD, and RGW. A number of secondary services are under active development, including NFS and iSCSI gateway support, and CIFS support (via Samba) is expected to follow after that. All of these changes will be backported to Octopus as they are completed.
Meanwhile, we also expect to improve the robustness and intelligence of the “scheduling” algorithm that decides where to run services. Right now cephadm simply spreads service daemons across hosts, but (by default) chooses those hosts at random. We’d like to improve this by setting resource limits on daemon containers (e.g., CPU and memory) and choosing the location of daemons intelligently based on the available resources on each host.
Finally, we expect to spend a lot of time over the next development cycle surfacing more of the orchestrator functionality through the Ceph dashboard to simplify the overall user experience, especially for common operations like initial deployment, cluster expansion, and the replacement of failed storage devices.
Last but not least: Cephadm is brand new, and we’re looking for feedback from real users deploying it for the first time in the real world to hear about what works well and what doesn’t, and what we can do to improve things!
For more information on Cephadm, please see the online documentation.
A big thank you goes out to the team that made cephadm possible: Sebastian Wagner, Joshua Schmidt, Michael Fritch, Daniel Pivonka, Paul Cuzner, Kristoffer Grönlund, Kiefer Chang, Patrick Seidensal, and Volker Theile; to Noah Watkins, who wrote the first version of the “ssh orchestrator;” and to John Spray, who got the whole orchestrator abstraction started way back in Nautilus.