Device Telemetry

Since January 2020 users have been opting-in to phone home anonymized, non-identifying data about their cluster’s deployment and configuration, and the health metrics of their storage drives. The cluster data helps Ceph developers to better understand how Ceph is being used in the wild, identify issues, and prioritize bugs. The drive health metrics data is aimed at building a dynamic, large, diverse, free and open data set to help data scientists create accurate failure prediction models.

This page holds the raw device health data for research. This data contains health metrics of various device types, interfaces, vendors, and models, from various deployment types (JBOD, HW RAID, VMs). Statistics on this data are available in public dashboards.

Source of data

First introduced in the 2019 Nautilus release, the devicehealth ceph-mgr module allows for user-friendly drive management, including scraping and exposing device health metrics like SMART and NVMe (where applicable) health metrics. In case the user opts-in to sending anonymized telemetry, the telemetry module will compile a daily report and phone it home. 

The data collected is anonymized on the client side before it is sent to the telemetry server. Ceph does not collect any identifying information, and replaces both host name and drive serial number with random generated UUIDs.

Data format

The data is split into zipped CSV files, which have 4 columns:

  • 'ts' - timestamp - timestamp of the health metrics scrape
  • 'device_uuid' - text - anonymized unique identifier of the reporting device
  • ‘invalid’ - boolean - denotes if the report contains invalid telemetry
  • ‘report’ - text - JSON blob that holds the output of the device report

Each file contains the data reported during a single month, according to the file’s name (e.g. contains all device health reports of January 2020).

Please note: the raw data contains reports from devices that could not scrape their health metrics due to software issues (e.g. an older smartmontools version that does not support the JSON output option); this is when the ‘invalid’ column is set to ‘true’. This does not reflect any problem with the device’s health; it simply means that the health metrics could not be retrieved due to software issues on that host. We chose to include these invalid telemetry reports since they do indicate the device is still active, which can contribute to certain model training approaches.

The drive telemetry data does not contain actual drives’ state (i.e., healthy vs failed) labels. We do not have a way of comprehensively understand the system operator's decision of removing a disk, as we do not have a mechanism for them of communicating with us and sharing the reason for the disk removal.

It presumably might be assumed that a drive should be labeled as failed when it stops reporting, while its host is still active (i.e. other drives attached to it are still reporting), especially if there is a new drive reporting from the same host. In the future we plan to include the device’s port and controller information, which provides a better understanding of the device’s presence.

Data usage

This data is under Community Data License Agreement – Sharing – Version 1.0. By using this data you agree to its license.

How you can help

Please join us in this effort to improve storage drive reliability, and opt-in to telemetry. You can do this from your Ceph dashboard, or with:

ceph telemetry on
See more details here.

Contact us

Please share with us how you use this data, any interesting findings, or any questions you may have:


2021 | 2.41 GB | 1.75 GB | 1.07 GB | 957.15 MB | 662.50 MB | 504.91 MB | 342.07 MB | 302.73 MB

2020 | 244.05 MB | 97.38 MB | 79.06 MB | 58.31 MB | 52.05 MB | 34.87 MB | 25.64 MB | 16.52 MB | 13.25 MB | 8.84 MB | 7.61 MB | 385.63 KB