Ceph OSD : Where Is My Data ?

laurentbarbe

The purpose is to verify where my data is stored on the Ceph cluster.

For this, I have just create a minimal cluster with 3 osd :

1
$ ceph-deploy osd create ceph-01:/dev/sdb ceph-02:/dev/sdb ceph-03:/dev/sdb

Where is my osd directory on ceph-01 ?

1
2
$ mount | grep ceph
/dev/sdb1 on /var/lib/ceph/osd/ceph-0 type xfs (rw,noatime,attr2,delaylog,noquota)

The directory content :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
$ cd /var/lib/ceph/osd/ceph-0; ls -l
total 52
-rw-r--r-- 1 root root  487 août  20 12:12 activate.monmap
-rw-r--r-- 1 root root    3 août  20 12:12 active
-rw-r--r-- 1 root root   37 août  20 12:12 ceph_fsid
drwxr-xr-x 133 root root 8192 août  20 12:18 current
-rw-r--r-- 1 root root   37 août  20 12:12 fsid
lrwxrwxrwx   1 root root   58 août  20 12:12 journal -> /dev/disk/by-partuuid/37180b7e-fe5d-4b53-8693-12a8c1f52ec9
-rw-r--r-- 1 root root   37 août  20 12:12 journal_uuid
-rw------- 1 root root   56 août  20 12:12 keyring
-rw-r--r-- 1 root root   21 août  20 12:12 magic
-rw-r--r-- 1 root root    6 août  20 12:12 ready
-rw-r--r-- 1 root root    4 août  20 12:12 store_version
-rw-r--r-- 1 root root    0 août  20 12:12 sysvinit
-rw-r--r-- 1 root root    2 août  20 12:12 whoami

$ du -hs *
4,0K  activate.monmap → The current monmap
4,0K  active      → "ok"
4,0K  ceph_fsid   → cluster fsid (same return by 'ceph fsid')
2,1M  current
4,0K  fsid        → id for this osd
0 journal         → symlink to journal partition
4,0K  journal_uuid
4,0K  keyring     → the key
4,0K  magic       → "ceph osd volume v026"
4,0K  ready       → "ready"
4,0K  store_version   
0 sysvinit
4,0K  whoami      → id of the osd

The data are store in the directory “current” : It contains some file and many _head file :

1
2
3
4
5
6
$ cd current; ls -l | grep -v head
total 20
-rw-r--r-- 1 root root     5 août  20 12:18 commit_op_seq
drwxr-xr-x 2 root root 12288 août  20 12:18 meta
-rw-r--r-- 1 root root     0 août  20 12:12 nosnap
drwxr-xr-x 2 root root   111 août  20 12:12 omap

In omap directory :

1
2
3
4
5
6
7
8
$ cd omap; ls -l
-rw-r--r-- 1 root root     150 août  20 12:12 000007.sst
-rw-r--r-- 1 root root 2031616 août  20 12:18 000010.log 
-rw-r--r-- 1 root root      16 août  20 12:12 CURRENT
-rw-r--r-- 1 root root       0 août  20 12:12 LOCK
-rw-r--r-- 1 root root     172 août  20 12:12 LOG
-rw-r--r-- 1 root root     309 août  20 12:12 LOG.old
-rw-r--r-- 1 root root   65536 août  20 12:12 MANIFEST-000009

In meta directory :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
$ cd ../meta; ls -l
total 940
-rw-r--r-- 1 root root  710 août  20 12:14 inc\uosdmap.10__0_F4E9C003__none
-rw-r--r-- 1 root root  958 août  20 12:12 inc\uosdmap.1__0_B65F4306__none
-rw-r--r-- 1 root root  722 août  20 12:14 inc\uosdmap.11__0_F4E9C1D3__none
-rw-r--r-- 1 root root  152 août  20 12:14 inc\uosdmap.12__0_F4E9C163__none
-rw-r--r-- 1 root root  153 août  20 12:12 inc\uosdmap.2__0_B65F40D6__none
-rw-r--r-- 1 root root  574 août  20 12:12 inc\uosdmap.3__0_B65F4066__none
-rw-r--r-- 1 root root  153 août  20 12:12 inc\uosdmap.4__0_B65F4136__none
-rw-r--r-- 1 root root  722 août  20 12:12 inc\uosdmap.5__0_B65F46C6__none
-rw-r--r-- 1 root root  136 août  20 12:14 inc\uosdmap.6__0_B65F4796__none
-rw-r--r-- 1 root root  642 août  20 12:14 inc\uosdmap.7__0_B65F4726__none
-rw-r--r-- 1 root root  153 août  20 12:14 inc\uosdmap.8__0_B65F44F6__none
-rw-r--r-- 1 root root  722 août  20 12:14 inc\uosdmap.9__0_B65F4586__none
-rw-r--r-- 1 root root    0 août  20 12:12 infos__head_16EF7597__none
-rw-r--r-- 1 root root 2870 août  20 12:14 osdmap.10__0_6417091C__none
-rw-r--r-- 1 root root  830 août  20 12:12 osdmap.1__0_FD6E49B1__none
-rw-r--r-- 1 root root 2870 août  20 12:14 osdmap.11__0_64170EAC__none
-rw-r--r-- 1 root root 2870 août  20 12:14 osdmap.12__0_64170E7C__none   → current osdmap
-rw-r--r-- 1 root root 1442 août  20 12:12 osdmap.2__0_FD6E4941__none
-rw-r--r-- 1 root root 1510 août  20 12:12 osdmap.3__0_FD6E4E11__none
-rw-r--r-- 1 root root 2122 août  20 12:12 osdmap.4__0_FD6E4FA1__none
-rw-r--r-- 1 root root 2122 août  20 12:12 osdmap.5__0_FD6E4F71__none
-rw-r--r-- 1 root root 2122 août  20 12:14 osdmap.6__0_FD6E4C01__none
-rw-r--r-- 1 root root 2190 août  20 12:14 osdmap.7__0_FD6E4DD1__none
-rw-r--r-- 1 root root 2802 août  20 12:14 osdmap.8__0_FD6E4D61__none
-rw-r--r-- 1 root root 2802 août  20 12:14 osdmap.9__0_FD6E4231__none
-rw-r--r-- 1 root root  354 août  20 12:14 osd\usuperblock__0_23C2FCDE__none
-rw-r--r-- 1 root root    0 août  20 12:12 pglog\u0.0__0_103B076E__none     → Log for each pg
-rw-r--r-- 1 root root    0 août  20 12:12 pglog\u0.1__0_103B043E__none
-rw-r--r-- 1 root root    0 août  20 12:12 pglog\u0.11__0_5172C9DB__none
-rw-r--r-- 1 root root    0 août  20 12:12 pglog\u0.13__0_5172CE3B__none
-rw-r--r-- 1 root root    0 août  20 12:13 pglog\u0.15__0_5172CC9B__none
-rw-r--r-- 1 root root    0 août  20 12:13 pglog\u0.16__0_5172CC2B__none
............
-rw-r--r-- 1 root root    0 août  20 12:12 snapmapper__0_A468EC03__noneosd

Try decompiling crush map from osdmap :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
$ ceph osd stat
e12: 3 osds: 3 up, 3 in

$ osdmaptool osdmap.12__0_64170E7C__none --export-crush /tmp/crushmap.bin
osdmaptool: osdmap file 'osdmap.12__0_64170E7C__none'
osdmaptool: exported crush map to /tmp/crushmap.bin

$ crushtool -d /tmp/crushmap.bin -o /tmp/crushmap.txt

$ cat /tmp/crushmap.txt
# begin crush map

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2

# types
type 0 osd
type 1 host
type 2 rack
type 3 row
type 4 room
type 5 datacenter
type 6 root

# buckets
host ceph-01 {
  id -2       # do not change unnecessarily
  # weight 0.050
  alg straw
  hash 0  # rjenkins1
  item osd.0 weight 0.050
}
host ceph-02 {
  id -3       # do not change unnecessarily
  # weight 0.050
  alg straw
  hash 0  # rjenkins1
  item osd.1 weight 0.050
}
host ceph-03 {
  id -4       # do not change unnecessarily
  # weight 0.050
  alg straw
  hash 0  # rjenkins1
  item osd.2 weight 0.050
}
root default {
  id -1       # do not change unnecessarily
  # weight 0.150
  alg straw
  hash 0  # rjenkins1
  item ceph-01 weight 0.050
  item ceph-02 weight 0.050
  item ceph-03 weight 0.050
}

...

# end crush map

Ok it’s what I expect. :)

The cluster is empty :

1
2
$ find *_head -type f | wc -l
0

The directory list correspond to the ‘ceph pg dump’

1
2
3
$ for dir in ` ceph pg dump | grep '\[0,' | cut -f1 `; do if [ -d $dir_head ]; then echo exist; else echo nok; fi; done | sort | uniq -c
dumped all in format plain
     69 exist

To get all stats for a specific pg :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
$ ceph pg 0.1 query
{ "state": "active+clean",
  "epoch": 12,
  "up": [
        0,
        1],
  "acting": [
        0,
        1],
  "info": { "pgid": "0.1",
      "last_update": "0'0",
      "last_complete": "0'0",
      "log_tail": "0'0",
      "last_backfill": "MAX",
      "purged_snaps": "[]",
      "history": { "epoch_created": 1,
          "last_epoch_started": 12,
          "last_epoch_clean": 12,
          "last_epoch_split": 0,
          "same_up_since": 9,
          "same_interval_since": 9,
          "same_primary_since": 5,
          "last_scrub": "0'0",
          "last_scrub_stamp": "2013-08-20 12:12:37.851559",
          "last_deep_scrub": "0'0",
          "last_deep_scrub_stamp": "2013-08-20 12:12:37.851559",
          "last_clean_scrub_stamp": "0.000000"},
      "stats": { "version": "0'0",
          "reported_seq": "12",
          "reported_epoch": "12",
          "state": "active+clean",
          "last_fresh": "2013-08-20 12:16:22.709534",
          "last_change": "2013-08-20 12:16:22.105099",
          "last_active": "2013-08-20 12:16:22.709534",
          "last_clean": "2013-08-20 12:16:22.709534",
          "last_became_active": "0.000000",
          "last_unstale": "2013-08-20 12:16:22.709534",
          "mapping_epoch": 5,
          "log_start": "0'0",
          "ondisk_log_start": "0'0",
          "created": 1,
          "last_epoch_clean": 12,
          "parent": "0.0",
          "parent_split_bits": 0,
          "last_scrub": "0'0",
          "last_scrub_stamp": "2013-08-20 12:12:37.851559",
          "last_deep_scrub": "0'0",
          "last_deep_scrub_stamp": "2013-08-20 12:12:37.851559",
          "last_clean_scrub_stamp": "0.000000",
          "log_size": 0,
          "ondisk_log_size": 0,
          "stats_invalid": "0",
          "stat_sum": { "num_bytes": 0,
              "num_objects": 0,
              "num_object_clones": 0,
              "num_object_copies": 0,
              "num_objects_missing_on_primary": 0,
              "num_objects_degraded": 0,
              "num_objects_unfound": 0,
              "num_read": 0,
              "num_read_kb": 0,
              "num_write": 0,
              "num_write_kb": 0,
              "num_scrub_errors": 0,
              "num_shallow_scrub_errors": 0,
              "num_deep_scrub_errors": 0,
              "num_objects_recovered": 0,
              "num_bytes_recovered": 0,
              "num_keys_recovered": 0},
          "stat_cat_sum": {},
          "up": [
                0,
                1],
          "acting": [
                0,
                1]},
      "empty": 1,
      "dne": 0,
      "incomplete": 0,
      "last_epoch_started": 12},
  "recovery_state": [
        { "name": "Started\/Primary\/Active",
          "enter_time": "2013-08-20 12:15:30.102250",
          "might_have_unfound": [],
          "recovery_progress": { "backfill_target": -1,
              "waiting_on_backfill": 0,
              "backfill_pos": "0\/\/0\/\/-1",
              "backfill_info": { "begin": "0\/\/0\/\/-1",
                  "end": "0\/\/0\/\/-1",
                  "objects": []},
              "peer_backfill_info": { "begin": "0\/\/0\/\/-1",
                  "end": "0\/\/0\/\/-1",
                  "objects": []},
              "backfills_in_flight": [],
              "pull_from_peer": [],
              "pushing": []},
          "scrub": { "scrubber.epoch_start": "0",
              "scrubber.active": 0,
              "scrubber.block_writes": 0,
              "scrubber.finalizing": 0,
              "scrubber.waiting_on": 0,
              "scrubber.waiting_on_whom": []}},
        { "name": "Started",
          "enter_time": "2013-08-20 12:14:51.501628"}]}

Retrieve an object on the cluster

In this test we create a standard pool (pgnum=8 and repli=2)

1
2
3
4
5
6
7
8
$ rados mkpool testpool
$ wget -q http://ceph.com/docs/master/_static/logo.png
$ md5sum logo.png
4c7c15e856737efc0d2d71abde3c6b28  logo.png

$ rados put -p testpool logo.png logo.png
$ ceph osd map testpool logo.png
osdmap e14 pool 'testpool' (3) object 'logo.png' -> pg 3.9e17671a (3.2) -> up [2,1] acting [2,1]

My Ceph logo is on pg 3.2 (main on osd.2 and replica on osd.1)

1
2
3
4
5
6
7
8
9
$ ceph osd tree
# id  weight  type name   up/down reweight
-1    0.15    root default
-2    0.04999     host ceph-01
0 0.04999         osd.0   up  1   
-3    0.04999     host ceph-02
1 0.04999         osd.1   up  1   
-4    0.04999     host ceph-03
2 0.04999         osd.2   up  1

And osd.2 is on ceph-03 :

1
2
3
4
5
$ cd /var/lib/ceph/osd/ceph-2/current/3.2_head/
$ ls
logo.png__head_9E17671A__3
$ md5sum logo.png__head_9E17671A__3
4c7c15e856737efc0d2d71abde3c6b28  logo.png__head_9E17671A__3

It exactly the same :)

Import RBD

Same thing, but testing as a block device.

1
2
3
4
5
6
7
8
$ rbd import logo.png testpool/logo.png 
Importing image: 100% complete...done.
$ rbd info testpool/logo.png
rbd image 'logo.png':
  size 3898 bytes in 1 objects
  order 22 (4096 KB objects)
  block_name_prefix: rb.0.1048.2ae8944a
  format: 1

Only one object.

1
2
3
4
5
6
7
$ rados ls -p testpool
logo.png
rb.0.1048.2ae8944a.000000000000
rbd_directory
logo.png.rbd
$ ceph osd map testpool logo.png.rbd
osdmap e14 pool 'testpool' (3) object 'logo.png.rbd' -> pg 3.d592352c (3.4) -> up [0,2] acting [0,2]

Let’s go.

1
2
3
4
$ cd /var/lib/ceph/osd/ceph-0/current/3.4_head/
$ cat logo.png.rbd__head_D592352C__3
<<< Rados Block Device Image >>>
rb.0.1048.2ae8944aRBD001.005:

Here we can retrieve the block name prefix of the rbd ‘rb.0.1048.2ae8944a’ :

1
2
$ ceph osd map testpool rb.0.1048.2ae8944a.000000000000
osdmap e14 pool 'testpool' (3) object 'rb.0.1048.2ae8944a.000000000000' -> pg 3.d512078b (3.3) -> up [2,1] acting [2,1]

On ceph-03 :

1
2
3
$ cd /var/lib/ceph/osd/ceph-2/current/3.3_head
$ md5sum rb.0.1048.2ae8944a.000000000000__head_D512078B__3
4c7c15e856737efc0d2d71abde3c6b28  rb.0.1048.2ae8944a.000000000000__head_D512078B__3

We retrieve the file unchanged because it is not split :)

Try RBD snapshot

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
$ rbd snap create testpool/logo.png@snap1
$ rbd snap ls testpool/logo.png
SNAPID NAME        SIZE 
     2 snap1 3898 bytes
$ echo "testpool/logo.png" >> /etc/ceph/rbdmap
$ service rbdmap reload
[ ok ] Starting RBD Mapping: testpool/logo.png.
[ ok ] Mounting all filesystems...done.

$ dd if=/dev/zero of=/dev/rbd/testpool/logo.png 
dd: écriture vers « /dev/rbd/testpool/logo.png »: Aucun espace disponible sur le périphérique
8+0 enregistrements lus
7+0 enregistrements écrits
3584 octets (3,6 kB) copiés, 0,285823 s, 12,5 kB/s

$ ceph osd map testpool rb.0.1048.2ae8944a.000000000000
osdmap e15 pool 'testpool' (3) object 'rb.0.1048.2ae8944a.000000000000' -> pg 3.d512078b (3.3) -> up [2,1] acting [2,1]

It’s the same place on ceph-03 :

1
2
3
4
$ cd /var/lib/ceph/osd/ceph-2/current/3.3_head
$ md5sum *
4c7c15e856737efc0d2d71abde3c6b28  rb.0.1048.2ae8944a.000000000000__2_D512078B__3
dd99129a16764a6727d3314b501e9c23  rb.0.1048.2ae8944a.000000000000__head_D512078B__3

We can notice that file containing 2 (snap id 2) contain original data. And a new file has been created for the current data : head

For next tests, I will try with stripped files, rbd format 2 and snap on pool.