ceph 可靠性测试 单故障域故障测试、单磁盘故障测试、单节点故障测试、单机柜故障测试、故障数据重构测试

1. 单磁盘故障测试

1,计算file1 file2 md5值
2,file1 拷贝到ceph存储系统的过程中拔掉硬盘,预期结果:写入过程不会中断,写入完成后,计算MD5值,对比MD5值一样
3,file2 拷贝到ceph存储系统的过程中插回硬盘,预期结果:数据重构周期内写入不会中断,写入完成后,计算MD5值,对比MD5值一样

2. 主机故障域 host

# ceph osd tree
    ID WEIGHT  TYPE NAME                        UP/DOWN REWEIGHT PRIMARY-AFFINITY 
    -6 0.17599 root root_rulecopyhost1                                            
    -5 0.08800     host ceph4_rulecopyhost1                                   
     0 0.08800         osd.0                         up  1.00000          1.00000 
    -7 0.08800     host ceph2_rulecopyhost1                                   
     1 0.08800         osd.1                         up  1.00000          1.00000 

3. 机架故障域 rack

# ceph osd tree
ID  WEIGHT  TYPE NAME                       UP/DOWN REWEIGHT PRIMARY-AFFINITY 
 -7 0.26399 root root_rulerack                                                
 -6 0.08800     rack rack02_rulerack                                          
 -5 0.08800         host ceph3_rulerack                                   
  2 0.08800             osd.2                    up  1.00000          1.00000 
 -9 0.08800     rack rack03_rulerack                                          
 -8 0.08800         host ceph4_rulerack                                   
  0 0.08800             osd.0                    up  1.00000          1.00000
-11 0.08800     rack rack01_rulerack                                          
-10 0.08800         host ceph2_rulerack                                   
  1 0.08800             osd.1                    up  1.00000          1.00000 
3.1 机架故障域 rack 数据重构
关闭一个机架,10分钟后数据开始重构recovery
分析机架故障域,2副本,数据迁移
1,故障前数据存放osd位置(osd.1 osd.0# ceph osd map poolrack rbd_data.c9b2a6b8b4567.0000000000000079
osdmap e308 pool 'poolrack' (16) object 'rbd_data.c9b2a6b8b4567.0000000000000079' -> pg 16.bf45dc7c (16.7c) -> up ([1,0], p1) acting ([1,0], p1)

2,故障时,数据存放osd位置(osd.1# ceph osd map poolrack rbd_data.c9b2a6b8b4567.0000000000000079
osdmap e310 pool 'poolrack' (16) object 'rbd_data.c9b2a6b8b4567.0000000000000079' -> pg 16.bf45dc7c (16.7c) -> up ([1], p1) acting ([1], p1)

3,故障等待10分钟,开始数据重构recovery,数据重构完毕,查看数据存放osd位置(osd.1 osd.2# ceph osd map poolrack rbd_data.c9b2a6b8b4567.0000000000000079
sdmap e314 pool 'poolrack' (16) object 'rbd_data.c9b2a6b8b4567.0000000000000079' -> pg 16.bf45dc7c (16.7c) -> up ([1,2], p1) acting ([1,2], p1)
4,关机的机架重新开机,OSD会重新up,数据会回填
# ceph osd map poolrack rbd_data.c9b2a6b8b4567.0000000000000079
osdmap e325 pool 'poolrack' (16) object 'rbd_data.c9b2a6b8b4567.0000000000000079' -> pg 16.bf45dc7c (16.7c) -> up ([1,0], p1) acting ([1,0], p1)
5,机架故障域 rack 数据重构结论,down的osd数据会迁移到其他OSD上。故障机架重新开机后,数据会回填到之前的OSD上

4. 机房故障域 room