corosync
corosync是基于openais(application interface standard)的一种实现,corosync在整个高可用集群中担任者message layer的工作,相比于heartbeat重量级应用,corosync就相当于轻量级了,在高可用方案的选择上corosync已经打败了heartbeat的应用,而corosync的1.0版本也是存在一个致命的缺陷,没有投票系统,所以需要cman来支持投票,而corosync的2.0版本就完全补足了这个缺陷,不同于heartbeat自带crm,corosync的crm来源于heartbeat发展到3.0的时候独立出来的crm管理器pacemaker,下面通过各实例来介绍下corosync+pacemaker的配置应用。
安装crmsh corosync pacemaker
准备工作
- 两台机器时间同步
- 两台机器名要与uname -a输出的名字相同
- 配置hosts解析
- 配置两台主机双机互信
- 资源的开机启动项必须关闭
实验环境
centos 6
crmsh 3.0.0
corosync 1.4.7
pacemaker 1.1.5
节点
node1 10.211.55.48
node2 10.211.55.49
资源
mysql 5.1
vip 10.211.55.24
- 安装corosync和pacemaker
corosync和pacemaker作为主流的高可用集群架构实现,yum源就收录这些应用。
1 | ~]# yum install -y corosync pacemaker;ssh node2 ‘yum install -y corosync pacemaker’ |
- 安装crmsh
crmsh是pacemaker的一个接口,可以实现对pacemaker上的资源进行定义和管理,从pacemaker1.1.8开始,crmsh发展成独立项目,pacemaker不再支持,crmsh目前由suse维护,而redhat也有相同的项目pcs,而crmsh相对pcs有更好的管理效果,所以大部分集群管理还是会使用crmsh来管理。
1 | 对于 RedHat RHEL-6,请以 根用户 root 运行下面命令: |
2 | |
3 | cd /etc/yum.repos.d/ |
4 | wget http://download.opensuse.org/repositories/network:ha-clustering:Stable/RedHat_RHEL-6/network:ha-clustering:Stable.repo |
5 | yum install crmsh |
6 | |
7 | 对于 CentOS CentOS-7,请以 根用户 root 运行下面命令: |
8 | |
9 | cd /etc/yum.repos.d/ |
10 | wget http://download.opensuse.org/repositories/network:ha-clustering:Stable/CentOS_CentOS-7/network:ha-clustering:Stable.repo |
11 | yum install crmsh |
12 | |
13 | 对于 CentOS CentOS-6,请以 根用户 root 运行下面命令: |
14 | |
15 | cd /etc/yum.repos.d/ |
16 | wget http://download.opensuse.org/repositories/network:ha-clustering:Stable/CentOS_CentOS-6/network:ha-clustering:Stable.repo |
17 | yum install crmsh |
以上是suse官网提供的crmsh的下载源,可以供不同的用户使用。
配置corosync和pacemaker
这里我们将pacemaker以corosync的插件模式运行,也就只有corosync1.0版本支持,2.0不支持插件运行。
1 | ~]# cd /etc/corosync |
2 | ~]# cp /etc/corosync/corosync.conf.example /etc/corosync/corosync.conf |
3 | ~]# cat /etc/corosync/corosync.conf |
4 | # Please read the corosync.conf.5 manual page |
5 | compatibility: whitetank # 兼容08.以前的版本 |
6 | totem { |
7 | version: 2 # totme 的版本,不可更改 |
8 | secauth: on # 安全认证,当使用aisexec时,会非常消耗CPU |
9 | threads: 0 # 用于安全认证开启并行线程数 |
10 | interface { |
11 | ringnumber: 0 # 回环号码,如果一个主机有多块网卡,避免心跳信息回流 |
12 | bindnetaddr: 10.211.55.0 # 绑定心跳网段 corosync会自动判断本地网卡上配置的哪个IP地址是属于这个网络的,并把这个接口作为多播心跳信息传递的接口 |
13 | mcastaddr: 239.245.14.1 # 心跳信息组播地址(所有节点组播地址必须为同一个) |
14 | mcastport: 5405 # 组播时使用的端口 |
15 | ttl: 1 #只向外一跳心跳信息,避免组播报文环路 |
16 | } |
17 | } |
18 | #totem定义集群内各节点间是如何通信的,totem本是一种协议,专用于corosync专用于各节点间的协议,totem协议是有版本的; |
19 | logging { |
20 | fileline: off # 指定要打印的行 |
21 | to_stderr: no # 日志信息是否发往错误输出(建议为否) |
22 | to_logfile: yes # 是否记录日志文件 |
23 | to_syslog: no # 是否记录于syslog日志-->此类日志记录于/var/log/message中 |
24 | logfile: /var/log/cluster/corosync.log # 日志存放位置 |
25 | debug: off #只要不是为了排错,最好关闭debug,它记录的信息过于详细,会占用大量的磁盘IO. |
26 | timestamp: on # 是否打印时间戳,利于定位错误,但会产生大量系统调用,消耗CPU资源 |
27 | logger_subsys { |
28 | subsys: AMF |
29 | debug: off |
30 | } |
31 | } |
32 | service{ |
33 | ver:0 # 版本号 |
34 | name:pacemaker # 模块名 # 启动corosync时同时启动pacemaker |
35 | } |
36 | # corosync启动后会自动启动 pacemaker (此时会以插件的方式来启动pacemaker) |
37 | aisxec { |
38 | user:root |
39 | group:root |
40 | } |
41 | # 启用ais功能时以什么身份来运行,默认就是 root,aisxec区域也可省略; |
42 | |
43 | ~]# corosync-keygen #我们启动了安全认证,需要产生密钥 |
44 | ~]# scp -p {authkey,corosync.conf} node2:/etc/corosync/ |
45 | ~]# service corosync start;ssh node2 "service corosync start" |
注意:
corosync-keygen命令生成密钥时会用到 /dev/random
/dev/random是Linux系统下的随机数生成器,它会从当前系统的内存中一个叫熵池的地址空间中根据系统中断来生成随机数,加密程序或密钥生成程序会用到大量的随机数,就会出现随机数不够用的情况,random 的特性就是一旦熵池中的随机数被取空,会阻塞当前系统进程等待产生中断会继续生成随机数;
由于此处会用到1024位长度的密钥,可能会存在熵池中的随机数不够用的情况,就会一直阻塞在生成密钥的阶段,两种解决办法:
1、手动在键盘上输入大量字符,产生系统中断(产生中断较慢,不建议使用)
2、通过互联网或FTP服务器下载较大的文件(产生中断较快,建议使用)
3、dd不停的产生io读写(在没有下载数据的时候可以使用)
检查corosync的启动日志:
1 | ~]# grep -e "Corosync Cluster Engine" -e "configuration file" /var/log/cluster/corosync.log |
2 | May 16 11:17:52 corosync [MAIN ] Corosync Cluster Engine ('1.4.7'): started and ready to provide service. |
3 | May 16 11:17:52 corosync [MAIN ] Successfully read main configuration file '/etc/corosync/corosync.conf'. |
4 | |
5 | ~]# ]# grep TOTEM /var/log/cluster/corosync.log |
6 | May 16 11:17:52 corosync [TOTEM ] Initializing transport (UDP/IP Multicast). |
7 | May 16 11:17:52 corosync [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0). |
8 | May 16 11:17:52 corosync [TOTEM ] The network interface [10.211.55.48] is now up. |
9 | May 16 11:17:53 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. |
10 | |
11 | ~]# ]# grep ERROR: /var/log/cluster/corosync.log | grep -v unpack_resources |
12 | May 16 11:17:52 corosync [pcmk ] ERROR: process_ais_conf: You have configured a cluster using the Pacemaker plugin for Corosync. The plugin is not supported in this environment and will be removed very soon. |
13 | May 16 11:17:52 corosync [pcmk ] ERROR: process_ais_conf: Please see Chapter 8 of 'Clusters from Scratch' (http://www.clusterlabs.org/doc) for details on using Pacemaker with CMAN |
14 | May 16 11:17:54 corosync [pcmk ] ERROR: pcmk_wait_dispatch: Child process mgmtd exited (pid=15431, rc=100) |
15 | |
16 | 这里可以看到pacemaker作为插件的警告,这里可以忽略。 |
17 | |
18 | ~]# grep pcmk_startup /var/log/cluster/corosync.log |
19 | May 16 11:17:53 corosync [pcmk ] info: pcmk_startup: CRM: Initialized |
20 | May 16 11:17:53 corosync [pcmk ] Logging: Initialized pcmk_startup |
21 | May 16 11:17:53 corosync [pcmk ] info: pcmk_startup: Maximum core file size is: 18446744073709551615 |
22 | May 16 11:17:53 corosync [pcmk ] info: pcmk_startup: Service: 9 |
23 | May 16 11:17:53 corosync [pcmk ] info: pcmk_startup: Local hostname: node1 |
24 | |
25 | 这里检查pacemaker是否启动 |
配置crmsh
查看crm的监测状态:
1 | ~]# crm_mon |
2 | Stack: classic openais (with plugin) |
3 | Current DC: node1 (version 1.1.15-5.el6-e174ec8) - partition with quorum |
4 | Last updated: Tue May 16 12:38:28 2017 Last change: Tue May 16 11:41:36 2017 by root via cibadmin on node1 |
5 | , 2 expected votes |
6 | 2 nodes and 0 resources configured |
7 | |
8 | Online: [ node1 node2 ] |
9 | |
10 | No active resources |
11 | ~]# crm_verify -L -V |
12 | error: unpack_resources: Resource start-up disabled since no STONITH resources have been defined |
13 | error: unpack_resources: Either configure some or disable STONITH with the stonith-enabled option |
14 | error: unpack_resources: NOTE: Clusters with shared data need STONITH to ensure data integrity |
15 | Errors found during check: config not valid |
16 | |
17 | 这里看下corosync默认启用了stonith设备,而集群是没油stonith设备的,因此配置是不可用的,我们需要关闭stonith设备的启用 |
1 | crmsh有两种工作模式 |
2 | 1、命令行模式 |
3 | ~]# crm status |
4 | Stack: classic openais (with plugin) |
5 | Current DC: node1 (version 1.1.15-5.el6-e174ec8) - partition with quorum |
6 | Last updated: Thu May 18 03:04:39 2017 Last change: Tue May 16 11:41:36 2017 by root via cibadmin on node1 |
7 | , 2 expected votes |
8 | 2 nodes and 0 resources configured |
9 | |
10 | Online: [ node1 node2 ] |
11 | |
12 | No resources |
13 | |
14 | 2、交互模式 |
15 | ~]# crm |
16 | crm(live)# status |
17 | Stack: classic openais (with plugin) |
18 | Current DC: node1 (version 1.1.15-5.el6-e174ec8) - partition with quorum |
19 | Last updated: Thu May 18 03:05:19 2017 Last change: Tue May 16 11:41:36 2017 by root via cibadmin on node1 |
20 | , 2 expected votes |
21 | 2 nodes and 0 resources configured |
22 | |
23 | Online: [ node1 node2 ] |
24 | |
25 | No resources |
简单介绍下子命令的使用:
1 | ~]# crm |
2 | crm(live)# configure |
3 | crm(live)configure# help |
4 | node define a cluster node # 定义一个集群节点 |
5 | primitive define a resource # 定义资源 |
6 | monitor add monitor operation to a primitive # 对一个资源添加监控选项(如超时时间,启动失败后的操作) |
7 | group define a group # 定义一个组类型(将多个资源整合在一起) |
8 | clone define a clone # 定义一个克隆类型(可以设置总的克隆数,每一个节点上可以运行几个克隆) |
9 | ms define a master-slave resource # 定义一个主从类型(集群内的节点只能有一个运行主资源,其它从的做备用) |
10 | rsc_template define a resource template # 定义一个资源模板 |
11 | location a location preference # 定义位置约束优先级(默认运行于那一个节点(如果位置约束的值相同,默认倾向性那一个高,就在那一个节点上运行)) |
12 | colocation colocate resources # 排列约束资源(多个资源在一起的可能性) |
13 | order order resources # 资源的启动的先后顺序 |
14 | rsc_ticket resources ticket dependency |
15 | property set a cluster property # 设置集群属性 |
16 | rsc_defaults set resource defaults # 设置资源默认属性(粘性) |
17 | fencing_topology node fencing order # 隔离节点顺序 |
18 | role define role access rights # 定义角色的访问权限 |
19 | user define user access rights # 定义用用户访问权限 |
20 | op_defaults set resource operations defaults # 设置资源默认选项 |
21 | schema set or display current CIB RNG schema |
22 | show display CIB objects # 显示集群信息库对 |
23 | edit edit CIB objects # 编辑集群信息库对象(vim模式下编辑) |
24 | filter filter CIB objects # 过滤CIB对象 |
25 | delete delete CIB objects # 删除CIB对象 |
26 | default-timeouts set timeouts for operations to minimums from the meta-data |
27 | rename rename a CIB object # 重命名CIB对象 |
28 | modgroup modify group # 改变资源组 |
29 | refresh refresh from CIB # 重新读取CIB信息 |
30 | erase erase the CIB # 清除CIB信息 |
31 | ptest show cluster actions if changes were committed |
32 | rsctest test resources as currently configured |
33 | cib CIB shadow management |
34 | cibstatus CIB status management and editing |
35 | template edit and import a configuration from a template |
36 | commit commit the changes to the CIB # 将更改后的信息提交写入CIB |
37 | verify verify the CIB with crm_verify # CIB语法验证 |
38 | upgrade upgrade the CIB to version 1.0 |
39 | save save the CIB to a file # 将当前CIB导出到一个文件中(导出的文件存于切换crm 之前的目录) |
40 | load import the CIB from a file # 从文件内容载入CIB |
41 | graph generate a directed graph |
42 | xml raw xml |
43 | help show help (help topics for list of topics) # 显示帮助信息 |
44 | end go back one level # 回到第一级(crm(live)#) |
45 | crm(live)configure# cd .. |
46 | crm(live)# resource |
47 | crm(live)resource# help |
48 | status show status of resources # 显示资源状态信息 |
49 | start start a resource # 启动一个资源 |
50 | stop stop a resource # 停止一个资源 |
51 | restart restart a resource # 重启一个资源 |
52 | promote promote a master-slave resource # 提升一个主从资源 |
53 | demote demote a master-slave resource # 降级一个主从资源 |
54 | manage put a resource into managed mode |
55 | unmanage put a resource into unmanaged mode |
56 | migrate migrate a resource to another node # 将资源迁移到另一个节点上 |
57 | unmigrate unmigrate a resource to another node |
58 | param manage a parameter of a resource # 管理资源的参数 |
59 | secret manage sensitive parameters # 管理敏感参数 |
60 | meta manage a meta attribute # 管理源属性 |
61 | utilization manage a utilization attribute |
62 | failcount manage failcounts # 管理失效计数器 |
63 | cleanup cleanup resource status # 清理资源状态 |
64 | refresh refresh CIB from the LRM status # 从LRM(LRM本地资源管理)更新CIB(集群信息库),在 |
65 | reprobe probe for resources not started by the CRM # 探测在CRM中没有启动的资源 |
66 | trace start RA tracing # 启用资源代理(RA)追踪 |
67 | untrace stop RA tracing # 禁用资源代理(RA)追踪 |
68 | help show help (help topics for list of topics) # 显示帮助 |
69 | end go back one level # 返回一级(crm(live)#) |
70 | quit exit the program # 退出交互式程序 |
71 | crm(live)resource# cd .. |
72 | crm(live)# node |
73 | crm(live)node# help |
74 | status show nodes status as XML # 以xml格式显示节点状态信息 |
75 | show show node # 命令行格式显示节点状态信息 |
76 | standby put node into standby # 模拟指定节点离线(standby在后面必须的FQDN) |
77 | online set node online # 节点重新上线 |
78 | maintenance put node into maintenance mode |
79 | ready put node into ready mode |
80 | fence fence node # 隔离节点 |
81 | clearstate Clear node state # 清理节点状态信息 |
82 | delete delete node # 删除 一个节点 |
83 | attribute manage attributes |
84 | utilization manage utilization attributes |
85 | status-attr manage status attributes |
86 | help show help (help topics for list of topics) |
87 | end go back one level |
88 | quit exit the program |
89 | crm(live)node# cd .. |
90 | crm(live)# ra |
91 | crm(live)ra # help |
92 | classes list classes and providers # 为资源代理分类 |
93 | list list RA for a class (and provider)# 显示一个类别中的提供的资源 |
94 | meta show meta data for a RA # 显示一个资源代理序的可用参数(如meta ocf:heartbeat:IPaddr2) |
95 | providers show providers for a RA and a class |
96 | help show help (help topics for list of topics) |
97 | end go back one level |
98 | quit exit the program |
配置高可用mysql
1 | ~]#crm |
2 | crm(live)# configure |
3 | crm(live)configure# proterty stonith-enabled=false |
4 | crm(live)configure# verify |
5 | crm(live)configure# commit |
6 | crm(libe)configure# show |
7 | node node1 |
8 | node node2 |
9 | property cib-bootstrap-options: \ |
10 | have-watchdog=false \ |
11 | dc-version=1.1.15-5.el6-e174ec8 \ |
12 | cluster-infrastructure="classic openais (with plugin)" \ |
13 | expected-quorum-votes=2 \ |
14 | stonith-enabled=false |
15 | crm(live)configure# primitive mysqlip ocf:heartbeat:IPaddr params ip=10.211.55.24 iflabel=eth0 nic=eth0 op monitor interval=10s timeout=20s |
16 | crm(live)configure# primitive mysqlservice lsb:mysqld op monitor interval=10s timeout=20s |
17 | crm(live)configure# verify |
18 | crm(live)configure# commit |
19 | crm(live)configure# cd .. |
20 | crm(live)# status |
21 | Stack: classic openais (with plugin) |
22 | Current DC: node2 (version 1.1.15-5.el6-e174ec8) - partition with quorum |
23 | Last updated: Thu May 18 03:33:15 2017 Last change: Thu May 18 03:32:37 2017 by root via cibadmin on node1 |
24 | , 2 expected votes |
25 | 2 nodes and 2 resources configured |
26 | |
27 | Online: [ node1 node2 ] |
28 | |
29 | Full list of resources: |
30 | |
31 | mysqlip (ocf::heartbeat:IPaddr): Started node1 |
32 | mysqlservice (lsb:mysqld): Started node2 |
33 | |
34 | crm(live)# configure |
35 | crm(live)configure# group mysql mysqlip mysqlservice |
36 | crm(live)configure# commit |
37 | crm(live)configure# cd .. |
38 | crm(live)# status |
39 | Stack: classic openais (with plugin) |
40 | Current DC: node2 (version 1.1.15-5.el6-e174ec8) - partition with quorum |
41 | Last updated: Thu May 18 03:39:37 2017 Last change: Thu May 18 03:39:34 2017 by root via cibadmin on node1 |
42 | , 2 expected votes |
43 | 2 nodes and 2 resources configured |
44 | |
45 | Online: [ node1 node2 ] |
46 | |
47 | Full list of resources: |
48 | |
49 | Resource Group: mysql |
50 | mysqlip (ocf::heartbeat:IPaddr): Started node2 |
51 | mysqlservice (lsb:mysqld): Started node2 |
这中间会出现一个问题,我们是个双节点集群,当其中一个节点停止,资源就会消失而不是转移到另一个节点上,因为当前是两节点的集群,只要一个节点顺坏,节点就没法投票,就会出现without quorum,而这种问题解决需要两种方法:
- 配置一个仲裁节点
- 当不具备法定票数的时候忽略处理
注意:忽略法定票数,可能导致集群分裂,在生产环境不建议使用。
1 | ~]# crm |
2 | crm(live)# configure |
3 | crm(live)configure# property no-quorum-policy=ignore |
4 | crm(live)configure# commit |
5 | |
6 | 注:no-quorum-policy={stop|freeze|suicide|ignore} 默认是stop |
配置资源约束
先删除资源组
1 | ~]#crm |
2 | crm(live)# resource |
3 | crm(live)resource# stop mysql |
4 | crm(live)resource# cd .. |
5 | crm(live)# status |
6 | Stack: classic openais (with plugin) |
7 | Current DC: node2 (version 1.1.15-5.el6-e174ec8) - partition with quorum |
8 | Last updated: Thu May 18 04:02:27 2017 Last change: Thu May 18 04:01:49 2017 by root via cibadmin on node1 |
9 | , 2 expected votes |
10 | 2 nodes and 2 resources configured: 4 resources DISABLED and 0 BLOCKED from being started due to failures |
11 | |
12 | Online: [ node1 node2 ] |
13 | |
14 | Full list of resources: |
15 | |
16 | Resource Group: mysql |
17 | mysqlip (ocf::heartbeat:IPaddr): Stopped (disabled) |
18 | mysqlservice (lsb:mysqld): Stopped (disabled) |
19 | |
20 | crm(live)# configure |
21 | crm(live)configure# delete mysql |
22 | crm(live)configure# cd .. |
23 | There are changes pending. Do you want to commit them (y/n)? y |
24 | crm(live)# status |
25 | Stack: classic openais (with plugin) |
26 | Current DC: node2 (version 1.1.15-5.el6-e174ec8) - partition with quorum |
27 | Last updated: Thu May 18 04:05:46 2017 Last change: Thu May 18 04:05:38 2017 by root via cibadmin on node1 |
28 | , 2 expected votes |
29 | 2 nodes and 2 resources configured |
30 | |
31 | Online: [ node1 node2 ] |
32 | |
33 | Full list of resources: |
34 | |
35 | mysqlip (ocf::heartbeat:IPaddr): Started node1 |
36 | mysqlservice (lsb:mysqld): Started node2 |
排列约束
1
crm(live)# configure
2
crm(live)configure# colocation mysqlip_with_mysqlservice inf: mysqlip mysqlservice
3
crm(live)configure# commit
4
crm(live)configure# cd ..
5
crm(live)# status
6
Stack: classic openais (with plugin)
7
Current DC: node2 (version 1.1.15-5.el6-e174ec8) - partition with quorum
8
Last updated: Thu May 18 04:18:48 2017 Last change: Thu May 18 04:14:22 2017 by root via cibadmin on node1
9
, 2 expected votes
10
2 nodes and 2 resources configured
11
12
Online: [ node1 node2 ]
13
14
Full list of resources:
15
16
mysqlip (ocf::heartbeat:IPaddr): Started node2
17
mysqlservice (lsb:mysqld): Started node2
顺序约束
1
crm(live)# configure
2
crm(live)configure# order mysqlip_after_myserver
3
Mandatory Optional Serialize
4
crm(live)configure# order mysqlip_after_myserver Mandatory: mysqlip mysqlservice
5
crm(live)configure# commit
Mandatory代表强制,mysqlip mysqlservice这两个资源必须按照我给定的顺序启动
- 定义位置约束
1
crm(live)configure# location mysqip_prefer_node1 mysqlip rule 100: #uname eq node1
2
crm(live)configure# commit
3
crm(live)configure# cd ..
4
crm(live)# status
5
Stack: classic openais (with plugin)
6
Current DC: node2 (version 1.1.15-5.el6-e174ec8) - partition with quorum
7
Last updated: Thu May 18 04:32:40 2017 Last change: Thu May 18 04:32:36 2017 by root via cibadmin on node2
8
, 2 expected votes
9
2 nodes and 2 resources configured
10
11
Online: [ node1 node2 ]
12
13
Full list of resources:
14
15
mysqlip (ocf::heartbeat:IPaddr): Started node2
16
mysqlservice (lsb:mysqld): Started node2
17
18
crm(live)# status
19
Stack: classic openais (with plugin)
20
Current DC: node2 (version 1.1.15-5.el6-e174ec8) - partition with quorum
21
Last updated: Thu May 18 04:32:43 2017 Last change: Thu May 18 04:32:36 2017 by root via cibadmin on node1
22
, 2 expected votes
23
2 nodes and 2 resources configured
24
25
Online: [ node1 node2 ]
26
27
Full list of resources:
28
29
mysqlip (ocf::heartbeat:IPaddr): Started node1
30
mysqlservice (lsb:mysqld): Started node1