Jusene's Blog

heartbeat 服务的心跳

字数统计: 3.7k阅读时长: 19 min
2017/05/09 Share

HA Cluster

我们通过keepalived已经初步理解了高可用集群的,HA Cluster 的目的就是为了提升系统调用性,结合多台主机构建成为集群,但是keepalived是通过vrrp来实现的,这里我们将会遇见更多高可用实现方案,如何让主机间知道对方还活着,就需要向对方发送心跳信息,所以我们就真的存在一个服务叫做heatbeat(心跳)。

HA Cluster的概念

对于服务器来说,所有的一切只不过是资源,资源的调度到位,所依赖的资源存在,这个服务就可用,但是HA的难点就在于如何调度这些资源以及如何不让资源发生争抢的事情发生,资源的抢占的发生往往是灾难性的,这里我们有一些概念。

  • 脑裂

脑裂的发生的原因有很多,但是基本都是在心跳发送不到对面主机,对面主机无法接受到心跳,并认为主机已死,对面主机将会抢占主机资源来恢复服务,而本地主机认为自己没问题,这样的资源抢占形成分区集群,形成宕机是最好的结局,但往往这样的结果是脏数据的产生。

  • vote system

投票系统,当形成分区集群的时候,我们将利用投票系统来进行集群投票,以少数服从多数的方式来决定哪片区域的集群继续工作,所以集群一般来说都是奇数高可用集群:
with quorum > total 1/2
without quorum <= total 1/2

  • 仲裁设备

节点可能会出现偶数节点,投票系统总有可能出现with quorum = total 1/2的情况出现,这样就有出现双方都对的情况出现了,还是会资源强占,所以这里再引进一个仲裁设备,常见的仲裁方式:
– quorum disk 可以往一块磁盘上写数据来决定那边的集群可用
– ping node 还可以ping网关来决定那边的集群可用
这样的结果,无论如何都会出现两边不平衡的结果,投票系统就可以很好的得到结果。

  • failover&failback

但出现节点不可用的时候,投票系统自动决定资源该往那边转移,这时就是failove状态,当原本节点恢复,资源往原本的节点转移的时候,这是failback状态。

资源概念

我们说了这么多资源,资源可以是ip,文件系统,服务程序等等,这些资源还是存在一定联系的,比如为了启动一个服务,我们需要先部署ip,再部署文件系统,然后配置好程序再启动,但是这样的资源关联性本身对一个程序设计者来说是不会关心的,它们不会关系那些资源需要捆绑在一起才能服务,所以后来的人们为了实现高可用集群,专门为整合这些资源来实现高可用,也就是通过CRM(集群资源管理器)来管理和分配这些资源。

  • 资源的约束关系:
    location:位置约束,定义资源对节点的倾向性,用数值来表示,-oo +oo
    colocation:排列约束,定义资源彼此‘在一起’倾向性;-oo +oo

    • 分组:亦能实现将多个资源绑定在一起
      order:顺序约束,定义资源在同一个节点启动时的先后顺序
  • 资源类型:
    primitive:主资源,只能运行于集群内的某个单个节点:(也称native)
    group:组资源,容器,包含一个或多个资源,这种资源可通过“组”这个资源统一进行调度
    clone:克隆资源,可以在同一个集群内的多个节点运行多份克隆
    master/slave:主从资源,在同一集群内部于两个节点进行两份资源,其中一个主,一个为从

  • 资源隔离:
    级别:

    • 节点:STONITH (shooting the other node in the head)
      power switch
    • 资源:fencing
      fc san switch

这个资源管理的基本概念:

解决方案:

  • message layer
  1. heartbeat
    v1,v2,v3
  2. corosync
  3. cman(redhat,rhcs)
  4. keepalived(完全不同于上述三种)
  • CRM
  1. heartbeat v1 haresources (配置接口:配置文件,文件名haresources)
  2. heartbeat v2 crm (在各个节点运行crmd进程,配置接口:命令行客户端程序crmsh,gui客户端:hb_gui)
  3. heartbeat v3 pacemaker (pacemaker可以以插件或独立方式运行:配置接口,cli接口:crmsh,pcs;GUI:hawk(webgui),LCMC,pacemaker-mgmt)
  4. rgmanager(资源组管理器,配置接口,cli:clustat,cman_tool;gui:conga(luci+ricci))
    组合方式:
    heartbeat v1
    heartbeat v2
    heartbeat v3 + pacemaker
    corosync + pacemaker
    cman + rgmanager(rhcs)
    cman + pacemaker
  • LRM

local resource manager,本地资源管理器,由CRM通过子进程提供

  • RA 资源代理
  1. heartbeat legacy:heartbeat 传统类型的 RA,通常位于/etc/ha.d/haresource/目录下
  2. lsb:linux standard base,/etc/rc.d/init.d/目录下的脚本,至少接受4各参数{start|stop|restart|status}
  3. OCF:Open Cluster Framework
    子类别:provider
  4. STONITH:专用于调用STINITH 设备功能的资源,通常为clone类别

heartbeat v1

目的:配置一个httpd,实现httpd程序和ip的资源的高可用

  • node1 10.211.55.45 centos6
  • node2 10.211.55.46 centos6

配置前提:

  1. 时间必须同步
  2. 节点间需要通过主机互相通信,必须解析主机名到ip地址
  3. 考虑仲裁设备是否被用到
  4. 一般建议各节点root用户间能够基于密钥通信

/etc/ha.d目录下:

  • ha.cf 主配置文件,定义各节点上的heartbeat HA集群的基本用法
  • authkeys 集群内节点间彼此传递消息时使用加密算法和密钥
  • haresources 为heartbeat v1提供资源管理器配置接口,v1版本专用的配置接口
1
- node1
2
3
~]#yum groupinstall -y 'Development Tools'
4
~]#yum install -y net-snmp-libs libnet  httpd
5
~]#rpm -ivh heartbeat-2.1.4-12.el6.x86_64.rpm heartbeat-pils-2.1.4-12.el6.x86_64.rpm heartbeat-stonith-2.1.4-12.el6.x86_64.rpm
6
~]#cd /etc/ha.d/
7
~]#cp /usr/share/doc/heartbeat-2.1.4/{ha.cf,authkeys,haresources} .
8
~]#chmod 600 authkeys
9
~]#cat authkeys
10
auth 2
11
1 crc
12
2 sha1 HI!@@@
13
3 md5 Hello!
14
~]#cat ha.cf
15
logfile /var/log/ha-log
16
mcast eth0 225.0.12.1 694 1 0
17
ping 10.211.55.1
18
compression     bz2
19
compression_threshold 2
20
其他默认即可
21
~]#cat haresources
22
node1 10.211.55.24/24/eth0/10.211.55.255 httpd
23
~]#service heartbeat start
24
~]#scp ha.cf,authkeys,haresources node2:/etc/ha.d/
25
 
26
27
- node2
28
29
~]#yum groupinstall -y 'Development Tools'
30
~]#yum install -y net-snmp-libs libnet PyXML httpd
31
~]#rpm -ivh heartbeat-2.1.4-12.el6.x86_64.rpm heartbeat-pils-2.1.4-12.el6.x86_64.rpm heartbeat-stonith-2.1.4-12.el6.x86_64.rpm
32
~]#cd /etc/ha.d/
33
~]#service heartbeat start
34
35
36
- node1
37
38
~]# ifconfig
39
eth0      Link encap:Ethernet  HWaddr 00:1C:42:31:37:0C  
40
          inet addr:10.211.55.45  Bcast:10.211.55.255  Mask:255.255.255.0
41
          inet6 addr: fdb2:2c26:f4e4:0:21c:42ff:fe31:370c/64 Scope:Global
42
          inet6 addr: fe80::21c:42ff:fe31:370c/64 Scope:Link
43
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
44
          RX packets:408158 errors:0 dropped:0 overruns:0 frame:0
45
          TX packets:142600 errors:0 dropped:0 overruns:0 carrier:0
46
          collisions:0 txqueuelen:1000 
47
          RX bytes:177491294 (169.2 MiB)  TX bytes:11603752 (11.0 MiB)
48
49
eth0:0    Link encap:Ethernet  HWaddr 00:1C:42:31:37:0C  
50
          inet addr:10.211.55.24  Bcast:10.211.55.255  Mask:255.255.255.0
51
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
52
~]# netstat -ntlp
53
Active Internet connections (only servers)
54
Proto Recv-Q Send-Q Local Address               Foreign Address             State       PID/Program name   
55
tcp        0      0 0.0.0.0:22                  0.0.0.0:*                   LISTEN      5956/sshd           
56
tcp        0      0 127.0.0.1:25                0.0.0.0:*                   LISTEN      1414/master         
57
tcp        0      0 :::80                       :::*                        LISTEN      8657/httpd          
58
tcp        0      0 :::22                       :::*                        LISTEN      5956/sshd           
59
tcp        0      0 ::1:25                      :::*                        LISTEN      1414/master         
60
~]# tail -f /var/log/ha-log
61
heartbeat[8935]: 2017/05/09_12:33:05 info: Version 2 support: false
62
heartbeat[8935]: 2017/05/09_12:33:05 WARN: Logging daemon is disabled --enabling logging daemon is recommended
63
heartbeat[8935]: 2017/05/09_12:33:05 info: **************************
64
heartbeat[8935]: 2017/05/09_12:33:05 info: Configuration validated. Starting heartbeat 2.1.4
65
heartbeat[8936]: 2017/05/09_12:33:05 info: heartbeat: version 2.1.4
66
heartbeat[8936]: 2017/05/09_12:33:05 info: Heartbeat generation: 1494342121
67
heartbeat[8936]: 2017/05/09_12:33:05 info: glib: UDP multicast heartbeat started for group 225.0.12.1 port 694 interface eth0 (ttl=1 loop=0)
68
heartbeat[8936]: 2017/05/09_12:33:05 info: glib: ping heartbeat started.
69
heartbeat[8936]: 2017/05/09_12:33:05 info: G_main_add_TriggerHandler: Added signal manual handler
70
heartbeat[8936]: 2017/05/09_12:33:05 info: G_main_add_TriggerHandler: Added signal manual handler
71
heartbeat[8936]: 2017/05/09_12:33:05 info: G_main_add_SignalHandler: Added signal handler for signal 17
72
heartbeat[8936]: 2017/05/09_12:33:05 info: Local status now set to: 'up'
73
heartbeat[8936]: 2017/05/09_12:33:05 info: Link 10.211.55.1:10.211.55.1 up.
74
heartbeat[8936]: 2017/05/09_12:33:05 info: Status update for node 10.211.55.1: status ping
75
heartbeat[8936]: 2017/05/09_12:33:06 info: Link node2:eth0 up.
76
heartbeat[8936]: 2017/05/09_12:33:06 info: Status update for node node2: status active
77
harc[8945]:     2017/05/09_12:33:06 info: Running /etc/ha.d/rc.d/status status
78
heartbeat[8936]: 2017/05/09_12:33:06 info: Comm_now_up(): updating status to active
79
heartbeat[8936]: 2017/05/09_12:33:06 info: Local status now set to: 'active'
80
heartbeat[8936]: 2017/05/09_12:33:07 info: remote resource transition completed.
81
heartbeat[8936]: 2017/05/09_12:33:07 info: remote resource transition completed.
82
heartbeat[8936]: 2017/05/09_12:33:07 info: Local Resource acquisition completed. (none)
83
heartbeat[8936]: 2017/05/09_12:33:07 info: node2 wants to go standby [foreign]
84
heartbeat[8936]: 2017/05/09_12:33:08 info: standby: acquire [foreign] resources from node2
85
heartbeat[8963]: 2017/05/09_12:33:08 info: acquire local HA resources (standby).
86
ResourceManager[8976]:  2017/05/09_12:33:08 info: Acquiring resource group: node1 10.211.55.24/24/eth0/10.211.55.255 httpd
87
IPaddr[9002]:   2017/05/09_12:33:08 INFO:  Resource is stopped
88
ResourceManager[8976]:  2017/05/09_12:33:08 info: Running /etc/ha.d/resource.d/IPaddr 10.211.55.24/24/eth0/10.211.55.255 start
89
IPaddr[9099]:   2017/05/09_12:33:08 INFO: Using calculated netmask for 10.211.55.24: 255.255.255.0
90
IPaddr[9099]:   2017/05/09_12:33:08 INFO: eval ifconfig eth0:0 10.211.55.24 netmask 255.255.255.0 broadcast 10.211.55.255
91
IPaddr[9070]:   2017/05/09_12:33:08 INFO:  Success
92
ResourceManager[8976]:  2017/05/09_12:33:08 info: Running /etc/init.d/httpd  start
93
heartbeat[8963]: 2017/05/09_12:33:08 info: local HA resource acquisition completed (standby).
94
heartbeat[8936]: 2017/05/09_12:33:08 info: Standby resource acquisition done [foreign].
95
heartbeat[8936]: 2017/05/09_12:33:08 info: Initial resource acquisition complete (auto_failback)
96
heartbeat[8936]: 2017/05/09_12:33:08 info: remote resource transition completed.
97
98
99
从日志中可以看出ip和httpd服务都已经配置启动了。
100
101
- node2
102
103
~]#service heartbeat start
104
~]# tail -f /var/log/ha-log
105
heartbeat[8561]: 2017/05/09_12:35:41 info: Version 2 support: false
106
heartbeat[8561]: 2017/05/09_12:35:41 WARN: Logging daemon is disabled --enabling logging daemon is recommended
107
heartbeat[8561]: 2017/05/09_12:35:41 info: **************************
108
heartbeat[8561]: 2017/05/09_12:35:41 info: Configuration validated. Starting heartbeat 2.1.4
109
heartbeat[8562]: 2017/05/09_12:35:41 info: heartbeat: version 2.1.4
110
heartbeat[8562]: 2017/05/09_12:35:42 info: Heartbeat generation: 1494342262
111
heartbeat[8562]: 2017/05/09_12:35:42 info: glib: UDP multicast heartbeat started for group 225.0.12.1 port 694 interface eth0 (ttl=1 loop=0)
112
heartbeat[8562]: 2017/05/09_12:35:42 info: glib: ping heartbeat started.
113
heartbeat[8562]: 2017/05/09_12:35:42 info: G_main_add_TriggerHandler: Added signal manual handler
114
heartbeat[8562]: 2017/05/09_12:35:42 info: G_main_add_TriggerHandler: Added signal manual handler
115
heartbeat[8562]: 2017/05/09_12:35:42 info: G_main_add_SignalHandler: Added signal handler for signal 17
116
heartbeat[8562]: 2017/05/09_12:35:42 info: Local status now set to: 'up'
117
heartbeat[8562]: 2017/05/09_12:35:42 info: Link 10.211.55.1:10.211.55.1 up.
118
heartbeat[8562]: 2017/05/09_12:35:42 info: Status update for node 10.211.55.1: status ping
119
heartbeat[8562]: 2017/05/09_12:35:42 info: Link node1:eth0 up.
120
heartbeat[8562]: 2017/05/09_12:35:42 info: Status update for node node1: status active
121
harc[8573]:     2017/05/09_12:35:42 info: Running /etc/ha.d/rc.d/status status
122
heartbeat[8562]: 2017/05/09_12:35:43 info: Comm_now_up(): updating status to active
123
heartbeat[8562]: 2017/05/09_12:35:43 info: Local status now set to: 'active'
124
heartbeat[8562]: 2017/05/09_12:35:43 info: remote resource transition completed.
125
heartbeat[8562]: 2017/05/09_12:35:43 info: remote resource transition completed.
126
heartbeat[8562]: 2017/05/09_12:35:43 info: Local Resource acquisition completed. (none)
127
heartbeat[8562]: 2017/05/09_12:35:44 info: node1 wants to go standby [foreign]
128
heartbeat[8562]: 2017/05/09_12:35:44 info: standby: acquire [foreign] resources from node1
129
heartbeat[8591]: 2017/05/09_12:35:44 info: acquire local HA resources (standby).
130
heartbeat[8591]: 2017/05/09_12:35:44 info: local HA resource acquisition completed (standby).
131
heartbeat[8562]: 2017/05/09_12:35:44 info: Standby resource acquisition done [foreign].
132
heartbeat[8562]: 2017/05/09_12:35:44 info: Initial resource acquisition complete (auto_failback)
133
heartbeat[8562]: 2017/05/09_12:35:45 info: remote resource transition completed.
134
135
136
我们启动了node2,httpd和ip并没有配置启动,接下来我们关掉node1。
137
138
- node1
139
140
~]# service heartbeat stop
141
142
143
- node2
144
145
~]#tail -f /var/log/ha-log
146
heartbeat[8562]: 2017/05/09_12:39:03 info: Received shutdown notice from 'node1'.
147
heartbeat[8562]: 2017/05/09_12:39:03 info: Resources being acquired from node1.
148
heartbeat[8604]: 2017/05/09_12:39:03 info: acquire local HA resources (standby).
149
heartbeat[8604]: 2017/05/09_12:39:03 info: local HA resource acquisition completed (standby).
150
heartbeat[8562]: 2017/05/09_12:39:03 info: Standby resource acquisition done [foreign].
151
heartbeat[8605]: 2017/05/09_12:39:03 info: No local resources [/usr/share/heartbeat/ResourceManager listkeys node2] to acquire.
152
harc[8630]:     2017/05/09_12:39:03 info: Running /etc/ha.d/rc.d/status status
153
mach_down[8645]:        2017/05/09_12:39:03 info: Taking over resource group 10.211.55.24/24/eth0/10.211.55.255
154
ResourceManager[8670]:  2017/05/09_12:39:03 info: Acquiring resource group: node1 10.211.55.24/24/eth0/10.211.55.255 httpd
155
IPaddr[8696]:   2017/05/09_12:39:03 INFO:  Resource is stopped
156
ResourceManager[8670]:  2017/05/09_12:39:03 info: Running /etc/ha.d/resource.d/IPaddr 10.211.55.24/24/eth0/10.211.55.255 start
157
IPaddr[8793]:   2017/05/09_12:39:03 INFO: Using calculated netmask for 10.211.55.24: 255.255.255.0
158
IPaddr[8793]:   2017/05/09_12:39:03 INFO: eval ifconfig eth0:0 10.211.55.24 netmask 255.255.255.0 broadcast 10.211.55.255
159
IPaddr[8764]:   2017/05/09_12:39:03 INFO:  Success
160
ResourceManager[8670]:  2017/05/09_12:39:03 info: Running /etc/init.d/httpd  start
161
mach_down[8645]:        2017/05/09_12:39:03 info: /usr/share/heartbeat/mach_down: nice_failback: foreign resources acquired
162
mach_down[8645]:        2017/05/09_12:39:03 info: mach_down takeover complete for node node1.
163
heartbeat[8562]: 2017/05/09_12:39:03 info: mach_down takeover complete.
164
heartbeat[8562]: 2017/05/09_12:39:34 WARN: node node1: is dead
165
heartbeat[8562]: 2017/05/09_12:39:34 info: Dead node node1 gave up resources.
166
heartbeat[8562]: 2017/05/09_12:39:34 info: Link node1:eth0 dead.
167
~]#netstat -ntlp
168
Active Internet connections (only servers)
169
Proto Recv-Q Send-Q Local Address               Foreign Address             State       PID/Program name   
170
tcp        0      0 0.0.0.0:22                  0.0.0.0:*                   LISTEN      5961/sshd           
171
tcp        0      0 127.0.0.1:25                0.0.0.0:*                   LISTEN      1414/master         
172
tcp        0      0 :::80                       :::*                        LISTEN      8915/httpd          
173
tcp        0      0 :::22                       :::*                        LISTEN      5961/sshd           
174
tcp        0      0 ::1:25                      :::*                        LISTEN      1414/master     
175
~]# ifconfig
176
eth0      Link encap:Ethernet  HWaddr 00:1C:42:E2:B9:94  
177
          inet addr:10.211.55.46  Bcast:10.211.55.255  Mask:255.255.255.0
178
          inet6 addr: fdb2:2c26:f4e4:0:21c:42ff:fee2:b994/64 Scope:Global
179
          inet6 addr: fe80::21c:42ff:fee2:b994/64 Scope:Link
180
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
181
          RX packets:427116 errors:0 dropped:0 overruns:0 frame:0
182
          TX packets:168588 errors:0 dropped:0 overruns:0 carrier:0
183
          collisions:0 txqueuelen:1000 
184
          RX bytes:178774654 (170.4 MiB)  TX bytes:15383371 (14.6 MiB)
185
186
eth0:0    Link encap:Ethernet  HWaddr 00:1C:42:E2:B9:94  
187
          inet addr:10.211.55.24  Bcast:10.211.55.255  Mask:255.255.255.0
188
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1

上述我们就实现了ip和服务的流动转移,这就是高可用集群的基本应用。

CATALOG
  1. 1. HA Cluster
  2. 2. HA Cluster的概念
  3. 3. 资源概念
  4. 4. heartbeat v1