一次集群内部网络心跳消失导致的集群驱逐日志分析

作者: 归源 | 来源:发表于2022-03-11 10:06 被阅读0次

企业监控系统部署、搭建Harbor私有仓库
高可用集群之仲裁盘
分布式系统中的网络分区问题
k8s之nginx-ingress
六、Kubernetes系列之Kubernetes使用ingre
基于腾讯云Elasticsearch Service的ELK实践
网络与通信原理
【Linux-zookeeper】Unable to read
Elastic Stack学习笔记
k8s中负载均衡器【ingress-nginx】部署

之前的文章中提到过，Oracle集群从11gR2版本之后开始使用资源的形式管理集群中的众多组件，而在众多资源中，网络资源对集群的健康状态有着至关重要的作用，经常会因为集群中的网络心跳丢失导致集群异常重启或崩溃。

前几天，日常使用中的Oracle集群中的一个节点发生了驱逐事件，通过排查Oracle集群相关的日志，最后发现了本次集群驱逐事件的产生原因--一次因为节点间网络心跳丢失引起的集群驱逐。

集群节点1(正常节点)上crs相关的(alert.log)告警日志

2022-03-01 16:16:21.546 [CRSD(25659)]CRS-2878: Failed to restart resource 'ora.asmnet2.asmnetwork'
2022-03-01 16:16:21.547 [CRSD(25659)]CRS-2769: Unable to failover resource 'ora.asmnet2.asmnetwork'.

在下午16:16:21的时候，crs集群的告警日志中开始出现了连接服务失败的打印，ora.asmnet2.asmnetwork节点2上的资源连接失败

2022-03-01 16:18:21.190 [OCSSD(11331)]CRS-7503: The Oracle Grid Infrastructure process 'ocssd' observed communication issues between node 'node1' and node 'node2', interface list of local node 'node1' is '172.xx.xx.1:48068;172.xx.xx.1:21226;', interface list of remote node 'node2' is '172.xx.xx.2:12027;172.xx.xx.2:49228;'.
2022-03-01 16:18:23.191 [OCSSD(11331)]CRS-7503: The Oracle Grid Infrastructure process 'ocssd' observed communication issues between node 'node1' and node 'node2', interface list of local node 'node1' is '172.xx.xx.1:48068;172.xx.xx.1:21226;', interface list of remote node 'node2' is '172.xx.xx.2:12027;172.xx.xx.2:49228;'.
2022-03-01 16:18:26.710 [OCSSD(11331)]CRS-1612: Network communication with node node2 (2) has been missing for 50% of the timeout interval.  If this persists, removal of this node from cluster will occur in 14.460 seconds
2022-03-01 16:18:29.717 [EVMD(10076)]CRS-7503: The Oracle Grid Infrastructure process 'evmd' observed communication issues between node 'node1' and node 'node2', interface list of local node 'node1' is '172.xx.xx.1:42544;172.xx.xx.1:14112;', interface list of remote node 'node2' is '172.xx.xx.2:30847;172.xx.xx.2:53979;'.
2022-03-01 16:18:33.712 [OCSSD(11331)]CRS-1611: Network communication with node node2 (2) has been missing for 75% of the timeout interval.  If this persists, removal of this node from cluster will occur in 7.460 seconds
2022-03-01 16:18:36.926 [CRSD(25659)]CRS-7503: The Oracle Grid Infrastructure process 'crsd' observed communication issues between node 'node1' and node 'node2', interface list of local node 'node1' is '172.xx.xx.1:25920;172.xx.xx.1:33793;', interface list of remote node 'node2' is '172.xx.xx.2:64880;172.xx.xx.2:57460;'.
2022-03-01 16:18:38.713 [OCSSD(11331)]CRS-1610: Network communication with node node2 (2) has been missing for 90% of the timeout interval.  If this persists, removal of this node from cluster will occur in 2.460 seconds
2022-03-01 16:18:40.517 [OHASD(9218)]CRS-8011: reboot advisory message from host: node2, component: cssagent, with time stamp: L-2022-03-01-16:19:10.946
2022-03-01 16:18:40.517 [OHASD(9218)]CRS-8013: reboot advisory message text: oracssdagent is about to reboot this node due to unknown reason as it did not receive local heartbeats for 1260 ms amount of time

集群中节点2上的asmnetwork服务已经无法通信后，紧接着Oralce集群中CSS(集群同步服务)检测到了集群内的两台节点间网络无法通信，并开始了30s的探测。而在这30s内，节点1上仍然无法收到来自节点2的网络心跳包，最终因为长时间丢包(超出了ocssd配置的网络心跳的超时时间)触发了集群的节点驱逐，通过OHASD向节点2发送reboot advisory message来完成节点2的驱逐重启。

2022-03-01 16:18:41.177 [OCSSD(11331)]CRS-1632: Node node2 is being removed from the cluster in cluster incarnation 538918274
2022-03-01 16:18:41.299 [OCSSD(11331)]CRS-1601: CSSD Reconfiguration complete. Active nodes are node1 .
2022-03-01 16:18:41.310 [CRSD(25659)]CRS-5504: Node down event reported for node 'node2'.
2022-03-01 16:18:42.523 [CRSD(25659)]CRS-2773: Server 'node2' has been removed from pool 'Generic'.
2022-03-01 16:18:42.524 [CRSD(25659)]CRS-2773: Server 'node2' has been removed from pool 'ora.rac'.
2022-03-01 16:18:50.364 [GIPCD(10549)]CRS-7503: The Oracle Grid Infrastructure process 'gipcd' observed communication issues between node 'node1' and node 'node2', interface list of local node 'node1' is '172.xx.xx.1:42613;172.xx.xx.1:43716;', interface list of remote node 'node2' is '172.xx.xx.2:16172;172.xx.xx.2:11463;'.
2022-03-01 16:18:50.364 [GIPCD(10549)]CRS-7503: The Oracle Grid Infrastructure process 'gipcd' observed communication issues between node 'node1' and node 'node2', interface list of local node 'node1' is '172.xx.xx.1:42613;172.xx.xx.1:43716;', interface list of remote node 'node2' is '172.xx.xx.2:16172;172.xx.xx.2:11463;'.

16:18:41，节点2成功从集群中被驱逐出去，CSSD完成了集群的重新配置，并确认了集群中存活的节点。在确认节点2已经离线后，集群将节点2相关的资源从Generic和ora.rac资源池中移除。

2022-03-01 16:21:12.599 [CLSECHO(203157)]ACFS-9316: Valid ADVM/ACFS distribution media detected at: '/xxxx/usm/install/Oracle/EL7/x86_64/3.10.0-862/3.10.0-862-x86_64/bin'
2022-03-01 16:23:39.293 [OCSSD(11331)]CRS-1601: CSSD Reconfiguration complete. Active nodes are node1 node2 .
2022-03-01 16:24:00.757 [CRSD(25659)]CRS-2772: Server 'node2' has been assigned to pool 'Generic'.
2022-03-01 16:24:00.758 [CRSD(25659)]CRS-2772: Server 'node2' has been assigned to pool 'ora.rac'.

16:21:12，节点2完成了驱逐重启后状态恢复正常，被集群服务所探测到，并再次进行了集群的重新配置(CSSD Reconfiguration complete)。之后，节点2上的资源被添加到资源池中，集群状态重新恢复正常。

集群中节点2(故障节点)上crs相关的(alert.log)告警日志

2022-03-01 16:15:16.999 [GIPCD(9581)]CRS-42216: No interfaces are configured on the local node for interface definition ib3(:.*)?:172.xx.xx.0: available interface definitions are [eno3(:.*)?:172.xx.xx.0][eno4:1(:.*)?:169.254.0.0][bond0(:.*)?:10.xx.xx.0][bond0:1(:.*)?:10.xx.xx.0][eno4(:.*)?:[fe80:0:0:0:0:0:0:0]][eno3(:.*)?:[fe80:0:0:0:0:0:0:0]][ib3(:.*)?:[fe80:0:0:0:0:0:0:0]][ib2(:.*)?:[fe80:0:0:0:0:0:0:0]][bond0(:.*)?:[fe80:0:0:0:0:0:0:0]][eno4(:.*)?:172.xx.xx.0].
2022-03-01 16:15:18.001 [ORAROOTAGENT(8853)]CRS-5050: HAIP failover due to network interface ib3 not functioning
2022-03-01 16:15:21.102 [GIPCD(9581)]CRS-42216: No interfaces are configured on the local node for interface definition ib3(:.*)?:172.xx.xx.0: available interface definitions are [eno3(:.*)?:172.xx.xx.0][eno4:1(:.*)?:169.254.0.0][eno4:2(:.*)?:169.254.16.0][bond0(:.*)?:10.xx.xx.0][bond0:1(:.*)?:10.xx.xx.0][eno4(:.*)?:[fe80:0:0:0:0:0:0:0]][eno3(:.*)?:[fe80:0:0:0:0:0:0:0]][ib3(:.*)?:[fe80:0:0:0:0:0:0:0]][ib2(:.*)?:[fe80:0:0:0:0:0:0:0]][bond0(:.*)?:[fe80:0:0:0:0:0:0:0]][eno4(:.*)?:172.xx.xx.0].

故障节点上16:15:16，GIPCD(Grid InterProcess Communication, 集群私网通信服务)守护进程检测到集群内部本地节点上的私网网卡异常，由于集群内部私网IP无法通信，HAIP进行了failover。

2022-03-01 16:15:45.753 [GIPCD(9581)]CRS-42216: No interfaces are configured on the local node for interface definition ib3(:.*)?:172.xx.xx.0: available interface definitions are [eno3(:.*)?:172.xx.xx.0][eno4:1(:.*)?:169.254.0.0][eno4:2(:.*)?:169.254.16.0][bond0(:.*)?:10.xx.xx.0][bond0:1(:.*)?:10.xx.xx.0][eno4(:.*)?:[fe80:0:0:0:0:0:0:0]][eno3(:.*)?:[fe80:0:0:0:0:0:0:0]][ib3(:.*)?:[fe80:0:0:0:0:0:0:0]][ib2(:.*)?:[fe80:0:0:0:0:0:0:0]][bond0(:.*)?:[fe80:0:0:0:0:0:0:0]][eno4(:.*)?:172.xx.xx.0].
2022-03-01 16:15:48.871 [ORAROOTAGENT(24905)]CRS-5017: The resource action "ora.asmnet2.asmnetwork start" encountered the following error: 
2022-03-01 16:15:48.871+CRS-5006: Unable to automatically select a network interface which has subnet mask  and subnet number 172.xx.xx.0
2022-03-01 16:15:49.860 [GIPCD(9581)]CRS-42216: No interfaces are configured on the local node for interface definition ib3(:.*)?:172.xx.xx.0: available interface definitions are [eno3(:.*)?:172.xx.xx.0][eno4:1(:.*)?:169.254.0.0][eno4:2(:.*)?:169.254.16.0][bond0(:.*)?:10.xx.xx.0][bond0:1(:.*)?:10.xx.xx.0][eno4(:.*)?:[fe80:0:0:0:0:0:0:0]][eno3(:.*)?:[fe80:0:0:0:0:0:0:0]][ib3(:.*)?:[fe80:0:0:0:0:0:0:0]][ib2(:.*)?:[fe80:0:0:0:0:0:0:0]][bond0(:.*)?:[fe80:0:0:0:0:0:0:0]][eno4(:.*)?:172.xx.xx.0].

16:15:48，orarootagent代理进程尝试启动故障节点上的ora.asmnet2.asmnetwork，但是失败了，当前节点上无法发现集群所需要使用的私网网卡服务。

2022-03-01 16:18:57.269 [OCSSD(10043)]CRS-1612: Network communication with node node1 (1) has been missing for 50% of the timeout interval.  If this persists, removal of this node from cluster will occur in 14.410 seconds
...
2022-03-01 16:19:04.272 [OCSSD(10043)]CRS-1611: Network communication with node node1 (1) has been missing for 75% of the timeout interval.  If this persists, removal of this node from cluster will occur in 7.410 seconds
...
2022-03-01 16:19:09.274 [OCSSD(10043)]CRS-1610: Network communication with node node1 (1) has been missing for 90% of the timeout interval.  If this persists, removal of this node from cluster will occur in 2.410 seconds

16:18:57至16:19:09两个时间点内，可以看到当前集群已经一段时间内无法接收到集群内其他节点的网络心跳了。

2022-03-01 16:19:09.706 [OCSSD(10043)]CRS-1606: The number of voting files available, 0, is less than the minimum number of voting files required, 2, resulting in CSSD termination to ensure data integrity; details at (:CSSNM00018:) in /xxxx/diag/crs/node2/crs/trace/ocssd.trc
2022-03-01 16:19:09.709 [OCSSD(10043)]CRS-1656: The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /xxxx/diag/crs/node2/crs/trace/ocssd.trc
2022-03-01 16:19:09.731 [OCSSD(10043)]CRS-1652: Starting clean up of CRSD resources.
2022-03-01 16:19:09.738 [ORAROOTAGENT(24905)]CRS-5818: Aborted command 'check' for resource 'ora.asmnet1.asmnetwork'. Details at (:CRSAGF00113:) {2:25114:2} in /xxxx/diag/crs/node2/crs/trace/crsd_orarootagent_root.trc.
2022-03-01 16:19:09.738 [ORAROOTAGENT(24905)]CRS-5818: Aborted command 'check' for resource 'ora.asmnet2.asmnetwork'. Details at (:CRSAGF00113:) {0:2:61} in /xxxx/diag/crs/node2/crs/trace/crsd_orarootagent_root.trc.
2022-03-01 16:19:10.326 [CSSDMONITOR(9991)]CRS-1661: The CSS daemon is not responding. If this persists, a reboot will occur in 610 milliseconds; details are at (:CLSN00121:) in /xxxx/diag/crs/node2/crs/trace/ohasd_cssdmonitor_root.trc.
...
2022-03-01 16:19:10.646 [CSSDAGENT(10025)]CRS-1661: The CSS daemon is not responding. If this persists, a reboot will occur in 290 milliseconds; details are at (:CLSN00121:) in /xxxx/diag/crs/node2/crs/trace/ohasd_cssdagent_root.trc.

由于长时间丢失网络心跳，节点2上的CSS守护服务被迫终止服务，本地节点上检查集群内部网络的asmnetwork进程也被强制关闭(aborted)。由于节点2上CSS守护进程无法得到响应，如果在接下来的610ms内仍无法响应，集群将会将节点2进行驱逐重启。最终节点2被集群重启。

2022-03-01 16:23:50.542 [OHASD(8521)]CRS-8500: Oracle Clusterware OHASD process is starting with operating system process ID 8521
2022-03-01 16:23:50.603 [OHASD(8521)]CRS-0714: Oracle Clusterware Release 19.0.0.0.0.
2022-03-01 16:23:50.618 [OHASD(8521)]CRS-2112: The OLR service started on node node2.
2022-03-01 16:23:50.780 [OHASD(8521)]CRS-8011: reboot advisory message from host: node2, component: cssagent, with time stamp: L-2022-03-01-16:19:10.946
2022-03-01 16:23:50.781 [OHASD(8521)]CRS-1301: Oracle High Availability Service started on node node2.
2022-03-01 16:23:50.782 [OHASD(8521)]CRS-8013: reboot advisory message text: oracssdagent is about to reboot this node due to unknown reason as it did not receive local heartbeats for 1260 ms amount of time

16:23:50，节点2发生重启后开始启动OHASD服务，并完成集群初始化的一系列动作。

2022-03-01 16:23:51.439 [CSSDMONITOR(9245)]CRS-8500: Oracle Clusterware CSSDMONITOR process is starting with operating system process ID 9245
2022-03-01 16:23:51.439 [CSSDAGENT(9241)]CRS-8500: Oracle Clusterware CSSDAGENT process is starting with operating system process ID 9241
2022-03-01 16:23:51.468 [ORAROOTAGENT(9224)]CRS-8500: Oracle Clusterware ORAROOTAGENT process is starting with operating system process ID 9224
2022-03-01 16:23:51.487 [ORAAGENT(9234)]CRS-8500: Oracle Clusterware ORAAGENT process is starting with operating system process ID 9234
2022-03-01 16:23:52.115 [ORAROOTAGENT(9224)]CRS-5016: Process "/xxxx/bin/acfsload" spawned by agent "ORAROOTAGENT" for action "check" failed: details at "(:CLSN00010:)" in "/xxxx/diag/crs/node2/crs/trace/ohasd_orarootagent_root.trc"
2022-03-01 16:23:52.126 [ORAAGENT(9330)]CRS-8500: Oracle Clusterware ORAAGENT process is starting with operating system process ID 9330

OHASD服务启动成功后，检查本地节点上的信息正常后，OHASD开始启动集群内部的各个资源对应的代理进程，包括CSSDMONITOR、CSSDAGENT、ORAROOTAGENT、ORAAGENT。

2022-03-01 16:23:52.447 [MDNSD(9378)]CRS-8500: Oracle Clusterware MDNSD process is starting with operating system process ID 9378
2022-03-01 16:23:52.452 [EVMD(9379)]CRS-8500: Oracle Clusterware EVMD process is starting with operating system process ID 9379
2022-03-01 16:23:53.288 [CLSECHO(9551)]ACFS-9154: Loading 'oracleoks.ko' driver.
2022-03-01 16:23:53.472 [GPNPD(9555)]CRS-8500: Oracle Clusterware GPNPD process is starting with operating system process ID 9555
2022-03-01 16:23:54.358 [GPNPD(9555)]CRS-2328: GPNPD started on node node2. 
2022-03-01 16:23:54.496 [GIPCD(9855)]CRS-8500: Oracle Clusterware GIPCD process is starting with operating system process ID 9855

启动MDNSD、EVMD、GPNPD、GIPCD等集群内部的通信进程

2022-03-01 16:24:04.732 [OCSSD(10648)]CRS-8500: Oracle Clusterware OCSSD process is starting with operating system process ID 10648
2022-03-01 16:24:05.805 [OCSSD(10648)]CRS-1713: CSSD daemon is started in hub mode
...
2022-03-01 16:24:09.559 [OCSSD(10648)]CRS-1601: CSSD Reconfiguration complete. Active nodes are node1 node2 .
2022-03-01 16:24:10.847 [OCSSD(10648)]CRS-1720: Cluster Synchronization Services daemon (CSSD) is ready for operation.
2022-03-01 16:24:11.240 [OCTSSD(18913)]CRS-8500: Oracle Clusterware OCTSSD process is starting with operating system process ID 18913
2022-03-01 16:24:12.039 [OCTSSD(18913)]CRS-2403: The Cluster Time Synchronization Service on host node2 is in observer mode.
2022-03-01 16:24:13.352 [OCTSSD(18913)]CRS-2407: The new Cluster Time Synchronization Service reference node is host node1.
2022-03-01 16:24:13.352 [OCTSSD(18913)]CRS-2401: The Cluster Time Synchronization Service started on host node2.
...
2022-03-01 16:24:21.919 [CRSD(24912)]CRS-8500: Oracle Clusterware CRSD process is starting with operating system process ID 24912
2022-03-01 16:24:24.413 [CRSD(24912)]CRS-1012: The OCR service started on node node2.
2022-03-01 16:24:24.439 [CRSD(24912)]CRS-1201: CRSD started on node node2.

集群启动通信结束后，OCSSD守护进程开始启动，并在之后启动CRSD守护进程。

2022-03-01 16:44:17.979 [OCSSD(10648)]CRS-1626: A Configuration change request completed successfully
2022-03-01 16:44:17.995 [OCSSD(10648)]CRS-1601: CSSD Reconfiguration complete. Active nodes are node1 node2 .

CSSD守护进程启动成功后，并完成集群的重新配置。

最后，Oracle的任何事件在对应的日志中都有详细的记录，了解每个资源对应的功能对于排查Oracle的故障有重大帮助。日常使用中，一定要定期对集群的健康状态进行检查，尽可能的开启Oracle的健康状态监控，并配置足够详细的系统监控指标，否则，等到问题发生的时候，往往没有足够的数据支撑你的故障诊断。

企业监控系统部署、搭建Harbor私有仓库
ES集群安装部署ES集群,用于ELK日志分析平台的构建主机名称IP地址角色规格es-0001192.168.1...
高可用集群之仲裁盘
背景：高可用集群共享盘场景下，多个节点间共享一块数据盘，该集群对外提供一个VIP。当集群中各个节点间心跳网络不通，...
分布式系统中的网络分区问题
网络分区问题什么是网络分区？网络分区只在分布式集群中，节点之间由于网络不通，导致集群中节点形成不同的子集，子集...
k8s之nginx-ingress
一、Ingress 简介在Kubernetes中，服务和Pod的IP地址仅可以在集群网络内部使用，对于集群外的应...
六、Kubernetes系列之Kubernetes使用ingre
一、Ingress简介在Kubernetes中，服务和Pod的IP地址仅可以在集群网络内部使用，对于集群外的应用...
基于腾讯云Elasticsearch Service的ELK实践
ELK日志分析实践腾讯云Elasticsearch Service提供的实例包含Elasticsearch集群和...
网络与通信原理
网络与通信原理集群网络搭建集群集群创建阶段会指定集群CIDR，并将这个值以参数传递的形式传递给集群节点初始脚...
【Linux-zookeeper】Unable to read
内部维护一个zookeeper集群，近期有用户反馈集群不稳定，导致业务侧有波动。根据用户提供的IP定位到异常并提供...
Elastic Stack学习笔记
简介 Elastic Stack可用于集群、多应用的日志分析。 Elastic Stack包含Elasticsea...
k8s中负载均衡器【ingress-nginx】部署
在Kubernetes中，服务和Pod的IP地址仅可以在集群网络内部使用，对于集群外的应用是不可见的。为了使外部的...

一次集群内部网络心跳消失导致的集群驱逐日志分析

集群节点1(正常节点)上crs相关的(alert.log)告警日志

集群中节点2(故障节点)上crs相关的(alert.log)告警日志

相关文章

企业监控系统部署、搭建Harbor私有仓库

高可用集群之仲裁盘

分布式系统中的网络分区问题

k8s之nginx-ingress

六、Kubernetes系列之Kubernetes使用ingre

基于腾讯云Elasticsearch Service的ELK实践

网络与通信原理

【Linux-zookeeper】Unable to read

Elastic Stack学习笔记

k8s中负载均衡器【ingress-nginx】部署

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读