-
问题现象:
openshift安装efk后,各个pod显示正常,但kibana pod提示打印异常信息
log [22:52:02.901] [warning][elasticsearch] Unable to revive connection: https://logging-es:9200/
log [22:52:02.902] [warning][elasticsearch] No living connections
log [22:52:05.442] [warning][elasticsearch] Unable to revive connection: https://logging-es:9200/
log [22:52:05.443] [warning][elasticsearch] No living connections
-
问题排查:
1、确认kibana和ES Server(logging-es)服务链接是否正常
进入OKD 集群,master节点,执行如下命令。
np="openshift-logging"; \
oc exec `oc get pods -l component=kibana-ops -o name -n $np |cut -d/ -f2` \
-c kibana \
-n $np \
-- curl -s \
--cacert /etc/kibana/keys/ca \
--cert /etc/kibana/keys/cert \
--key /etc/kibana/keys/key \
https://logging-es:9200/
结果显示kibana连接ES服务是正常的,可以排除kibana的问题。
{
"name" : "logging-es-data-master-ypgh5heq",
"cluster_name" : "logging-es",
"cluster_uuid" : "uHhppucFTI2JwZdYbStTaA",
"version" : {
"number" : "5.6.13",
"build_hash" : "4d5320b",
"build_date" : "2018-10-30T19:05:08.237Z",
"build_snapshot" : false,
"lucene_version" : "6.6.1"
},
"tagline" : "You Know, for Search"
}
2、确认ES Server(logging-es)集群服务是否正常
执行命令查看集群状态
np="openshift-logging"; \
oc exec `oc get pods -l component=es -o name -n $np |cut -d/ -f2` \
-c elasticsearch \
-n $np \
-- curl -s \
--cacert /etc/elasticsearch/secret/admin-ca \
--cert /etc/elasticsearch/secret/admin-cert \
--key /etc/elasticsearch/secret/admin-key \
https://localhost:9200/_cluster/health?pretty=true
结果显示,status为yellow,unassigned_shards为14,active_shards_percent_as_number为50。
{
"cluster_name" : "logging-es",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 14,
"active_shards" : 14,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 14,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 50.0
}
3、通过unassigned_shards和active_shards_percent_as_number的显示,推断是有索引分片没有被正确存储。
执行命令查看索引状态:
np="openshift-logging"; \
oc exec `oc get pods -l component=es -o name -n $np |cut -d/ -f2` \
-c elasticsearch \
-n $np \
-- curl -s \
--cacert /etc/elasticsearch/secret/admin-ca \
--cert /etc/elasticsearch/secret/admin-cert \
--key /etc/elasticsearch/secret/admin-key \
https://localhost:9200/_cat/indices?v
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
yellow open .operations.2019.08.01 eb5nHgZdSYmU91YBKoMXWA 1 1 42731 0 31.5mb 31.5mb
yellow open .operations.2019.07.30 XAcy27SxQ-6eKw4e3NHtRA 1 1 67211 0 74.2mb 74.2mb
yellow open .operations.2019.08.03 fTcaIEx-TD-BQuYhocYOjA 1 1 275384 0 173.4mb 173.4mb
yellow open .kibana.d033e22ae348aeb5660fc2140aec35850c4da997 fN0s-AywSDieHYiI0yor2A 1 1 5 0 57.7kb 57.7kb
yellow open .operations.2019.07.31 KYJ0wn8TTl6cVYjUplW3Ag 1 1 61817 0 63.8mb 63.8mb
yellow open .operations.2019.08.08 FmKDNSmLQ469GXZzh3-oTQ 1 1 73322 0 64.1mb 64.1mb
yellow open .searchguard E_CTiBPXQy29uAceCLfEoA 1 1 5 0 66.3kb 66.3kb
yellow open .operations.2019.08.05 BbDX_5rlRAO-kMbIrS7Wag 1 1 249518 0 174.7mb 174.7mb
yellow open .operations.2019.08.02 Tvnk20pKS-eBz2mPtR-rmA 1 1 293582 0 199.5mb 199.5mb
yellow open .kibana g8Py9rE0TNe6K37XIXmXiw 1 1 1 0 3.2kb 3.2kb
yellow open .operations.2019.08.09 MIbESM4KTWOzzoCPe8b9Jg 1 1 20257 0 17.3mb 17.3mb
yellow open .operations.2019.08.04 oZi_MuSNS9-HUvmXT61CNA 1 1 271250 0 169mb 169mb
yellow open .operations.2019.08.06 uOgHA4EVT42WgiDkKPf97g 1 1 124307 0 102.4mb 102.4mb
yellow open .operations.2019.07.29 aBQSJHprTK-bhW0JIX3UpA 1 1 39028 0 29.4mb 29.4mb
yellow open .operations.2019.07.27 m1QQJSjjRXS9gBMpos_2Qw 1 1 36097 0 30.4mb 30.4mb
结果显示indes的rep都为1,这个和预期是不相符的。
副本分片的主要目的就是为了故障转移,正如在 集群内的原理 中讨论的:如果持有主分片的节点挂掉了,一个副本分片就会晋升为主分片的角色。
那么可以看出来副本分片和主分片是不能放到一个节点上面的,可是在只有一个节点的集群里,副本分片没有办法分配到其他的节点上,所以出现所有副本分片都unassigned得情况。因为只有一个节点,如果存在主分片节点挂掉了,那么整个集群理应就挂掉了,不存在副本分片升为主分片的情况。
-
问题修复:
既然只有一个节点,无法做副本分片,那就设置index副本分片为0,使得ES集群恢复正常。
执行命令:
np="openshift-logging"; \
oc exec `oc get pods -l component=es -o name -n $np |cut -d/ -f2` \
-c elasticsearch \
-n $np \
-- curl -s \
--cacert /etc/elasticsearch/secret/admin-ca \
--cert /etc/elasticsearch/secret/admin-cert \
--key /etc/elasticsearch/secret/admin-key \
-H "Content-Type: application/json" \
-XPUT 'https://localhost:9200/_settings' \
-d '{
"index" : {
"number_of_replicas" : 0
}
}'
再次执行集群状态查看,可以看到,集群已经恢复正常。
{
"cluster_name" : "logging-es",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 15,
"active_shards" : 15,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 100.0
}









网友评论