客户在使用我们的SPARK(用的是 spark thrift server)产品后,反馈说,使用一天后就报错。重启一下spark thrift server。但是这种方式治标不治本,本质问题还是的挖出来解决掉的。
报错信息如下:
-
1.jpg
-
2.jpg
-
3.jpg
因为报错的日志是hdfs中的BlockManager,于是查看源码中该类的方法chooseTarget4NewBlock:如下
/**
* Choose target datanodes for creating a new block.
*
* @throws IOException
* if the number of targets < minimum replication.
* @see BlockPlacementPolicy#chooseTarget(String, int, Node,
* Set, long, List, BlockStoragePolicy)
*/
public DatanodeStorageInfo[] chooseTarget4NewBlock(final String src,
final int numOfReplicas, final Node client,
final Set<Node> excludedNodes,
final long blocksize,
final List<String> favoredNodes,
final byte storagePolicyID) throws IOException {
List<DatanodeDescriptor> favoredDatanodeDescriptors =
getDatanodeDescriptors(favoredNodes);
final BlockStoragePolicy storagePolicy = storagePolicySuite.getPolicy(storagePolicyID);
final DatanodeStorageInfo[] targets = blockplacement.chooseTarget(src,
numOfReplicas, client, excludedNodes, blocksize,
favoredDatanodeDescriptors, storagePolicy);
if (targets.length < minReplication) {
throw new IOException("File " + src + " could only be replicated to "
+ targets.length + " nodes instead of minReplication (="
+ minReplication + "). There are "
+ getDatanodeManager().getNetworkTopology().getNumOfLeaves()
+ " datanode(s) running and "
+ (excludedNodes == null? "no": excludedNodes.size())
+ " node(s) are excluded in this operation.");
}
return targets;
}
- 从改方法中可以看出来是
hdfs block的问题,于是执行:hdfs dfsadmin -report
发现有两台机器DFS Remaining和DFS Remaining%空间严重不足。当然也可以通过查看datanode的日志发现一些问题滴。 - 解决问题的思路:
1. 新增DataNode节点
2. 在空间不足的DataNode增加硬盘(也有可能是有足够的硬盘空间,但是没有成功的挂载到HDFS上)










网友评论