in HDFS by
Who takes care of replication consistency in a Hadoop cluster and what do under/over replicated blocks mean?

► Click here to show 1 Answer

0 votes

In a cluster, it is always the NameNode that takes care of the replication consistency. The fsck command provides information regarding the over and under-replicated block. 

Under-replicated blocks:

These are the blocks that do not meet their target replication for the files they belong to. HDFS will automatically create new replicas of under-replicated blocks until they meet the target replication.

Consider a cluster with three nodes and replication set to three. At any point, if one of the NameNodes crashes, the blocks would be under-replicated. It means that there was a replication factor set, but there are not enough replicas as per the replication factor. If the NameNode does not get information about the replicas, it will wait for a limited amount of time and then start the re-replication of missing blocks from the available nodes. 

Over-replicated blocks:

These are the blocks that exceed their target replication for the files they belong to. Usually, over-replication is not a problem, and HDFS will automatically delete excess replicas.

Consider a case of three nodes running with the replication of three, and one of the nodes goes down due to a network failure. Within a few minutes, the NameNode re-replicates the data, and then the failed node is back with its set of blocks. This is an over-replication situation, and the NameNode will delete a set of blocks from one of the nodes. 

MapReduce Interview Questions

After HDFS, let’s now move on to some of the interview questions related to the processing framework of Hadoop: MapReduce.

Apache HDFS
Learn More with Madanswer

Related questions