+1 vote
in Hadoop by
What is the fundamental difference between a MapReduce InputSplit and HDFS block?

1 Answer

0 votes
by

What is the fundamental difference between a MapReduce InputSplit and HDFS block?

By definition

1) Block – Block is the continuous location on the hard drive where data HDFS store data. In general, FileSystem store data as a collection of blocks. In a similar way, HDFS stores each file as blocks, and distributes it across the Hadoop cluster.

2) InputSplit- InputSplit represents the data which individual Mapper will process. Further split divides into records. Each record (which is a key-value pair) will be processed by the map.

Data representation

1) Block- It is the physical representation of data.

2) InputSplit- It is the logical representation of data. Thus, during data processing in MapReduce program or other processing techniques use InputSplit. In MapReduce, important thing is that InputSplit does not contain the input data. Hence, it is just a reference to the data.

Size

1) Block- The default size of the HDFS block is 128 MB which is configured as per our requirement. All blocks of the file are of the same size except the last block. The last Block can be of same size or smaller. In Hadoop, the files split into 128 MB blocks and then stored into Hadoop Filesystem.

2) InputSplit- Split size is approximately equal to block size, by default.

Example

Consider an example, where we need to store the file in HDFS. HDFS stores files as blocks. Block is the smallest unit of data that can store or retrieved from the disk. The default size of the block is 128MB. HDFS break files into blocks and stores these blocks on different nodes in the cluster. We have a file of 130 MB, so HDFS will break this file into 2 blocks.

Now, if we want to perform MapReduce operation on the blocks, it will not process, as the 2nd block is incomplete. InputSplit solves this problem. InputSplit will form a logical grouping of blocks as a single block. As the InputSplit include a location for the next block. It also includes the byte offset of the data needed to complete the block.

From this, we can conclude that InputSplit is only a logical chunk of data. i.e. it has just the information about blocks address or location. Thus, during MapReduce execution, Hadoop scans through the blocks and create InputSplits. Split act as a broker between block and mapper.

Related questions

+1 vote
asked Nov 8, 2020 in Hadoop by rahuljain1
0 votes
asked Feb 17, 2023 in Hadoop by sharadyadav1986
...