InputFormat describes the input-specification for a MapReduce job.
The MapReduce framework relies on the InputFormat of the job to:
Validate the input-specification of the job.
Split-up the input file(s) into logical InputSplit instances, each of which is then assigned to an individual Mapper.
Provide the RecordReader implementation used to glean input records from the logical InputSplit for processing by the Mapper.
The default behavior of file-based InputFormat implementations, typically sub-classes of FileInputFormat, is to split the input into logical InputSplit instances based on the total size, in bytes, of the input files. However, the FileSystem blocksize of the input files is treated as an upper bound for input splits. A lower bound on the split size can be set via mapreduce.input.fileinputformat.split.minsize.
Clearly, logical splits based on input-size is insufficient for many applications since record boundaries must be respected. In such cases, the application should implement a RecordReader, who is responsible for respecting record-boundaries and presents a record-oriented view of the logical InputSplit to the individual task.
TextInputFormat is the default InputFormat.
If TextInputFormat is the InputFormat for a given job, the framework detects input-files with the .gz extensions and automatically decompresses them using the appropriate CompressionCodec. However, it must be noted that compressed files with the above extensions cannot be split and each compressed file is processed in its entirety by a single mapper.
InputSplit
InputSplit represents the data to be processed by an individual Mapper.
Typically InputSplit presents a byte-oriented view of the input, and it is the responsibility of RecordReader to process and present a record-oriented view.
FileSplit is the default InputSplit. It sets mapreduce.map.input.file to the path of the input file for the logical split.
RecordReader
RecordReader reads <key, value> pairs from an InputSplit.
Typically the RecordReader converts the byte-oriented view of the input, provided by the InputSplit, and presents a record-oriented to the Mapper implementations for processing. RecordReader thus assumes the responsibility of processing record boundaries and presents the tasks with keys and values.