in Big Data | Hadoop by (5.1k points)
Job Configuration MapReducer

1 Answer

0 votes
by (5.1k points)
Job represents a MapReduce job configuration.

Job is the primary interface for a user to describe a MapReduce job to the Hadoop framework for execution. The framework tries to faithfully execute the job as described by Job, however:

Some configuration parameters may have been marked as final by administrators (see Final Parameters) and hence cannot be altered.

While some job parameters are straight-forward to set (e.g. Job.setNumReduceTasks(int)) , other parameters interact subtly with the rest of the framework and/or job configuration and are more complex to set (e.g. Configuration.set(JobContext.NUM_MAPS, int)).

Job is typically used to specify the Mapper, combiner (if any), Partitioner, Reducer, InputFormat, OutputFormat implementations. FileInputFormat indicates the set of input files (FileInputFormat.setInputPaths(Job, Path…)/ FileInputFormat.addInputPath(Job, Path)) and (FileInputFormat.setInputPaths(Job, String…)/ FileInputFormat.addInputPaths(Job, String)) and where the output files should be written (FileOutputFormat.setOutputPath(Path)).

Optionally, Job is used to specify other advanced facets of the job such as the Comparator to be used, files to be put in the DistributedCache, whether intermediate and/or job outputs are to be compressed (and how), whether job tasks can be executed in a speculative manner (setMapSpeculativeExecution(boolean))/ setReduceSpeculativeExecution(boolean)), maximum number of attempts per task (setMaxMapAttempts(int)/ setMaxReduceAttempts(int)) etc.

Of course, users can use Configuration.set(String, String)/ Configuration.get(String) to set/get arbitrary parameters needed by applications. However, use the DistributedCache for large amounts of (read-only) data.

Related questions

0 votes
asked Jan 8, 2020 in Big Data | Hadoop by GeorgeBell (5.1k points)