Job is the primary interface by which user-job interacts with the ResourceManager.
Job provides facilities to submit jobs, track their progress, access component-tasks’ reports and logs, get the MapReduce cluster’s status information and so on.
The job submission process involves:
Checking the input and output specifications of the job.
Computing the InputSplit values for the job.
Setting up the requisite accounting information for the DistributedCache of the job, if necessary.
Copying the job’s jar and configuration to the MapReduce system directory on the FileSystem.
Submitting the job to the ResourceManager and optionally monitoring it’s status.
Job history files are also logged to user specified directory mapreduce.jobhistory.intermediate-done-dir and mapreduce.jobhistory.done-dir, which defaults to job output directory.
User can view the history logs summary in specified directory using the following command $ mapred job -history output.jhist This command will print job details, failed and killed tip details. More details about the job such as successful tasks and task attempts made for each task can be viewed using the following command $ mapred job -history all output.jhist
Normally the user uses Job to create the application, describe various facets of the job, submit the job, and monitor its progress.
Users may need to chain MapReduce jobs to accomplish complex tasks which cannot be done via a single MapReduce job. This is fairly easy since the output of the job typically goes to distributed file-system, and the output, in turn, can be used as the input for the next job.
However, this also means that the onus on ensuring jobs are complete (success/failure) lies squarely on the clients. In such cases, the various job-control options are:
Job.submit() : Submit the job to the cluster and return immediately.
Job.waitForCompletion(boolean) : Submit the job to the cluster and wait for it to finish.