Categories
5G Network
Agile
Amazon EC2
Android
Angular
Ansible
Arduino
Artificial Intelligence
Augmented Reality
AWS
Azure
Big Data
Blockchain
BootStrap
Cache Teachniques
Cassandra
Commercial Insurance
C#
C++
Cloud
CD
CI
Cyber Security
Data Handling
Data using R
Data Science
DBMS
Design-Pattern
DevOps
ECMAScript
Fortify
Ethical Hacking
Framework
GIT
GIT Slack
Gradle
Hadoop
HBase
HDFS
Hibernate
Hive
HTML
Image Processing
IOT
JavaScript
Java
Jenkins
Jira
JUnit
Kibana
Linux
Machine Learning
MangoDB
MVC
NGINX
Onsen UI
Oracle
PHP
Python
QTP
R Language
Regression Analysis
React JS
Robotic
Salesforce
SAP
Selenium
Service Discovery
Service Now
SOAP UI
Spark SQL
Testing
TOGAF
Research Method
Virtual Reality
Vue.js
Home
Recent Q&A
Feedback
Ask a Question
Job Output in MapReducer
Home
>
Big Data | Hadoop
>
Job Output in MapReducer
Jan 8, 2020
in
Big Data | Hadoop
Q: Job Output in MapReducer
1
Answer
0
votes
Jan 8, 2020
OutputFormat describes the output-specification for a MapReduce job.
The MapReduce framework relies on the OutputFormat of the job to:
Validate the output-specification of the job; for example, check that the output directory doesn’t already exist.
Provide the RecordWriter implementation used to write the output files of the job. Output files are stored in a FileSystem.
TextOutputFormat is the default OutputFormat.
OutputCommitter
OutputCommitter describes the commit of task output for a MapReduce job.
The MapReduce framework relies on the OutputCommitter of the job to:
Setup the job during initialization. For example, create the temporary output directory for the job during the initialization of the job. Job setup is done by a separate task when the job is in PREP state and after initializing tasks. Once the setup task completes, the job will be moved to RUNNING state.
Cleanup the job after the job completion. For example, remove the temporary output directory after the job completion. Job cleanup is done by a separate task at the end of the job. Job is declared SUCCEDED/FAILED/KILLED after the cleanup task completes.
Setup the task temporary output. Task setup is done as part of the same task, during task initialization.
Check whether a task needs a commit. This is to avoid the commit procedure if a task does not need commit.
Commit of the task output. Once task is done, the task will commit it’s output if required.
Discard the task commit. If the task has been failed/killed, the output will be cleaned-up. If task could not cleanup (in exception block), a separate task will be launched with same attempt-id to do the cleanup.
FileOutputCommitter is the default OutputCommitter. Job setup/cleanup tasks occupy map or reduce containers, whichever is available on the NodeManager. And JobCleanup task, TaskCleanup tasks and JobSetup task have the highest priority, and in that order.
Task Side-Effect Files
In some applications, component tasks need to create and/or write to side-files, which differ from the actual job-output files.
In such cases there could be issues with two instances of the same Mapper or Reducer running simultaneously (for example, speculative tasks) trying to open and/or write to the same file (path) on the FileSystem. Hence the application-writer will have to pick unique names per task-attempt (using the attemptid, say attempt_200709221812_0001_m_000000_0), not just per task.
To avoid these issues the MapReduce framework, when the OutputCommitter is FileOutputCommitter, maintains a special ${mapreduce.output.fileoutputformat.outputdir}/_temporary/_${taskid} sub-directory accessible via ${mapreduce.task.output.dir} for each task-attempt on the FileSystem where the output of the task-attempt is stored. On successful completion of the task-attempt, the files in the ${mapreduce.output.fileoutputformat.outputdir}/_temporary/_${taskid} (only) are promoted to ${mapreduce.output.fileoutputformat.outputdir}. Of course, the framework discards the sub-directory of unsuccessful task-attempts. This process is completely transparent to the application.
The application-writer can take advantage of this feature by creating any side-files required in ${mapreduce.task.output.dir} during execution of a task via FileOutputFormat.getWorkOutputPath(Conext), and the framework will promote them similarly for succesful task-attempts, thus eliminating the need to pick unique paths per task-attempt.
Note: The value of ${mapreduce.task.output.dir} during execution of a particular task-attempt is actually ${mapreduce.output.fileoutputformat.outputdir}/_temporary/_{$taskid}, and this value is set by the MapReduce framework. So, just create any side-files in the path returned by FileOutputFormat.getWorkOutputPath(Conext) from MapReduce task to take advantage of this feature.
The entire discussion holds true for maps of jobs with reducer=NONE (i.e. 0 reduces) since output of the map, in that case, goes directly to HDFS.
RecordWriter
RecordWriter writes the output <key, value> pairs to an output file.
RecordWriter implementations write the job outputs to the FileSystem.
Click here to read more about Loan/Mortgage
Click here to read more about Insurance
Facebook
Twitter
LinkedIn
Related questions
0
votes
Q: Job Input in Mapreducer
Jan 8, 2020
in
Big Data | Hadoop
0
votes
Q: Job Submission and Monitoring overview in MapReducer
Jan 8, 2020
in
Big Data | Hadoop
0
votes
Q: Job Configuration MapReducer
Jan 8, 2020
in
Big Data | Hadoop
0
votes
Q: Is it possible to have hadoop job output in multiple directories? If yes, how?
Nov 7, 2020
in
Hadoop
#hadoop-job
#multiple-directories-hadoop
0
votes
Q: Skipping Bad Records in MapReducer
Jan 8, 2020
in
Big Data | Hadoop
0
votes
Q: Data Compression in MapReducer
Jan 8, 2020
in
Big Data | Hadoop
...