in Big Data | Hadoop by
Debugging in MapReducer

1 Answer

0 votes
The MapReduce framework provides a facility to run user-provided scripts for debugging. When a MapReduce task fails, a user can run a debug script, to process task logs for example. The script is given access to the task’s stdout and stderr outputs, syslog and jobconf. The output from the debug script’s stdout and stderr is displayed on the console diagnostics and also as part of the job UI.

In the following sections we discuss how to submit a debug script with a job. The script file needs to be distributed and submitted to the framework.

How to distribute the script file:

The user needs to use DistributedCache to distribute and symlink to the script file.

How to submit the script:

A quick way to submit the debug script is to set values for the properties and mapreduce.reduce.debug.script, for debugging map and reduce tasks respectively. These properties can also be set by using APIs Configuration.set(MRJobConfig.MAP_DEBUG_SCRIPT, String) and Configuration.set(MRJobConfig.REDUCE_DEBUG_SCRIPT, String). In streaming mode, a debug script can be submitted with the command-line options -mapdebug and -reducedebug, for debugging map and reduce tasks respectively.

The arguments to the script are the task’s stdout, stderr, syslog and jobconf files. The debug command, run on the node where the MapReduce task failed, is:

$script $stdout $stderr $syslog $jobconf

Pipes programs have the c++ program name as a fifth argument for the command. Thus for the pipes programs the command is

$script $stdout $stderr $syslog $jobconf $program

Default Behavior:

For pipes, a default script is run to process core dumps under gdb, prints stack trace and gives info about running threads.

Related questions

0 votes
asked Jan 8, 2020 in Big Data | Hadoop by GeorgeBell
0 votes
asked Jan 8, 2020 in Big Data | Hadoop by GeorgeBell
0 votes
asked May 7, 2020 in Big Data | Hadoop by Kemoko