You cannot change the no of Mappers via new Java API, because we are using Job class in MapReduceconfiguration core. In old API(deprecated), we can set no of mappers using setNumMapTasks(int n) methods via the JobConf object. Ideally, this is not the best way to set/change the no of mappers.
By default, no of mappers are 2 on each slave-node. We can set/change this value using mapreduce.tasktracker.map.tasks.maximum parameter. You need to set this parameter in mapred-site.xml file. We should not directly select random value to set the no of mappers.
Ideally for each logical InputSplit, a independent mapper or map dynamic container will get invoked. If we go with default case, on each particular slave node, Node-manager can run only two mappers or map dynamic containers parallely irrespective of logical input splits. Initially two input split are assigned to two map dynamic containers on slave-node1. then the remaining input split might be in a queue. In some cases, these input splits might got traveled to some other
slave-node(Let’s say SN2) which is having map dynamic container sitting idle. This mapper can process the traveled input-split on this slave-node (SN2).
Even though if you specified 2 value(No of mappers) in configuration file. Node-manager doesn’t invoke all mappers parallely. This decision is taken care by Resource Manager based on the input split(s) available on a particular slave-node. But that slave-node can run maximum 2 map dynamic container parallely.
Please go through below one, so that you can come to know how many no of maximum mapper we need to set in order to get optimize solution on a particular slave node.
When you are setting up the cluster, at that time you should decide how many maximum no of mappers that should be configured/run parallely on all slaves-nodes. Basically, no of mappers are decided based on the below two factors, that is,
1) No of cores
2) Ram memory
Lets say we have 10 cores on your system. we can have 10 mappers(One mapper = one core) if go with one core for each mapper. Each mapper/map dynamic container can run on one core. This case might not be true in all cases.
Let’s say you have 10 cores on your slave-node, and ram memory is 25GB. Your job need 5GB of memory, so every map tasks requires 5GB of ram.You will have 5 cores on each slave-node. So that we can run maximum 5 mappers parallely. On slave-node, it doesn’t have enough memory to run more than 5 mappers parallely even though we have more no of cores available on slave-node. In this case, maximum no of mappers are limited by amount of ram available in your systems. It is not limited by cores available in your system.
If your job required ,every map tasks to be loaded with 5GB of memory, then you are wasting cores if you are having 10 cores on each slave-nodes. Here we are using only 5 cores on each slave-node, remaining 5 cores are not utilized. Either go with “10 cores with 50GB memory” or “5 cores with 25GB of ram memory”. This will gives the optimal usage of resources.
In general, for each mapper, we will go with 1 to 1.5 core processor. If the usage/processing is very small/light, then go with 1 core processor for each map dynamic container. If the usage/processing is very heavy, then go with 1.5 core processor for each map dynamic container. And also you should the keep above two factors in mind to serve the optimized solution