0 votes
in Big Data | Hadoop by
Distribute By

When we have a large set of data, it is preferable to use sort as it uses more than one reducers.

When records of a particular category appear in all the output files (it is not the duplicate data, the output is being distributed between the reducers and then sorted in each reducer, which is not ideal). So, when you want all the records of the same category to be sorted in one file, then use DISTRIBUTE BY.

All columns to distribute by will be sent to the same reducer.

hive> select id, name from person distribute by id;

Related questions

0 votes
asked Jan 10, 2020 in Big Data | Hadoop by sharadyadav1986
0 votes
asked Apr 1, 2020 in Big Data | Hadoop by AdilsonLima
...