When we have a large set of data, it is preferable to use sort as it uses more than one reducers.
When records of a particular category appear in all the output files (it is not the duplicate data, the output is being distributed between the reducers and then sorted in each reducer, which is not ideal). So, when you want all the records of the same category to be sorted in one file, then use DISTRIBUTE BY.
All columns to distribute by will be sent to the same reducer.
hive> select id, name from person distribute by id;