With Bucketing in Hive, we can divide datasets into manageable parts. It is similar to hashing. Even if the size of data set varies, we can still have fixed number of buckets.
Also with bucketing we can do map-side joins in Hive.
E.g. Let say we have a table with date as a first level partition and user_id as second level partition.
The date may have smaller number of partitions. But user_id may have a large number of partitions. If we have millions of users, there will be millions of second level partitions and files.
To avoid creating so many partitions, we can do bucketing instead of partitioning on user_id. With bucketing, we can use HASH function to put different user_ids in different buckets. We can create a manageable number of buckets for user_id values.