Jan 12, 2020 in Big Data | Hadoop
Q: What is Bucketing in Hive?

1 Answer

0 votes
Jan 12, 2020

With Bucketing in Hive, we can divide datasets into manageable parts. It is similar to hashing. Even if the size of data set varies, we can still have fixed number of buckets.

Also with bucketing we can do map-side joins in Hive.

E.g. Let say we have a table with date as a first level partition and user_id as second level partition.

 

The date may have smaller number of partitions. But user_id may have a large number of partitions. If we have millions of users, there will be millions of second level partitions and files.

To avoid creating so many partitions, we can do bucketing instead of partitioning on user_id. With bucketing, we can use HASH function to put different user_ids in different buckets. We can create a manageable number of buckets for user_id values.

Related questions

0 votes
Apr 24, 2020 in Big Data | Hadoop
0 votes
Jan 13, 2020 in Big Data | Hadoop
...