In Apache Spark, shuffling is the process of redistributing data across partitions that may lead to data movement across the executors. The implementation of shuffle operation is entirely different in Spark as compared to Hadoop.
Shuffling has two important compression parameters:
- shuffle.compress: It is used to check whether the engine would compress shuffle outputs or not.
- shuffle.spill.compress: It is used to decide whether to compress intermediate shuffle spill files or not.
- Shuffling comes in the scene when we join two tables or perform byKey operations such as GroupByKey or ReduceByKey.