0 votes
in Apache Spark by
What is shuffling in Apache Spark? When does it occur?

1 Answer

0 votes
by

In Apache Spark, shuffling is the process of redistributing data across partitions that may lead to data movement across the executors. The implementation of shuffle operation is entirely different in Spark as compared to Hadoop.

Shuffling has two important compression parameters:

  • shuffle.compress: It is used to check whether the engine would compress shuffle outputs or not.
  • shuffle.spill.compress: It is used to decide whether to compress intermediate shuffle spill files or not.
  • Shuffling comes in the scene when we join two tables or perform byKey operations such as GroupByKey or ReduceByKey.

Related questions

0 votes
asked Mar 29, 2022 in Apache Spark by sharadyadav1986
0 votes
asked Mar 29, 2022 in Apache Spark by sharadyadav1986
...