0 votes
in Apache Spark by

What is the difference between repartition and coalesce?

1 Answer

0 votes
by

Repartition: This method increases or decreases the number of partitions in an RDD, DataFrame, or Dataset. It involves a full shuffle of the data, which is costly in terms of performance because it redistributes data across the cluster.

Coalesce: This method decreases the number of partitions in an RDD, DataFrame, or Dataset. It avoids a full shuffle by attempting to combine existing partitions, making it more efficient than repartition when reducing the number of partitions.

...