0 votes
in Azure Databricks by

Do Compressed Data Sources Like .csv.gz Get Distributed in Apache Spark?

1 Answer

0 votes
by

When we read a compressed data source arranged in serial, it is called Single-Threaded. When such data is read off disk, it remains in memory as a distributed dataset. Therefore, only the initial read is not distributed. Compressed files are difficult to break; however, readable/chunkable files get distributed in multiple extents in an Azure data lake or Hadoop file system. Chunking up a lot of files in compressed form creates a thread per file depending on the number of files.

Related questions

0 votes
asked Nov 19, 2022 in Azure Databricks by SakshiSharma
+1 vote
asked Mar 13, 2023 in Azure by SakshiSharma
...