What is a Parquet file in Spark?

Apache Parquet is a columnar storage format that is available to any project in Hadoop ecosystem. Any data processing framework, data model or programming language can use it.

It is a compressed, efficient and encoding format common to Hadoop system projects.

Spark SQL supports both reading


and writing of parquet files. Parquet files also automatically preserves the schema of the original data.

During write operations, by default all columns in a parquet file are converted to nullable column.

