Categories

Jan 26 in Big Data | Hadoop
Q:

How will you choose various file formats for storing and processing data using Apache Hadoop ?

1 Answer

Jan 26

 

The decision to choose a particular file format is based on the following factors-

 

i) Schema evolution to add, alter and rename fields.

 

ii) Usage pattern like accessing 5 columns out of 50 columns vs accessing most of the columns.

 

iii)Splittability to be processed in parallel.

 

iv) Read/Write/Transfer performance vs block compression saving storage space

 

File Formats that can be used with Hadoop - CSV, JSON, Columnar, Sequence files, AVRO, and Parquet file.

 

CSV Files

 

CSV files are an ideal fit for exchanging data between hadoop and external systems. It is advisable not to use header and footer lines when using CSV files.

 

JSON Files

 

Every JSON File has its own record. JSON stores both data and schema together in a record and also enables complete schema evolution and splitability. However, JSON files do not support block level compression.

 

Avro FIiles

 

This kind of file format is best suited for long term storage with Schema. Avro files store metadata with data and also let you specify independent schema for reading the files.

 

Parquet Files

 

A columnar file format that supports block level compression and is optimized for query performance as it allows selection of 10 or less columns from from 50+ columns records.

 

Test Your Practical Hadoop Knowledge

Scenario Based Hadoop Interview Question -

You have a file that contains 200 billion URLs. How will you find the first unique URL using Hadoop MapReduce? 

 

Hadoop Hive Interview Question_Finding Unique URLs Using Hive

 

hadoop interview questions dataflair

 

Click here to read more about Loan/Mortgage
Click here to read more about Insurance

Related questions

Madanswer
Jan 11 in Big Data | Hadoop
Sep 7, 2019 in Big Data | Hadoop
Oct 12, 2019 in Big Data | Hadoop
...