0 votes
in Hadoop by
Explain Hadoop Archives?

1 Answer

0 votes
by

Explain Hadoop Archives?

Apache Hadoop HDFS stores and processes large (terabytes) data sets. However, storing a large number of small files in HDFS is inefficient, since each file is stored in a block, and block metadata is held in memory by the namenode.

Reading through small files normally causes lots of seeks and lots of hopping from datanode to datanode to retrieve each small file, all of which is inefficient data access pattern.

Hadoop Archive (HAR) basically deals with small files issue. HAR pack a number of small files into a large file, so, one can access the original files in parallel transparently (without expanding the files) and efficiently.

Hadoop Archives are special format archives. It maps to a file system directory. Hadoop Archive always has a *.har extension. In particular, Hadoop MapReduce uses Hadoop Archives as an Input.

Related questions

0 votes
asked Nov 8, 2020 in Hadoop by rahuljain1
0 votes
asked Jun 7, 2020 in Hadoop by Robindeniel
...