What is a “Distributed Cache” in Apache Hadoop?

Question

What is a “Distributed Cache” in Apache Hadoop?

1 Answer

SakshiSharma · Answer 1 · 2020-11-06T10:51:15+0000

In Hadoop, data chunks process independently in parallel among DataNodes, using a program written by the user. If we want to access some files from all the DataNodes, then we will put that file to distributed cache.

Big Data Interview Questions For Freshers - Distributed Cache

Big Data Interview Questions For Freshers – Distributed Cache

MapReduce framework provides Distributed Cache to caches files needed by the applications. It can cache read-only text files, archives, jar files etc.

Once we have cached a file for our job. Then, Hadoop will make it available on each datanodes where map/reduce tasks are running. Then, we can access files from all the datanodes in our map and reduce job.

An application which needs to use distributed cache should make sure that the files are available on URLs. URLs can be either hdfs:// or http://. Now, if the file is present on the mentioned URLs. The user mentions it to be cache file to distributed cache. This framework will copy the cache file on all the nodes before starting of tasks on those nodes. By default size of distributed cache is 10 GB. We can adjust the size of distributed cache using local.cache.size.

What is a “Distributed Cache” in Apache Hadoop?

Please log in or register to answer this question.

1 Answer