The extra read only data required by a hadoop job to process the main dataset is referred to as side data. Hadoop has two side data distribution techniques -
i) Using the job configuration - This technique should not be used for transferring more than few kilobytes of data as it can pressurize the memory usage of hadoop daemons,particularly if your system is running several hadoop jobs.
ii) Distributed Cache - Rather than serializing side data using the job configuration, it is suggested to distribute data using hadoop's distributed cache mechanism.