Friday, November 9, 2012

Hadoop Architecture – Hadoop Distributed File System - Part 1


Hadoop cluster is a collection of racks. Every Rack is having nodes generally it is called computers. When we group all the nodes in same rack and collection of various racks become Hadoop cluster.
Hadoop Cluster









Hadoop has two major componets:-
1.Hadoop Distributed File System a.k.a HDFS
2.Map Reduce
I will be discussing more on HDFS in this post. HDFS runs on top of the existing file systems on each node in a Hadoop cluster Hadoop works best with very large files. The larger the file, the less time Hadoop spends seeking for the next data location on disk and the more time Hadoop runs at the limit of the bandwidth of your disks. Seeks are generally expensive operations that are useful when you only need to analyze a small subset of your dataset. Since Hadoop is designed to run over your entire dataset, it is best to minimize seeks by using large files. Hadoop is designed for streaming or sequential data access rather than random access. Sequential data access means fewer seeks, since Hadoop only seeks to the beginning of each block and begins reading sequentially from there. Hadoop uses blocks to store a file or parts of a file.









HDFS-Blocks
They default block size is 64 megabytes each and most systems run with block sizes of 128 megabytes or larger. A Hadoop block is a file on the underlying file system. Since the underlying file system stores files as blocks, one Hadoop block may consist of many blocks in the underlying file system as shown in the figure.





Advantages of Blocks
1.Easy to calculate how many can fit on a disk.
2.A file may be larger than any disc in the network
3.If the file is smaller than the block size, the only needed space is used. This is mainly the case when to store the last block. E.g. to store 420MB file the split is shown as below:-


People who read this post also read :



1 comment:

Brent Salisbury said...

Great series!!!
Thanks,
-Brent