Friday, November 9, 2012

Hadoop Architecture – Hadoop Distributed File System - Part 1


Hadoop cluster is a collection of racks. Every Rack is having nodes generally it is called computers. When we group all the nodes in same rack and collection of various racks become Hadoop cluster.
Hadoop Cluster









Hadoop has two major componets:-
1.Hadoop Distributed File System a.k.a HDFS
2.Map Reduce
I will be discussing more on HDFS in this post. HDFS runs on top of the existing file systems on each node in a Hadoop cluster Hadoop works best with very large files. The larger the file, the less time Hadoop spends seeking for the next data location on disk and the more time Hadoop runs at the limit of the bandwidth of your disks. Seeks are generally expensive operations that are useful when you only need to analyze a small subset of your dataset. Since Hadoop is designed to run over your entire dataset, it is best to minimize seeks by using large files. Hadoop is designed for streaming or sequential data access rather than random access. Sequential data access means fewer seeks, since Hadoop only seeks to the beginning of each block and begins reading sequentially from there. Hadoop uses blocks to store a file or parts of a file.









HDFS-Blocks
They default block size is 64 megabytes each and most systems run with block sizes of 128 megabytes or larger. A Hadoop block is a file on the underlying file system. Since the underlying file system stores files as blocks, one Hadoop block may consist of many blocks in the underlying file system as shown in the figure.





Advantages of Blocks
1.Easy to calculate how many can fit on a disk.
2.A file may be larger than any disc in the network
3.If the file is smaller than the block size, the only needed space is used. This is mainly the case when to store the last block. E.g. to store 420MB file the split is shown as below:-


People who read this post also read :



4 comments:

Brent Salisbury said...

Great series!!!
Thanks,
-Brent

Unknown said...

You have certainly explained that Big data analytics is the process of examining big data to uncover hidden patterns, unknown correlations and other useful information that can be used to make better decisions..The big data analytics is the major part to be understood regarding Hadoop Training in Chennai program. Via your quality content i get to know about that in deep. Thanks for sharing this here.

Olap on Hadoop said...

Informative post with nice explanation. thank for that.

Unknown said...

It’s too informative blog and I am getting conglomerations of info’s about CCNA certification. Thanks for sharing; I would like to see your updates regularly so keep blogging.
Regards,
ccna course in Chennai|ccna institutes in Chennai