Sunday, November 11, 2012

Hadoop Architecture – Types of Hadoop Nodes in Cluster - Part 2

In continuation to the previous post (Hadoop Architecture-Hadoop Distributed File System), Hadoop cluster is made up of the following main nodes:-
1.Name Node
2.Data Node
3.Job Tracker
4.Task Tracker

The above depicted is the logical architecture of Hadoop Nodes. But physically data node and task tracker could be placed on single physical machine as per below shown diagram.

There are few other secondary nodes name as secondary name node, backup node and checkpoint node. This above diagram shows some of the communication paths between the different types of nodes in the Hadoop cluster. A client is shown as communicating with a JobTracker as well as with the NameNode and with any DataNode. There is only one NameNode in the cluster but one can plan for the redundant name node in the cluster but manually it has to be switched on. While the data file is stored in blocks at the data nodes, the metadata for a file is stored at the NameNode. If there is one node in the cluster to spend money on the best enterprise hardware for maximum reliability it is the NameNode. The NameNode should also have as much RAM as possible because it keeps the entire filesystem metadata in memory and data nodes could be used as commodity hardware.

Any typical HDFS cluster has many DataNodes. They store the blocks of data and when a client requests a file, it finds out from the NameNode which DataNodes store the blocks that make up that file and the client directly reads the blocks from the individual DataNodes. Each DataNode also reports to the NameNode periodically with the list of blocks it stores. A JobTracker node manages MapReduce jobs. There is only one of these on the cluster. It receives jobs submitted by clients. It schedules the Map tasks and Reduce tasks on the appropriate TaskTrackers in a rack-aware manner(Hadoop knows the network topology) and monitors for any failing tasks that need to be rescheduled on a different TaskTracker. To achieve the parallelism for your map and reduce tasks, there are many TaskTrackers in a Hadoop cluster. Each TaskTracker spawns Java Virtual Machines to run your map or reduce task.

People who read this post also read :


Anonymous said...

Please find 60 Hadoop Interview Question at following link.
PappuPass Learning Resources

Unknown said...

Really informative post. Big data is a term that portrays the substantial volume of information; both organized and unstructured that immerses a business on an everyday premise. To know more details please visit Big Data Training in Chennai | Primavera Training in Chennai