Monday, 28 March 2016

Apache Hadoop HDFS Architecture



HDFS achieve its scale-out capabilities through the interaction of two main components: NameNodes, which handle the management and storage of the file system metadata, such as access permissions, modification time, and data location (i.e. data blocks) and DataNodes, which store and retrieve the data itself on local, direct-attached disks.

 HDFS or Hadoop Distributed File System is a block-structured file system where each file is divided into blocks of a predetermined size. These blocks are stored across a cluster of one or several machines(DataNodes). Though one can run several datanodes on a single machine, but in practical world, these datanodes are spread across various machines.

Apache Hadoop HDFS Architecture follows a Master/Slave Architecture, where a cluster comprises of a single NameNode (Master node) and a number of DataNodes (Slave nodes). HDFS is constructed using Java programming language, due to which HDFS can be deployed on broad spectrum of machines that support Java..(JVM)
An HDFS cluster has two types of nodes operating in a master-worker pattern: a namenode(the master) and a number of datanodes(workers).

NameNode:
Namenode is the master of HDFS that maintains and manages the blocks present on the DataNodes (slave nodes). Think of the Namenode as an Enterprise class Hardware, Namenode is a very high-availability also expensive server that manages the file system namespace and controls access to files by clients. The HDFS architecture is built in such a way that the user data is never stored in the Namenode. the NameNode handles activities like new file creation and the locality assignments or metadata changes such as file system hierarchies and file renaming. In order to scale to the thousands of servers deployed in large Hadoop clusters, the NameNode does not participate in the actual read and write flows; the individual DataNodes interact directly with the requestor for all read and write activities.
companies such as Yahoo and Facebook store over a hundred PBs of data, representing hundreds of millions of files, across thousands of DataNodes. A 10,000 node HDFS cluster with a single NameNode server can expect to handle up to 100,000 concurrent readers and 10,000 concurrent writers

Although we have multiple name nodes and secondary name node concepts in the Hadoop 2 versions, we will see later on about those Topics
Let’s list out various functions of a NameNode:
1.   The NameNode maintains and executes the file system namespace. If there are any modifications in the file system namespace or in its properties, this is tracked by the NameNode.
2.      It directs the Datanodes (Slave nodes) to execute the low-level I/O operations.
3.      It keeps a record of how the files in HDFS are divided into blocks, in which nodes these blocks are stored and NameNode manages cluster configuration.
4.      It maps a file name to a set of blocks and maps a block to the DataNodes where it is located.
5.      It records the metadata of all the files stored in the cluster, e.g. the location, the size of the files, permissions, hierarchy, etc.
6.      With the help of a transactional log, that is, the EditLog, the NameNode records each and every change that takes place to the file system metadata. For example, if a file is deleted in HDFS, the NameNode will immediately record this in the EditLog.
7.      The NameNode is also responsible to take care of the replication factor of all the blocks. If there is a change in the replication factor of any of the blocks, the NameNode will record this in the EditLog.
8.      NameNode regularly receives a Heartbeat and a Blockreport from all the DataNodes in the cluster to make sure that the datanodes are working properly. A Block Report contains a list of all blocks on a DataNode.
9.      In case of a datanode failure, the Namenode chooses new datanodes for new replicas, balances disk usage and also manages the communication traffic to the datanodes.
Without the namenode, the HDFS filesystem cannot be used.

DataNodes:
Datanodes are the slave nodes in HDFS, just like an any average computer with the descent configurationUnlike NameNode, datanode is commodity hardware, that is, a non-expensive system which is not of high quality or high-availability. Datanode is a block server that stores the data in the local file ext3 or ext4.
The DataNode service is the real workhorse of HDFS as this service handles all the storage and retrieval of Hadoop data. Once the Name Node retrieves which DataNode or DataNodes contain a given file’s data blocks, all the actual data communications stream directly between the client application or framework and the DataNode housing that particular data block.

Functions of DataNodes:
1.      Datanodes perform the low-level read and write requests from the file system’s clients.
2.      They are also responsible for creating blocks, deleting blocks and replicating the same based on the decisions taken by the NameNode.
3.      They regularly send a report on all the blocks present in the cluster to the NameNode.
4.      Datanodes also enables pipelining of data.
5.      They forward data to other specified DataNodes.
6.      Datanodes send heartbeats to the NameNode once every 3 seconds, to report the overall health of HDFS.
7.      The DataNode stores each block of HDFS data in separate files in its local file system.
8.       When the Datanodes gets started, they scan through its local file system, creates a list of all HDFS data blocks that relate to each of these local files and send a Blockreport to the NameNode.

No comments:

Post a Comment