HDFS achieve its scale-out capabilities through the interaction of two main components: NameNodes, which handle the management and storage of the file system metadata, such as access permissions, modification time, and data location (i.e. data blocks) and DataNodes, which store and retrieve the data itself on local, direct-attached disks.
Apache Hadoop HDFS Architecture follows a Master/Slave Architecture, where a cluster comprises of a
single NameNode (Master node) and a number of DataNodes (Slave nodes). HDFS is
constructed using Java programming language, due to which HDFS can be deployed
on broad spectrum of machines that support Java..(JVM)
An HDFS cluster has two types of nodes operating in a master-worker
pattern: a namenode(the master) and a number of datanodes(workers).
NameNode:
Namenode
is the master of HDFS that maintains and manages the blocks present on the
DataNodes (slave nodes). Think of the Namenode as an Enterprise class Hardware,
Namenode is a very high-availability also expensive server that manages the
file system namespace and controls access to files by clients. The HDFS
architecture is built in such a way that the user data is never stored in the
Namenode.
the NameNode handles activities like new file creation and the locality
assignments or metadata changes such as file system hierarchies and file
renaming. In order to scale to the thousands of servers deployed in large
Hadoop clusters, the NameNode does not participate in the actual read and write
flows; the individual DataNodes interact directly with the requestor for all
read and write activities.
companies such as Yahoo and Facebook store over a
hundred PBs of data, representing hundreds of millions of files, across
thousands of DataNodes. A 10,000 node HDFS cluster with a single NameNode
server can expect to handle up to 100,000 concurrent readers and 10,000
concurrent writers
Although
we have multiple name nodes and secondary name node concepts in the Hadoop 2
versions, we will see later on about those Topics
Let’s
list out various functions of a NameNode:
1. The NameNode maintains and executes the file system
namespace. If there are any modifications in the file system namespace or
in its properties, this is tracked by the NameNode.
2.
It directs the Datanodes (Slave nodes) to
execute the low-level I/O operations.
3.
It keeps a record of how the files in HDFS are
divided into blocks, in which nodes these blocks are stored and NameNode
manages cluster configuration.
4.
It maps a file name to a set of blocks and maps
a block to the DataNodes where it is located.
5.
It records the metadata of all the files stored
in the cluster, e.g. the location, the size of the files, permissions,
hierarchy, etc.
6.
With the help of a transactional log, that is, the EditLog, the NameNode records each and every
change that takes place to the file system metadata. For example, if a
file is deleted in HDFS, the NameNode will immediately record this in the EditLog.
7.
The NameNode is also responsible to take care of the replication
factor of all the blocks. If there is a change in the replication
factor of any of the blocks, the NameNode will record this in the EditLog.
8.
NameNode regularly receives a Heartbeat and a
Blockreport from all the DataNodes in the cluster to make sure that
the datanodes are working properly. A Block Report contains a list of all
blocks on a DataNode.
9.
In case of a datanode failure, the Namenode chooses
new datanodes for new replicas, balances disk usage and also manages the
communication traffic to the datanodes.
Without the namenode, the HDFS filesystem cannot be used.
DataNodes:
Datanodes
are the slave nodes in HDFS, just like an any average computer with the
descent configurationUnlike NameNode, datanode is commodity
hardware, that is, a non-expensive system which is not of high quality or
high-availability. Datanode is a block server that stores the data in the
local file ext3 or ext4.
The DataNode service is the real workhorse of HDFS
as this service handles all the storage and retrieval of Hadoop data. Once the
Name Node retrieves which DataNode or DataNodes contain a given file’s data
blocks, all the actual data communications stream directly between the client
application or framework and the DataNode housing that particular data block.
Functions of
DataNodes:
1.
Datanodes perform the low-level read and write requests from
the file system’s clients.
2.
They are also responsible for creating blocks, deleting
blocks and replicating the same based on the decisions taken by the
NameNode.
3.
They regularly send a report on all the blocks
present in the cluster to the NameNode.
4.
Datanodes also enables pipelining of data.
5.
They forward data to other specified DataNodes.
6.
Datanodes send heartbeats to the NameNode once
every 3 seconds, to report the overall health of HDFS.
7.
The DataNode stores each block of HDFS data in
separate files in its local file system.
8.
When the Datanodes gets started, they scan
through its local file system, creates a list of all HDFS data blocks that
relate to each of these local files and send a Blockreport to the NameNode.
No comments:
Post a Comment