HDFS Anatomy of a File Read
When
a client tries to read a file in HDFS:
1. The client contacts the namenode daemon to get the
location of the data blocks of the file it wants to read.
2.
The namenode daemon returns the list of addresses
of the datanodes for the data blocks.
3.
For any read operation, HDFS tries to return the node
with the data block that is closest to the client. Here, closest refers to
network proximity between the datanode daemon and the client.
4.
Once the client has the list, it connects the
closest datanode daemon and starts reading the data block using a stream.
5.
After the block is read completely, the connection
to datanode is terminated and the datanode daemon that hosts the next block in
the sequence is identified and the data block is streamed. This goes on until
the last data block for that file is read.
Anatomy of File Read
Anatomy of File Read :
1.
First the Client will open the
file by giving a call to open() method on FileSystem object, which for HDFS is
an instance of DistributedFileSystem class.
2.
DistributedFileSystem calls the
Namenode, using RPC (Remote Procedure Call), to determine the locations of the
blocks for the first few blocks of the file.
For each block, the namenode
returns the addresses of all the datanodes that have a copy of that block.
Client will interact with
respective datanodes to read the file.
Namenode also provide a token
to the client which it shows to datanode for authentication.
The
DistributedFileSystem returns an object of FSDataInputStream(an input stream
that supports file seeks) to the client for it to read data from FSDataInputStream
in turn wraps DFSInputStream, which manages the datanode
and namenode I/O.
3.
The client then
calls read() on the stream. DFSInputStream, which has stored the
datanode addresses for the first few blocks in the file, then connects to the
first closest datanode for the first block in the file.
4.
Data is streamed from the datanode
back to the client, which calls read() repeatedly on the stream.
5.
When the end of the block is reached, DFSInputStream will
close the connection to the datanode, then find the best datanode for the
next block. This happens transparently to the client, which from its point of
view is just reading a continuous stream.
6.
Blocks are read in order, with the DFSInputStream opening
new connections to datanodes as the client reads through the stream. It will
also call the namnode to retrieve the datanode locations for the next batch of
blocks as needed. When the client has finished reading, it calls close() on
the FSDataInputStream.
Note
:
Failure
: During Reading if the DFSInput stream encounters an error while communicating
the DataNode ,it will try next closest one for that block ,it will also
Remember DataNodes that have failed so that it does not needlessly retry them
for later blocks
The
DFSouptput stream also verifies checksum for the Data transferred to it from
the DataNode.
If
a corrupted block is found , it is reported to the NameNode before the
DFSoutput Stream attempts to read a replica of the block from another Data
Nodes .
Data
integrity in HDFS uses checksum mechanism .
In
hadoop inter process communication between nodes in the system is implemented
using Remote procedure call.
No comments:
Post a Comment