Monday, 28 March 2016

Anatomy of File Read in HDFS

HDFS Anatomy of a File Read


When a client tries to read a file in HDFS:
1.       The client contacts the namenode daemon to get the location of the data blocks of the file it wants to read.
2.        The namenode daemon returns the list of addresses of the datanodes for the data blocks.
    3.        For any read operation, HDFS tries to return the node with the data block that is closest to the client. Here, closest refers to network proximity between the datanode daemon and the client.
       4.        Once the client has the list, it connects the closest datanode daemon and starts reading the data block using a stream.
     5.        After the block is read completely, the connection to datanode is terminated and the datanode daemon that hosts the next block in the sequence is identified and the data block is streamed. This goes on until the last data block for that file is read.


Anatomy of File Read 

Anatomy of File Read :
1.          First the Client will open the file by giving a call to open() method on FileSystem object, which for HDFS is an instance of DistributedFileSystem class.

2.        DistributedFileSystem calls the Namenode, using RPC (Remote Procedure Call), to determine the locations of the blocks for the first few blocks of the file.
For each block, the namenode returns the addresses of all the datanodes that have a copy of that block.
Client will interact with respective datanodes to read the file.
Namenode also provide a token to the client which it shows to datanode for authentication.
           The DistributedFileSystem returns an object of FSDataInputStream(an input stream that supports file seeks) to the client for it to read data from FSDataInputStream in turn wraps           DFSInputStream, which manages the datanode and namenode I/O.
3.        The client then calls read() on the stream. DFSInputStream, which has stored the datanode addresses for the first few blocks in the file, then connects to the first closest datanode for the first block in the file.
4.        Data is streamed from the datanode back to the client, which calls read() repeatedly on the stream.
5.        When the end of the block is reached, DFSInputStream will close the connection to the datanode, then find the best datanode for the next block. This happens transparently to the client, which from its point of view is just reading a continuous stream.
6.       Blocks are read in order, with the DFSInputStream opening new connections to datanodes as the client reads through the stream. It will also call the namnode to retrieve the datanode locations for the next batch of blocks as needed. When the client has finished reading, it calls close() on the FSDataInputStream.
Note :
Failure : During Reading if the DFSInput stream encounters an error while communicating the DataNode ,it will try next closest one for that block ,it will also Remember DataNodes that have failed so that it does not needlessly retry them for later blocks
The DFSouptput stream also verifies checksum for the Data transferred to it from the DataNode.
If a corrupted block is found , it is reported to the NameNode before the DFSoutput Stream attempts to read a replica of the block from another Data Nodes .
Data integrity in HDFS uses checksum mechanism .
In hadoop inter process communication between nodes in the system is implemented using Remote procedure call.


No comments:

Post a Comment