Introduction to Apache HDFS and MapReduce
Apache Hadoop has been originated from Google’s Whitepapers:
1.
Apache HDFS is derived from GFS (Google File System).
2.
Apache MapReduce is derived from Google MapReduce
3.
Apache HBase is derived from Google BigTable.
Though Google has only provided the Whitepapers, without any
implementation, around 90-95% of the architecture presented in these
Whitepapers is applied in these three Java-based Apache projects.
HDFS and MapReduce are the two major components of Hadoop, where
HDFS is from the ‘Infrastructural’ point of view and MapReduce is from the
‘Programming’ aspect. Though HDFS is at present a subproject of Apache Hadoop,
it was formally developed as an infrastructure for the Apache Nutch web search
engine project.
To understand the magic behind the scalability of Hadoop from
one-node cluster to a thousand-nodes cluster (Yahoo! has 4,500-node cluster
managing 40 petabytes of enterprise data), we need to first understand Hadoop’s
file system, that is, HDFS (Hadoop Distributed File System).
What
is HDFS (Hadoop Distributed File System)?
HDFS is a distributed and scalable file system designed for
storing very large files with streaming data access patterns, running clusters
on commodity hardware.
Very
large” in this context means files that are hundreds of megabytes, gigabytes,or
terabytes in size. There are Hadoop clusters running today that store petabytesof data.
Though it has many similarities with existing traditional
distributed file systems, there are noticeable differences between these. Let’s
look into some of the assumptions and goals/objectives behind HDFS, which also
form some striking features of this incredible file system!
Assumptions
and Goals/Objectives behind HDFS:
1.
Large Data Sets:
It is assumed that HDFS always needs to work with large data
sets. It will be an underplay if HDFS is deployed to process several small data
sets ranging in some megabytes or even a few gigabytes. So The architecture of
HDFS is designed in such a way that it is best fit to store and retrieve huge
amount of data. What is required is high cumulative data bandwidth and the
scalability feature to spread out from a single node cluster to a hundred or a
thousand-node cluster. The acid test is that HDFS should be able to manage tens
of millions of files in a single occurrence.
2.
Write Once, Read Many Model:
HDFS follows the write-once, read-many approach for its files
and applications. It assumes that a file in HDFS once written will not be
modified, though it can be access ‘n’ number of times (though future versions
of Hadoop may support this feature too)! At present, in HDFS strictly has one
writer at any time. This assumption enables high throughput data access and
also simplifies data coherency issues. A web crawler or a MapReduce application
is best suited for HDFS.
3.
Streaming Data Access:
As HDFS works on the principle of ‘Write Once, Read Many‘, the
feature of streaming data access is extremely important in HDFS.
As HDFS is designed more for batch processing
rather than interactive use by users(returning single record at a time) . The
emphasis is on high throughput of data access rather than low latency of data
access. HDFS focuses not so much on storing the data but how to retrieve it at
the fastest possible speed, especially while analyzing Data. In HDFS, reading
the complete data is more important than the time taken to fetch a single
record from the data.
A
dataset is typically generated or copied from source, and then various analyses
are performed on that dataset over time. Each analysis will involve a large
proportion of the dataset, so the time to read
the whole dataset is more important than the latency in reading the first record.
4.
Commodity Hardware:
HDFS (Hadoop Distributed File System) assumes that the
cluster(s) will run on common hardware, that is, non-expensive, ordinary
machines rather than high-availability systems. A great feature of Hadoop is
that it can be installed in any average commodity hardware. We don’t need super
computers or high-end hardware to work on Hadoop. This leads to an overall cost
reduction to a great extent.
5.
Data Replication and Fault Tolerance:
HDFS works on the assumption that hardware is bound to fail at
some point of time or the other. This disrupts the smooth and quick processing
of large volumes of data. To overcome this obstacle, in HDFS, the files are
divided into large blocks of data and each block is stored on three nodes: two
on the same rack and one on a different rack for fault tolerance. A block is
the amount of data stored on every data node. Though the default block size is
64MB and the replication factor is three, these are configurable per file. This
redundancy enables robustness, fault detection, quick recovery, scalability,
eliminating the need of RAID storage on hosts and merits of data locality.
HDFS is
designed to carry on working without a noticeable interruption to the user in
the face of such failure.
6.
High Throughput:
Throughput is the amount of work done in a unit time. It
describes how fast the data is getting accessed from the system and it is
usually used to measure performance of the system. In Hadoop HDFS, when we want
to perform a task or an action, then the work is divided and shared among
different systems. So, all the systems will be executing the tasks assigned to
them independently and in parallel. So the work will be completed in a very
short period of time. In this way, the Apache HDFS gives good throughput. By
reading data in parallel, we decrease the actual time to read data
tremendously.
Applications
that require low-latency access to data, in the tens of milliseconds range,
will not work well with HDFS.HBaseis currently a
better choice for low-latency access.
7.
Moving Computation is better than Moving Data:
Hadoop HDFS works on the principle that if a computation is done
by an application near the data it operates on, it is much more efficient than
done far of, particularly when there are large data sets. The major advantage
is reduction in the network congestion and increased overall throughput of the
system. The assumption is that it is often better to locate the computation
closer to where the data is located rather than moving the data to the
application space. To facilitate this, Apache HDFS provides interfaces for
applications to relocate themselves nearer to where the data is located.
8.
File System Namespace:
A traditional hierarchical file organization is followed by
HDFS, where any user or an application can create directories and store files
inside these directories. Thus, HDFS’s file system namespace hierarchy is
similar to most of the other existing file systems, where one can create and
delete files or relocate a file from one directory to another, or even rename a
file. In general, HDFS does not support hard links or soft links, though these
can be implemented if need arise.
Thus, HDFS works on these assumptions and goals in order to help
the user access or process large data sets within incredibly short period of
time!
9.What
is HDFS block ? why is it so large ?
A disk
has a block size, which is the minimum amount of data that it can read or
write.Normal
Filesystem blocks are typically a few kilobytes in size, whereas disk blocks
are normally 512 bytes. This is generally transparent to the filesystem user
who is simply reading or writing a file
HDFS
also has the concept of a block, but it is a much larger unit—64 MB by default.Like in
a filesystem for a single disk, files in HDFS are broken into block-sized
chunks,which
are stored as independent units to make a file.
HDFS
blocks are large compared to disk blocks, and the reason is to minimize the
cost of
seeks. By making a block large enough, the time to transfer the data from the
disk can be
significantly longer than the time to seek to the start of the block. Thus the
time to
transfer a large file made of multiple blocks operates at the disk transfer
rate.
10.MAP-Reduce
components (Introduction) :
MapReduce
is a programming model for data processing.
Hadoop
can run MapReduce programs written in various languages;Java, Ruby, Python, and
C++.
To speed
up the processing, we need to run parts of the program in parallel,so some
processes will finish much earlier than others since the input files may vary
from small to large files.so for A better approach Map-Reduce will split the
input into fixed-size chunks and assign each chunk to a process.
A
MapReduce job is made up of four distinct stages, executed in order:
1.client
job submission,
2.map
task execution,
3.shuffle
and sort,
4.reduce
task execution.
A
MapReducejob is a unit of work that the client wants to be performed:
it consists of the input data, the MapReduce program, and configuration information.
Hadoop runs the job by dividing it into tasks, of which there are two
types:
map tasks and reduce tasks.
There
are two types of nodes that control the job execution process: a jobtracker(Master) and number
of tasktrackers(workers).
The jobtracker coordinates all the jobs run on the system by scheduling
tasks to run on tasktrackers.
Tasktrackers
run tasks and send progress reports
to the jobtracker, which keeps a record of the overall progress of each job. If
a task
fails, the jobtracker can reschedule it on a different tasktracker.
A master
process, called the jobtrackerin
HadoopMapReduce, is responsible for
accepting these submissions on the name node and the task Tracker which will
work on the Datanode will manage this task in to map task and Reduce Task on
node where the Data located is nearer.
Hadoop
divides the input to a MapReduce job into fixed-size pieces called input splits.
Hadoop
creates one map task for each split, which runs the userdefined map
function for each record in the split.
if we
are processing the splits in parallel, the processing is
better load-balanced when the splits are small, since a faster machine will be able to
process proportionally more splits over the course of the job than a slower machine.
if
splits are too small, the overhead of managing the splits and of map task
creation begins to dominate(timeconsuming) the total job execution time.
For most
jobs, a good split size tends to be the size of an HDFS block, 64 MB by default
although its changeable
Hadoop
does its best to run the map task on a node where the input data resides in HDFS.
This is called the data locality optimization because it doesn’t use
valuable cluster bandwidth.
To make
it simple Files which is made of Blocks are divided in to input splits Each
input split will be processed by a mapper task and then each mappers output
produced will go in to shuffle and sort
phase which are intermediate outputs will be processed
by reduce tasks to produce the final output, and once the job is complete, the
map output can be thrown away. if the
node running the map task fails before the map
output has been consumed by the reduce task, then Hadoop will automatically
rerun
the map task on an another node to re-create the map output
Reduce
tasks don’t have the advantage of data locality; the input to a single reduce
task is
normally the output from all mappers.
We will focus more about MAP-Reduce on the title
Anatomy of Map-Reduce Program.
No comments:
Post a Comment