Hadoop Beginner Guide

In this blog, we will learn some basic concepts and basic API about Hadoop.

Overview of Hadoop

Hadoop is an Apache open source framework written in java that allows distributed processing of ‘Big Data’ across clusters of computers using simple programming models.

What Is Big Data?

Hadoop is the software to handle big data, but what is big data?
Big Data has following features:

Big volume
Big velocity
Big variety

Structure of Hadoop

The following is some components of Hadoop:

A distributed file system called Hadoop Distributed File System (HDFS) to store data
A framework and API for building and running MapReduce jobs to handle data
Hadoop common: utilities
Hadoop YARN: job scheduling and cluster resource management

Range and Aim

In order to achieve better effect of learning, we should have a clear aim and detailed road map.
So we first define the range we will learning:

basic concepts and architecture of hadoop
basic API

Aim:

understand the concepts and introduce them to others
use API to write a simple example

Resources

Plans and Steps

Concepts: HDFS and MapReduce
API
Examples & Application

Concept

MapReduce

Map Task and Reduce Task

Map: convert data into key/value – transformation
Reduce: merge multiple key/value into smaller set – aggregation
Three other stages between map and reduce: Partitioning, Sorting, and Grouping. In the default configuration, the goal of these intermediate steps is to ensure this behavior; that the values for each key are grouped together ready for the reduce() function.

The major advantage of MapReduce is its scalablity.

Master JobTracker and Slave TaskTracker

The master is responsible for resource management, tracking resource consumption/availability and scheduling the jobs component tasks on the slaves, monitoring them and re-executing the failed tasks.

The slaves TaskTracker execute the tasks as directed by the master and provide task-status information to the master periodically.

HDFS

HDFS provides a shell like any other file system and a list of commands are available to interact with the file system.

NameNode and DataNode

The NameNode determines the mapping of blocks to the DataNodes.

The DataNodes takes care of read and write operation with the file system. They also take care of block creation, deletion and replication based on instruction given by NameNode.

Setup and Example

From Fedora document

sudo install hadoop-common hadoop-hdfs hadoop-mapreduce hadoop-mapreduce-examples hadoop-yarn
# Initialize the HDFS directories:
hdfs-create-dirs
# Start the cluster by issuing:
systemctl start hadoop-namenode hadoop-datanode hadoop-nodemanager hadoop-resourcemanager
# Create a directory for the user running the tests:
runuser hdfs -s /bin/bash /bin/bash -c "hadoop fs -mkdir /user/<name>"
runuser hdfs -s /bin/bash /bin/bash -c "hadoop fs -chown <name> /user/<name>"

Run Example

# calculate pi in map reduce way
hadoop jar /usr/share/java/hadoop/hadoop-mapreduce-examples.jar pi 10 1000000

We can also access our hadoop state in browser via:

http://localhost:50070/

Access our application state:

http://localhost:8088/

API

HDFS

hadoop fs is the file system shell that hadoop provides to interact with the HDFS. It includes some commands which other file system supports, like ls, cat etc.

You can always use:

hadoop fs -help

to find out if there are some commands that you need.

MapReduce

First, we write a mapper and a reducer according to the interface of hadoop library.

//Mapper class
public static class LineToWordMapper extends Mapper
        <LongWritable,        /*Input key Type */
                Text,         /*Input value Type*/
                Text,         /*Output key Type*/
                Text>        /*Output value Type*/ {}

public static class WordsReducer extends Reducer<Text, Text, Text, Text> {}

// More job configuration and input/output format

Complete code can be found in my github repo.

Now, we finish our code and we can compile it now.

# compile with hadoop lib
javac -cp path/to/hadoop-core-0.20.2.jar -d . src/FirstJob.java 
# package it
jar cvf test.jar *.class
# move our input file into test directory
hadoop fs -put path/to/input /user/zzt/search/
# run it in hadoop
hadoop jar test.jar FirstJob /user/zzt/search/input /user/zzt/search/output

Common Errors

ClassCastException

Error: java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.Text

As this SO question said, we use the default input format, so the input key type is LongWritable as default.

FileAlreadyExistsException

Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://localhost:8020/user/zzt/search/output already exists

The common solution is to remove it handy or use java code, although it is strange that hadoop have to re-create directory every time.

hadoop fs -rmr /path/to/your/output/

Written with StackEdit.

sathya2017年9月27日 06:11
Excellent .. Amazing .. I will bookmark your blog and take the feeds additionally? I’m satisfied to find so many helpful information here within the put up, we want work out extra strategies in this regard, thanks for sharing..

Hadoop Training in Chennai

Base SAS Training in Chennai

MSBI Training in Chennai
回复删除
回复

添加评论

On teh way

Blog Search