In this blog, we will learn some basic concepts and basic API about Hadoop.
Overview of Hadoop
Hadoop is an Apache open source framework written in java that allows distributed processing of ‘Big Data’ across clusters of computers using simple programming models.
What Is Big Data?
Hadoop is the software to handle big data, but what is big data?
Big Data has following features:
- Big volume
- Big velocity
- Big variety
Structure of Hadoop
The following is some components of Hadoop:
- A distributed file system called Hadoop Distributed File System (HDFS) to store data
- A framework and API for building and running MapReduce jobs to handle data
- Hadoop common: utilities
- Hadoop YARN: job scheduling and cluster resource management
Range and Aim
In order to achieve better effect of learning, we should have a clear aim and detailed road map.
So we first define the range we will learning:
- basic concepts and architecture of hadoop
- basic API
Aim:
- understand the concepts and introduce them to others
- use API to write a simple example
Resources
- A Beginners Guide
- Hadoop tutorial
- Hadoop tutorial with labs
- Hadoop.The.Definitive.Guide
Plans and Steps
- Concepts: HDFS and MapReduce
- API
- Examples & Application
Concept
MapReduce
Map Task and Reduce Task
- Map: convert data into key/value – transformation
- Reduce: merge multiple key/value into smaller set – aggregation
Three other stages between map and reduce: Partitioning, Sorting, and Grouping. In the default configuration, the goal of these intermediate steps is to ensure this behavior; that the values for each key are grouped together ready for the reduce() function.
The major advantage of MapReduce is its scalablity.
Master JobTracker and Slave TaskTracker
The master is responsible for resource management, tracking resource consumption/availability and scheduling the jobs component tasks on the slaves, monitoring them and re-executing the failed tasks.
The slaves TaskTracker execute the tasks as directed by the master and provide task-status information to the master periodically.
HDFS
HDFS provides a shell like any other file system and a list of commands are available to interact with the file system.
NameNode and DataNode
The NameNode determines the mapping of blocks to the DataNodes.
The DataNodes takes care of read and write operation with the file system. They also take care of block creation, deletion and replication based on instruction given by NameNode.
Setup and Example
sudo install hadoop-common hadoop-hdfs hadoop-mapreduce hadoop-mapreduce-examples hadoop-yarn
# Initialize the HDFS directories:
hdfs-create-dirs
# Start the cluster by issuing:
systemctl start hadoop-namenode hadoop-datanode hadoop-nodemanager hadoop-resourcemanager
# Create a directory for the user running the tests:
runuser hdfs -s /bin/bash /bin/bash -c "hadoop fs -mkdir /user/<name>"
runuser hdfs -s /bin/bash /bin/bash -c "hadoop fs -chown <name> /user/<name>"
Run Example
# calculate pi in map reduce way
hadoop jar /usr/share/java/hadoop/hadoop-mapreduce-examples.jar pi 10 1000000
We can also access our hadoop state in browser via:
Access our application state:
API
HDFS
hadoop fs is the file system shell that hadoop provides to interact with the HDFS. It includes some commands which other file system supports, like ls
, cat
etc.
You can always use:
hadoop fs -help
to find out if there are some commands that you need.
MapReduce
First, we write a mapper and a reducer according to the interface of hadoop library.
//Mapper class
public static class LineToWordMapper extends Mapper
<LongWritable, /*Input key Type */
Text, /*Input value Type*/
Text, /*Output key Type*/
Text> /*Output value Type*/ {}
public static class WordsReducer extends Reducer<Text, Text, Text, Text> {}
// More job configuration and input/output format
Complete code can be found in my github repo.
Now, we finish our code and we can compile it now.
# compile with hadoop lib
javac -cp path/to/hadoop-core-0.20.2.jar -d . src/FirstJob.java
# package it
jar cvf test.jar *.class
# move our input file into test directory
hadoop fs -put path/to/input /user/zzt/search/
# run it in hadoop
hadoop jar test.jar FirstJob /user/zzt/search/input /user/zzt/search/output
Common Errors
ClassCastException
Error: java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.Text
As this SO question said, we use the default input format, so the input key type is LongWritable
as default.
FileAlreadyExistsException
Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://localhost:8020/user/zzt/search/output already exists
The common solution is to remove it handy or use java code, although it is strange that hadoop have to re-create directory every time.
hadoop fs -rmr /path/to/your/output/
Written with StackEdit.
Excellent .. Amazing .. I will bookmark your blog and take the feeds additionally? I’m satisfied to find so many helpful information here within the put up, we want work out extra strategies in this regard, thanks for sharing..
回复删除Hadoop Training in Chennai
Base SAS Training in Chennai
MSBI Training in Chennai