跳至主要内容

Hadoop Beginner Guide

In this blog, we will learn some basic concepts and basic API about Hadoop.

Overview of Hadoop

Hadoop is an Apache open source framework written in java that allows distributed processing of ‘Big Data’ across clusters of computers using simple programming models.

What Is Big Data?

Hadoop is the software to handle big data, but what is big data?
Big Data has following features:

  • Big volume
  • Big velocity
  • Big variety

Structure of Hadoop

The following is some components of Hadoop:

  • A distributed file system called Hadoop Distributed File System (HDFS) to store data
  • A framework and API for building and running MapReduce jobs to handle data
  • Hadoop common: utilities
  • Hadoop YARN: job scheduling and cluster resource management

Range and Aim

In order to achieve better effect of learning, we should have a clear aim and detailed road map.
So we first define the range we will learning:

  • basic concepts and architecture of hadoop
  • basic API

Aim:

  • understand the concepts and introduce them to others
  • use API to write a simple example

Resources

Plans and Steps

  • Concepts: HDFS and MapReduce
  • API
  • Examples & Application

Concept

MapReduce

Map Task and Reduce Task
  • Map: convert data into key/value – transformation
  • Reduce: merge multiple key/value into smaller set – aggregation
    Three other stages between map and reduce: Partitioning, Sorting, and Grouping. In the default configuration, the goal of these intermediate steps is to ensure this behavior; that the values for each key are grouped together ready for the reduce() function.

The major advantage of MapReduce is its scalablity.

Master JobTracker and Slave TaskTracker

The master is responsible for resource management, tracking resource consumption/availability and scheduling the jobs component tasks on the slaves, monitoring them and re-executing the failed tasks.

The slaves TaskTracker execute the tasks as directed by the master and provide task-status information to the master periodically.

HDFS

HDFS provides a shell like any other file system and a list of commands are available to interact with the file system.

NameNode and DataNode

The NameNode determines the mapping of blocks to the DataNodes.

The DataNodes takes care of read and write operation with the file system. They also take care of block creation, deletion and replication based on instruction given by NameNode.

Setup and Example

From Fedora document

sudo install hadoop-common hadoop-hdfs hadoop-mapreduce hadoop-mapreduce-examples hadoop-yarn
# Initialize the HDFS directories:
hdfs-create-dirs
# Start the cluster by issuing:
systemctl start hadoop-namenode hadoop-datanode hadoop-nodemanager hadoop-resourcemanager
# Create a directory for the user running the tests:
runuser hdfs -s /bin/bash /bin/bash -c "hadoop fs -mkdir /user/<name>"
runuser hdfs -s /bin/bash /bin/bash -c "hadoop fs -chown <name> /user/<name>"

Run Example

# calculate pi in map reduce way
hadoop jar /usr/share/java/hadoop/hadoop-mapreduce-examples.jar pi 10 1000000

We can also access our hadoop state in browser via:

http://localhost:50070/

Access our application state:

http://localhost:8088/

API

HDFS

hadoop fs is the file system shell that hadoop provides to interact with the HDFS. It includes some commands which other file system supports, like ls, cat etc.

You can always use:

hadoop fs -help

to find out if there are some commands that you need.

MapReduce

First, we write a mapper and a reducer according to the interface of hadoop library.

//Mapper class
public static class LineToWordMapper extends Mapper
        <LongWritable,        /*Input key Type */
                Text,         /*Input value Type*/
                Text,         /*Output key Type*/
                Text>        /*Output value Type*/ {}

public static class WordsReducer extends Reducer<Text, Text, Text, Text> {}

// More job configuration and input/output format

Complete code can be found in my github repo.

Now, we finish our code and we can compile it now.

# compile with hadoop lib
javac -cp path/to/hadoop-core-0.20.2.jar -d . src/FirstJob.java 
# package it
jar cvf test.jar *.class
# move our input file into test directory
hadoop fs -put path/to/input /user/zzt/search/
# run it in hadoop
hadoop jar test.jar FirstJob /user/zzt/search/input /user/zzt/search/output

Common Errors

ClassCastException
Error: java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.Text

As this SO question said, we use the default input format, so the input key type is LongWritable as default.

FileAlreadyExistsException
Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://localhost:8020/user/zzt/search/output already exists

The common solution is to remove it handy or use java code, although it is strange that hadoop have to re-create directory every time.

hadoop fs -rmr /path/to/your/output/

Written with StackEdit.

评论

  1. Excellent .. Amazing .. I will bookmark your blog and take the feeds additionally? I’m satisfied to find so many helpful information here within the put up, we want work out extra strategies in this regard, thanks for sharing..

    Hadoop Training in Chennai

    Base SAS Training in Chennai

    MSBI Training in Chennai

    回复删除

发表评论

此博客中的热门博文

Spring Boot: Customize Environment

Spring Boot: Customize Environment Environment variable is a very commonly used feature in daily programming: used in init script used in startup configuration used by logging etc In Spring Boot, all environment variables are a part of properties in Spring context and managed by Environment abstraction. Because Spring Boot can handle the parse of configuration files, when we want to implement a project which uses yml file as a separate config file, we choose the Spring Boot. The following is the problems we met when we implementing the parse of yml file and it is recorded for future reader. Bind to Class Property values can be injected directly into your beans using the @Value annotation, accessed via Spring’s Environment abstraction or bound to structured objects via @ConfigurationProperties. As the document says, there exists three ways to access properties in *.properties or *.yml : @Value : access single value Environment : can access multi

Elasticsearch: Join and SubQuery

Elasticsearch: Join and SubQuery Tony was bothered by the recent change of search engine requirement: they want the functionality of SQL-like join in Elasticsearch! “They are crazy! How can they think like that. Didn’t they understand that Elasticsearch is kind-of NoSQL 1 in which every index should be independent and self-contained? In this way, every index can work independently and scale as they like without considering other indexes, so the performance can boost. Following this design principle, Elasticsearch has little related supports.” Tony thought, after listening their requirements. Leader notice tony’s unwillingness and said, “Maybe it is hard to do, but the requirement is reasonable. We need to search person by his friends, didn’t we? What’s more, the harder to implement, the more you can learn from it, right?” Tony thought leader’s word does make sense so he set out to do the related implementations Application-Side Join “The first implementation

Implement isdigit

It is seems very easy to implement c library function isdigit , but for a library code, performance is very important. So we will try to implement it and make it faster. Function So, first we make it right. int isdigit ( char c) { return c >= '0' && c <= '9' ; } Improvements One – Macro When it comes to performance for c code, macro can always be tried. #define isdigit (c) c >= '0' && c <= '9' Two – Table Upper version use two comparison and one logical operation, but we can do better with more space: # define isdigit(c) table[c] This works and faster, but somewhat wasteful. We need only one bit to represent true or false, but we use a int. So what to do? There are many similar functions like isalpha(), isupper ... in c header file, so we can combine them into one int and get result by table[c]&SOME_BIT , which is what source do. Source code of ctype.h : # define _ISbit(bit) (1 << (