Thursday, February 27, 2014

Apache Hadoop - How to Install a Three Nodes Cluster (4)

Previous tutorial:
Apache Hadoop - How to Install a Three Nodes Cluster (1)
Apache Hadoop - How to Install a Three Nodes Cluster (2)
Apache Hadoop - How to Install a Three Nodes Cluster (3)

At this point you should have your three nodes Hadoop cluster up and running. hadoop1 is the namenode + resource manager, hadoop2 is the job historyserver, hadoop3 is the datanode. We still haven't configure secondary namenode yet, but since the cluster is up and running, let's do some test first.

Tests we will run:
Grep - example extracts matching strings from text files and counts how many time they occurred.

The command works different than the Unix grep call: it doesn't display the complete matching line, but only the matching string, so in order to display lines matching "foo", use .*foo.* as a regular expression.

The program runs two map/reduce jobs in sequence. The first job counts how many times a matching string occurred and the second job sorts matching strings by their frequency and stores the output in a single output file.

Each mapper of the first job takes a line as input and matches the user-provided regular expression against the line. It extracts all matching strings and emits (matching string, 1) pairs. Each reducer sums the frequencies of each matching string. The output is sequence files contaning the matching string and count. The reduce phase is optimized by running a combiner that sums the frequency of strings from local map output. As a result it reduces the amount of data that needs to be shipped to a reduce task.

The second job takes the output of the first job as input. The mapper is an inverse map, while the reducer is an identity reducer. The number of reducers is one, so the output is stored in one file, and it is sorted by the count in a descending order. The output file is text, each line of which contains count and a matching string.

The example also demonstrates how to pass a command-line parameter to a mapper or a reducer. This is done by adding (key, value) pairs to the job's configuration before the job is submitted. Map or reduce tasks are able to access the value by getting it from the job's configuration in the method configure.

1. First we create user directory in HDFS:
# su hdfs
$ hadoop fs -mkdir /user
$ hadoop fs -mkdir /user/lxu
$ hadoop fs -chown -R lxu:lxu /user/lxu
$ hadoop fs -ls /user/
drwxr-xr-x   - lxu lxu          0 2014-02-26 14:11 /user/lxu

Switch to lxu

Create input and out directories
$ hadoop fs -mkdir /user/lxu/grep-input

Upload sample.txt to grep-input, you can download the sample.txt file from:
sample.txt(https://drive.google.com/file/d/0B-r3WDLvcH3iT0U1c3A4TWlVajQ/edit?usp=sharing)
$ hadoop fs -put sample.txt /user/lxu/grep-input

If you see error like:
put: Specified block size is less than configured minimum value (dfs.namenode.fs-limits.min-block-size): 131072 < 1048576

Check you hdfs-site.xml file and increate your block size to at least 64M (67108864).

$ hadoop fs -ls /user/lxu/grep-input
-rw-r--r--   1 lxu lxu     124640 2014-02-26 14:38 /user/lxu/grep-input/sample.txt

Run grep command
$ hadoop org.apache.hadoop.examples.Grep grep-input grep-output 'was'
....
14/02/27 16:23:14 INFO mapreduce.Job: Counters: 43
File System Counters
FILE: Number of bytes read=20
FILE: Number of bytes written=158799
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=233
HDFS: Number of bytes written=8
HDFS: Number of read operations=7
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=5705
Total time spent by all reduces in occupied slots (ms)=6419
Map-Reduce Framework
Map input records=1
Map output records=1
Map output bytes=12
Map output materialized bytes=20
Input split bytes=127
Combine input records=0
Combine output records=0
Reduce input groups=1
Reduce shuffle bytes=20
Reduce input records=1
Reduce output records=1
Spilled Records=2
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=147
CPU time spent (ms)=1400

Physical memory (bytes) snapshot=280195072
Virtual memory (bytes) snapshot=1793179648
Total committed heap usage (bytes)=136450048

Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0

File Input Format Counters
Bytes Read=106
File Output Format Counters
Bytes Written=8
....

$ hadoop fs -cat /user/lxu/grep-output/part-r-00000
288 was



No comments: