TΩИΨ: Apache Hadoop - How to Install a Three Nodes Cluster (2)

Previous tutorial:
Apache Hadoop - How to Install a Three Nodes Cluster (1)
Apache Hadoop - How to Install a Three Nodes Cluster (3)
Apache Hadoop - How to Install a Three Nodes Cluster (4)

Install Hadoop

1. Download the stable version of Hadoop from the Apache official website:
http://hadoop.apache.org
I use hadoop-2.2.0.tar.gz, copy over to all three nodes.

2. Unpack the tarball.

# mv hadoop-2.2.0.tar.gz /opt
# cd /opt
# tar -xzvf hadoop-2.2.0.tar.gz

3. Make a decision, Hadoop cluster architecture:
I will use hadoop1 as my Namenode, Resource manager and web proxy server, hadoop2 as my seconday Namenode, job history server, hadoop3 will be my datanode.
hadoop1: Namenode, Resource manager, WebAppProxyServer
hadoop2: Secondary Namenode, JobHistoryServer
Hadoop3: Datanode, Nodemanager.

Configure Hadoop:

Setup JAVA_HOME:
We check the "hadoop-env.sh" file, make sure at least $JAVA_HOME is set on all three servers.

# vim hadoop-env.sh
# cat hadoop-env.sh | grep JAVA
# The only required environment variable is JAVA_HOME.  All others are
# set JAVA_HOME in this file, so that it is correctly defined on
export JAVA_HOME="/usr/lib/jvm/jdk1.7.0/"

The core configuration files include: core-site.xml, hdfs-site.xml, yarn-site.xml, mapred-site.xml, hadoop-env.sh and yarn-evn.sh. I will talk about how to configure each configuration files one by one. Depends on the roles you assign to each server, some configuration files are slightly different than others.

Configure core-site.xml file:
core-site.xml contains the configuration information that overrides the default values for core Hadoop properties. The default core-site.xml file looks like:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at
        http://www.apache.org/licenses/LICENSE-2.0
  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->
<configuration>
</configuration>

Core parameters include:

fs.defaultFS - This is the namnode URI, this is the default path prefix used by the Hadoop FS client when none is given.
io.file.buffer.size - This file specifies the buffer size for sequence files. For optimal performance, we set this parameter's value to a multiple of the hardware page size (The intel x86 architecture has a hardware page size of 4096). The is the value determines how much data is buffered during read and write operations.

You can use "getconf" to get your pagesize.

# getconf PAGESIZE
4096

Sample core-site.xml:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at
    http://www.apache.org/licenses/LICENSE-2.0
  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://hadoop1.com:8020</value>
    </property>

    <property>
        <name>io.file.buffer.size</name>
        <value>8192</value>
    </property>
</configuration>

Note: core-site.xml file is the same on all servers.

Configure hdfs-site.xml:
Now we configure hdfs-site.xml file, like core-site.xml file to Hadoop, hdfs-site.xml file contains HDFS filesystem important parameters:
https://hadoop.apache.org/docs/current2/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml

Some core hdfs-site.xml parameters:

I will not explain every parameter, the Apache official documentation is very self explanatory.

Sample hdfs-site.xml:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at
    http://www.apache.org/licenses/LICENSE-2.0
  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://hadoop1:8020</value>
        <description>The port where the NameNode runs the HDFS protocol.
                     Combined with the NameNode's hostname to build its address.
        </description>
    </property>

    <property>
        <name>dfs.namenode.rpc-address</name>
        <value>hadoop1:8020</value>
        <description>
            RPC address that handles all clients requests. In the case of HA/Federation where multiple namenodes exist, the name ser
vice id is added to the name e.g. dfs.namenode.rpc-address.ns1 dfs.namenode.rpc-address.EXAMPLENAMESERVICE The value of this propert
y will take the form of nn-host1:rpc-port.
        </description>
    </property>

    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:///data/nn1,file:///data/nn2</value>
    </property>

    <property>
        <name>dfs.blocksize</name>
        <value>131072</value>
    </property>

    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:///data/data1</value>
    </property>

    <property>
        <name>dfs.replication</name>
        <value>1</value>
   </property>

</configuration>

Note: hdfs-site.xml file is the same on all servers.

Configure yarn-site.xml:
For a complete list of YARN parameters, see http://hadoop.apache.org/docs/r2.0.2-alpha/hadoop-yarn/hadoop-yarn-common/yarn-default.xml

Since this file is big, I will not paste its content here, you can download it from:
Sample yarn-site.xml
Note: yarn-site.xml file is the same on all servers.

Configure mapred-site.xml:
For a complete list of mapred parameters: http://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml
This file is not there by default, just copy it from mapred-site.xml.template:

# cp mapred-site.xml.template mapred-site.xml
# vi  mapred-site.xml

You can download mapred-site.xml file from:
Sameple mapred-site.xml
Note: If you choose a different server for you JobHistoryServer (not on namenode), you should update "mapreduce.jobhistory.address" and "mapreduce.jobhistory.webapp.address" to corresponding server hostname.

Note: If you just want to get your MRv2 job run in YARN, you can use the minimum configuration for your yarn-site.xml and mapred-site.xml (I strongly recommend you to try other configuration parameters, if you want to get deep into it, use minimum does no good on understanding):

yarn-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<configuration> 
    <property>   
        <name>yarn.resourcemanager.hostname</name>   
        <value>you.hostname.com</value> 
    </property> 

    <property>   
        <name>yarn.nodemanager.aux-services</name>   
        <value>mapreduce_shuffle</value> 
    </property> 

    <property>   
        <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>   
        <value>org.apache.hadoop.mapred.ShuffleHandler</value> 
    </property>
</configuration>

mapred-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<configuration> 
    <property>   
        <name>mapreduce.framework.name</name>   
        <value>yarn</value> 
    </property>
</configuration>

Configure Slave file:
Typically we choose one machine in the cluster to act as the NameNode and one machine as to act as the ResourceManager, exclusively. The rest of the machines act as both a DataNode and NodeManager and are referred to as slaves. List all slave hostnames or IP addresses in your conf/slaves file, one per line. In our case, only hadoop3 should be put into this slave file.

Distribute configuration files to all other servers:
Copy the whole Hadoop conf directory to other two servers: On hadoop2 and hadoop3

# cd /opt/hadoop-2.2.0/etc
# mv hadoop hadoop-org  -- Backup

On hadoop1

# cd /opt/hadoop-2.2.0/etc
# scp -r hadoop/ root@hadoop2:/opt/hadoop-2.2.0/etc/
# scp -r hadoop/ root@hadoop3:/opt/hadoop-2.2.0/etc/

Hadoop Startup:
Before we start our cluster, do yourself a favour, set "HADOOP_PREFIX" and "HADOOP_YARN_HOME" variables in your env, so that we don't need to type the full path every time we try to run a command:

# For each of the hdfs, yarn and mapred users:
# vi ~/.bash_profile
add the following two lines into profile:
export HADOOP_PREFIX="/opt/hadoop-2.2.0"
export HADOOP_YARN_HOME="/opt/hadoop-2.2.0"
export HADOOP_CONF_DIR="/opt/hadoop-2.2.0/etc/hadoop"

then do a

# . ~/.bash_profile

Create log directory:
Run the following commands on all three servers:

# mkdir /var/log/hadoop
# chmod 775 /var/log/hadoop/
# chown hdfs:hadoop /var/log/hadoop/
# mkdir /var/log/yarn
# chmod 775 /var/log/yarn/
# chown yarn:hadoop /var/log/yarn/
# mkdir /var/log/mapred
# chmod 775 /var/log/mapred/
# chown yarn:hadoop /var/log/mapred/

In hadoop-env.sh:

export HADOOP_LOG_DIR=/var/log/hadoop/

In yarn-env.sh

# default log directory & file
if [ "$YARN_LOG_DIR" = "" ]; then
  YARN_LOG_DIR="/var/log/yarn"
fi

Start Hadoop cluster:
HDFS service:
On hadoop1 server:

# su hdfs  -- become hdfs user
# $HADOOP_PREFIX/bin/hdfs namenode -format "Test Cluster"
14/02/19 11:22:24 INFO util.GSet: 0.029999999329447746% max memory = 966.7 MB
14/02/19 11:22:24 INFO util.GSet: capacity          = 2^15 = 32768 entries
14/02/19 11:22:24 INFO common.Storage: Storage directory /data/nn1 has been successfully formatted.
14/02/19 11:22:24 INFO common.Storage: Storage directory /data/nn2 has been successfully formatted.
14/02/19 11:22:24 INFO namenode.FSImage: Saving image file /data/nn2/current/fsimage.ckpt_0000000000000000000 using no compression
14/02/19 11:22:24 INFO namenode.FSImage: Saving image file /data/nn1/current/fsimage.ckpt_0000000000000000000 using no compression
14/02/19 11:22:24 INFO namenode.FSImage: Image file /data/nn2/current/fsimage.ckpt_0000000000000000000 of size 196 bytes saved in 0 seconds.
14/02/19 11:22:24 INFO namenode.FSImage: Image file /data/nn1/current/fsimage.ckpt_0000000000000000000 of size 196 bytes saved in 0 seconds.
14/02/19 11:22:24 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
14/02/19 11:22:24 INFO util.ExitUtil: Exiting with status 0
14/02/19 11:22:24 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at hadoop1.com/192.168.143.131
************************************************************/

Start the HDFS namenode with the following command:

$ $HADOOP_PREFIX/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs start namenode
starting namenode, logging to /opt/hadoop-2.2.0/logs/hadoop-root-namenode-hadoop1.com.out
$ jps
29322 NameNode
29983 Jps

You should also check the log file make sure no errors.

Start datanode, on hadoop3:

# su hdfs
$ $HADOOP_PREFIX/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR --script
hdfs start datanode
starting datanode, logging to /opt/hadoop-2.2.0/logs/hadoop-root-datanode-hadoop1.com.out

Check if Namenode and DataNode processes are running
$ jps
6973 DataNode
7037 Jps

You should also check the log file make sure no errors.

YARN:
On hadoop1 Start the YARN ResourceManager:

# su yarn
$ $HADOOP_YARN_HOME/sbin/yarn-daemon.sh --config $HADOOP_CONF_DIR start resourcemanager
starting resourcemanager, logging to /opt/hadoop-2.2.0/logs/yarn-root-resourcemanager-hadoop1.com.out
$ jps
29739 ResourceManager
29773 Jps

You should also check the log file make sure no errors.

On hadoop3, start nodemanager:

# su yarn
# $HADOOP_YARN_HOME/sbin/yarn-daemon.sh --config $HADOOP_CONF_DIR start nodemanager
starting nodemanager, logging to /opt/hadoop-2.2.0/logs/yarn-root-nodemanager-hadoop1.com.out
$ jps
7603 Jps
7573 NodeManager

Start a standalone WebAppProxy server. If multiple servers are used with load balancing it should be run on each of them:

# su yarn
# $HADOOP_YARN_HOME/sbin/yarn-daemon.sh start proxyserver --config $HADOOP_CONF_DIR
starting proxyserver, logging to /opt/hadoop-2.2.0/logs/yarn-root-proxyserver-hadoop1.com.out
$ jps
29739 ResourceManager
30278 Jps
30245 WebAppProxyServer

Start a job history server:

# su mapred
$ $HADOOP_PREFIX/sbin/mr-jobhistory-daemon.sh start historyserver --config $HADOOP_CONF_DIR
starting historyserver, logging to /opt/hadoop-2.2.0/logs/mapred-root-historyserver-hadoop1.com.out

At this step, the NameNode process, DataNode process, ResourceManager process and JobHistoryServer process should all be up and running.

Check from web interface: