Friday, August 09, 2013

Hadoop - How to install CDH4 using Cloudera Manager 4.6.2

So this tutorial shows you how to install Cloudera Hadoop Distribution (CDH 4.3.0-1.cdh4.3.0.p0.22) using Cloudera Manger standard.

OS: ubuntu 12.04 LTS
Cloudera Manager: 4.6.2
CDH4: 4.3.0-1.cdh4.3.0.p0.22
Cluster Size: 3 nodes


If you are installing Cloudera Manager for the first time, you will:

  • Install a database application on the Cloudera Manager Server host machine or on a host machine that the Cloudera Manager Server can access, and (depending on the configuration you decide on) on other hosts as well.
  • Install the Cloudera Manager Server on one cluster host machine.
  • Install CDH and the Cloudera Manager Agents on the other cluster host machines.


I choose this path and probably the easiest path to install CM (Automated Installation by Cloudera Manager). Make sure the following requirements are meet:
  • Uniform SSH access to cluster hosts on the same port from Cloudera Manager Server host.
  • All hosts must have access to standard package repositories.
  • All hosts must have access to the either archive.cloudera.com on the internet or to a local repository with the necessary installation files.

Note: An embedded PostgreSQL database will be installed. 

Installation:
  1. Download cloudera-manager-installer.bin from http://www.cloudera.com/content/support/en/downloads.html
  2. Change it to have executable permission: # chmod +x cloudera-manager-installer.bin
  3. Run cloudera-manager-installer.bin. 
  4. Read Readme, accept license for CM and Oracle.
  5. Note the complete URL provided for the Cloudera Manager Admin Console, including the port number, which is 7180 by default. Press Enter to choose OK to continue. 


Note: By default CM will install Oracle jdk1.6.0 for you. I manually install jdk1.7.0 configured CM to use jdk1.7.0. 


Automated CDH Installation and Configuration:

1. Start CM in a browser, by default the port it 7180, for example, I use "http://192.168.1.11:7180". The initial username and password is "admin/admin".

2. Select "Cloudera Standard", unless you have a Enterprise license.

3. Find the cluster hosts you specify via hostname and IP-address ranges. You can Search for hostnames and/or IP addresses. In the search result, if you see "hadoopx.local" in FQDN, and you want to get rid of the local, define hosts in your /etc/hosts file (if you don't have a DNS server). My hosts file looks like:
=====================
127.0.0.1       localhost
192.168.1.11    hadoop1
192.168.1.12    hadoop2
192.168.1.13    hadoop3
=====================

4. Continue and select repository. Make sure you choose "Use Parcels", "Use Parcels" is probably the easiest way for Cloudera Manager to manage the software on your cluster, by automating the deployment and upgrade of service binaries. Not user parcels will require you to manually upgrade packages on all hosts. 

5. Provide SSH login credentials. 

6. Install CDH packages or parcels, optionally including the Cloudera Impala and Cloudera Search Beta packages or parcels.

7. Configure Hadoop automatically and start the Hadoop services.

8. Inspect hosts for correctness. Make sure all the validations are passed. 

9. Choose the CDH4 services that you want to install on your cluster. 

10. Database Setup, choose Embedded database, make sure you write down the username and password for all the DBs. 

11. Review the Configuration Changes to be applied.

12. Click Continue to proceed to the Cloudera Manager Services page.

13. Change the Default Administrator Password: From the Administration tab, select Users, click "change password" button.



 At this point, you have finished with the CDH and Cloudera Manager installation, you are ready to test the installation.

How to run simple test:

To test the newly installed cluster, we run the wordcount example. WordCount is a simple application that counts the number of occurrences of each word in an input set. This jar file is already included in

# If you don't have the input file.
$ echo 'Hello World, Bye World!' > file01
$ echo 'Hello Hadoop, Goodbye to hadoop.' > file02
$ hadoop fs -mkdir /user/lxu/input
$ hadoop fs -copyFromLocal file0* /user/lxu/input
$ hadoop jar /opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar wordcount input output

If job finishes without any errors, you will see output like:

13/08/09 16:07:50 INFO mapred.JobClient: Job complete: job_local223435910_0001
13/08/09 16:07:51 INFO mapred.JobClient: Counters: 25
13/08/09 16:07:51 INFO mapred.JobClient:   File System Counters
13/08/09 16:07:51 INFO mapred.JobClient:     FILE: Number of bytes read=429993
13/08/09 16:07:51 INFO mapred.JobClient:     FILE: Number of bytes written=715549
13/08/09 16:07:51 INFO mapred.JobClient:     FILE: Number of read operations=0
13/08/09 16:07:51 INFO mapred.JobClient:     FILE: Number of large read operations=0
13/08/09 16:07:51 INFO mapred.JobClient:     FILE: Number of write operations=0
13/08/09 16:07:51 INFO mapred.JobClient:     HDFS: Number of bytes read=147
13/08/09 16:07:51 INFO mapred.JobClient:     HDFS: Number of bytes written=67
13/08/09 16:07:51 INFO mapred.JobClient:     HDFS: Number of read operations=23
13/08/09 16:07:51 INFO mapred.JobClient:     HDFS: Number of large read operations=0
13/08/09 16:07:51 INFO mapred.JobClient:     HDFS: Number of write operations=4
13/08/09 16:07:51 INFO mapred.JobClient:   Map-Reduce Framework
13/08/09 16:07:51 INFO mapred.JobClient:     Map input records=2
13/08/09 16:07:51 INFO mapred.JobClient:     Map output records=9
13/08/09 16:07:51 INFO mapred.JobClient:     Map output bytes=93
13/08/09 16:07:51 INFO mapred.JobClient:     Input split bytes=212
13/08/09 16:07:51 INFO mapred.JobClient:     Combine input records=9
13/08/09 16:07:51 INFO mapred.JobClient:     Combine output records=9
13/08/09 16:07:51 INFO mapred.JobClient:     Reduce input groups=8
13/08/09 16:07:51 INFO mapred.JobClient:     Reduce shuffle bytes=0
13/08/09 16:07:51 INFO mapred.JobClient:     Reduce input records=9
13/08/09 16:07:51 INFO mapred.JobClient:     Reduce output records=8
13/08/09 16:07:51 INFO mapred.JobClient:     Spilled Records=18
13/08/09 16:07:51 INFO mapred.JobClient:     CPU time spent (ms)=0
13/08/09 16:07:51 INFO mapred.JobClient:     Physical memory (bytes) snapshot=0
13/08/09 16:07:51 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=0
13/08/09 16:07:51 INFO mapred.JobClient:     Total committed heap usage (bytes)=482869248


If you are interested, you can even try to word count the Bible:
Entire KJV Bible
which is a 4.2M text file.

My 3 nodes cluster took:

real 0m8.874s
user 0m13.073s
sys 0m0.392s


No comments: