Set Up Single-Node Hadoop Cluster

Set Up Single-Node Hadoop Cluster

Hadoop Environment for your Big Data Journey

Prerequisites:

  • GNU/Linux is supported as a development and production platform (e.g. Ubuntu, Red Hat Linux etc).
  • Java™ must be installed.
  • OpenSSH must be installed and sshd must be running to use the Hadoop scripts. You can check if sshd is running or not using sudo systemctl status sshd.service

Download the recent stable release of Apache Hadoop binaries from this link

Prepare to Start Single Node Hadoop Cluster

Unpack the downloaded Hadoop distribution to /usr/local/hadoop directory. In the distribution, edit the file etc/hadoop/hadoop-env.sh to define $JAVA_HOME parameter as follows:

# set to the root of your Java installation
export JAVA_HOME=/path/to/java/installation

Create a file in /etc/profile.d/ directory and add below Environment Variables to the file.

sudo vi /etc/profile.d/hadoop-env.sh
# Java environment
export JAVA_HOME=/usr/local/java
# Hadoop Environment
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
export HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=$HADOOP_HOME/lib/native"
export CLASSPATH=$CLASSPATH:$HADOOP_HOME/lib/*
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_HOME/lib/native

Once this parameter is set you can use Hadoop in Standalone Mode You can Try the following command to check if it is working properly:

$ bin/hadoop

Pseudo-Distributed Operation

Hadoop can also be run on a single-node in a pseudo-distributed mode where each Hadoop daemon runs in a separate Java process.

To run Hadoop in Pseudo-Distributed mode we will specify the following Configurations in core-site.xml and hdfs-site.xml

etc/hadoop/core-site.xml:

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
    <property>
        <name>hadoop.proxyuser.[username].groups</name>
        <value>*</value>
    </property>
    <property>
        <name>hadoop.proxyuser.[username].hosts</name>
        <value>*</value>
    </property>
</configuration>

Remember to Replace [username] in above lines above with the local operating system user(e.g. logan) that will be used in beeline.

etc/hadoop/hdfs-site.xml:

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
    <property>
           <name>dfs.namenode.name.dir</name>
        <value>/home/logan/data/nameNode</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>/home/logan/data/dataNode</value>
    </property>
</configuration>

Setup passphraseless ssh

Please cehck if you can do ssh localhost successfully, if you cannot ssh to localhost without a passphrase, execute the following commands:

  $ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
  $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
  $ chmod 0600 ~/.ssh/authorized_keys

Now you have set the required propertes and use the following instructions to run a MapReduce job locally. Installation Setup we have done till now is sufficient to do any operation on Hadoop, even with Hive and Spark along with Hadoop but If you want to experience how yarn works and want to execute a job on YARN, see YARN on Single Node section to configure YARN with Hadoop.

Now let's Format the Hadoop filesystem:

$ hdfs namenode -format

If above command runs successfully (and it will definitely run if you have followed instructions carefully), Start NameNode daemon and DataNode daemon:

$ start-dfs.sh

Browse the web interface for the NameNode; by default it is available at:

NameNode - http://localhost:9870/

Now you can Make the HDFS directories required to execute MapReduce jobs:

$ hdfs dfs -mkdir /user
$ hdfs dfs -mkdir /user/<username>

Copy the input files into the distributed filesystem:

$ hdfs dfs -mkdir input
$ hdfs dfs -put $HADOOP_HOME/etc/hadoop/*.xml input

Run MapReduce examples provided with Hadoop Distribution:

$ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.3.jar grep input output 'dfs[a-z.]+'

View the output files on the distributed filesystem:

$ hdfs dfs -cat output/*

NOTE: The hadoop daemon log output is written to the $HADOOP_LOG_DIR directory (defaults to $HADOOP_HOME/logs). When you are done, stop daemons with below command:

$ stop-dfs.sh

YARN on Single Node

You can run a MapReduce job on YARN in a pseudo-distributed mode by setting below mentioned parameters and running ResourceManager daemon and NodeManager daemon in addition.

If you have completed the Pseudo-Distributed Operation Section then you can proceed further

To run YARN we need to configure some required parameters in mapred-site.xml and yarn-site.xml file as follows:

etc/hadoop/mapred-site.xml:

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
    <property>
        <name>mapreduce.application.classpath</name>
        <value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*</value>
    </property>
</configuration>

etc/hadoop/yarn-site.xml:

<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.env-whitelist</name>
        <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_HOME,PATH,LANG,TZ,HADOOP_MAPRED_HOME</value>
    </property>
</configuration>

Once properties of YARN and MapReduce are set we are ready to run YARN Resource Manager.

Start ResourceManager daemon and NodeManager daemon:

$ start-yarn.sh

Now we can browse the web interface for the ResourceManager; by default it is available at: ResourceManager - http://localhost:8088/

You can again try to Run a MapReduce job which we have run previously and feel the experience of MapReduce on YARN.

When you’re done, stop the daemons with:

$ stop-yarn.sh