Install Big Data tools ( Hadoop, Spark, ...)

Install Big Data tools ( Hadoop, Spark, ...)

First version 14-Oct-2021

·

9 min read

List necessary tools:

  • java 8
  • apache hadoop
  • apache spark
  • luigi
  • apache hive

*In this instruction, we will have 2 machine:

- master

- slave

INSTALL IN ALL MACHINE

these install have to do in all machine

check ip of your machine by using ifconfig or ip addr show

go to :

sudo nano /etc/hosts

add these:

<master ip>     master
<slave ip>     slave

I. Java 8

Because hadoop and spark run on java 8, so that we will install java first.

The easiest way to install java is using SDKMAN:

$ curl -s "https://get.sdkman.io" | bash

check sdk

$ sdk version

check list java version SDK support

$ sdk list java

================================================================================
Available Java Versions
================================================================================
     12.ea.20-open                                                              
   * 11.0.1-zulu                                                                
   * 11.0.1-open                                                                
     10.0.2-zulu                                                                
     10.0.2-open                                                                
     9.0.7-zulu                                                                 
     9.0.4-open                                                                 
     8.0.192-zulu                                                               
     8.0.191-oracle                                                             
 > + 8.0.181-zulu

choose java version 8 ( I use zulu )

$ sdk install java 8.0.192-zulu

check java

$ java -version

add java enviroment (u can set in either .bashrc, .profile or enviroment. Reccomment .bashrc)

export JAVA_HOME=/usr/lib/jvm/8.0.282-zulu
export PATH=$PATH:export PATH=$PATH:/usr/lib/jvm/8.0.282-zulu/bin

II. Install SSH, PDSH

ref: thachpham.com/linux-webserver/huong-dan-ssh..

we will use SSH, PDSH to connect multi machine for hadoop and spark

sudo apt-get install ssh

and then install pdsh

sudo apt-get install pdsh

add path to .bashrc

export PDSH_RCMD_TYPE=ssh

try ssh to localhost:

ssh localhost

II. Hadoop

Note that I follow most from this tutorial below, you can check this video along with my post.

#%[youtube.com/watch?v=KN6uzfQ42zE&t=117s]

INSTALL IN ALL MACHINE

add this to .bashrc. this is for quick access to hadoop

export bh="/usr/local/bigdata/hadoop"

you can access by using

cd $bh

Step 0 : turn off firewall in all machine

https://phoenixnap.com/kb/how-to-enable-disable-firewall-ubuntu

check status:

sudo ufw status verbose

disable:

sudo ufw disable

enable:

sudo ufw enable

reset:

sudo ufw reset

Step 1

create folder bigdata in local: we will note create folder bigdata in user directory, but in system directory instead

$ mkdir /usr/local/bigdata

create folder hadoop:

$ mkdir /usr/local/bigdata/hadoop

Step 2

add authority to bidata folder

sudo chown <username>:root -R /usr/local/bigdata/

check

drwxr-xr-x  4 <username> root 4096 Thg 2  14 01:00 bigdata

Step 2

we will add some path to hadoop: open .bashrc and paste this:

export HADOOP_HOME="/usr/local/bigdata/hadoop"
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
export HADOOP_STREAMING=$HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-3.2.2.jar

Step 3:

now go to website hadoop.apache.org

download the tar file ( I use version 3.2.2) extract it

tar -zxvf ~/Downloads/hadoop-3.2.2.tar.gz

move file to bigdata folder

mv ~/Downloads/hadoop-3.2.2/* /usr/local/bigdata/hadoop

Step 4: CONFIG ENVIRONMENT HADOOP

note: you can just config hadoop environment in master machine, and then copy its setting to slaves by using:

scp /usr/local/bigdata/hadoop/etc/hadoop/* <user-name>@slave:/usr/local/bigdata/hadoop/etc/hadoop/

cd to /usr/local/bigdata/hadoop/etc/hadoop/. this is the config folder of hadoop

* hadoop-env.h

find JAVA_HOME , change to:

JAVA_HOME=/usr/lib/jvm/8.0.282-zulu

* core-site.xml

add these:

<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://master:9000</value>
</property>
<property>
<name>hadoop.proxyuser.dataflair.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.dataflair.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.server.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.server.groups</name>
<value>*</value>
</property>
</configuration>

* hdfs-site.xml

add these:

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>

<property>
  <name>dfs.name.dir</name>
  <value>/usr/local/bigdata/hadoop/tmp/dfs/name</value>
</property>

<property>
  <name>dfs.data.dir</name>
  <value>/usr/local/bigdata/hadoop/tmp/dfs/data</value>
</property>

</configuration>

note: ditectory tmp is where hdfs store data, it will auto create since we format namenote by using <hadoop>/bin/hdfs namenode -format

* mapred-site.xml (since we use spark to manipulate data, we don't need map-reduce)

add these:

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
    <property>
        <name>mapreduce.application.classpath</name>
        <value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*</value>
    </property>
</configuration>

* yarn-site.xml

add these:

<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.env-whitelist</name>
        <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME>
    </property>
</configuration>

run this to add worker

sudo nano /usr/local/bigdata/hadoop/etc/hadoop/workers

image.png

Step 5:

run this:

/usr/local/bigdata/hadoop/bin/hdfs namenode -format

this will create folder tmp as well format everything.

INSTALL IN MASTER NODE

create ssh key

ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa

add content of public key to auth keys

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

add mod 600 to auth keys

chmod 0600 ~/.ssh/authorized_keys

note: disable password auth since we will use ssh key to connect machines

go to /etc/ssh/sshd_config, find and set these:

PasswordAuthentication no
UsePAM no

then, try to connect to master

ssh master

connect to slave

run this. it will copy public key in master to slave

ssh-copy-id <slave user>@slave

if user have password, type it.

then, try to connect to slave using

ssh slave

go to

sudo nano /etc/hostname

add these

image.png

INSTALL IN SLAVE NODE

go to

sudo nano /etc/hostname

add these

image.png

Lastly, in MASTER try to run hdfs:

go to hadoop directory

cd $bh

you can run manually by using:

hdfs:

sbin/start-dfs.sh

yarn:

sbin/start-yarn.sh

or just run auto by using:

sbin/start-all.sh

note: if you have any problem go to

$bh/logs/<file-that-describe-component-that-have-error.log>

image.png

you can go to and inspect error in it

afterall, you should see that

image.png

check if everything is ok by using jps

image.png

go to SLAVE by using

ssh slave

and run jps

you should see this

image.png

in MASTER, go to http://master:9870

you can check if all node worked:

image.png

so to summerize, namenode is a master node that only appear on MASTER machine only, in SLAVE machine we only see datanode (aka slave node) and node manager

hadoop work like this:

image.png

III.SPARK

next, we will install spark

INSTALL IN ALL MACHINE

step 1:

go to https://spark.apache.org/downloads.html and download suitable version

image.png

step 2:

create folder /usr/local/bigdata/spark/ and extract files to it

go to .bashrc and add this

SPARK_HOME="/usr/local/bigdata/spark"
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

and this (use for quick access spark folder)

export bs="/usr/local/bigdata/spark"

step 3:

go to:

 cd $bs/conf

run this (copy templates to new file in order to edit it)

cp spark-env.sh.template spark-env.sh
cp slaves.sh.template slaves.sh

then edit spark-env.sh and add these:

export SPARK_MASTER_HOST='<MASTER-IP>'
export JAVA_HOME=<Path_of_JAVA_installation>

then edit slaves.sh and add these: ( this is the names we map to machines ip above)

master
slave

step 4:

in MASTER machine, run this:

$bs/sbin/start-all.sh

note: you can go to $bs/logs to check if any error occurs

if everything work fine, you should see this:

image.png

ssh to slave:

image.png

you can see that in SLAVE machine, occur worker node.

go to http://master:8080 (master port)

you should see something like this

image.png

~~~~~~~~~~~~~~

reference:

hadoop:

set up hadoop multi node

medium0.com/@jootorres_11979/how-to-set-up-..

scp

linuxize.com/post/how-to-use-scp-command-to..

spark:

medium0.com/ymedialabs-innovation/apache-sp..

ssh:

askubuntu.com/questions/4830/easiest-way-to..