Introduction

The purpose of this repository is to setup virtual clusters on a single machine for learning big-data tools.

The clusters design is 1 master + 2 slaves.

1. Build Hadoop Docker Image

1.1 Hadoop Installation

Pull ubuntu image from docker
```
docker pull ubuntu
```

Run ubuntu image and link local disk to ubuntu container

docker run -it -v C:\hadoop\build:/root/build --name ubuntu ubuntu

Enter ubuntu container and update necessary components

# Update system
apt-get update
# Install vim
apt-get install vim
# Install ssh (need ssh to connect to slave)
apt-get install ssh

Add the followings to .bashrc to start sshd service automatically

vim ~/.bashrc
# Add the following
/etc/init.d/ssh start
# save and quit
:wq

Setup ssh to connect the workers without password

# Enter the following and keep pressing enter
ssh-keygen -t rsa
# save to authorized keys
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Test ssh connection
```
ssh localhost
# Example:
```

Install java JDK

# Use JDK 8 because Hive only support up to 8
apt-get install openjdk-8-jdk
# Setup environment variable
vim ~/.bashrc
# Add the followings
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/
export PATH=$PATH:$JAVA_HOME/bin
# save and quit
:wq
# Make .bashrc effective immediately
source ~/.bashrc

Save this container as jdkinstalled image for later use

# Commit the ubuntu container and rename it as ubuntu/jdkinstalled
docker commit ubuntu ubuntu/jdkinstalled

Run ubuntu/jdkinstalled image and link local disk to container

docker run -it -v C:\hadoop\build:/root/build --name ubuntu-jdkinstalled ubuntu/jdkinstalled

Download Hadoop installation package from Apache (I use Hadoop-3.3.3) and put it to C:\hadoop\build

https://dlcdn.apache.org/hadoop/common/stable/

Enter jdkinstalled container

docker exec -it ubuntu-jdkinstalled bash
# switch to build and look for the installation package
cd /root/build
# extract the package to /usr/local
tar -zxvf hadoop-3.3.3.tar.gz -C /usr/local

Switch to Hadoop folder

# switch to hadoop folder
cd /usr/local/hadoop-3.3.3/
# Test if hadoop is running properly
./bin/hadoop version

Setup Hadoop environment variables

vim /etc/profile
# Add the followings
export HADOOP_HOME=/usr/local/hadoop-3.3.3
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
# save and quit
:wq

vim ~/.bashrc
# Add the following
source /etc/profile
# save and quit
:wq

1.2 Setup Hadoop clusters

Switch to Hadoop configuration path
```
cd /usr/local/hadoop-3.3.3/etc/hadoop/
```

Setup hadoop-env.sh

vim hadoop-env.sh

# Add the followings
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/
export HDFS_NAMENODE_USER="root"
export HDFS_DATANODE_USER="root"
export HDFS_SECONDARYNAMENODE_USER="root"
export YARN_RESOURCEMANAGER_USER="root"
export YARN_NODEMANAGER_USER="root"

# save and quit
:wq

setup core-site.xml

<configuration>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>file:/usr/local/hadoop/data</value>
        <description>Abase for other temporary directories.</description>
    </property>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://master:9000</value>
    </property>
    <property>
        <name> hadoop.hyyp.staticuser.user</name>
        <value>root</value>
    </property>
    <property>
        <name>hadoop.proxyuser.root.hosts</name>
        <value>*</value>
    </property>
    <property>
        <name>hadoop.proxyuser.root.groups</name>
        <value>*</value>
    </property>
    <property>
        <name>fs.trash.interval</name>
        <value>1440</value>
    </property>
</configuration>

Setup hdfs-site.xml

<configuration>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:/usr/local/hadoop/namenode_dir</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:/usr/local/hadoop/datanode_dir</value>
    </property>
    <property>
        <name>dfs.replication</name>
        <value>3</value>
    </property>
    <property>
        <name>dfs.namenode.secondary.http-address</name>
        <value>slave01:9868</value>
    </property>
</configuration>

Setup madpred-site.xml

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
    <property>
        <name>mapreduce.jobhistory.address</name>
        <value>master:10020</value>
    </property>
    <property>
        <name>mapreduce.jobhistory.webapp.address</name>
        <value>master:19888</value>
    </property>
    <property>
        <name>yarn.app.mapreduce.am.env</name>
        <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
    </property>
    <property>
        <name>mapreduce.map.env</name>
        <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
    </property>
    <property>
        <name>mapreduce.reduce.env</name>
        <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
    </property>
</configuration>

Setup yarn-site.xml

<configuration>
    <!-- Site specific YARN configuration properties -->
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>master</value>
    </property>
    <property>
        <name>yarn.nodemanager.pmem-check-enabled</name>
        <value>false</value>
    </property>
    <property>
        <name>yarn.nodemanager.vmem-check-enabled</name>
        <value>false</value>
    </property>
    <property>
        <name>yarn.log-aggregation-enable</name>
        <value>true</value>
    </property>
    <property>
        <name>yarn.log.server.url</name>
        <value>http://master:19888/jobhistory/logs</value>
    </property>
    <property>
        <name>yarn.log-aggregation.retain-seconds</name>
        <value>604800</value>
    </property>
 </configuration>

Save this container as hadoop-installed image for later use
```
docker commit ubuntu-jdkinstalled ubuntu/hadoopinstalled
```

Open 3 terminals and enter the following cmd separately

Terminal 1 (master):

# 8088 is the default WEB UI port for YARN
# 9870 is the default WEB UI port for HDFS
# 9864 is for inspecting file content inside HDFS WEB UI
docker run -it -h master  -p 8088:8088 -p 9870:9870 -p 9864:9864 --name master ubuntu/hadoopinstalled

Terminal 2 (slave01):

docker run -it -h slave01 --name slave01 ubuntu/hadoopinstalled

Terminal 3 (slave02):

docker run -it -h slave02 --name slave02 ubuntu/hadoopinstalled

For every terminal, inspect the IP for each container enter the ip to every container.
```
# Enter the hosts
vim /etc/hosts
# Example
```

Open the workers file in master container

vim /usr/local/hadoop-3.3.3/etc/hadoop/workers
# Change localhost to the followings (I put master in workers to add datanode and NodeManager to master as well)
master
slave01
slave02
# Example

Test to see if master can enter slave01 and slave02 without password

# Need to enter yes for first time connection
ssh slave01
# Example

1.3 Start Hadoop clusters

Need to format HDFS system if starting for the first time
```
hdfs namenode -format
```
Start HDFS clusters + YARN clusters
```
start-all.sh
```
Enter jps to check if the corresponding java processes have started successfully

To properly use the WEB UI, need to modify hosts file.

For windows:

Open C:\Windows\System32\drivers\etc\hosts

# Add the followings
127.0.0.1 master
127.0.0.1 slave01
127.0.0.1 slave02

# Save the changes

Open YARN WEB UI

http://localhost:8088/
Open HDFS WEB UI

http://localhost:9870/

1.4 TODO

HDFS WEB UI upload has problem
HDFS WEB UI make folders has problem

1.5 Test Hadoop grep example

# Make input folder in HDFS
hdfs fs -mkdir -p /user/hadoop/input

# Upload local hadoop xml file to HDFS
hdfs fs -put ./etc/hadoop/*.xml /user/hadoop/input

# Use ls to check if files are uploaded to HDFS properly
hdfs fs -ls /user/hadoop/input

# Run grep example. Use all files in HDFS input folder and filter dfs[a-z.]+ word counts to output folder
hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar grep /user/hadoop/input output 'dfs[a-z.]+'

# Check results
./bin/hdfs dfs -cat output/*

2. Build Hive Docker Image

2.1 Hive Installation

Run Hadoop image

docker run -it -h master -p 8088:8088 -p 9870:9870 -p 9864:9864 --name master ubuntu/hadoopinstalled

Download Apache Hive installation package

https://dlcdn.apache.org/hive/hive-3.1.3/

Copy Hive installation package to master container

docker cp apache-hive-3.1.3-bin.tar.gz master:/usr/local/

Install sudo
```
apt-get install sudo -y
```
Install net-tools
```
sudo apt install net-tools
```

Install mysql-server for meta-store

# Install mysql-server 
sudo apt install mysql-server -y

# Grant permission
usermod -d /var/lib/mysql mysql

# Check mysql version
mysql --version

Extract Hive package

# Enter master container
docker exec -it master bash

# Change to package location
cd /usr/local/

# Extract Hive package
tar -zxvf apache-hive-3.1.3-bin.tar.gz

# Rename to hive
mv apache-hive-3.1.3-bin hive

# Remove Hive installation package
rm apache-hive-3.1.3-bin.tar.

# Grant root to hive folder
sudo chown -R root:root hive

Setup Hive environment variables

# Open ./bashrc
vim ~/.bashrc

# Enter the followings:
export HIVE_HOME=/usr/local/hive
export PATH=$PATH:$HIVE_HOME/bin
export HADOOP_HOME=/usr/local/hadoop-3.3.3

# save and quit
:wq

# make .bashrc effective immediately
source ~/.bashrc

2.2 Setup Hive clusters

Change to Hive configuration folder
```
cd /usr/local/hive/conf
```

Use hive-default.xml template

mv hive-default.xml.template hive-default.xml

Setup hive-site.xml

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>javax.jdo.option.ConnectionURL</name>
        <value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true&amp;useSSL=false&amp;useUnicode=true&amp;characterEncoding=UTF-8&amp;allowPublicKeyRetrieval=true</value>
        <description>JDBC connect string for a JDBC metastore</description>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>com.mysql.cj.jdbc.Driver</value>
        <description>Driver class name for a JDBC metastore</description>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionUserName</name>
        <value>hive</value>
        <description>username to use against metastore database</description>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionPassword</name>
        <value>hive</value>
        <description>password to use against metastore database</description>
    </property>
    <property>
        <name>hive.server2.thrift.bind.host</name>
        <value>master</value>
        <description> H2S bind to host </description>
    </property>
    <property>
        <name>hive.metastore.uris</name>
        <value>thrift://master:9083</value>
        <description> metastore url address </description>
    </property>
    <property>
        <name>hive.metastore.event.db.notification.api.auth</name>
        <value>false</value>
        <description> close metadata storage authorization </description>
    </property>
</configuration>

Start mysql database

# Enter mysql as root user without password
mysql -u root  -p

Enter the following commands

create database hive;

create user hive@localhost identified by 'hive';

GRANT ALL PRIVILEGES ON *.* TO hive@localhost with grant option;

FLUSH PRIVILEGES;

Enter ctrl + d to quit mysql

Remove jar from Hive as it conflicts with Hadoop

rm /usr/local/hive/lib/log4j-slf4j-impl-2.17.1.jar

Download mysql connector driver (mysql version is 8.0.29)

https://dev.mysql.com/downloads/connector/j/

Copy to master container and extract

# Copy to master container
docker cp mysql-connector-java_8.0.29-1ubuntu21.10_all.deb master:/usr/local/

# Switch to /usr/local
cd /usr/local

# Extract package
dpkg -i mysql-connector-java_8.0.29-1ubuntu21.10_all.deb

# Copy jar file to Hive library
cp /usr/share/java/mysql-connector-java-8.0.29.jar /usr/local/hive/lib/

Commit the container to a new image

docker commit master ubuntu/hiveinstalled

2.3 Start Hive clusters

Open 3 terminals and enter the following cmd separately

Terminal 1 (master):

docker run -it -h master  -p 8088:8088 -p 9870:9870 -p 9864:9864 -p 10000:10000 -p 8080:8080 --name master ubuntu/hiveinstalled

Terminal 2 (slave01):

docker run -it -h slave01 --name slave01 ubuntu/hadoopinstalled

Terminal 3 (slave02):

docker run -it -h slave02 --name slave02 ubuntu/hadoopinstalled

For every terminal, inspect the IP for each container enter the ip to every container.
```
# Enter the hosts
vim /etc/hosts
# Example
```

Open the workers file in master container

vim /usr/local/hadoop-3.3.3/etc/hadoop/workers
# Change localhost to the followings (I put master in workers to add datanode and NodeManager to master as well)
master
slave01
slave02
# Example

Enter master container and start mysql

# Enter master container
docker exec -it master bash

# Start mysql
service mysql start

Start HDFS service

# Start HDFS
start-dfs.sh

# Create the following folders
hadoop fs -mkdir /tmp
hadoop fs -mkdir -p /user/hive/warehouse

# Grant read write permission
hadoop fs -chmod g+w /tmp
hadoop fs -chmod g+w /user/hive/warehouse

Start metastore service and hiveserver2

# Start metastore
nohup hive --service metastore > nohup.out &
# Use net-tools to check if metastore has started properly
netstat -nlpt |grep 9083

# Start hiveserver2
nohup hiveserver2 > nohup2.out &
# Use net-tools to check if hiveserver2 has started properly
netstat -nlpt|grep 10000

Should see the followings if metastore and hiveserver2 have started properly

Test hiveserver2 with beeline client

# Enter beeline in terminal
beeline

# Enter the following to connect to metastore
!connect jdbc:hive2://master:10000

# Enter root as username; no need to enter password

3. Build Spark Docker Image

3.1 Spark Installation

Download Spark from Apache Spark (I use spark-3.3.0 for this setup)

https://www.apache.org/dyn/closer.lua/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz

Install Spark installation package in master container

# Copy Spark installation package from local machine to master container
docker cp spark-3.3.0-bin-hadoop3.tgz master:/usr/local/

# Enter master container
docker exec -it master bash

# Change to /usr/local
cd /usr/local

# Extract package
sudo tar -zxf spark-3.3.0-bin-hadoop3.tgz -C /usr/local/

# rm package
rm spark-3.3.0-bin-hadoop3.tgz

# Rename Spark folder
sudo mv ./spark-3.3.0-bin-hadoop3/ ./spark

Download Anaconda3 Linux distribution

https://www.anaconda.com/products/distribution#Downloads

Install Anaconda3 in master container (I use Anaconda3-2022.05 for this setup)

# Copy Anaconda3 installation package from local machine to master container
docker cp Anaconda3-2022.05-Linux-x86_64.sh master:/usr/local/

# Enter master container
docker exec -it master bash

# Change to /usr/local
cd /usr/local

# Install Anaconda3 package
sh ./ Anaconda3-2022.05-Linux-x86_64.sh
# Type yes

# Enter installation path
/usr/local/anaconda3
# Type yes

# Remove Anaconda3 package
rm Anaconda3-2022.05-Linux-x86_64.sh

Setup python virtual environment named "pyspark"

conda create -n pyspark python=3.8
# Type y

# Activate pyspark virtual env
conda activate pyspark

Install necessary python packages

conda install pyspark
conda install pyhive

Enter the followings into /etc/profile

# Open /etc/profile
vim /etc/profile

# Make sure the followings are in the file
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/

export HADOOP_HOME=/usr/local/hadoop-3.3.3
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

export SPARK_HOME=/usr/local/spark
export PYSPARK_PYTHON=/usr/local/anaconda3/envs/pyspark/bin/python

# Save and quit
:wq

Configure workers file

# Switch to spark configuration folder
cd /usr/local/spark/conf

# cp workers template
cp workers.template workers

# Open workers file
vim workers

# Change localhost to the followings:
master
slave01
slave02

# Save and quit
:wq

Configure spark-env.sh file

# Open spark-env.sh file
vim spark-env.sh

# Enter the followings
HADOOP_CONF_DIR=/usr/local/hadoop-3.3.3/etc/hadoop
YARN_CONF_DIR=/usr/local/hadoop-3.3.3/etc/hadoop
export SPARK_MASTER_HOST=master
export SPARK_MASTER_PORT=7077
SPARK_MASTER_WEBUI_PORT=8080
SPARK_WORKER_PORT=7078
SPARK_WORKER_WEBUI_PORT=8081
SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=hdfs://master:9000/sparklog/ -Dspark.history.fs.cleaner.enabled=true -Dspark.history.ui.port=18080"

# Save and quit
:wq

Create sparklog folder in HDFS

hadoop fs -mkdir /sparklog
hadoop fs -chmod 777 /sparklog

Configure spark-defaults.conf file

# Copy spark-defaults.conf template
cp spark-defaults.conf.template spark-defaults.conf

# Open the file
vim spark-defaults.conf

# Enter the followings
spark.eventLog.enabled true
spark.eventLog.dir hdfs://master:9000/sparklog/
spark.eventLog.compress true

# Save and quit
:wq

Configure log4j2.properties file

# Copy log4j2.properties template
cp log4j2.properties.template log4j2.properties

# Open the file
vim log4j2.properties

# Change rootLogger.level to warn
rootLogger.level = warn

# Save and quit
:wq

Commit the container to a new image

docker commit master ubuntu/sparkinstalled

3.2 Start Spark clusters

Open 3 terminals and enter the following cmd separately

Terminal 1 (master):

docker run -it -h master  -p 8088:8088 -p 9870:9870 -p 9864:9864 -p 10000:10000 -p 8080:8080 -p 18080:18080 -p 4040:4040 -p 19888:19888 --name master ubuntu/sparkinstalled

Terminal 2 (slave01)

docker run -it -h slave01 --name slave01 ubuntu/sparkinstalled

Terminal 3 (slave02)

docker run -it -h slave02 --name slave02 ubuntu/sparkinstalled

Enter master container and start spark clusters

# Enter master container
docker exec -it master bash

# Make sure to Hadoop clusters first
start-all.sh

# Follow 2.3 instruction to start Hive clusters (For Spark on Hive)

# Switch to spark sbin folder
cd /usr/local/spark/sbin

# Start spark clusters
./start-all.sh

# Start history server
./start-history-server.sh

# Start mr-jobhistory server
cd /usr/local/hadoop-3.3.3/sbin
mr-jobhistory-daemon.sh start historyserver

3.3 Test Spark Standalone mode

cd /usr/local/spark
# Start pyspark in Standalone mode
bin/pyspark --master spark://master:7077

# Type the following
sc.parallelize([1,2,3,4,5]).map(lambda x: x + 1).collect()

4. Docker image link

Hadoop ready image

https://hub.docker.com/repository/docker/johnhui/hadoopinstalled
Hive ready image

https://hub.docker.com/repository/docker/johnhui/hiveinstalled
Spark ready image

https://hub.docker.com/repository/docker/johnhui/sparkinstalled

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
images		images
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Table of Contents

Introduction

1. Build Hadoop Docker Image

1.1 Hadoop Installation

1.2 Setup Hadoop clusters

1.3 Start Hadoop clusters

1.4 TODO

1.5 Test Hadoop grep example

2. Build Hive Docker Image

2.1 Hive Installation

2.2 Setup Hive clusters

2.3 Start Hive clusters

3. Build Spark Docker Image

3.1 Spark Installation

3.2 Start Spark clusters

3.3 Test Spark Standalone mode

4. Docker image link

About

Releases

Packages

John-CYHui/BigData-Docker-Images

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

Introduction

1. Build Hadoop Docker Image

1.1 Hadoop Installation

1.2 Setup Hadoop clusters

1.3 Start Hadoop clusters

1.4 TODO

1.5 Test Hadoop grep example

2. Build Hive Docker Image

2.1 Hive Installation

2.2 Setup Hive clusters

2.3 Start Hive clusters

3. Build Spark Docker Image

3.1 Spark Installation

3.2 Start Spark clusters

3.3 Test Spark Standalone mode

4. Docker image link

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages