Report oft Practical Course on High-Performance Computing

1 Project notation

1.1 Youtube link

https://www.youtube.com/watch?v=2siZQBvRPuY&t=6s executing project in local computer

git clone it
cd go_mpi_network/goai
uncomment one case in myai.go
go build
./goai

2 Datasets

This project source code can be found https://github.com/scofild429/go_mpi_network,This is the README page.
Iris dataset (https://www.kaggle.com/datasets/saurabh00007/iriscsv)
Intel image classification, (https://www.kaggle.com/datasets/puneet6060/intel-image-classification?resource=download). Download it, put archive it in the folder ./datasets/

All training data will equally divied for each training network, specially for mpi

3 Configuration example

./goai/.irisenv
./goai/.imgenv

inputdataDims=4
inputLayerNeurons=30
hiddenLayerNeurons=20
outputLayerNeurons=3
labelOnehotDims=3
numEpochs=100
learningRate=0.01
batchSize=4

4 Sumbit the job in cluster

no singularity, installing golang 1.18 was failed always

using binary executable code of golang, go build and then transfering goai to cluster.

#!/bin/bash
#SBATCH --job-name mpi-go-neural-network
#SBATCH -N 1
#SBATCH -p fat
#SBATCH -n 20
#SBATCH --time=01:30:00

module purge
module load openmpi

mpirun -n 20 ./goai

5 Deep learning’s problem

As AI comes to deep learning, the computing resource becomes more critical for training process.

Applications:

Image Classification
NLP
Semantic segmentation

Solution

GPU
TPU
Distributed learning

6 Single network architecture

raining data -> inputLayer(w1, b1) -> dinputLayer
Normalization
dinputLayer -> hiddenLayer(w2, b2) -> dhiddenLayer
Normalization
dhiddenLayer -> OutputLayer(w3, b3) -> doutputLayer

Loss = L2: (doutputLayer - onehotlable)^2

Backpropagation from Loss  of Outputlayer  to w3, b3
Backpropagation from error of Hiddenlayer  to w2, b2
Backpropagation from error of Inputlayer   to w1, b1

Derivative of sigmoid, Normalization, Standardization

Stochastic Gradient Descent (SGD)
Mini-batch Gradient Descent (MBGD)
Batch Gradient Descent (BGD)

7 Illustration of weights updating

8 Code implementation

func main() {
        singlenode.Single_node_iris(true)
        mpicode.Mpi_iris_Allreduce()
        mpicode.Mpi_iris_SendRecv()
        mpicode.Mpi_images_Allreduce()
        mpicode.Mpi_images_SendRecv()
}

You can review my code, and choose one of them to be executed in /goai/myai.go main function.

Comparing with python:

./pytorchDemo/irisfromscratch.py
./pytorchDemo/iriswithpytorch.py
./pytorchDemo/logisticRcuda.py

9 Network performance(iris dataset)

9.1 Loss

9.2 Accuarcy

10 MPI communication

github.com/sbromberger/gompi
import CGO as C

Collective
- gompi.BcastFloat64s() -> C.MPI \textunderscore Bcast()
- gompi.AllreduceFloat64s -> C.MPI \textunderscore Allreduce()
Non Collective
- gompi.SendFloat64s() -> C.MPI \textunderscore Send()
- gompi.SendFloat64() -> C.MPI \textunderscore Send()
- gompi.RecvFloat64s() -> C.MPI \textunderscore Recv()
- gompi.RecvFloat64() -> C.MPI \textunderscore Recv()

11 Non collective architecture

12 Non collective design

12.1 rank = 0

in main network weights will be initialized, but not for training,
weights will broadcast to all other training networks

12.2 rank != 0

in train network receive weights from main network for initialization
After each batch training done, sending its weights variance to main network

12.3 rank = 0

receiving the variance from all training network
accumulating and then sending back to training network

12.4 rank != 0

start next training batch

13 Collective architecture

14 Collective design

All network train its data respectively,
After each train batch, pack all weights into array
MPI_Allreduce for new array
updating weights with new array

15 Iris dataset performance for non-collective

15.1 Send&Recv loss

15.2 Send&Recv accuracy

16 Iris dataset performance for collective

16.1 Allreduce loss

16.2 Allreduce accuracy

17 Intel image classification performance

17.1 Send&Recv loss (220 images)

SendRecv loss (14000 images)

17.2 Allreduce loss (220 images)

Allreduce loss (14000 images)

18 Speedup Diagrams

18.1 Iris for Allreduce and Send&Recv with different nodes

18.2 Intel Image Classification for Allreduce and Send&Recv with different nodes

19 Discussion

neural network model implement is not perfect, so the accuracy performance not so well

For each epoch:

Allreduce: about 2 minutes
Send&Recv: about 3.6 minutes, because of synchronization of each batch training

Change nodes, scaling behavior, such as speedup diagrams is missing

Change the batchsize, reducing mpi communication

20 Conclusion

Golang can also be used for parallel computing
neural network implementation of golang can be improved
HPC cluster for distributed learning has significant benefits for large dataset

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
datasets		datasets
goai		goai
linearRegression		linearRegression
logisticRegression		logisticRegression
myData		myData
network		network
plots		plots
png		png
pytorchDemo		pytorchDemo
smallGoNetwork		smallGoNetwork
.gitignore		.gitignore
LICENSE		LICENSE
README.org		README.org
README.pdf		README.pdf
README.tex		README.tex

License

scofild429/go_mpi_network

Folders and files

Latest commit

History

Repository files navigation