Skip to content

scofild429/go_mpi_network

Repository files navigation

Report oft Practical Course on High-Performance Computing

1 Project notation

1.1 Youtube link

https://www.youtube.com/watch?v=2siZQBvRPuY&t=6s executing project in local computer

git clone it
cd go_mpi_network/goai
uncomment one case in myai.go
go build
./goai

2 Datasets

All training data will equally divied for each training network, specially for mpi

3 Configuration example

  • ./goai/.irisenv
  • ./goai/.imgenv
inputdataDims=4
inputLayerNeurons=30
hiddenLayerNeurons=20
outputLayerNeurons=3
labelOnehotDims=3
numEpochs=100
learningRate=0.01
batchSize=4

4 Sumbit the job in cluster

no singularity, installing golang 1.18 was failed always

using binary executable code of golang, go build and then transfering goai to cluster.

#!/bin/bash
#SBATCH --job-name mpi-go-neural-network
#SBATCH -N 1
#SBATCH -p fat
#SBATCH -n 20
#SBATCH --time=01:30:00

module purge
module load openmpi

mpirun -n 20 ./goai

5 Deep learning’s problem

As AI comes to deep learning, the computing resource becomes more critical for training process.

Applications:

  • Image Classification
  • NLP
  • Semantic segmentation

Solution

  • GPU
  • TPU
  • Distributed learning

6 Single network architecture

raining data -> inputLayer(w1, b1) -> dinputLayer
Normalization
dinputLayer -> hiddenLayer(w2, b2) -> dhiddenLayer
Normalization
dhiddenLayer -> OutputLayer(w3, b3) -> doutputLayer

Loss = L2: (doutputLayer - onehotlable)^2

Backpropagation from Loss  of Outputlayer  to w3, b3
Backpropagation from error of Hiddenlayer  to w2, b2
Backpropagation from error of Inputlayer   to w1, b1

Derivative of sigmoid, Normalization, Standardization

  • Stochastic Gradient Descent (SGD)
  • Mini-batch Gradient Descent (MBGD)
  • Batch Gradient Descent (BGD)

7 Illustration of weights updating

./png/NeuralNetwork.png

8 Code implementation

func main() {
        singlenode.Single_node_iris(true)
        mpicode.Mpi_iris_Allreduce()
        mpicode.Mpi_iris_SendRecv()
        mpicode.Mpi_images_Allreduce()
        mpicode.Mpi_images_SendRecv()
}

You can review my code, and choose one of them to be executed in /goai/myai.go main function.

Comparing with python:

  • ./pytorchDemo/irisfromscratch.py
  • ./pytorchDemo/iriswithpytorch.py
  • ./pytorchDemo/logisticRcuda.py

9 Network performance(iris dataset)

9.1 Loss

./png/single_node_loss.png

9.2 Accuarcy

./png/single_node_acc.png

10 MPI communication

github.com/sbromberger/gompi
import CGO as C
  • Collective
    • gompi.BcastFloat64s() -> C.MPI \textunderscore Bcast()
    • gompi.AllreduceFloat64s -> C.MPI \textunderscore Allreduce()
  • Non Collective
    • gompi.SendFloat64s() -> C.MPI \textunderscore Send()
    • gompi.SendFloat64() -> C.MPI \textunderscore Send()
    • gompi.RecvFloat64s() -> C.MPI \textunderscore Recv()
    • gompi.RecvFloat64() -> C.MPI \textunderscore Recv()

11 Non collective architecture

./png/MPINetworkSendRecv.png

12 Non collective design

12.1 rank = 0

  • in main network weights will be initialized, but not for training,
  • weights will broadcast to all other training networks

12.2 rank != 0

  • in train network receive weights from main network for initialization
  • After each batch training done, sending its weights variance to main network

12.3 rank = 0

  • receiving the variance from all training network
  • accumulating and then sending back to training network

12.4 rank != 0

  • start next training batch

13 Collective architecture

./png/MPINetworkAllreduce.png

14 Collective design

  • All network train its data respectively,
  • After each train batch, pack all weights into array
  • MPI_Allreduce for new array
  • updating weights with new array

15 Iris dataset performance for non-collective

15.1 Send&Recv loss

./png/iris_sendrecv_loss.png

15.2 Send&Recv accuracy

./png/iris_sendrecv_accuracy.png

16 Iris dataset performance for collective

16.1 Allreduce loss

./png/iris_allreduce_loss.png

16.2 Allreduce accuracy

./png/iris_allreduce_accuracy.png

17 Intel image classification performance

17.1 Send&Recv loss (220 images)

./png/intelImage_subset_sendrecving_loss.png

SendRecv loss (14000 images)

./png/intelImage_sendrecv_loss.png

17.2 Allreduce loss (220 images)

./png/intelImage_subset_allreduce_loss.png

Allreduce loss (14000 images)

./png/intelImage_allreduce_loss.png

18 Speedup Diagrams

18.1 Iris for Allreduce and Send&Recv with different nodes

./png/irisSpendup.png

18.2 Intel Image Classification for Allreduce and Send&Recv with different nodes

./png/intelImageSpendup.png

19 Discussion

neural network model implement is not perfect, so the accuracy performance not so well

For each epoch:

  • Allreduce: about 2 minutes
  • Send&Recv: about 3.6 minutes, because of synchronization of each batch training

Change nodes, scaling behavior, such as speedup diagrams is missing

Change the batchsize, reducing mpi communication

20 Conclusion

  • Golang can also be used for parallel computing
  • neural network implementation of golang can be improved
  • HPC cluster for distributed learning has significant benefits for large dataset

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published