Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MixNet (Mix_Conv) - 0.360 (0.5) BFlops - 77.0% (71.5%) Top1 #4203

Closed
CuongNguyen218 opened this issue Nov 1, 2019 · 51 comments
Closed

MixNet (Mix_Conv) - 0.360 (0.5) BFlops - 77.0% (71.5%) Top1 #4203

CuongNguyen218 opened this issue Nov 1, 2019 · 51 comments

Comments

@CuongNguyen218
Copy link

CuongNguyen218 commented Nov 1, 2019

Hi @AlexeyAB ,
Mix_conv: Mixed Depthwise Convolutional Kernels.
Arxiv
Github
Top1 Acc: 78.9% on ImageNet with 0.56 BFlops. I think this idea is good.

@AlexeyAB AlexeyAB added the want enhancement Want to improve accuracy, speed or functionality label Nov 1, 2019
@AlexeyAB
Copy link
Owner

AlexeyAB commented Nov 1, 2019

MixNet-L and -M have the same network architecture: we simple apply depth_multiplier 1.3 on MixNet-M to get MixNet-L, as shown in this code: https://github.com/tensorflow/tpu/blob/56e1058cba2b7b5ca233a4c9bfd7331a69082188/models/official/mnasnet/mixnet/mixnet_builder.py#L217

Is trained:




Explanation:

  • MixNet-M-GPU - is a slightly optimized version of MixNet-M for GPU, it has higher Bflops but also faster on GPU

  • MixNet-M achieves 77.0% Top1 and EfficientNetB0 achieves 76.3% Top1 only when they are trained with a large mini_batch_size on a large cluster DGX-2 400k$ / GPU/TPU-Cluster 1M$, otherwise official EfficientNetB0 achieves only 70.0% Top1 that is lower than our EfficientNetB0 71.3% Top1 https://github.com/WongKinYiu/CrossStagePartialNetworks#small-models (for example GhostNet-1.0 should be trained with Batch-norm-synchronization on 8 GPUs with mini_batch_size=1024)
    To achieve 77.0% Top1 on MixNet-M use Darknet GPU-processing on CPU-RAM: Beta: Using CPU-RAM instead of GPU-VRAM for large Mini_batch=32 - 128 #4386

  • MixNet-M-GPU has 0.532 BFlops while Darknet shows 1.065 BFlops, that is 2x more. In all papers BFlops is actually FMA_BFlops (2 operations = MUL + ADD) https://en.wikipedia.org/wiki/Multiply%E2%80%93accumulate_operation

  • Why are there a low amount of BFLOPS in models, but also low speed - in these models a low amount of BFLOPS is achieved by using a grouped/depthwise-convolution, which is very slow on GPU, TPU-edge and other devices.


We replace one of the 15 layers with either (1) vanilla DepthwiseConv9x9 with kernel size 9x9; or (2) MixConv3579 with 4 groups of kernels: {3x3, 5x5, 7x7, 9x9}.
As shown in the figure, large kernel size has different impact on different layers: for most of layers, the accuracy doesn’t change much, but for certain layers with stride 2, a larger kernel can significantly improve the accuracy. Notably, although MixConv3579 uses only half parameters and FLOPS than the vanilla DepthwiseConv9x9, our MixConv achieves similar or slightly better performance for most of the layers.

Depthwise convolution is becoming increasingly popular in modern efficient ConvNets, but its kernel size is often overlooked. In this paper, we systematically study the impact of different kernel sizes, and observe that combining the benefits of multiple kernel sizes can lead to better accuracy and efficiency.

mixnet-flops


image


For comparison with EfficientNet

59429215-fb9f6580-8de7-11e9-9b6d-63ff4bddd897


image


image

@CuongNguyen218
Copy link
Author

@AlexeyAB
As I understand, input tensor is split by the number of filters in Mix_Conv. As i see in cfg above, i think you assume that the input channels is 16 and split it by 4 and get 4 tensor with input channels is 4, right?. But I can't understand why you used the route layer is -2, -4, -6. Can you ensure that the input of each convlayer follow the order [0:3] for 3x3, [4 : 8] for 5x5 and so on ?

@gnefihs
Copy link

gnefihs commented Nov 5, 2019

@CuongNguyen218 thanks for sharing this.

And yea it seems like AlexeyAB's cfg will apply the filters to the entire input tensor (like inceptionnet).

@beHappy666
Copy link

Maybe the slice implementation be called but not split @AlexeyAB

@AlexeyAB
Copy link
Owner

AlexeyAB commented Nov 7, 2019

@beHappy666

[route]
layers = -1
group_id=0
groups=4

[convolutional]
batch_normalize=1
filters=4
groups=4
size=3
stride=2
pad=1
activation=leaky

[route]
layers = -3
group_id=1
groups=4

[convolutional]
batch_normalize=1
filters=4
groups=4
size=5
stride=2
pad=1
activation=leaky

[route]
layers = -5
group_id=2
groups=4

[convolutional]
batch_normalize=1
filters=4
groups=4
size=7
stride=2
pad=1
activation=leaky

[route]
layers = -7
group_id=3
groups=4

[convolutional]
batch_normalize=1
filters=4
groups=4
size=9
stride=2
pad=1
activation=leaky

[route]
layers = -1,-3,-5,-7

@AlexeyAB
Copy link
Owner

AlexeyAB commented Nov 7, 2019

I added groups= and groupd_id= params to the [route] layer, so you can try to implement MixNet by using such blocks: #4203 (comment)

But I didn't test it.

Commit: 0fa9c8f

@CuongNguyen218
Copy link
Author

@AlexeyAB , how can i know that it's true

@dexception
Copy link

@AlexeyAB
Since it is using Depthwise Convolutional. Better to use on CPU.
This must be converted to OpenVino. We have to think about operator fusion.

@AlexeyAB
Copy link
Owner

@dexception

We have to think about operator fusion.

What is the operator fusion?

@AlexeyAB
Copy link
Owner

AlexeyAB commented Nov 12, 2019

@CuongNguyen218 @dexception @beHappy666 @gnefihs @WongKinYiu @LukeAI

I implemented MixNet-M classification network, so you can try to train it on ImageNet.
It seems it can be fast only on CPU.

GPU nVidia RTX 2070

  • MixNet-M: mixnet_m.cfg.txt - 0.759 BFlops (0.379 FMA) - 4.6 sec per iteration training - 45ms inference

  • MixNet-M-XNOR (partially BIT-1 inference): mixnet_m_xnor.cfg.txt - 0.237 BFlops (0.118 FMA) - 5.3 sec per iteration training - 45ms inference (32 BIT-1 ops = 1 Flops)

  • MixNet-M-GPU (minor modification for GPU): mixnet_m_gpu.cfg.txt - 1.0 BFlops (0.500 FMA) - 2.7 sec per iteration training - 45 ms inference

@WongKinYiu
Copy link
Collaborator

WongKinYiu commented Nov 12, 2019

@AlexeyAB Hello,

#4203 (comment)

  • MixNet-S - 4.1M params - 0.256 BFlops - 75.8% Top1 - 92.8% Top5
  • MixNet-M - 5.0M params - 0.360 BFlops - 77.0% Top1 - 93.3% Top5

#4203 (comment)

  • MixNet-M - 0.256 BFlops - 4.6 sec per iteration training - 45ms inference
  • MixNet-M-GPU (minor modification for GPU) - 1.0 BFlops - 2.7 sec per iteration training - 45 ms inference

i d like too know what r difference between these two comments, thanks.

@AlexeyAB
Copy link
Owner

AlexeyAB commented Nov 12, 2019

@WongKinYiu

i d like too know what r difference between these two comments, thanks.

1st is got from paper
2nd actual implementation

Or what do you mean?

MixNet is just more efficient (Top1/Flops) modification of EfficientNet

@WongKinYiu
Copy link
Collaborator

just to make sure i understand correctly.

implemented MixNet-M is 0.256 BFLOPs, but GPU version is 1.0 BFLOPs.
and, BFLOPs of implemented MixNet-M is same as MixNet-S in the paper.

i ll take a look cfg files after finish my breakfast, thank you.

@AlexeyAB
Copy link
Owner

Yes, I just made some changes in MixNet-M (mixnet_m_gpu.cfg.txt) so it can be trained ~2x faster - 2.7 sec instead of 4.6 sec per training iteration with the same inference speed on GPU.
I just decreased groups= in depthwise-MixConv-layers, so it should be more accurate and faster on GPU.

May be we should look at Diagonalwise Refactorization: 15x speedup Depthwise Convolutions to speedup EfficientNet and MixNet: #3908

@WongKinYiu
Copy link
Collaborator

WongKinYiu commented Nov 13, 2019

Now training mixnet_m.cfg.txt - 0.256 BFlops - 4.6 sec per iteration training - 45ms inference.
But it shows: Total BFLOPS 0.759.

image

update: gets cuDNN Error: CUDNN_STATUS_INTERNAL_ERROR

@AlexeyAB
Copy link
Owner

AlexeyAB commented Nov 13, 2019

@WongKinYiu Yes, I fixed, BFLOPS 0.759 it is 0.379 FMA (EfficientNet and MixNet authors use FMA).

I successfully trained mixnet_m_gpu.cfg.txt for 10 000 iterations on Windows 7 x64.

@WongKinYiu
Copy link
Collaborator

@AlexeyAB thanks,

i do not know why on my every windows computer, training models with group convolution will crash.
on ubuntu, everything works.

@AlexeyAB
Copy link
Owner

@WongKinYiu

  • How many iterations did you train before this error occured?
  • Can you show screenshot of this error?
  • Try to increase subdivisions.
  • What CUDA and cuDNN versions do you use?
  • Show output of
nvcc --version
nvidia-smi

@WongKinYiu
Copy link
Collaborator

@AlexeyAB

100~900 iterations.
cuda 10
image

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:04_Central_Daylight_Time_2018
Cuda compilation tools, release 10.0, V10.0.130

windows do not have nvidia-smi

@AlexeyAB
Copy link
Owner

@WongKinYiu

This is a very strange error, why it is trying to create another instance of cuDNN-handle when it is already created.

windows do not have nvidia-smi

It should be in the C:\Program Files\NVIDIA Corporation\NVSMI\nvidia-smi
nvidia-smi.zip

Do you use the latest version of Darknet?
If you set subdivisions=8 does it help?

@WongKinYiu
Copy link
Collaborator

yes, i use latest version.

image

@AlexeyAB
Copy link
Owner

@WongKinYiu

  • your nvcc --version shows CUDA 10.0, while nvidia-smi shows CUDA 10.1 - may be this is the reason.
  • also some users encountered errors when using CUDA 10.1

@WongKinYiu
Copy link
Collaborator

yes, i notice that nvidia-smi shows cuda vesion 10.1.
it is really strange.
when i installed cuda, cuda 10.1 had not been released.

@AlexeyAB
Copy link
Owner

Or just try to use new cuDNN version

@CuongNguyen218
Copy link
Author

@WongKinYiu,
can you give me a link to CIOU and Diou papers ?

@WongKinYiu
Copy link
Collaborator

@CuongNguyen218

here u r: #4360

@CuongNguyen218
Copy link
Author

@AlexeyAB ,
Did you provide EfficientNet model or convert Efficientnet pretrained with ImageNet model to darknet.

@WongKinYiu
Copy link
Collaborator

@CuongNguyen218

ImageNet and COCO models of EfficientNet-B0: #3874 (comment)

@CuongNguyen218
Copy link
Author

@AlexeyAB , what result did you get?

@WongKinYiu
Copy link
Collaborator

@AlexeyAB

mixnet-m-gpu, top-1 = 71.5%, top-5 = 90.5%.

@AlexeyAB
Copy link
Owner

@WongKinYiu Nice! Can you share weights-file?

@CuongNguyen218
Copy link
Author

Why is your results very different from paper?

@WongKinYiu
Copy link
Collaborator

becuz mixnet-m-gpu is designed by @AlexeyAB, not appears in the paper.

@CuongNguyen218
Copy link
Author

CuongNguyen218 commented Dec 11, 2019 via email

@WongKinYiu
Copy link
Collaborator

@AlexeyAB https://drive.google.com/open?id=1SOLd3eXHwcLkvwFgdiui6uL3-_rWWB1E

@AlexeyAB
Copy link
Owner

AlexeyAB commented Dec 11, 2019

@CuongNguyen218

Why is your results very different from paper?

Because in the paper MixNet and Efficientnet are trained with very large mini_batch_size on DGX-2 / Cluster ~400k$ - 1M$.
You can achieve the same accuracy 77.0% Top1 by using Darknet with #4386

If we train with the same mini_batch_size, then EfficientNet-B0 (official) has even lower Top1/5 accuracy than my EfficientNet-B0: https://github.com/WongKinYiu/CrossStagePartialNetworks#small-models

Also, I slightly optimized MixNet on GPU so that it can be trained in 1 month instead of 2 months.

@AlexeyAB AlexeyAB removed the want enhancement Want to improve accuracy, speed or functionality label Dec 11, 2019
@AlexeyAB AlexeyAB changed the title Mix_Conv Mix_Conv - 0.360 BFlops - 77.0% (71.5%) Top1 Dec 11, 2019
@AlexeyAB AlexeyAB changed the title Mix_Conv - 0.360 BFlops - 77.0% (71.5%) Top1 Mix_Conv - 0.360 (0.5) BFlops - 77.0% (71.5%) Top1 Dec 11, 2019
@AlexeyAB AlexeyAB changed the title Mix_Conv - 0.360 (0.5) BFlops - 77.0% (71.5%) Top1 MixNet (Mix_Conv) - 0.360 (0.5) BFlops - 77.0% (71.5%) Top1 Dec 11, 2019
@AlexeyAB
Copy link
Owner

@CuongNguyen218 If you want you can train original MixNet-M on ImageNet: #4203 (comment)

MixNet-M: mixnet_m.cfg.txt - 0.759 BFlops (0.379 FMA) - 4.6 sec per iteration training - 45ms inference

https://github.com/AlexeyAB/darknet/files/3838329/mixnet_m.cfg.txt

@glenn-jocher
Copy link

glenn-jocher commented Mar 12, 2020

@AlexeyAB I just started looking into MixConvs. They seem very interesting! Do you know of anywhere that they are applied to object detection or are they only used in classification?

EfficientDet was published in November 2019, while MixConv was published in July 2019, so the EfficientDet authors clearly must have been aware of this type of convolution but neglected to use it for some reason I'm thinking.

@AlexeyAB
Copy link
Owner

@glenn-jocher

There are the same authors in all three articles: MixNet, EfficientNet, EfficientDet

  • EfficientNet uses Grouped-Conv
  • MixNet uses Grouped-Conv with different kernel_size

Both EfficientNet / MixNet are not optimal for the current CPU/GPU/Neuro-chips (MyriadX, Coral-TPU-Edge).

So they do such network as a reference-network to help to create a new neurochips (new version of TPU-edge).

So may be the reason why they don't use MixNet for Detector: Creating a neurochip for EfficientNet (Grouped-conv) is much easier than for MixNet (Grouped-Conv with different kernel_size).

Also may be MixNet has lower BFlops, but also slower.

@glenn-jocher
Copy link

@AlexeyAB Ah I see, that's an interesting approach. Yes it seems like hardware speeds for all of these new grouped convolution techniques are quite slow, despite the lower parameter count.

@minhaj3
Copy link

minhaj3 commented May 3, 2020

Hi @AlexeyAB , I am trying to do inference on mixnet model using your config and pretrained weights mentioned in the starting of the tread, but I am getting error: " Error: in the file data/coco.names number of names 80 that isn't equal to classes=0 in the file cfg/mixnet_m_gpu.cfg
". The number of classes is not mentioned in the config file, but this error says so. And even if it implies that it was trained on a different number of classes, it still does not makes sense to have 0 classes in a config file. Am I missing something here? Someone help me out.

I tried running it on my ubuntu 18.04 by using command: "./darknet detector test cfg/coco.data cfg/mixnet_m_gpu.cfg mixnet_m_gpu_final.weights -ext_output data/dog.jpg"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants