Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

different training results #22

Closed
xiao1228 opened this issue Oct 4, 2018 · 37 comments
Closed

different training results #22

xiao1228 opened this issue Oct 4, 2018 · 37 comments

Comments

@xiao1228
Copy link

xiao1228 commented Oct 4, 2018

Hi,
I started to train the yolov3 using 1 GPU without changing your code. And i got the below graphs...Which are all slightly different from your results. The shapes are roughly the same but the values are all in a different range shown below. I am a bit confused...It will be great if you could point me out the right direction thank you!

results

@xiao1228
Copy link
Author

xiao1228 commented Oct 5, 2018

After training 41 epoch. I still got mAP Mean Average Precision: 0.1177 on COCO test..
Am I missing anything in the code?

@JegernOUTT
Copy link

I have the same result with the latest commits - very low recall, precision is fine
Maybe the reason in unimplemented loss feature:
"TODO: Additional works needs to be done to ignore non-best anchors > 0.50 iou."
or objectness loss part became too low comparing the other loss parts?

After 200 epoch on small test dataset with 3 classes I got 0.07 recall and nearly 0.97 precision
Predictions is almost fine but with very low confidence

PS: Thank you for your work 👍

@xiao1228
Copy link
Author

xiao1228 commented Oct 5, 2018

Also when training TP is always 0 or maybe 1 sometimes..FP start with some number see below...
image
And after a while it become 0 as well....
image

@glenn-jocher
Copy link
Member

glenn-jocher commented Oct 5, 2018

@xiao1228 @JegernOUTT sorry guys, I'm still trying to figure out the exact loss terms to use. Small changes have huge effects. Some of the options that need testing are:

  1. CE vs BCE for the classification loss.
  2. Size averaging vs not size averaging all loss terms.
  3. Splitting confidence into obj and noobj, or not splitting (this is what's causing the very low TPs)

The main region of the code affected is small. My main strategy is to resume training from official yolov3.pt weights and look for the loss terms that produce the best mAP after 1 epoch. In the latest commit b7d0397 the mAP after 1 epoch of resumed training is about 0.50 mAP, down from 0.57 mAP with the official weights, so something is probably still not right I think.

yolov3/models.py

Lines 160 to 182 in b7d0397

nT = sum([len(x) for x in targets]) # number of targets
nM = mask.sum().float() # number of anchors (assigned to targets)
nB = len(targets) # batch size
k = nM / nB
if nM > 0:
lx = k * MSELoss(x[mask], tx[mask])
ly = k * MSELoss(y[mask], ty[mask])
lw = k * MSELoss(w[mask], tw[mask])
lh = k * MSELoss(h[mask], th[mask])
# lconf = k * BCEWithLogitsLoss(pred_conf[mask], mask[mask].float())
lconf = k * BCEWithLogitsLoss(pred_conf, mask.float())
# lcls = k * CrossEntropyLoss(pred_cls[mask], torch.argmax(tcls, 1))
lcls = k * BCEWithLogitsLoss(pred_cls[mask], tcls.float())
else:
lx, ly, lw, lh, lcls, lconf = FT([0]), FT([0]), FT([0]), FT([0]), FT([0]), FT([0])
# Add confidence loss for background anchors (noobj)
#lconf += k * BCEWithLogitsLoss(pred_conf[~mask], mask[~mask].float())
# Sum loss components
loss = lx + ly + lw + lh + lconf + lcls

@xiao1228
Copy link
Author

xiao1228 commented Oct 5, 2018

@glenn-jocher Thank you Glenn, yea I resume the yolov3.pt and after 3 epochs, the precision and recall become 0.381 and 0.458..

@glenn-jocher
Copy link
Member

glenn-jocher commented Oct 6, 2018

Ok I'm going to document my test results here. These are the mAPs after 1 epoch of yolov3.pt resumed training. All mAPs are as produced by test.py:

  • Official yolov3.pt mAP: 0.57
  • b7d0397 baseline mAP: 0.50
  • b7d0397 + CE lconf: 0.51
  • UPDATE: b7d0397 + CE lconf + batch_size 16: 0.5191
  • UPDATE: b7d0397 + weighted CE lconf: 0.4980
  • b7d0397 + size_average = False: 0.47
  • b7d0397 + lconf split into obj and noobj @ 0.50 conf_thres: 0.23
  • b7d0397 + lconf split into obj and noobj @ 0.99 conf_thres (I've noticed splitting usually requires much higher thresholds here): 0.45
  • UPDATE: b7d0397 + augmentation disabled: 0.50
  • UPDATE: b7d0397 + batch_size 4: 0.4588
  • UPDATE: b7d0397 + batch_size 16: 0.5126
  • UPDATE: b7d0397 + batch_size 16x4: 0.5118
  • UPDATE: b7d0397 + MSE sigmoid w and h loss: 4785
  • UPDATE: b7d0397 + MSE exp w and h loss: 0.5073
  • UPDATE: b7d0397 + trained for 2 epochs: ?

... further tests?

@xiao1228
Copy link
Author

xiao1228 commented Oct 6, 2018

@glenn-jocher thanks for the update! which means CE classification loss gives a slightly better mAP compare to the rest of the changes then?

@glenn-jocher
Copy link
Member

CE means nn.CrossEntropyLoss() for lcls. BCE is nn.BCEWithLogitsLoss for lcls. These two options are here:

yolov3/models.py

Lines 173 to 174 in b7d0397

# lcls = k * CrossEntropyLoss(pred_cls[mask], torch.argmax(tcls, 1))
lcls = k * BCEWithLogitsLoss(pred_cls[mask], tcls.float())

The problem though is none of these changes I tried retains the 0.57 mAP at the start of the resumed epoch. I'm not sure what to do, any ideas are welcome.

@ecr23xx
Copy link

ecr23xx commented Oct 7, 2018

Sorry I'm testing on yolov3.pt instead of latest.pt. On latest.pt I get very low mAP also 😞

@glenn-jocher I pull your latest commit and get 56.67mAP after 1 epoch of yolov3.pt resumed.

Here is the Tensorboard information I recorded.

@ydixon
Copy link

ydixon commented Oct 7, 2018

@glenn-jocher No luck here too.... So far I think i can get the model to overfit the training set without augmentations. As soon I use augmentations, loss kinda get stuck at some point.

Btw, I also found the augmentations you used seems to be different from the C version. So it's very possible the model might converge differently if trained from the original YoloV3 weights.

@glenn-jocher
Copy link
Member

@ydixon thats interesting, I'll try disabling the HSV augmentation and then disabling the spatial augmentation as two additional tests. You are right this could have a significant impact. I'm also going to try a larger batch size (16 vs 12). Darknet may be training with a much larger batch size (64?).

@ECer23 thanks a lot for the plots! I'm going to plot the same on my end for one resumed epoch. I wish your results were correct, but I think you might accidentally be testing the same yolov3.pt to get that 56.67 mAP. After you resume for one epoch the new weights are saved in latest.pt (not yolov3.pt). The code I've been using to do these tests is here:

sudo rm -rf yolov3 && git clone https://github.com/ultralytics/yolov3
cd yolov3/checkpoints
wget https://storage.googleapis.com/ultralytics/yolov3.pt
cp yolov3.pt latest.pt
cd ..
python3 train.py -img_size 416 -batch_size 12 -epochs 1 -resume 1
python3 test.py -img_size 416 -weights_path checkpoints/latest.pt -conf_thres 0.5

@ydixon
Copy link

ydixon commented Oct 7, 2018

@glenn-jocher From this cfg, batch=64, subdivision=16. Therefore, real batch size = 64 / 16 = 4.

@xiao1228
Copy link
Author

xiao1228 commented Oct 8, 2018

hi @glenn-jocher I have tried to using batch size 16 and trained for 40+ epochs...the mAP is still 0.1083..
Training plot is shown below..it is very low compare to what you get
results

@glenn-jocher
Copy link
Member

@ydixon so I'm assuming they accumulate the gradient for the 64 images and then update the optimizer only once at the end of the 64 images (with the subdivisions only serving to reduce the memory requirements)?

@xiao1228 the shape of those plots looks good but the rate of change of P and R is painfully slow. If we didn't divide k by nB that would make your gradient 16x steeper (same as multiplying the learning rate by 16), could you try that for a few epochs?

k = nM / nB

I resumed b7d0397 again for 1 epoch and get these loss plots over the epoch. If the model were perfectly aligned with darknet I think we'd expect to see the losses stay pretty consistent, but instead they drop over the epoch, especially the width and height terms. mAP at end is the same 0.5015 I saw before.
resumed

@xiao1228
Copy link
Author

xiao1228 commented Oct 8, 2018

@glenn-jocher the loss become nan in the first epoch by using k = nM
image

@ydixon
Copy link

ydixon commented Oct 9, 2018

@glenn-jocher That's what I thought too. Seems to make most common sense. However, after asking the authors, they insist on saying it's being updated per minibatch.

@glenn-jocher
Copy link
Member

glenn-jocher commented Oct 9, 2018

@ydixon that's strange. Wouldn't that be equivalent to batch size 4 with no minibatches? I'll test it out both ways. I tested batch size 16 and the resumed mAP increases from 0.50 (bs 12) to 0.5126 (bs 16). Its possible effective batch size 64 would make a big difference.

@xiao1228 ok yes I was afraid that might happen. The parameter updates become too large and the training becomes unstable.

I've been updating #22 (comment) with my test results. Positive results are that switching lconf from BCE to CE and increasing batch size from 12 to 16 increase resumed mAP by 1 point each. I'll keep trying new tests and then implement all the changes that show increased mAP into a commit.

@xiao1228
Copy link
Author

@glenn-jocher after 70+ epochs i can see the precision and recall are becoming flat...but the mAP is still only 0.1961...
results

@glenn-jocher
Copy link
Member

glenn-jocher commented Oct 10, 2018

@xiao1228 thanks for the update! Yes your plots make sense, the LR scheduler multiplies the initial 1e-3 lr by 0.1 at epoch 54 and 61 (to match yolov3.cfg settings). This assumes total training time of 68 epochs. From your plots though it seems far too soon to drop the LR at epoch 54, as the P and R are still increasing linearly ...

yolov3/train.py

Lines 104 to 112 in d748bed

# Update scheduler (manual)
if epoch < 54:
lr = 1e-3
elif epoch < 61:
lr = 1e-4
else:
lr = 1e-5
for g in optimizer.param_groups:
g['lr'] = lr

You could also try varying -conf_thres and -nms_thres during testing. In the past I've seen -conf_thres be very sensitive to the exact loss equation. Perhaps try varying -conf_thres from 0.4 - 0.9 in steps of 0.1, and -nms_thres perhaps try from 0.2-0.6 also in steps of 0.1 (i.e. two 1-D vectors about the default point).

My tests are not revealing any breakthroughs unfortunately. All I have so far is what I mentioned before, increasing batch size to 16 and using CE for lconf both are showing 1 point mAP increases. Batch size 16x4 = 64 did not improve mAP over bs 16.

I have one big change to test also, which is ignoring non best >0.5 IOU anchors. This is a little tricky to implement but I should have it soon. This is explicitly stated in the paper, so it could have an impact.

@glenn-jocher
Copy link
Member

glenn-jocher commented Oct 10, 2018

@xiao1228 @ydixon I just realized something important. When I used this repo for the xview challenge (https://github.com/ultralytics/xview-yolov3) I saw a vast improvement in performance when using weighted CE for lcls. So I harvested and placed the coco 2014train class counts into the latest commit to test this option. I think this will "break" the resumed training since darknet does not use weights (I think), but this may create a big boost in performance when training from scratch, if my xview experience serves me well. The weight counts are here, you can see they differ by 3 orders of magnitude. If the low mAPs we see are primarily due to the rare classes this may boost our mAP.

yolov3/utils/utils.py

Lines 33 to 41 in f79e7ff

def class_weights(): # frequency of each class in coco train2014
weights = 1 / torch.FloatTensor(
[187437, 4955, 30920, 6033, 3838, 4332, 3160, 7051, 7677, 9167, 1316, 1372, 833, 6757, 7355, 3302, 3776, 4671,
6769, 5706, 3908, 903, 3686, 3596, 6200, 7920, 8779, 4505, 4272, 1862, 4698, 1962, 4403, 6659, 2402, 2689,
4012, 4175, 3411, 17048, 5637, 14553, 3923, 5539, 4289, 10084, 7018, 4314, 3099, 4638, 4939, 5543, 2038, 4004,
5053, 4578, 27292, 4113, 5931, 2905, 11174, 2873, 4036, 3415, 1517, 4122, 1980, 4464, 1190, 2302, 156, 3933,
1877, 17630, 4337, 4624, 1075, 3468, 135, 1380])
weights /= weights.sum()
return weights

@xiao1228
Copy link
Author

@glenn-jocher thank you very much for this update. Have you tried it using the new weighted CE? also I will keep the lr to 1e-3 for the entire training then...
One thing i also noticed is lots of calculation is done on CPU when training the model, therefore the CPU usage is very high during training as well..

@glenn-jocher
Copy link
Member

@xiao1228 yes most of build_targets() is done on the CPU, as well as 100% of utils/datasets.py. I used Spyder lineprofiler to speed up the code as much as I could, but moving variables between CPU and GPU in models.py consumes much time, as well as loading jpegs and doing the image augmentation in datasets.py. If you find faster workarounds let me know and we'll definitely update the code.

I resume-trained one epoch with the weighted CE, and mAP came out lower :( training from scratch with weighted CE may be better though, as I see that the two lowest count categories have 0.0 mAP each (the latest test.py additionally produces a mAP for each class).

@okanlv
Copy link

okanlv commented Oct 10, 2018

@glenn-jocher @ydixon @xiao1228 Training depends on batch size and subdivision as follows:
The gradient is calculated for every batch(64) and accumulated for subdivision(16) number of times. After that, the parameters are updated. In practice, you either need to accumulate the gradients (lower resource required) or take the batch size as 64x16 and do the gradient update afterwards.
Reference : https://github.com/pjreddie/darknet/blob/61c9d02ec461e30d55762ec7669d6a1d3c356fb2/src/network.c#L289-L298

@glenn-jocher
Copy link
Member

@okanlv I see your link. Ok I'll try and follow your logic here. If I use the following values

yolov3/cfg/yolov3.cfg

Lines 6 to 7 in d336e00

batch=16
subdivisions=1

with the darknet equation
if(((*net->seen)/net->batch)%net->subdivisions == 0) update_network(net);

then this would be if (seen/16) % 1 == 0 then update, or update every 16 images. Ok, so this should be equivalent to -batch_size = 16 in train.py. I'll set this as the new default. Only ppl with 16 Gb cards will likely be able to do this though (a P100 can do it). Setting torch.backends.cudnn.benchmark = False in train.py will free up GPU memory, but slow things down a bit ~15%.

@xiao1228
Copy link
Author

@glenn-jocher I tried to train from scratch again with the new weighted CE..but after 10 epochs seems the trend is very similar as the previous one :(

@glenn-jocher
Copy link
Member

glenn-jocher commented Oct 11, 2018

@xiao1228 I just made a new commit 24a4197 which switch lconf to CE (unweighted) and increases default batch size to 16. These two changes produce resumed mAP of 0.5191 with conf_thres = 0.5.

BUT I noticed that dropping conf_thres has a huge impact on mAP, as I said before it is worth varying the tunable variables to see which produces the best mAP. If I set conf_thres = 0.3 I get 0.5522 resumed mAP, and if I set conf_thres = 0.2 I get 0.5548 mAP for example.

If you have time, I would explore conf_thres and nms_thres variations in test.py to find the best balance. It could be that since we are not perfectly in sync with darknet other setting perform better here.

@ydixon
Copy link

ydixon commented Oct 15, 2018

@okanlv @glenn-jocher
Thanks for clarifying. I probably misunderstood the author's responses.
So basicially
net.batch = batch_cfg / subdivisons_cfg

https://github.com/pjreddie/darknet/blob/b13f67bfdd87434e141af532cdb5dc1b8369aa3b/src/parser.c#L657-L667

For example:

batch_cfg=64, subdivision_cfg=16
net.batch = batch_cfg / subdivision_cfg = 4
net.subdivisions = 16

Case 1: net.seen = 124

(net.seen/net.batch)%net.subdivisions = (124/4) % 4 = 3
Do not update

Case 2: net.seen = 128

(net.seen/net.batch)%net.subdivisions = (128/4) % 4 = 0
Update

Also refer to AlexeyAB/darknet#1736 and pjreddie/darknet#224

@okanlv
Copy link

okanlv commented Oct 16, 2018

@ydixon You are right. I have totally forgotten the modification in parser. So to correct my previous answer:
Training depends on batch size and subdivision as follows:
The gradient is calculated for every batch(64/16) and accumulated for subdivision(16) number of times. After that, the parameters are updated. In practice, you either need to accumulate the gradients (lower resource required) or take the batch size as 64 and do the gradient update afterwards.

@glenn-jocher
For batch=16 and subdivisions=1 in the config file, you should take batch_size as 16 and do the parameter update afterwards. However, the batch size is taken as 64 in the original code. Btw, you should accumulate the gradients then update the parameters afterwards in order to fit the larger batch size into the memory. Lets say that you want use batch_size=64 during the training but you can only fit 16 images into GPU memory. Define your batch_size=16 for the dataloader. Then, only when i % 4 == 0 call optimizer.step() and optimizer.zero_grad() in the following for loop.

yolov3/train.py

Line 118 in d336e00

for i, (imgs, targets) in enumerate(dataloader):

You have to do 1 more modification. Divide the loss by 4 before calling loss.backward() to average the loss.

@glenn-jocher
Copy link
Member

glenn-jocher commented Oct 18, 2018

@ydixon @okanlv the latest commit from a few days ago actually already includes code for accumulating gradients. I have not tested it from scratch, but resuming training with batch size 64 vs 16 did not show a big effect after one epoch :(

This might not be the most elegant implementation (any better ideas are welcome), but if you uncomment this if statement and place optimizer.step() and optimizer.zero_grad() inside it, then the model will only update once every accumulated_batches. With the default settings (batch_size = 16 and accumulate_batches = 4), then the effective batch size of each update is 64. I'm not sure you should divide by 4 though, the learning rate already seems painfully small, this would make each update even smaller.

yolov3/train.py

Lines 132 to 135 in 05f28ab

# accumulated_batches = 4 # accumulate gradient for 4 batches before stepping optimizer
# if ((i+1) % accumulated_batches == 0) or (i == len(dataloader) - 1):
optimizer.step()
optimizer.zero_grad()

@okanlv are you saying that the latest darknet code does not load batch=16 and subdivisions=1 from yolov3.cfg?

@ydixon
Copy link

ydixon commented Oct 19, 2018

@glenn-jocher I tested batch size of 64 on my own too. Not much improvement too :(. Loss still stuck at some local minimum I think.

As for accumulated gradients, if you don't divide it by 4, it would be like summing 4 average loss values of size 16 instead of getting the average loss value of size 64. Either way, I don't see any improvement yet.

@glenn-jocher
Copy link
Member

All, I trained to 60 epochs using the current setup. I used batch size 16 for the first 24 hours, then reverted to batch size 12 accidentally for the rest (hence the nonlinearity at epoch 10). A strange hiccup happened at epoch 40, then learning rate dropped from 1e-3 to 1e-4 at epoch 51 as part of the LR scheduler. This seemed to produce much accelerated improvements in recall during the last ten epochs. The test mAP at epoch 55 was 0.40 with conf_thresh = 0.10, so I feel if I continued training until perhaps epoch 100 we might get a very good mAP, especially seeing the Recall improving so well during epochs 51-60.

The strange thing is that I had to lower conf_thresh to 0.10 to get this good (0.40) mAP, otherwise I see 0.20 mAP at the default conf_thresh = 0.50. I am going to restart training with a constant 16 batchsize, and hopefully the epoch 40 hiccup does not repeat.

figure_1

@hxy1051653358
Copy link

Hello, I have the same problem as you, when training TP and FN is always 0 or maybe 1 sometimes. How can you solve it?Thank you so much!
_20181109091427
_20181109091440
@xiao1228

@xiao1228
Copy link
Author

Hi @glenn-jocher
I am able to achieve the results as yours now. with mAP 52.2
So after training, i would like to convert the model to ONNX, and I see that in model.py you have set ONNX_export to False.
If i understand correctly, i only need to change ONNX_export to True and then run torch.onnx._export(model, img, 'weights/model.onnx', verbose=True) WITHOUT retraining again right?
but why you need to set a flag in model.py which changes some of the layer structure for ONNX?
Thank you
results

@fourth-archive
Copy link

@xiao1228 you should set ONNX_EXPORT = True and then run detect.py with an example image you want your exported model to process. If you see any errors, perhaps download the latest repo and try again.

Are your plots for training COCO or a custom dataset? They look pretty good!

@xiao1228
Copy link
Author

@fourth-archive thank you for the information, i tried both for setting ONNX_EXPORT true and false, I dont see any error..but I wonder what is the difference.

The graph is plotted based on COCO.

@fourth-archive
Copy link

@xiao1228 if you don't see any error then it worked. Do you see a new model.onnx file now in your directory?

If ONNX_EXPORT=False then detect.py runs as normal. If ONNX_EXPORT=True then instead of running inference an onnx model is created and saved. You should see all this printed to screen.

@xiao1228
Copy link
Author

@fourth-archive Thank you, but I mean the ONNX_EXPORT=True/False in the models.py
because I tried both, they can both generate a model.onnx in the end. but in the models.py seems there are some arch changes if ONNX_EXPORT is set to True.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants