Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor finetuning results #9

Closed
wj-son opened this issue Feb 13, 2021 · 23 comments
Closed

Poor finetuning results #9

wj-son opened this issue Feb 13, 2021 · 23 comments

Comments

@wj-son
Copy link

wj-son commented Feb 13, 2021

Thanks for your great work.
I finetuned the pretrained model on UCF101 train split1, but evaluation results show about 6.5% accuracy.
I think that is caused by multi gpus and the procedure of loading checkpoints. But, despite of the change, the result was same.
I only change the original code about dataset path, model wrapped with torch.nn.DataParallel().

@BestJuly
Copy link
Owner

Hi, @wj-son
I do not check whether the code is fine with multi-GPU because it only occupies few GPU memories when fine-tuning and for single GPU, the fine-tuning time is also very fast.

I would like to ask some questions for more details:
Does the poor performance only occur for multi-GPU (with torch.nn.DataParallel()) in your case?
Have you checked the results by default settings?

@wj-son
Copy link
Author

wj-son commented Feb 13, 2021

Both single gpu and multi gpus configuration raised the poor finetuning performance. All the codes are the same except for the dataset path. In my case, I pretrained the model with shuffled residual frames as view_2 and then finetuned the model with residual frames.

@BestJuly
Copy link
Owner

BestJuly commented Feb 14, 2021

Then I want to ask,

  1. Is the performance normal for video retrieval?
  2. Is the model correctly loaded? You can set model.load_state_dict(pretrained_weights['model'], strict=True) and errors will be raise because of mismatched layer names. However, if correctly loaded, the mismatched weights are only from linear classification layers.
  3. Are the fine-tuning logs normal? The accuracy should at least be high for the training datasets in fine-tuning part.

If all mentioned things are normal, then maybe we need to discuss for further solutions.

@wj-son
Copy link
Author

wj-son commented Feb 14, 2021

  1. I didn't evaluate performance for video retrieval. I'll check that soon.
  2. Yes, I checked the result with strict option True. So, I know the mismatch result and then didn't change the options to True.
  3. Yes, fine-tuning accuracy for validation set saturates at 90% epoch 150.

1-result

(xmar) prml@ai02:~/wj/source_code/IIC$ python retrieve_clips.py --ckpt=./ckpt/intraneg_shuffle_r3d_res_0211/ckpt_epoch_240.pth --dataset=ucf101 --merge=True
{'cl': 16, 'model': 'r3d', 'id': 'r3d', 'dataset': 'ucf101', 'feature_dir': 'features/ucf101/r3d', 'gpu': 0, 'ckpt': './ckpt/intraneg_shuffle_r3d_res_0211/ckpt_epoch_240.pth', 'bs': 8, 'workers': 8, 'extract': True, 'modality': 'res', 'merge': True}
[Warning] The testing modality is res.
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1192/1192 [19:53<00:00, 1.00s/it]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 472/472 [08:17<00:00, 1.05s/it]
Saving features to ... features/ucf101/r3d
Load local .npy files. from ... features/ucf101/r3d
Top-1, correct = 899.00, total = 3776, acc = 0.238
Top-5, correct = 1419.00, total = 3776, acc = 0.376
Top-10, correct = 1737.00, total = 3776, acc = 0.460
Top-20, correct = 2071.00, total = 3776, acc = 0.548
Top-50, correct = 2564.00, total = 3776, acc = 0.679

This result shows somewhat different performance comparing the paper.

@BestJuly
Copy link
Owner

Oh, that is weired.

For the finetuning part, I think on the validation as well as the training set, if the accuracy is high, it should not be only 6% on the testing set. If you use the test mode of ft.classify, you should confirm that the right model is loaded. Testing your trained model using training set is also one way to check.

For the video retrieval part, the performance is also low, which is also strange. I have tried different random seeds but my results are around reported one. 23.8% is about 10% lower than normal performance. If everything remains the same with the code, the data, as well as training settings, I could not imagine the reason why you met poor performance. I am running one experiment with --neg=shuffle now to see what will the performance be using a different server.

@BestJuly
Copy link
Owner

Hi, @wj-son

I trained the model again recently with --neg=shuffle and for video retrieval, the results are

Top-1, correct = 1357.00, total = 3776, acc = 0.359
Top-5, correct = 2041.00, total = 3776, acc = 0.541
Top-10, correct = 2368.00, total = 3776, acc = 0.627
Top-20, correct = 2690.00, total = 3776, acc = 0.712
Top-50, correct = 3123.00, total = 3776, acc = 0.827

which are higher than that reported in our paper.

And for fine-tuning, on UCF101 dataset, the result is [TEST] loss: 1.061, acc: 0.720, which is normal.

I still do not know why the performance in your case is worse.
Maybe you can change your experimental environment (using docker or other servers) to try.

@wj-son
Copy link
Author

wj-son commented Feb 18, 2021

Sorry for replying your meaningful and laborious works.
In my case, I'm using a server with 4 gpus.
I tried the python script of default in README.md using original code and then got the following results.

python train_ssl.py --dataset=ucf101 --model=r3d --modality=res --neg=repeat

Train: [240/240][590/596]8 (53.5BT 0.809 (0.816) DT 0.000 (0.004) loss 1.284 (1.066) 1_p 54.761 (53.873) 2_p 54.624 (53.557)
epoch 240, total time 486.15
==> Saving...
intraneg_repeat_r3d_res_0216

python retrieve_clips.py --ckpt=/path/to/your/model --dataset=ucf101 --merge=True

{'cl': 16, 'model': 'r3d', 'id': 'r3d', 'dataset': 'ucf101', 'feature_dir': 'features/ucf101/r3d', 'gpu': 0, 'ckpt': 'ckpt/intraneg_repeat_r3d_res_0216/ckpt_epoch_240.pth', 'bs': 8, 'workers': 8, 'extract': True, 'modality': 'res', 'merge': True}
[Warning] The testing modality is res.
100%|████████████████████████████████████████████████████████████████████████| 1192/1192 [20:00<00:00, 1.01s/it]
100%|██████████████████████████████████████████████████████████████████████████| 472/472 [08:25<00:00, 1.07s/it]
Saving features to ... features/ucf101/r3d
Load local .npy files. from ... features/ucf101/r3d
Top-1, correct = 1049.00, total = 3776, acc = 0.278
Top-5, correct = 1596.00, total = 3776, acc = 0.423
Top-10, correct = 1894.00, total = 3776, acc = 0.502
Top-20, correct = 2214.00, total = 3776, acc = 0.586
Top-50, correct = 2681.00, total = 3776, acc = 0.710

python ft_classify.py --ckpt=/path/to/your/model --dataset=ucf101

Train epoch: [150/150][ 548/ 547] Loss 4.0295 (0.1402) Acc 0.000 (0.974) lr: 0.001
Val epoch: [150/150][ 50/ 50] Loss 0.1769 (0.4682) Acc 0.938 (0.890) lr: 0.001
Epoch time: 229.36 s.
start testing ...
[Warning]: using residual frames as input
Test: [237/237], 1.000 (0.053)
[TEST] loss: 6.517, acc: 0.053

Thank you for your help.. Sorry to bother you

@BestJuly
Copy link
Owner

BestJuly commented Feb 18, 2021

This time, your reported results are better than previous version. However, the performance is still very low.

Because you mentioned that you used the same code as the repo, I would like to ask

  1. For video frames, are the data the same as ours? I used data directly from this repo;

This is trying to confirm that the data part is the same.

  1. Have you tried to train video classification without loading pre-trained weights?

This is trying to confirm that the fine-tuning code works well in your experimental settings.

If 1 and 2 are fine, then 3. What kind of experimental environment do you use? Here is the detailed settings in my case. If possible, you can try my settings.

Python 3.7.4 compiled with GCC 8.4.0.
Python packages: torch==1.2.0, numpy==1.16.2, accimage==0.1.1, torchvision==0.4.0,
Other packages which should not affect performance: tqdm==4.42.1, scikit-learn==0.21.3, pandas==0.25.3, Pillow==6.2.0, tensorboardX==1.9, tensorboard-logger==0.1.0.

@wj-son
Copy link
Author

wj-son commented Feb 20, 2021

  1. yes I did. The dataset I am using was from the repo
  2. I tried to train video classification without loading pre-trained weights, but the results were the same as with pre-trained weights.
    That means the procedure of loading pre-trained weights caused main error in this problem.

Train epoch: [150/150][ 548/ 547] Loss 4.6004 (0.0370) Acc 0.000 (0.996) lr: 0.001
Val epoch: [150/150][ 50/ 50] Loss 0.1112 (0.2578) Acc 0.938 (0.927) lr: 0.001
Epoch time: 229.75 s.
start testing ...
[Warning]: using residual frames as input
Test: [237/237], 0.000 (0.081)
[TEST] loss: 5.781, acc: 0.081

@BestJuly
Copy link
Owner

That is wired.
Because you said you have tried to set model.load_state_dict(pretrained_weights['model'], strict=True) to confirm whether model is correctly loaded.

You mentioned

I tried to train video classification without loading pre-trained weights, but the results were the same as with pre-trained weights.

Because I have fixed all possible random seeds, the performance should be the same in the same experimental environment. Therefore, if the same means exactly the same results, then the model is not correctly loaded. So please check the model loading part again carefully.

To eliminate the possibility of ssl pre-trained models, you can also try to load my provided model weights for finetuning.
I provided model weights in the Readme.md of this repo which used frame repeating as intra-negative generation. Here I also provide model (intraneg_shuffle_r3d_res_0216.pth) trained several days ago with frame shuffling.

I trained the model again recently with --neg=shuffle and for video retrieval, the results are

Top-1, correct = 1357.00, total = 3776, acc = 0.359
Top-5, correct = 2041.00, total = 3776, acc = 0.541
Top-10, correct = 2368.00, total = 3776, acc = 0.627
Top-20, correct = 2690.00, total = 3776, acc = 0.712
Top-50, correct = 3123.00, total = 3776, acc = 0.827

which are higher than that reported in our paper.

@wj-son
Copy link
Author

wj-son commented Feb 23, 2021

#1
model.load_state_dict(pretrained_weights['model'])
#2
pretrained_weights = torch.load(args.ckpt)['model']
model_state_dict = model.state_dict()
for name, param in pretrained_weights.items():
    # name = '.'.join(name.split('.')[1:])
    if name in model.state_dict().keys():
        model_state_dict[name].copy_(param)

I do the same thing(#2) for loading pretrained checkpoint, but the finetuning training result was somewhat different: At the beginning of the training, the loss was decreased a little bit slowly vice versa but the loss and accuracy was the almost same as before.

Train epoch: [150/150][ 548/ 547] Loss 4.0295 (0.1402) Acc 0.000 (0.974) lr: 0.001
Val epoch: [150/150][ 50/ 50] Loss 0.1769 (0.4682) Acc 0.938 (0.890) lr: 0.001
Epoch time: 227.11 s.
start testing ...
[Warning]: using residual frames as input
Test: [237/237], 1.000 (0.053)
[TEST] loss: 6.517, acc: 0.053

In addition to above, I just wonder suddenly what epoch of the pretrained(SSL) weights you have been using.
I tried evaluation with pretrained weights saved at the last epoch(240).

@wuchlei
Copy link

wuchlei commented Mar 13, 2021

Sorry for replying your meaningful and laborious works.
In my case, I'm using a server with 4 gpus.
I tried the python script of default in README.md using original code and then got the following results.

python train_ssl.py --dataset=ucf101 --model=r3d --modality=res --neg=repeat

Train: [240/240][590/596]8 (53.5BT 0.809 (0.816) DT 0.000 (0.004) loss 1.284 (1.066) 1_p 54.761 (53.873) 2_p 54.624 (53.557)
epoch 240, total time 486.15
==> Saving...
intraneg_repeat_r3d_res_0216

python retrieve_clips.py --ckpt=/path/to/your/model --dataset=ucf101 --merge=True

{'cl': 16, 'model': 'r3d', 'id': 'r3d', 'dataset': 'ucf101', 'feature_dir': 'features/ucf101/r3d', 'gpu': 0, 'ckpt': 'ckpt/intraneg_repeat_r3d_res_0216/ckpt_epoch_240.pth', 'bs': 8, 'workers': 8, 'extract': True, 'modality': 'res', 'merge': True}
[Warning] The testing modality is res.
100%|████████████████████████████████████████████████████████████████████████| 1192/1192 [20:00<00:00, 1.01s/it]
100%|██████████████████████████████████████████████████████████████████████████| 472/472 [08:25<00:00, 1.07s/it]
Saving features to ... features/ucf101/r3d
Load local .npy files. from ... features/ucf101/r3d
Top-1, correct = 1049.00, total = 3776, acc = 0.278
Top-5, correct = 1596.00, total = 3776, acc = 0.423
Top-10, correct = 1894.00, total = 3776, acc = 0.502
Top-20, correct = 2214.00, total = 3776, acc = 0.586
Top-50, correct = 2681.00, total = 3776, acc = 0.710

python ft_classify.py --ckpt=/path/to/your/model --dataset=ucf101

Train epoch: [150/150][ 548/ 547] Loss 4.0295 (0.1402) Acc 0.000 (0.974) lr: 0.001
Val epoch: [150/150][ 50/ 50] Loss 0.1769 (0.4682) Acc 0.938 (0.890) lr: 0.001
Epoch time: 229.36 s.
start testing ...
[Warning]: using residual frames as input
Test: [237/237], 1.000 (0.053)
[TEST] loss: 6.517, acc: 0.053

Thank you for your help.. Sorry to bother you

This is wired. Last time I ran your code, I got the same results as the paper. However, this time I'm getting the same poor results. I've tried to roll back your code to a few commits before. Still not working.

The retrieval results are as below. All default settings.
Top-1, correct = 1043.00, total = 3776, acc = 0.276
Top-5, correct = 1565.00, total = 3776, acc = 0.414
Top-10, correct = 1904.00, total = 3776, acc = 0.504
Top-20, correct = 2228.00, total = 3776, acc = 0.590
Top-50, correct = 2743.00, total = 3776, acc = 0.726

@BestJuly
Copy link
Owner

@wj-son Hi, sorry to reply late.
For finetuning, I always used the last epoch checkpoint for self-supervised training.

According to the logs, it seems good for training & validation. Have you tried to use training dataset for testing to check whether the poor performance was caused by the data? If overfitting, it should not be that poor on testing dataset compared to training dataset.

@BestJuly
Copy link
Owner

@wuchlei Thank you for your report.
I will run the code again to check the performance in my experimental environment.

If you do not change the code and all settings are the same, it is really strange to have such different results.
If you use the original code, I think the main difference between the newest version and previous version is that all possible random seeds are fixed and no rescaling for residual inputs when fine-tuning.

For retrieval, 27.6%@top1 is not that bad. And it would be caused by using RGB as the retrieval modality.

Anyway, I will run the experiment again to check and I will report my newest results here later.

@wuchlei
Copy link

wuchlei commented Mar 13, 2021

@BestJuly I used RGB and Res for retrieval (the default settings). So I think this is not the reason.
I think the problem may be caused by pytorch version (I'm running the experiments on a new machine).
I'll re-run the experiments with your pytorch version and see if it works.

@wuchlei
Copy link

wuchlei commented Mar 16, 2021

So I've confirmed the problem is caused by pytorch version. Everything is fine with pytorch 1.3.

@BestJuly
Copy link
Owner

Hi, @wuchlei Could you provide the pytorch version where the performance is poor? Then I can text in my experimental environment and also I can add some information in the readme.md to mention about this. Thank you in advance.

@wuchlei
Copy link

wuchlei commented Mar 21, 2021

@BestJuly 1.7.0

@wj-son
Copy link
Author

wj-son commented Mar 21, 2021

@BestJuly 1.7.0 same version in my environment

So I've confirmed the problem is caused by pytorch version. Everything is fine with pytorch 1.3.

Regarding to this finding, I am trying to set up the environment(CUDA, CuDNN and so on) version with docker and reimplement.

@BestJuly
Copy link
Owner

@wuchlei @wj-son Thank you for your information. I plan to use a docker with pytorch 1.7.0 and check whether I can solve the compatibility problem.

@wuchlei
Copy link

wuchlei commented Apr 11, 2021

Hi @BestJuly, have you found the problem?

@BestJuly
Copy link
Owner

BestJuly commented Apr 14, 2021

Hi, @wuchlei @wj-son
Sorry to reply late.
I have findings and I will report them here.

I used the same code and trained with torch 1.3 and torch 1.7.

Task torch 1.3 torch 1.7
Retreival 31.8 25.3
Recognition 70.2 6.9
Note Recognition (torch 1.7)
Fine-tine from SSL model (1.3) 4.4
Only test recog. model (1.3) 70.2

The results are similar as what you have found.

I want to use torch1.3 to finetune models from torch1.7 but version error is reported, meaning that the saving part is different for different pytorch versions. When I use torch1.7 to test models trained by torch1.3, everything is OK.

Therefore, for the data and testing part, I think there is no problem. And the problem lies to the training part.

I used vimdiff to check some functions and modules such as SGD, CrossEntropy, Conv3d, and it seems that the only differences are some interfaces, such as torch.add(input, other, alpha=1, out=None) in torch1.3 becomes torch.add(input, other, *, alpha=1, out=None) in torch1.7.

By far, I did not explore deeper differences.

Therefore, for current findings, I will still keep using torch 1.3 for all of my experiments in case there are some unclear bugs in torch 1.7+.

Four scratch training experiments (torch1.3 vs torch1.7, SGD & adam) for video recognition. The results are shown in that table:

Scratch torch 1.3 torch 1.7
SGD 28.9 6.6
Adam 61.8 6.4

* Note that the learning rate is the same (initial=0.001, multistepLR=[40,80]) but we do not explore the best performance for the training settings.

@BestJuly
Copy link
Owner

BestJuly commented Apr 15, 2021

I should have fixed random seeds using codes and I also tested numpy and random as well as the model initialization weights. With the same random seed, all are the same. However, after data loading, the input data for 1.3 and 1.7 are different, even though I manually set random seed again in dataset.py.

Another finding is that I used

inputs = np.random.rand(16, 3, 16, 112, 112).astype(np.float32)
inputs = torch.from_numpy(inputs).cuda()

to generate input data manually. However, a wired thing is that for both pytorch 1.3 and 1.7, the input data are

(Pdb) p inputs[0][0][0][0][:10]                                                                                                                                       
tensor([0.8316, 0.4461, 0.8744, 0.9900, 0.3078, 0.5676, 0.6671, 0.2020, 0.6063,
        0.1934], device='cuda:0')

However, after several epochs, the performance starts to be different. I guess there are some differences deeply in the optimizer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants