Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wandb logger doesn't upload saved model checkpoint for final epoch #1116

Closed
bilal2vec opened this issue Mar 11, 2020 · 15 comments · Fixed by #1193
Closed

Wandb logger doesn't upload saved model checkpoint for final epoch #1116

bilal2vec opened this issue Mar 11, 2020 · 15 comments · Fixed by #1193
Labels
bug Something isn't working help wanted Open to be worked on logger Related to the Loggers

Comments

@bilal2vec
Copy link
Contributor

🐛 Bug

When training a model on the TPU and using the wandb logger, the checkpoint for the last epoch trained doesn't get uploaded to wandb.

To Reproduce

Colab notebook: https://colab.research.google.com/drive/1oPaRWGZcz6YEol012xFADN42LV-jowtT

@bilal2vec bilal2vec added bug Something isn't working help wanted Open to be worked on labels Mar 11, 2020
@Borda
Copy link
Member

Borda commented Mar 11, 2020

Hello, could you be mo specific what is not working and ho are you using it?
I see in the notebook for example if __name__ == "__main__": parser = argparse.ArgumentParser() which is not correct

@bilal2vec
Copy link
Contributor Author

I'm finetuning distilgpt2 on 8 cores of a colab TPU and using the wandb logger. The main problem seems to be that at the end of training, each tpu process and the wandb logger shut down before the built in ModelCheckpoint callback finishes saving the checkpoint for the epoch to disk.

Relevant part of stack trace:

Epoch 1: 100%|████| 327/327 [01:39<00:00,  1.65it/s, loss=3.468, v_num=2211mbhk]
Epoch 2:  91%|███▌| 296/327 [01:15<01:51,  3.59s/it, loss=3.529, v_num=2211mbhk]
Validating:   0%|                                        | 0/31 [00:00<?, ?it/s]
Epoch 2:  91%|███▋| 297/327 [01:24<02:31,  5.05s/it, loss=3.529, v_num=2211mbhk]
Epoch 2:  92%|███▋| 301/327 [01:24<01:32,  3.54s/it, loss=3.529, v_num=2211mbhk]
Epoch 2:  94%|███▊| 307/327 [01:24<00:49,  2.48s/it, loss=3.529, v_num=2211mbhk]
Epoch 2:  96%|███▊| 314/327 [01:24<00:22,  1.74s/it, loss=3.529, v_num=2211mbhk]
Epoch 2:  98%|███▉| 321/327 [01:24<00:07,  1.23s/it, loss=3.529, v_num=2211mbhk]
wandb: Waiting for W&B process to finish, PID 2784
wandb: Program ended successfully.

wandb: Waiting for W&B process to finish, PID 2784

wandb: Waiting for W&B process to finish, PID 2784

wandb: Waiting for W&B process to finish, PID 2784
Epoch 2: 100%|████| 327/327 [01:24<00:00,  1.23s/it, loss=3.529, v_num=2211mbhk]
                                                                                
wandb: Waiting for W&B process to finish, PID 2784

wandb: Waiting for W&B process to finish, PID 2784

wandb: Waiting for W&B process to finish, PID 2784
wandb: Run summary:
wandb:        global_step 591
wandb:           _runtime 242.3862087726593
wandb:      learning_rate 3.7664783427495294e-07
wandb:         train_loss 3.7114081382751465
wandb:              _step 61
wandb:         _timestamp 1583893453.517989
wandb:           val_loss 3.4930191040039062
wandb:            val_ppl 32.88510513305664
wandb:   adjusted_val_ppl 55.14464569091797
wandb: Syncing files in wandb/run-20200311_022014-2211mbhk:
wandb:   code/finetune.py
wandb:   epoch=0.ckpt
wandb:   epoch=1.ckpt.part
wandb: plus 7 W&B file(s) and 2 media file(s)

@Borda
Copy link
Member

Borda commented Mar 14, 2020

@borisdayma @calclavia pls ^^

@borisdayma
Copy link
Contributor

@bkkaggle does it happen also if you use a jupyter notebook or is it specific to collab?

wandb has also a way to let you run experiments offline and sync afterwards.

@vanpelt
Copy link
Contributor

vanpelt commented Mar 14, 2020

@bkkaggle I just took a look at this, couple things to note.

  1. Never put your api key in a notebook. Instead use:
import wandb
wandb.login()

And it will prompt you for your api key if it's not in the notebook env yet.

  1. Could this be related to the early stopping callback? I'm not familiar with the mechanics of lighting and TPU's but this seems to be caused by triggering the finalize action of the WandbLogger before the model has actually been finalized.

@bilal2vec
Copy link
Contributor Author

I've set the early stopping callback to have a patience of 10 epochs with

early_stopping_callback = EarlyStopping(monitor='val_loss', patience=10)

so I don't think it should be the problem. I also updated the code to use wandb.login() so the API key isn't visible anymore.

@borisdayma
Copy link
Contributor

borisdayma commented Mar 15, 2020

@Borda

I overwrote the collab by removing the finalize method and just tracking calls:

class WandbLogger(WandbLogger):
    @rank_zero_only
    def finalize(self, status: str = 'success') -> None:
        print('Calling finalize')
        '''try:
            exit_code = 0 if status == 'success' else 1
            wandb.join(exit_code)
        except TypeError:
            wandb.join()'''
        print('Called finalize')

I get following error stack:

wandb: Tracking run with wandb version 0.8.29
wandb: Run data is saved locally in wandb/run-20200315_175429-1ig7n4xi
wandb: Syncing run smooth-leaf-2
wandb: ⭐️ View project at https://app.wandb.ai/borisd13/lm-finetuning
wandb: 🚀 View run at https://app.wandb.ai/borisd13/lm-finetuning/runs/1ig7n4xi
wandb: Run `wandb off` to turn off syncing.

/usr/local/lib/python3.6/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py:82: UserWarning: Checkpoint directory /content/wandb/run-20200315_175429-1ig7n4xi exists and is not empty with save_top_k != 0.All files in this directory will be deleted when a checkpoint is saved!
  f"Checkpoint directory {filepath} exists and is not empty with save_top_k != 0."
100%|█████████████████████████████████████████████| 1/1 [00:09<00:00,  9.95s/it]
Dataset created in 10 seconds
Dataset length: 9444
Num tokens: 2417664 | Num original tokens: 2088674
2020-03-15 17:54:45.061374: I tensorflow/compiler/xla/xla_client/mesh_service.cc:208] Waiting to connect to client mesh master (300 seconds) localhost:50549
2020-03-15 17:54:45.080550: I tensorflow/compiler/xla/xla_client/mesh_service.cc:208] Waiting to connect to client mesh master (300 seconds) localhost:50549
2020-03-15 17:54:45.099582: I tensorflow/compiler/xla/xla_client/mesh_service.cc:208] Waiting to connect to client mesh master (300 seconds) localhost:50549
2020-03-15 17:54:45.119609: I tensorflow/compiler/xla/xla_client/mesh_service.cc:208] Waiting to connect to client mesh master (300 seconds) localhost:50549
2020-03-15 17:54:45.139724: I tensorflow/compiler/xla/xla_client/mesh_service.cc:208] Waiting to connect to client mesh master (300 seconds) localhost:50549
2020-03-15 17:54:45.159427: I tensorflow/compiler/xla/xla_client/mesh_service.cc:208] Waiting to connect to client mesh master (300 seconds) localhost:50549
2020-03-15 17:54:45.179451: I tensorflow/compiler/xla/xla_client/mesh_service.cc:208] Waiting to connect to client mesh master (300 seconds) localhost:50549
E0315 17:54:52.681799302    8753 server_chttp2.cc:40]        {"created":"@1584294892.681777447","description":"Only 1 addresses added out of total 2 resolved","file":"external/com_github_grpc_grpc/src/core/ext/transport/chttp2/server/chttp2_server.cc","file_line":404,"referenced_errors":[{"created":"@1584294892.681774375","description":"Address family not supported by protocol","errno":97,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":420,"os_error":"Address family not supported by protocol","syscall":"socket","target_address":"[::1]:50549"}]}
2020-03-15 17:54:53.917624: I tensorflow/compiler/xla/xla_client/computation_client.cc:197] Fetching mesh configuration for worker tpu_worker:0 from mesh service at localhost:50549
2020-03-15 17:54:53.918498: I tensorflow/compiler/xla/xla_client/computation_client.cc:197] Fetching mesh configuration for worker tpu_worker:0 from mesh service at localhost:50549
2020-03-15 17:54:54.150331: I tensorflow/compiler/xla/xla_client/computation_client.cc:197] Fetching mesh configuration for worker tpu_worker:0 from mesh service at localhost:50549
2020-03-15 17:54:54.281601: I tensorflow/compiler/xla/xla_client/computation_client.cc:197] Fetching mesh configuration for worker tpu_worker:0 from mesh service at localhost:50549
2020-03-15 17:54:54.655386: I tensorflow/compiler/xla/xla_client/computation_client.cc:197] Fetching mesh configuration for worker tpu_worker:0 from mesh service at localhost:50549
2020-03-15 17:54:54.830783: I tensorflow/compiler/xla/xla_client/computation_client.cc:197] Fetching mesh configuration for worker tpu_worker:0 from mesh service at localhost:50549
2020-03-15 17:54:55.214524: I tensorflow/compiler/xla/xla_client/computation_client.cc:197] Fetching mesh configuration for worker tpu_worker:0 from mesh service at localhost:50549
2020-03-15 17:54:59.626295: I tensorflow/compiler/xla/xla_client/mesh_service.cc:208] Waiting to connect to client mesh master (300 seconds) localhost:50549
0it [00:00, ?it/s]
  0%|                                                     | 0/1 [00:00<?, ?it/s]
100%|█████████████████████████████████████████████| 1/1 [00:00<00:00,  1.40it/s]
100%|█████████████████████████████████████████████| 1/1 [00:00<00:00,  1.40it/s]
100%|█████████████████████████████████████████████| 1/1 [00:00<00:00,  1.40it/s]
100%|█████████████████████████████████████████████| 1/1 [00:00<00:00,  1.40it/s]
Dataset created in 0 seconds
Dataset length: 976
Num tokens: 249856 | Num original tokens: 217646
100%|█████████████████████████████████████████████| 1/1 [00:00<00:00,  1.41it/s]Dataset created in 0 seconds
Dataset length: 976
Num tokens: 249856 | Num original tokens: 217646
100%|█████████████████████████████████████████████| 1/1 [00:00<00:00,  1.40it/s]
Dataset created in 0 seconds
Dataset length: 976
Num tokens: 249856 | Num original tokens: 217646
Dataset created in 0 seconds
Dataset length: 976
Num tokens: 249856 | Num original tokens: 217646
100%|█████████████████████████████████████████████| 1/1 [00:00<00:00,  1.40it/s]
Dataset created in 0 seconds
Dataset length: 976
Num tokens: 249856 | Num original tokens: 217646
Dataset created in 0 seconds
Dataset length: 976
Num tokens: 249856 | Num original tokens: 217646
100%|█████████████████████████████████████████████| 1/1 [00:00<00:00,  1.36it/s]
Dataset created in 0 seconds
Dataset length: 976
Num tokens: 249856 | Num original tokens: 217646
100%|█████████████████████████████████████████████| 1/1 [00:00<00:00,  1.35it/s]
Dataset created in 0 seconds
Dataset length: 976
Num tokens: 249856 | Num original tokens: 217646
Epoch 1:   0%|                                          | 0/327 [00:00<?, ?it/s]/pytorch/torch/csrc/utils/python_arg_parser.cpp:739: UserWarning: This overload of add_ is deprecated:
	add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
	add_(Tensor other, Number alpha)
/pytorch/torch/csrc/utils/python_arg_parser.cpp:739: UserWarning: This overload of add_ is deprecated:
	add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
	add_(Tensor other, Number alpha)
/pytorch/torch/csrc/utils/python_arg_parser.cpp:739: UserWarning: This overload of add_ is deprecated:
	add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
	add_(Tensor other, Number alpha)
/pytorch/torch/csrc/utils/python_arg_parser.cpp:739: UserWarning: This overload of add_ is deprecated:
	add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
	add_(Tensor other, Number alpha)
/pytorch/torch/csrc/utils/python_arg_parser.cpp:739: UserWarning: This overload of add_ is deprecated:
	add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
	add_(Tensor other, Number alpha)
/pytorch/torch/csrc/utils/python_arg_parser.cpp:739: UserWarning: This overload of add_ is deprecated:
	add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
	add_(Tensor other, Number alpha)
/pytorch/torch/csrc/utils/python_arg_parser.cpp:739: UserWarning: This overload of add_ is deprecated:
	add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
	add_(Tensor other, Number alpha)
/pytorch/torch/csrc/utils/python_arg_parser.cpp:739: UserWarning: This overload of add_ is deprecated:
	add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
	add_(Tensor other, Number alpha)
Epoch 1:  91%|███▌| 296/327 [01:13<00:21,  1.47it/s, loss=3.468, v_num=1ig7n4xi]
Validating:   0%|                                        | 0/31 [00:00<?, ?it/s]
Epoch 1:  91%|███▋| 297/327 [01:21<01:24,  2.82s/it, loss=3.468, v_num=1ig7n4xi]
Epoch 1:  91%|███▋| 298/327 [01:23<01:11,  2.47s/it, loss=3.468, v_num=1ig7n4xi]
Epoch 1:  93%|███▋| 305/327 [01:23<00:38,  1.73s/it, loss=3.468, v_num=1ig7n4xi]
Epoch 1:  95%|███▊| 312/327 [01:23<00:18,  1.22s/it, loss=3.468, v_num=1ig7n4xi]
Epoch 1:  98%|███▉| 319/327 [01:23<00:06,  1.17it/s, loss=3.468, v_num=1ig7n4xi]
Epoch 1: 100%|████| 327/327 [01:24<00:00,  1.65it/s, loss=3.468, v_num=1ig7n4xi]
                                                                                /usr/local/lib/python3.6/dist-packages/torch/optim/lr_scheduler.py:200: UserWarning: Please also save or load the state of the optimzer when saving or loading the scheduler.
  warnings.warn(SAVE_STATE_WARNING, UserWarning)
tcmalloc: large alloc 1704353792 bytes == 0x151728000 @  0x7fd8343af2a4 0x592727 0x4dddf7 0x4ddece 0x4e24ad 0x4e25eb 0x4e1430 0x4e2edb 0x4e284a 0x4e14d8 0x4e2c5b 0x4e27b2 0x4e1430 0x4e2c5b 0x4e307c 0x4e3120 0x4e3024 0x4e34a4 0x4e307c 0x4e3ac6 0x5ebdc2 0x50a94c 0x50c5b9 0x508245 0x50a080 0x50aa7d 0x50c5b9 0x508245 0x50a080 0x50aa7d 0x50c5b9
Epoch 1: 100%|████| 327/327 [01:41<00:00,  1.65it/s, loss=3.468, v_num=1ig7n4xi]tcmalloc: large alloc 2568052736 bytes == 0x1d72fe000 @  0x7fd8343af2a4 0x592727 0x4dddf7 0x4ddece 0x4e24ad 0x4e25eb 0x4e1430 0x4e2edb 0x4e284a 0x4e14d8 0x4e2c5b 0x4e27b2 0x4e1430 0x4e2c5b 0x5ec012 0x4e1683 0x4e2c5b 0x4e3120 0x4e3ac6 0x5ebdc2 0x50a94c 0x50c5b9 0x508245 0x50a080 0x50aa7d 0x50c5b9 0x508245 0x50a080 0x50aa7d 0x50c5b9 0x509d48
Epoch 2:  91%|███▌| 296/327 [01:13<01:47,  3.47s/it, loss=3.529, v_num=1ig7n4xi]
Validating:   0%|                                        | 0/31 [00:00<?, ?it/s]
Epoch 2:  91%|███▋| 297/327 [01:21<02:26,  4.88s/it, loss=3.529, v_num=1ig7n4xi]
Epoch 2:  92%|███▋| 300/327 [01:21<01:32,  3.43s/it, loss=3.529, v_num=1ig7n4xi]
Epoch 2:  94%|███▊| 307/327 [01:21<00:48,  2.40s/it, loss=3.529, v_num=1ig7n4xi]
Epoch 2:  95%|███▊| 312/327 [01:21<00:25,  1.69s/it, loss=3.529, v_num=1ig7n4xi]
Epoch 2:  98%|███▉| 319/327 [01:21<00:09,  1.19s/it, loss=3.529, v_num=1ig7n4xi]
Epoch 2: 100%|███▉| 326/327 [01:21<00:00,  1.20it/s, loss=3.529, v_num=1ig7n4xi]Calling finalize
Called finalize
Calling finalize
Called finalize
Calling finalize
Called finalize
Calling finalize
Called finalize
Calling finalize
Called finalize
Calling finalize
Called finalize
Calling finalize
Called finalize
Epoch 2: 100%|████| 327/327 [01:21<00:00,  1.20it/s, loss=3.529, v_num=1ig7n4xi]
                                                                                tcmalloc: large alloc 1704116224 bytes == 0x151728000 @  0x7fd8343af2a4 0x592727 0x4dddf7 0x4ddece 0x4e24ad 0x4e25eb 0x4e1430 0x4e2edb 0x4e284a 0x4e14d8 0x4e2c5b 0x4e27b2 0x4e1430 0x4e2c5b 0x4e307c 0x4e3120 0x4e3024 0x4e34a4 0x4e307c 0x4e3ac6 0x5ebdc2 0x50a94c 0x50c5b9 0x508245 0x50a080 0x50aa7d 0x50c5b9 0x508245 0x50a080 0x50aa7d 0x50c5b9
tcmalloc: large alloc 2567823360 bytes == 0x1e993e000 @  0x7fd8343af2a4 0x592727 0x4dddf7 0x4ddece 0x4e24ad 0x4e25eb 0x4e1430 0x4e2edb 0x4e284a 0x4e14d8 0x4e2c5b 0x4e27b2 0x4e1430 0x4e2c5b 0x5ec012 0x4e1683 0x4e2c5b 0x4e3120 0x4e3ac6 0x5ebdc2 0x50a94c 0x50c5b9 0x508245 0x50a080 0x50aa7d 0x50c5b9 0x508245 0x50a080 0x50aa7d 0x50c5b9 0x509d48
Epoch 2: 100%|████| 327/327 [01:50<00:00,  2.95it/s, loss=3.529, v_num=1ig7n4xi]
Calling finalize
Called finalize
tcmalloc: large alloc 2567823360 bytes == 0x1e993e000 @  0x7fd8343af2a4 0x592727 0x4dddf7 0x4ddece 0x4e24ad 0x4e25eb 0x4e1430 0x4e2edb 0x4e284a 0x4e14d8 0x4e2c5b 0x4e27b2 0x4e1430 0x4e2c5b 0x5ec012 0x4e1683 0x4e2c5b 0x4e3120 0x4e3ac6 0x5ebdc2 0x50a94c 0x50c5b9 0x508245 0x50a080 0x50aa7d 0x50c5b9 0x508245 0x50a080 0x50aa7d 0x50c5b9 0x509d48
Traceback (most recent call last):
  File "finetune.py", line 387, in <module>
    trainer.fit(model)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 615, in fit
    self.load_spawn_weights(model)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 366, in load_spawn_weights
    loaded_model = original_model.__class__.load_from_checkpoint(path)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/core/lightning.py", line 1389, in load_from_checkpoint
    model = cls._load_model_state(checkpoint)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/core/lightning.py", line 1419, in _load_model_state
    model = cls(*model_args)
TypeError: __init__() missing 1 required positional argument: 'args'

wandb: Waiting for W&B process to finish, PID 8665
wandb: Program failed with code 1. Press ctrl-c to abort syncing.
wandb: Run summary:
wandb:         train_loss 3.7114081382751465
wandb:      learning_rate 3.7664783427495294e-07
wandb:         _timestamp 1584295103.1529658
wandb:        global_step 591
wandb:              _step 61
wandb:           _runtime 236.35386848449707
wandb:            val_ppl 32.88510513305664
wandb:           val_loss 3.4930191040039062
wandb:   adjusted_val_ppl 55.14464569091797
wandb: Syncing files in wandb/run-20200315_175429-1ig7n4xi:
wandb:   code/finetune.py
wandb:   epoch=0.ckpt
wandb:   epoch=1.ckpt
wandb: plus 7 W&B file(s) and 2 media file(s)
wandb:                                                                                
wandb: Synced smooth-leaf-2: https://app.wandb.ai/borisd13/lm-finetuning/runs/1ig7n4xi

We can see 2 issues (I believe unrelated to wandb):

  • WandbLogger.finalize method is called before "Epoch 2" is finished -> could be related to reaching the end of dataloaders?
  • Another error appears with load_spawn_weights

Note that the checkpoints get synchronized correctly with W&B now that we ignore the finalize call: https://app.wandb.ai/borisd13/lm-finetuning/runs/1ig7n4xi

You are probably more familiar with the internals of pytorch_lightning if it's related to it but let me know if I can be of any further help.

@bilal2vec
Copy link
Contributor Author

Thanks for the fix.

The reason why the error message TypeError: __init__() missing 1 required positional argument: 'args' is probably happening is because I pass a custom parameter to my LightningModel

class LM(pl.LightningModule):
    def __init__(self, args):
        super(LM, self).__init__()

        self.args = args

I'll change my code to follow pytorch-lightning's recommended way of setting hyperparameters (https://pytorch-lightning.readthedocs.io/en/latest/hyperparameters.html) so this error message should go away.

Should I close the issue now or wait until a fix is pushed up?

@Borda
Copy link
Member

Borda commented Mar 15, 2020

@borisdayma thx for your help and tracing back the issues. 🤖
about the load_spawn_weights I think that we have been fixing it in the last release
the finalize() is explicitly called in the cleaning phase after killing the training
https://github.com/PyTorchLightning/pytorch-lightning/blob/3ad6169f187ea41aa1534a1d9a3b978d053dca2b/pytorch_lightning/trainer/training_loop.py#L624

@borisdayma
Copy link
Contributor

@Borda What is strange in this instance is that finalize seems to be called while Epoch 2 is still running (as we get few more prints after).
However it may be related to some issues with the implementation of this specific example.

@Borda
Copy link
Member

Borda commented Mar 16, 2020

@jeremyjordan @jeffling any thought?

@jeremyjordan
Copy link
Contributor

should we only set self.logger.finalize("success") on the main process? (and after all other nodes are finished)

@bilal2vec
Copy link
Contributor Author

Hi, i've updated the colab notebook with @borisdayma 's override of WandbLogger's finalize() method and updated the LightningModule to use take command line args as hparamsso it now runs without errors.

WandbLogger already has the @rank_zero_only decorator applied to its finalize method, so shouldn't it already only be called on the main process?

    @rank_zero_only
    def finalize(self, status: str = 'success') -> None:

@Borda
Copy link
Member

Borda commented Mar 18, 2020

@bkkaggle mind sending a PR?

@bilal2vec
Copy link
Contributor Author

Sure

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on logger Related to the Loggers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants