Wandb logger doesn't upload saved model checkpoint for final epoch #1116

bilal2vec · 2020-03-11T02:47:18Z

🐛 Bug

When training a model on the TPU and using the wandb logger, the checkpoint for the last epoch trained doesn't get uploaded to wandb.

To Reproduce

Colab notebook: https://colab.research.google.com/drive/1oPaRWGZcz6YEol012xFADN42LV-jowtT

Borda · 2020-03-11T13:40:37Z

Hello, could you be mo specific what is not working and ho are you using it?
I see in the notebook for example if __name__ == "__main__": parser = argparse.ArgumentParser() which is not correct

bilal2vec · 2020-03-11T23:42:53Z

I'm finetuning distilgpt2 on 8 cores of a colab TPU and using the wandb logger. The main problem seems to be that at the end of training, each tpu process and the wandb logger shut down before the built in ModelCheckpoint callback finishes saving the checkpoint for the epoch to disk.

Relevant part of stack trace:

Epoch 1: 100%|████| 327/327 [01:39<00:00,  1.65it/s, loss=3.468, v_num=2211mbhk]
Epoch 2:  91%|███▌| 296/327 [01:15<01:51,  3.59s/it, loss=3.529, v_num=2211mbhk]
Validating:   0%|                                        | 0/31 [00:00<?, ?it/s]
Epoch 2:  91%|███▋| 297/327 [01:24<02:31,  5.05s/it, loss=3.529, v_num=2211mbhk]
Epoch 2:  92%|███▋| 301/327 [01:24<01:32,  3.54s/it, loss=3.529, v_num=2211mbhk]
Epoch 2:  94%|███▊| 307/327 [01:24<00:49,  2.48s/it, loss=3.529, v_num=2211mbhk]
Epoch 2:  96%|███▊| 314/327 [01:24<00:22,  1.74s/it, loss=3.529, v_num=2211mbhk]
Epoch 2:  98%|███▉| 321/327 [01:24<00:07,  1.23s/it, loss=3.529, v_num=2211mbhk]
wandb: Waiting for W&B process to finish, PID 2784
wandb: Program ended successfully.

wandb: Waiting for W&B process to finish, PID 2784

wandb: Waiting for W&B process to finish, PID 2784

wandb: Waiting for W&B process to finish, PID 2784
Epoch 2: 100%|████| 327/327 [01:24<00:00,  1.23s/it, loss=3.529, v_num=2211mbhk]
                                                                                
wandb: Waiting for W&B process to finish, PID 2784

wandb: Waiting for W&B process to finish, PID 2784

wandb: Waiting for W&B process to finish, PID 2784
wandb: Run summary:
wandb:        global_step 591
wandb:           _runtime 242.3862087726593
wandb:      learning_rate 3.7664783427495294e-07
wandb:         train_loss 3.7114081382751465
wandb:              _step 61
wandb:         _timestamp 1583893453.517989
wandb:           val_loss 3.4930191040039062
wandb:            val_ppl 32.88510513305664
wandb:   adjusted_val_ppl 55.14464569091797
wandb: Syncing files in wandb/run-20200311_022014-2211mbhk:
wandb:   code/finetune.py
wandb:   epoch=0.ckpt
wandb:   epoch=1.ckpt.part
wandb: plus 7 W&B file(s) and 2 media file(s)

Borda · 2020-03-14T02:05:16Z

@borisdayma @calclavia pls ^^

borisdayma · 2020-03-14T03:16:48Z

@bkkaggle does it happen also if you use a jupyter notebook or is it specific to collab?

wandb has also a way to let you run experiments offline and sync afterwards.

vanpelt · 2020-03-14T23:13:43Z

@bkkaggle I just took a look at this, couple things to note.

Never put your api key in a notebook. Instead use:

import wandb
wandb.login()

And it will prompt you for your api key if it's not in the notebook env yet.

Could this be related to the early stopping callback? I'm not familiar with the mechanics of lighting and TPU's but this seems to be caused by triggering the finalize action of the WandbLogger before the model has actually been finalized.

bilal2vec · 2020-03-15T16:41:49Z

I've set the early stopping callback to have a patience of 10 epochs with

early_stopping_callback = EarlyStopping(monitor='val_loss', patience=10)

so I don't think it should be the problem. I also updated the code to use wandb.login() so the API key isn't visible anymore.

borisdayma · 2020-03-15T18:07:33Z

@Borda

I overwrote the collab by removing the finalize method and just tracking calls:

class WandbLogger(WandbLogger):
    @rank_zero_only
    def finalize(self, status: str = 'success') -> None:
        print('Calling finalize')
        '''try:
            exit_code = 0 if status == 'success' else 1
            wandb.join(exit_code)
        except TypeError:
            wandb.join()'''
        print('Called finalize')

I get following error stack:

wandb: Tracking run with wandb version 0.8.29
wandb: Run data is saved locally in wandb/run-20200315_175429-1ig7n4xi
wandb: Syncing run smooth-leaf-2
wandb: ⭐️ View project at https://app.wandb.ai/borisd13/lm-finetuning
wandb: 🚀 View run at https://app.wandb.ai/borisd13/lm-finetuning/runs/1ig7n4xi
wandb: Run `wandb off` to turn off syncing.

/usr/local/lib/python3.6/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py:82: UserWarning: Checkpoint directory /content/wandb/run-20200315_175429-1ig7n4xi exists and is not empty with save_top_k != 0.All files in this directory will be deleted when a checkpoint is saved!
  f"Checkpoint directory {filepath} exists and is not empty with save_top_k != 0."
100%|█████████████████████████████████████████████| 1/1 [00:09<00:00,  9.95s/it]
Dataset created in 10 seconds
Dataset length: 9444
Num tokens: 2417664 | Num original tokens: 2088674
2020-03-15 17:54:45.061374: I tensorflow/compiler/xla/xla_client/mesh_service.cc:208] Waiting to connect to client mesh master (300 seconds) localhost:50549
2020-03-15 17:54:45.080550: I tensorflow/compiler/xla/xla_client/mesh_service.cc:208] Waiting to connect to client mesh master (300 seconds) localhost:50549
2020-03-15 17:54:45.099582: I tensorflow/compiler/xla/xla_client/mesh_service.cc:208] Waiting to connect to client mesh master (300 seconds) localhost:50549
2020-03-15 17:54:45.119609: I tensorflow/compiler/xla/xla_client/mesh_service.cc:208] Waiting to connect to client mesh master (300 seconds) localhost:50549
2020-03-15 17:54:45.139724: I tensorflow/compiler/xla/xla_client/mesh_service.cc:208] Waiting to connect to client mesh master (300 seconds) localhost:50549
2020-03-15 17:54:45.159427: I tensorflow/compiler/xla/xla_client/mesh_service.cc:208] Waiting to connect to client mesh master (300 seconds) localhost:50549
2020-03-15 17:54:45.179451: I tensorflow/compiler/xla/xla_client/mesh_service.cc:208] Waiting to connect to client mesh master (300 seconds) localhost:50549
E0315 17:54:52.681799302    8753 server_chttp2.cc:40]        {"created":"@1584294892.681777447","description":"Only 1 addresses added out of total 2 resolved","file":"external/com_github_grpc_grpc/src/core/ext/transport/chttp2/server/chttp2_server.cc","file_line":404,"referenced_errors":[{"created":"@1584294892.681774375","description":"Address family not supported by protocol","errno":97,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":420,"os_error":"Address family not supported by protocol","syscall":"socket","target_address":"[::1]:50549"}]}
2020-03-15 17:54:53.917624: I tensorflow/compiler/xla/xla_client/computation_client.cc:197] Fetching mesh configuration for worker tpu_worker:0 from mesh service at localhost:50549
2020-03-15 17:54:53.918498: I tensorflow/compiler/xla/xla_client/computation_client.cc:197] Fetching mesh configuration for worker tpu_worker:0 from mesh service at localhost:50549
2020-03-15 17:54:54.150331: I tensorflow/compiler/xla/xla_client/computation_client.cc:197] Fetching mesh configuration for worker tpu_worker:0 from mesh service at localhost:50549
2020-03-15 17:54:54.281601: I tensorflow/compiler/xla/xla_client/computation_client.cc:197] Fetching mesh configuration for worker tpu_worker:0 from mesh service at localhost:50549
2020-03-15 17:54:54.655386: I tensorflow/compiler/xla/xla_client/computation_client.cc:197] Fetching mesh configuration for worker tpu_worker:0 from mesh service at localhost:50549
2020-03-15 17:54:54.830783: I tensorflow/compiler/xla/xla_client/computation_client.cc:197] Fetching mesh configuration for worker tpu_worker:0 from mesh service at localhost:50549
2020-03-15 17:54:55.214524: I tensorflow/compiler/xla/xla_client/computation_client.cc:197] Fetching mesh configuration for worker tpu_worker:0 from mesh service at localhost:50549
2020-03-15 17:54:59.626295: I tensorflow/compiler/xla/xla_client/mesh_service.cc:208] Waiting to connect to client mesh master (300 seconds) localhost:50549
0it [00:00, ?it/s]
  0%|                                                     | 0/1 [00:00<?, ?it/s]
100%|█████████████████████████████████████████████| 1/1 [00:00<00:00,  1.40it/s]
100%|█████████████████████████████████████████████| 1/1 [00:00<00:00,  1.40it/s]
100%|█████████████████████████████████████████████| 1/1 [00:00<00:00,  1.40it/s]
100%|█████████████████████████████████████████████| 1/1 [00:00<00:00,  1.40it/s]
Dataset created in 0 seconds
Dataset length: 976
Num tokens: 249856 | Num original tokens: 217646
100%|█████████████████████████████████████████████| 1/1 [00:00<00:00,  1.41it/s]Dataset created in 0 seconds
Dataset length: 976
Num tokens: 249856 | Num original tokens: 217646
100%|█████████████████████████████████████████████| 1/1 [00:00<00:00,  1.40it/s]
Dataset created in 0 seconds
Dataset length: 976
Num tokens: 249856 | Num original tokens: 217646
Dataset created in 0 seconds
Dataset length: 976
Num tokens: 249856 | Num original tokens: 217646
100%|█████████████████████████████████████████████| 1/1 [00:00<00:00,  1.40it/s]
Dataset created in 0 seconds
Dataset length: 976
Num tokens: 249856 | Num original tokens: 217646
Dataset created in 0 seconds
Dataset length: 976
Num tokens: 249856 | Num original tokens: 217646
100%|█████████████████████████████████████████████| 1/1 [00:00<00:00,  1.36it/s]
Dataset created in 0 seconds
Dataset length: 976
Num tokens: 249856 | Num original tokens: 217646
100%|█████████████████████████████████████████████| 1/1 [00:00<00:00,  1.35it/s]
Dataset created in 0 seconds
Dataset length: 976
Num tokens: 249856 | Num original tokens: 217646
Epoch 1:   0%|                                          | 0/327 [00:00<?, ?it/s]/pytorch/torch/csrc/utils/python_arg_parser.cpp:739: UserWarning: This overload of add_ is deprecated:
	add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
	add_(Tensor other, Number alpha)
/pytorch/torch/csrc/utils/python_arg_parser.cpp:739: UserWarning: This overload of add_ is deprecated:
	add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
	add_(Tensor other, Number alpha)
/pytorch/torch/csrc/utils/python_arg_parser.cpp:739: UserWarning: This overload of add_ is deprecated:
	add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
	add_(Tensor other, Number alpha)
/pytorch/torch/csrc/utils/python_arg_parser.cpp:739: UserWarning: This overload of add_ is deprecated:
	add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
	add_(Tensor other, Number alpha)
/pytorch/torch/csrc/utils/python_arg_parser.cpp:739: UserWarning: This overload of add_ is deprecated:
	add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
	add_(Tensor other, Number alpha)
/pytorch/torch/csrc/utils/python_arg_parser.cpp:739: UserWarning: This overload of add_ is deprecated:
	add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
	add_(Tensor other, Number alpha)
/pytorch/torch/csrc/utils/python_arg_parser.cpp:739: UserWarning: This overload of add_ is deprecated:
	add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
	add_(Tensor other, Number alpha)
/pytorch/torch/csrc/utils/python_arg_parser.cpp:739: UserWarning: This overload of add_ is deprecated:
	add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
	add_(Tensor other, Number alpha)
Epoch 1:  91%|███▌| 296/327 [01:13<00:21,  1.47it/s, loss=3.468, v_num=1ig7n4xi]
Validating:   0%|                                        | 0/31 [00:00<?, ?it/s]
Epoch 1:  91%|███▋| 297/327 [01:21<01:24,  2.82s/it, loss=3.468, v_num=1ig7n4xi]
Epoch 1:  91%|███▋| 298/327 [01:23<01:11,  2.47s/it, loss=3.468, v_num=1ig7n4xi]
Epoch 1:  93%|███▋| 305/327 [01:23<00:38,  1.73s/it, loss=3.468, v_num=1ig7n4xi]
Epoch 1:  95%|███▊| 312/327 [01:23<00:18,  1.22s/it, loss=3.468, v_num=1ig7n4xi]
Epoch 1:  98%|███▉| 319/327 [01:23<00:06,  1.17it/s, loss=3.468, v_num=1ig7n4xi]
Epoch 1: 100%|████| 327/327 [01:24<00:00,  1.65it/s, loss=3.468, v_num=1ig7n4xi]
                                                                                /usr/local/lib/python3.6/dist-packages/torch/optim/lr_scheduler.py:200: UserWarning: Please also save or load the state of the optimzer when saving or loading the scheduler.
  warnings.warn(SAVE_STATE_WARNING, UserWarning)
tcmalloc: large alloc 1704353792 bytes == 0x151728000 @  0x7fd8343af2a4 0x592727 0x4dddf7 0x4ddece 0x4e24ad 0x4e25eb 0x4e1430 0x4e2edb 0x4e284a 0x4e14d8 0x4e2c5b 0x4e27b2 0x4e1430 0x4e2c5b 0x4e307c 0x4e3120 0x4e3024 0x4e34a4 0x4e307c 0x4e3ac6 0x5ebdc2 0x50a94c 0x50c5b9 0x508245 0x50a080 0x50aa7d 0x50c5b9 0x508245 0x50a080 0x50aa7d 0x50c5b9
Epoch 1: 100%|████| 327/327 [01:41<00:00,  1.65it/s, loss=3.468, v_num=1ig7n4xi]tcmalloc: large alloc 2568052736 bytes == 0x1d72fe000 @  0x7fd8343af2a4 0x592727 0x4dddf7 0x4ddece 0x4e24ad 0x4e25eb 0x4e1430 0x4e2edb 0x4e284a 0x4e14d8 0x4e2c5b 0x4e27b2 0x4e1430 0x4e2c5b 0x5ec012 0x4e1683 0x4e2c5b 0x4e3120 0x4e3ac6 0x5ebdc2 0x50a94c 0x50c5b9 0x508245 0x50a080 0x50aa7d 0x50c5b9 0x508245 0x50a080 0x50aa7d 0x50c5b9 0x509d48
Epoch 2:  91%|███▌| 296/327 [01:13<01:47,  3.47s/it, loss=3.529, v_num=1ig7n4xi]
Validating:   0%|                                        | 0/31 [00:00<?, ?it/s]
Epoch 2:  91%|███▋| 297/327 [01:21<02:26,  4.88s/it, loss=3.529, v_num=1ig7n4xi]
Epoch 2:  92%|███▋| 300/327 [01:21<01:32,  3.43s/it, loss=3.529, v_num=1ig7n4xi]
Epoch 2:  94%|███▊| 307/327 [01:21<00:48,  2.40s/it, loss=3.529, v_num=1ig7n4xi]
Epoch 2:  95%|███▊| 312/327 [01:21<00:25,  1.69s/it, loss=3.529, v_num=1ig7n4xi]
Epoch 2:  98%|███▉| 319/327 [01:21<00:09,  1.19s/it, loss=3.529, v_num=1ig7n4xi]
Epoch 2: 100%|███▉| 326/327 [01:21<00:00,  1.20it/s, loss=3.529, v_num=1ig7n4xi]Calling finalize
Called finalize
Calling finalize
Called finalize
Calling finalize
Called finalize
Calling finalize
Called finalize
Calling finalize
Called finalize
Calling finalize
Called finalize
Calling finalize
Called finalize
Epoch 2: 100%|████| 327/327 [01:21<00:00,  1.20it/s, loss=3.529, v_num=1ig7n4xi]
                                                                                tcmalloc: large alloc 1704116224 bytes == 0x151728000 @  0x7fd8343af2a4 0x592727 0x4dddf7 0x4ddece 0x4e24ad 0x4e25eb 0x4e1430 0x4e2edb 0x4e284a 0x4e14d8 0x4e2c5b 0x4e27b2 0x4e1430 0x4e2c5b 0x4e307c 0x4e3120 0x4e3024 0x4e34a4 0x4e307c 0x4e3ac6 0x5ebdc2 0x50a94c 0x50c5b9 0x508245 0x50a080 0x50aa7d 0x50c5b9 0x508245 0x50a080 0x50aa7d 0x50c5b9
tcmalloc: large alloc 2567823360 bytes == 0x1e993e000 @  0x7fd8343af2a4 0x592727 0x4dddf7 0x4ddece 0x4e24ad 0x4e25eb 0x4e1430 0x4e2edb 0x4e284a 0x4e14d8 0x4e2c5b 0x4e27b2 0x4e1430 0x4e2c5b 0x5ec012 0x4e1683 0x4e2c5b 0x4e3120 0x4e3ac6 0x5ebdc2 0x50a94c 0x50c5b9 0x508245 0x50a080 0x50aa7d 0x50c5b9 0x508245 0x50a080 0x50aa7d 0x50c5b9 0x509d48
Epoch 2: 100%|████| 327/327 [01:50<00:00,  2.95it/s, loss=3.529, v_num=1ig7n4xi]
Calling finalize
Called finalize
tcmalloc: large alloc 2567823360 bytes == 0x1e993e000 @  0x7fd8343af2a4 0x592727 0x4dddf7 0x4ddece 0x4e24ad 0x4e25eb 0x4e1430 0x4e2edb 0x4e284a 0x4e14d8 0x4e2c5b 0x4e27b2 0x4e1430 0x4e2c5b 0x5ec012 0x4e1683 0x4e2c5b 0x4e3120 0x4e3ac6 0x5ebdc2 0x50a94c 0x50c5b9 0x508245 0x50a080 0x50aa7d 0x50c5b9 0x508245 0x50a080 0x50aa7d 0x50c5b9 0x509d48
Traceback (most recent call last):
  File "finetune.py", line 387, in <module>
    trainer.fit(model)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 615, in fit
    self.load_spawn_weights(model)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 366, in load_spawn_weights
    loaded_model = original_model.__class__.load_from_checkpoint(path)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/core/lightning.py", line 1389, in load_from_checkpoint
    model = cls._load_model_state(checkpoint)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/core/lightning.py", line 1419, in _load_model_state
    model = cls(*model_args)
TypeError: __init__() missing 1 required positional argument: 'args'

wandb: Waiting for W&B process to finish, PID 8665
wandb: Program failed with code 1. Press ctrl-c to abort syncing.
wandb: Run summary:
wandb:         train_loss 3.7114081382751465
wandb:      learning_rate 3.7664783427495294e-07
wandb:         _timestamp 1584295103.1529658
wandb:        global_step 591
wandb:              _step 61
wandb:           _runtime 236.35386848449707
wandb:            val_ppl 32.88510513305664
wandb:           val_loss 3.4930191040039062
wandb:   adjusted_val_ppl 55.14464569091797
wandb: Syncing files in wandb/run-20200315_175429-1ig7n4xi:
wandb:   code/finetune.py
wandb:   epoch=0.ckpt
wandb:   epoch=1.ckpt
wandb: plus 7 W&B file(s) and 2 media file(s)
wandb:                                                                                
wandb: Synced smooth-leaf-2: https://app.wandb.ai/borisd13/lm-finetuning/runs/1ig7n4xi

We can see 2 issues (I believe unrelated to wandb):

WandbLogger.finalize method is called before "Epoch 2" is finished -> could be related to reaching the end of dataloaders?
Another error appears with load_spawn_weights

Note that the checkpoints get synchronized correctly with W&B now that we ignore the finalize call: https://app.wandb.ai/borisd13/lm-finetuning/runs/1ig7n4xi

You are probably more familiar with the internals of pytorch_lightning if it's related to it but let me know if I can be of any further help.

bilal2vec · 2020-03-15T21:11:53Z

Thanks for the fix.

The reason why the error message TypeError: __init__() missing 1 required positional argument: 'args' is probably happening is because I pass a custom parameter to my LightningModel

class LM(pl.LightningModule):
    def __init__(self, args):
        super(LM, self).__init__()

        self.args = args

I'll change my code to follow pytorch-lightning's recommended way of setting hyperparameters (https://pytorch-lightning.readthedocs.io/en/latest/hyperparameters.html) so this error message should go away.

Should I close the issue now or wait until a fix is pushed up?

Borda · 2020-03-15T22:06:05Z

@borisdayma thx for your help and tracing back the issues. 🤖
about the load_spawn_weights I think that we have been fixing it in the last release
the finalize() is explicitly called in the cleaning phase after killing the training
https://github.com/PyTorchLightning/pytorch-lightning/blob/3ad6169f187ea41aa1534a1d9a3b978d053dca2b/pytorch_lightning/trainer/training_loop.py#L624

borisdayma · 2020-03-15T22:47:12Z

@Borda What is strange in this instance is that finalize seems to be called while Epoch 2 is still running (as we get few more prints after).
However it may be related to some issues with the implementation of this specific example.

Borda · 2020-03-16T22:55:53Z

@jeremyjordan @jeffling any thought?

jeremyjordan · 2020-03-17T01:16:52Z

should we only set self.logger.finalize("success") on the main process? (and after all other nodes are finished)

bilal2vec · 2020-03-18T22:39:55Z

Hi, i've updated the colab notebook with @borisdayma 's override of WandbLogger's finalize() method and updated the LightningModule to use take command line args as hparamsso it now runs without errors.

WandbLogger already has the @rank_zero_only decorator applied to its finalize method, so shouldn't it already only be called on the main process?

    @rank_zero_only
    def finalize(self, status: str = 'success') -> None:

Borda · 2020-03-18T23:06:50Z

@bkkaggle mind sending a PR?

bilal2vec · 2020-03-19T01:21:33Z

Sure

bilal2vec added bug Something isn't working help wanted Open to be worked on labels Mar 11, 2020

Borda added the information needed label Mar 11, 2020

bilal2vec mentioned this issue Mar 19, 2020

Replace Wandb callback's finalize with no-op #1193

Merged

5 tasks

borisdayma mentioned this issue Mar 25, 2020

WandbLogger does not log test results #1220

Closed

williamFalcon closed this as completed in #1193 Mar 30, 2020

Borda added the logger Related to the Loggers label Aug 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wandb logger doesn't upload saved model checkpoint for final epoch #1116

Wandb logger doesn't upload saved model checkpoint for final epoch #1116

bilal2vec commented Mar 11, 2020

Borda commented Mar 11, 2020

bilal2vec commented Mar 11, 2020

Borda commented Mar 14, 2020

borisdayma commented Mar 14, 2020

vanpelt commented Mar 14, 2020

bilal2vec commented Mar 15, 2020

borisdayma commented Mar 15, 2020 •

edited

Loading

bilal2vec commented Mar 15, 2020

Borda commented Mar 15, 2020

borisdayma commented Mar 15, 2020

Borda commented Mar 16, 2020

jeremyjordan commented Mar 17, 2020

bilal2vec commented Mar 18, 2020

Borda commented Mar 18, 2020

bilal2vec commented Mar 19, 2020

Wandb logger doesn't upload saved model checkpoint for final epoch #1116

Wandb logger doesn't upload saved model checkpoint for final epoch #1116

Comments

bilal2vec commented Mar 11, 2020

🐛 Bug

To Reproduce

Borda commented Mar 11, 2020

bilal2vec commented Mar 11, 2020

Borda commented Mar 14, 2020

borisdayma commented Mar 14, 2020

vanpelt commented Mar 14, 2020

bilal2vec commented Mar 15, 2020

borisdayma commented Mar 15, 2020 • edited Loading

bilal2vec commented Mar 15, 2020

Borda commented Mar 15, 2020

borisdayma commented Mar 15, 2020

Borda commented Mar 16, 2020

jeremyjordan commented Mar 17, 2020

bilal2vec commented Mar 18, 2020

Borda commented Mar 18, 2020

bilal2vec commented Mar 19, 2020

borisdayma commented Mar 15, 2020 •

edited

Loading