Fix GUI resume training #1314

roomrys · 2023-05-18T19:49:11Z

Description

After attempting a few rounds of training via the GUI, we were a bit surprised to find that the results when resuming training did not perform better than the results for training from scratch. When digging into this, we noticed that the base_checkpoint parameter in the training_config.json file was always null even when directly specifying a file path to the base_checkpoint. It turns out that the CLI arg --base_checkpoint was always overwriting the parameter set in the training config (making the GUI resume training useless).

This PR ensure that the base_checkpoint is only overwritten if a non-default value is specified for --base_checkpoint.

Related: #1130

Types of changes

Does this address any currently open issues?

[list open issues here]

Outside contributors checklist

Review the guidelines for contributing to this repository
Read and sign the CLA and add yourself to the authors list
Make sure you are making a pull request against the develop branch (not main). Also you should start your branch off develop
Add tests that prove your fix is effective or that your feature works
Add necessary documentation (if appropriate)

Thank you for contributing to SLEAP!

❤️

codecov · 2023-05-18T20:37:38Z

Codecov Report

Merging #1314 (3ed2701) into develop (1e9026d) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff            @@
##           develop    #1314   +/-   ##
========================================
  Coverage    72.38%   72.38%           
========================================
  Files          133      133           
  Lines        23663    23665    +2     
========================================
+ Hits         17128    17130    +2     
  Misses        6535     6535

Impacted Files	Coverage Δ
sleap/gui/learning/runners.py	`32.02% <100.00%> (+0.19%)`	⬆️
sleap/nn/training.py	`75.42% <100.00%> (+0.02%)`	⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

talmo · 2023-05-19T18:53:42Z

I think we need to add one more check here on this fix to address item 4 here: #1291 (reply in thread)

Iin the centered_instance.json, I notice that the "training_inds" list shows only the prior trained model's list of frame numbers and does not include any newly annotated frame numbers. # Is this what I should be seeing in the json file when retraining with new training data?

Thanks for pointing that out! I'm not sure we had noticed that previously. The new centered_instance.json probably shouldn't have the list of frame numbers at all and that may indeed be leading to the trainer not using the new labels! We'll look into it as part of the bug fix investigation mentioned above.

Namely, I'm concerned that this part of the DataReaders is kicking in:

sleap/sleap/nn/training.py

Lines 191 to 199 in d05e2bf

    
           if ( 
        
               labels_config.training_inds is not None 
        
               and len(labels_config.training_inds) > 0 
        
           ): 
        
               logger.info( 
        
                   "Creating training split from explicit indices " 
        
                   f"(n = {len(labels_config.training_inds)})." 
        
               ) 
        
               training = training.extract(labels_config.training_inds, copy=False)

which results in any new labeled frames to be not used in training when resuming from a previous checkpoint.

This would also lead to issues when resuming from a checkpoint used in another labels project entirely which may have different numbers of labeled frames.

I know the old config isn't used in the Trainer (only the best_model.h5 during model initialization), but it will carry over when doing this from the GUI training config editor.

roomrys · 2023-05-24T22:18:23Z

The above would only apply if split_by_inds == True.

sleap/sleap/nn/training.py

Lines 173 to 178 in d05e2bf

    
           if labels_config is not None and labels_config.split_by_inds: 
        
               # First try to split by indices if specified in config. 
        
               if ( 
        
                   labels_config.validation_inds is not None 
        
                   and len(labels_config.validation_inds) > 0 
        
               ):

Thus, I created another PR (#1324) which allows users the option to "Split by Indices" if they would like to rerun the exact same experiment in the GUI (to measure any noise, or exited early previously). However, this checkbox is likely most useful for de-selecting "Split by Indices" if selected in the original config (while being transparent).

* Disable data caching by default for SingleImageVideos (#1243) * Disable data caching by default for SingleImageVideos * Remove a couple things that shouldn't be in this PR :) * Centralize video extensions (#1244) * Add video extension support lists to io.video module * Use centralized extension definitions * Remove some unused code * Add some coverage for indirect coverage reduction * Fix single frame GUI increment (#1254) * Fix frame increment for single frame videos * Add test and lint * Add video search to "robustify" (excessively) fragile test (#1262) Add video search to fix fragile test * Organize docks (#1265) * Add `DockWidget` class and subclasses * Create docks in `MainWindow` using `DockWidget` classes * Remove unused imports * Fix references in existing tests * Fix intermittent bug (file-corruption) that has snuck into the tests (#1267) * Remove corrupted test file * Add non-corrupted copy of test file * Rename the file so tests run * Layout docks in a tab configuration (instead of stacked) (#1289) * Add tab with to docks (and use `self.model` in table) * Use `self.model_type` in `create_models` * Increase GUI crop size range from 512 to 832 (#1295) * Fix conversion to numpy array when last frame(s) do not have labels (#1307) * Ensure frames to predict list is unique (#1293) * Ensure frames to predict list is unique * Ensure frames to predict on are ordered correctly * Better frame sorting * Fix GUI resume training (#1314) * Do not choose `top_k` instances if `max_instances` < num centroids (#1313) * Use max_instances as a max without any minimum requirement * Create test (that fails) * Fix test by re-init inference model each call * Do not compute top k if max is greater * Add `--max_instances` to `sleap-track` and GUI (#1305) * Expose --max_instances to sleap-track and in GUI * Lint * Change shorthand to `-n` for both `sleap-export` and `sleap-track` * Add test for creating predictor from cli * Add max instances support to bottom up model (#1306) * Add max instances support to bottom up model * Remove unnecessary attribute setter in CLI parser * Edge case * Expose max_instances for BU to train/infer GUI --------- Co-authored-by: roomrys <38435167+roomrys@users.noreply.github.com> * Update docs for `--max_instances` command * Add test for BU `--max_instances` --------- Co-authored-by: Talmo Pereira <talmo@salk.edu> * Remove `--labels` and redundant `data_path` (#1326) * Create copy of config info to modify (gui) (#1325) * Fixes GPU memory polling using environment variable filtering (#1272) * add cuda visible gpus to nvidia smi CLI * tests masked none, single and multiple GPUs * deal with comma separated lists and test 2 gpus * clean up system script and better test that actually works * add a case for cuda_visible_devices = [] * add test for gpu order and length in nvidia and tf * test nvidia smi indices and checking gpu memory * fix spaces * fix linting with Black * put skips in tests for git not having nvidia-smi * add doc strings to get_gpu_memory * set CUDA device order to PCI BUS ID * test with no nvidia smi + invalid visible devices * fixed linting again * remove pci env variable, move to other issue --------- Co-authored-by: Eric Leonardis <eleonardis@salk.edu> * Set `split_by_inds`, `test_labels`, and `validation_labels` to default (GUI) (#1331) * Allow returning PAF graph during low level inference (#1329) * allow returning the paf graph during low level inference * reformatted with black * Fix `SingleImageVideo` caching (#1330) * Set `SingleImageVideo.caching` as a class attribute * Modify tests for `SingleImageVideo.caching` * Add caching as a preference and menu checkbox * Test `SingleImageVideo.toggle_caching` * Remove GUI elements for `SingleImageVideo.CACHING` * Remove remaining prefs for `SingleImageVideo.CACHING` * Add depreciated comment * Clean-up * Update comments * Bump to 1.3.1 (#1335) * Bump to 1.3.1 * Test build to dev label * Update environment creation (#1366) * First 'working' env for 1.3.1, no GPU... * First working 1.3.1 w/gpu * Sorry, this is the first gpu working env * Fix `imgstore` version-dependent bug * Build and entry points work, but pip packages not installed * Add default channels in condarc.yaml * Rename environment.yaml to .yml * Working build no gpu (documented) * Env working w/gpu using tensorflow from pypi * Working build with GPU! And pip dep * Attempt to fix cannot find wheel ~= 0.35 issue * Run constrain everything * Use minimal conda req for install_requires, inc build number to get it working * Ubuntu build and environments working * Env creation working on M2 * Build and env working on windows (again) * Apple silicon (M2) build working * Pip package working (M2) * Exclude tests and docs from pypi wheel * Add comments to requirements * Get ready for manual build * Update os images, trigger dev build * Retry manual build win and mac * Retry mac manual build (to upload) * Require latest pynwb, remove setuptools * Remove old mac environment * Add twine as a build dependency * Add comments to manual build * Update build.yml * Rename "apple_silicon" to "mac" * Update installation docs (#1365) * Update installation docs * Trigger website build * Update quick install and build docs * Add Mambaforge links * Remove comment about experimental apple silicon support Co-authored-by: Talmo Pereira <talmo@salk.edu> * Fix minor typo * Change installation docs to reference universal mac env --------- Co-authored-by: Talmo Pereira <talmo@salk.edu> * More flexible with python version ~=3.7 * Pin py3.7.12 for host and run (good on win) * Add comments for why python ~=3.7 * Update ci workflow (#1371) * Update CI workflow to include mac * Trigger CI workflow * Remove verbose? python version * Migrate to micromamba * Explicitly state environment name * Remove environment chaching (for now) * Use different environment file for mac * Use correct syntax * Add `env_hash` and `bash -l {0}` * Remove env_file * Try different nested variable call via format * Use correct syntax * Update env file reference AS -> mac * Different path to environment file * Checkout repo * Use default shells * Remove unused comments * Fix caching attempt by changing caching key * Remove environment caching * Increase python in no_cuda to 3.7.17 * Less restrictive with python ~3.7 version * More direct installation docs 🤞 * Increase build numbers for develop branch build --------- Co-authored-by: Talmo Pereira <talmo@salk.edu> --------- Co-authored-by: Talmo Pereira <talmo@salk.edu> Co-authored-by: Eric Leonardis <leonardiseric@gmail.com> Co-authored-by: Eric Leonardis <eleonardis@salk.edu> Co-authored-by: Caleb Weinreb <calebsw@gmail.com>

Fix GUI resume training

3ed2701

roomrys marked this pull request as ready for review May 18, 2023 20:51

roomrys requested a review from talmo May 18, 2023 20:52

talmo approved these changes May 24, 2023

View reviewed changes

roomrys merged commit 8334d88 into develop May 24, 2023

roomrys deleted the liezl/fix-gui-resume-training branch May 24, 2023 22:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix GUI resume training #1314

Fix GUI resume training #1314

roomrys commented May 18, 2023

codecov bot commented May 18, 2023

talmo commented May 19, 2023

roomrys commented May 24, 2023

Fix GUI resume training #1314

Fix GUI resume training #1314

Conversation

roomrys commented May 18, 2023

Description

Types of changes

Does this address any currently open issues?

Outside contributors checklist

Thank you for contributing to SLEAP!

codecov bot commented May 18, 2023

Codecov Report

talmo commented May 19, 2023

roomrys commented May 24, 2023