Failing to train my model #138

tsotop · 2023-09-06T18:00:15Z

tsotop
Sep 6, 2023

Hello there,

I've been trying to use segmentation_gym, but I've encountered some challenges. Currently, I'm facing an issue with training a model, and I suspect it's related to my data. I was able to make_dataset.py and train_model.py successfully using the dataset example (capehatteras) provided within the guide. In any case, I'd like to share my experiences with the problems I've encountered and how I managed to solve them.

1. Installation:
I'm using WSL2 (Ubuntu 22.x.x.) on Windows 11. The installation process was relatively straightforward following the provided guide. I had no issues with GPU installation. However, I did notice that some library paths (e.g., cuda) were not included, causing crashes and errors. I had to add them manually, like this: export LD_LIBRARY_PATH=/usr/local/cuda/lib:$LD_LIBRARY_PATH.

2. Data (images):
I'm working with .jpg clips extracted from an orthophoto (.tif). Initially, I faced problems with make_dataset.py, which was giving me an error message saying, "unable to fit an X size into a [Y, Z] array." Upon closer inspection, I realized that the jpg clips had a 32-bit depth. Examining the orthophoto, I found it had 4 bands (R, G, B, alpha). To resolve this issue, I removed the alpha band and reclipped the images, resulting in 24-bit jpgs that allowed me to create the dataset successfully.

3. make_dataset.py:
I suspect there might be a minimum requirement for the number of images (or classes within images) for make_dataset.py to run successfully. Can someone confirm this? Initially, I attempted with 10 images without success, but when I used 20 images, it worked as expected (or at least I believe so).

4. train_model.py:
My current roadblock is with the training process, and I suspect it's related to my data. I'm encountering the following error message:

ValueError: A Concatenate layer requires inputs with matching shapes except for the concatenation axis. Received: input_shape=[(None, 284, 260, 64), (None, 283, 259, 16)]

My data includes black (blank) areas that are not segmented into any classes within DashDoodler. In fact, Doodler won't even detect these areas. These black regions result from the image clipping process, where I enforce a specific image size for practicality. You can view the images and labels I used for running make_dataset.py in this link. I segmented them into three classes: "wet," "dry," and "other."

Any insights on how to handle this situation would be greatly appreciated.

Cheers,
Tulio

ebgoldstein · 2023-09-06T18:57:14Z

ebgoldstein
Sep 6, 2023
Maintainer

Hi Tulio,
thanks for all the detail on your setup and workflow. Is it possible for you to upload a zip file of ~ 10 images & labels?

(One thing to check is the example overlays created by make_dataset - They are a great way to see if the code is workign as expected on your data..)

(I have successfully worked with this sort of 'no data' are before w/ Gym, so I bet we can make it work for you)

take care,
Evan

1 reply

tsotop Sep 6, 2023
Author

Hey Evan, thanks for replying. Attached you will find a zip file containing the following:

images
labels from doodler
overlays from doodler
labels for gym (after running doodler's util script 'gen_images_and_labels')

Regarding the overlays, I could not find any created by make_dataset. Do you refer to the ones created by doodler after running gen_images_and_labels? By checking these overlays I do note something weird, it is definitely classifying 'no data', even when it is not shown during doodler's compute/show segmentation (you will see them in the attachment).

data.zip

Thanks for the help,
Tulio

ebgoldstein · 2023-09-07T12:48:33Z

ebgoldstein
Sep 7, 2023
Maintainer

Tulio,
I totally forgot -
can you upload the config file too?

thx,
Evan

1 reply

tsotop Sep 7, 2023
Author

Sure, here it is:
spoel_resunet.zip

Tulio

ebgoldstein · 2023-09-07T14:45:19Z

ebgoldstein
Sep 7, 2023
Maintainer

OK:
0) the first thing i notice in the config file is the size. Gym models are very sensitive to target size. I tend to choose 512, 768, or 1024. I changed your config to "TARGET_SIZE": [512,512],. Now that i look at the tensor shape error you are reporting in the concat layer, i am 99% sure that this is the issue.

the example files to see if make_dataset is worked well are in the directory you specify ias output.. in my case, this is npzForModel. So i look in npzForModel/train_npzs/noaug_sample and then look to see if thhe script is working well. so far they look good:

With the modfiied target size, i can get the model to start training on my machine..

Can you try to adjust that target size and see if it works for you?

0 replies

tsotop · 2023-09-09T23:21:02Z

tsotop
Sep 9, 2023
Author

Hey Evan,

The model started training, thank you! unfortunately, at epoch 13 the 'slowed down' warning popped and kept repeating every couple of minutes while the training seemed completely halted.

Here's the output:

`
Training model ...

...

Epoch 12: LearningRateScheduler setting learning rate to 5.5045e-06.
Epoch 12/100
12/12 [==============================] - 1663s 140s/step - loss: 0.6915 - mean_iou: 0.2654 - dice_coef: 0.3085 - val_loss: 0.7095 - val_mean_iou: 0.2683 - val_dice_coef: 0.2905 - lr: 5.5045e-06

Epoch 13: LearningRateScheduler setting learning rate to 6.0040000000000005e-06.
Epoch 13/100
8/12 [===================>..........] - ETA: 12:09 - loss: 0.6790 - mean_iou: 0.2668 - dice_coef: 0.32102023-09-07 19:49:15.635361: E tensorflow/compiler/xla/service/slow_operation_alarm.cc:65]

[Compiling module a_inference__update_step_xla_735845__XlaMustCompile_true_config_proto_8589078909834744431_executor_type_11160318154034397263_.56] Very slow compile? If you want to file a bug, run with envvar XLA_FLAGS=--xla_dump_to=/tmp/foo and attach the results.

2023-09-07 19:50:58.621014: E tensorflow/compiler/xla/service/slow_operation_alarm.cc:133] The operation took 3m42.957210043s
`

After this, based on a post in the Issues section, I made a HOT_START from Epoch 13, specifically, I added the following lines to my config file:
"HOT_START": "/mnt/c/Users/tsoto/segmentation_gym/spoel/weights/spoel_resunet.h5", "INITIAL_EPOCH": 13, "LOAD_DATA_WITH_CPU" : true, "CLEAR_MEMORY":true

This trained the model up to Epoch 24 before showing the same issue. I repeated this process from Epoch23 to Epoch36, 35-47, 46-58, 57-70, 70-82, 82-95... and it basically arrived at my max epoch of 100. At this point, I am not even sure whether to continue or not...strangely, it's slowing doing every ~12 epochs. I am using the first resunet model.

This is the final output after the 100 epochs:
Mean of mean IoUs (train subset)=0.414 Mean of mean IoUs, confusion matrix (train subset)=0.414 Mean of mean frequency weighted IoUs, confusion matrix (train subset)=0.767 Mean of Matthews Correlation Coefficients (train subset)=0.406 Mean of mean Dice scores (train subset)=0.432 Mean of mean KLD scores (train subset)=2.444

Apparently from the output plot, it just trained on the 6 last epochs (95-100). Maybe I wrote something wrong in the config file?

I then tried with the segformer model. It trained really fast and the validation results are good enough:

Mean of mean IoUs (train subset)=0.709 Mean of mean IoUs, confusion matrix (train subset)=0.709 Mean of mean frequency weighted IoUs, confusion matrix (train subset)=0.905 Mean of Matthews Correlation Coefficients (train subset)=0.765 Mean of mean Dice scores (train subset)=0.777 Mean of mean KLD scores (train subset)=0.840

HOWEVER, when using seg_images_in_folder.py the results are not accurate at all... it also displays the following warning (which I am not sure what it means):

Some layers from the model checkpoint at nvidia/mit-b0 were not used when initializing TFSegformerForSemanticSegmentation: ['classifier'] This IS expected if you are initializing TFSegformerForSemanticSegmentation from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). This IS NOT expected if you are initializing TFSegformerForSemanticSegmentation from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some layers of TFSegformerForSemanticSegmentation were not initialized from the model checkpoint at nvidia/mit-b0 and are newly initialized: ['decode_head'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. Number of samples: 55

I attached my segformer model (config, modelOut, weights, toPredict). You probably have a better idea of what is going on.

segformer_weights.zip
segformer_weights_fullmodel.zip

segformer_config_modelOut_toPredict.zip

I'll be also uploading the results of the other models if I manage to run them.

Thanks again for the help!
Tulio

0 replies

tsotop · 2023-09-11T11:50:25Z

tsotop
Sep 11, 2023
Author

I tried running the unet model. After each iteration, it was properly reporting loss, mean_iou and dice coeff, however, it also had this msg:

Epoch 38: LearningRateScheduler setting learning rate to 1.676050451796691e-05.
Epoch 38/100
3/3 [==============================] - ETA: 0s - loss: 0.4268 - mean_iou: 0.5068 - dice_coef: 0.5732 **WARNING:tensorflow:Can save best model only with val_loss available, skipping.
WARNING:tensorflow:Early stopping conditioned on metric `val_loss` which is not available. Available metrics are: loss,mean_iou,dice_coef**
3/3 [==============================] - 76s 25s/step - loss: 0.4268 - mean_iou: 0.5068 - dice_coef: 0.5732 - lr: 1.6761e-05

I believe the variable loss refers to val_loss but it is not defined like so within the script?

Also, the model did not stop until epoch 100, and displayed this error msg at the end, which is probably the same issue:

Traceback (most recent call last):
  File "/mnt/c/Users/tsoto/segmentation_gym/train_model.py", line 858, in <module>
    plot_seg_history_iou(history, hist_fig, MODEL)
  File "/home/tsotop/miniconda3/envs/gym/lib/python3.11/site-packages/doodleverse_utils/imports.py", line 189, in plot_seg_history_iou
    n = len(history.history["val_loss"])
            ~~~~~~~~~~~~~~~^^^^^^^^^^^^
KeyError: 'val_loss'

I made sure I installed the latest versions of the packages.

Tulio

0 replies

ebgoldstein · 2023-09-14T15:00:46Z

ebgoldstein
Sep 14, 2023
Maintainer

Sorry for the delay @tsotop - @dbuscombe-usgs and I have discussed this, and one/both of use will train up a model using your data and the config file you sent. It might take a week for use to get to it.. but we will figure it out!

1 reply

tsotop Sep 14, 2023
Author

No worries, on the contrary, thank you both for the help. If you need more data or anything let me know.

ebgoldstein · 2023-09-15T15:17:58Z

ebgoldstein
Sep 15, 2023
Maintainer

ok, with the config that @tsotop provided, I was getting the val_loss error - i have not encountered this before. I modified/simplified the config file a bit, and eventually I got the model to train. I am not exactly sure why it wasn't working, and why i was able to make it work.
My guesses are:

cleaning up older datafiles that are in the data folder, including the resized images and labels. I recommend keeping your data directory very clean, and essentially deleting all the data aside from your config and the original images & labels
making sure there are enough val files and the val split is large enough... i went with 50/50 split in this config
making aug loops and aug copies a bit smaller

I still think there is a bug here, but i am hoping its a corner case and that these modifications will prohibt it from happening again...

so, @tsotop , can you try trian a model using the attacehd config and report back? (i had to change it to a .txt file to upload.. you should be able to just change the suffix to .json and use it..

spoel_resunet_ebg.txt

2 replies

tsotop Sep 17, 2023
Author

Hey Evan,

I confirm I was cleaning up the folder every time I was making a new run, keeping only the folder structure, config, and original images and labels.

I used the config file you provided and the model trained just fine until epoch 15. The 'slowed down' warning popped up and it seems it doesn't train anymore.

I could potentially keep training using a hot start, but the last time I tried the end result just seemed like it had just trained the last hot start epochs, and also it went up to max_epoch before stopping.

Were you able to check the segformer model? it trains relatively fast and the training output seems quite decent. However, the predictions are way off.

Thanks again!

ebgoldstein Sep 18, 2023
Maintainer

I have not tried the segformer model - I will try to do that in the next few days. My guess is that the code is somehow using the pretrained weights and not the model that you have trained... i will try to figure it out...
I am not sure why you get the slowdown. I can't reproduce it (i can train for the full number of epochs). I wonder if this is a memory issue? Can you monitor nvidia-smi and watch your memory usage as you go from epoch 1 to 12? You could also try to make the batch size smaller, and/or make the target size smaller, like (256,256)... sorry, I haven;t seen this error before.

dbuscombe-usgs · 2023-09-19T01:40:02Z

dbuscombe-usgs
Sep 19, 2023
Maintainer

I'm sorry I've been out the loop. Tulio, thanks for reporting, and for your patience. The segformer tends to be more accurate than the resunet for all of the segmentations I've done recently, so I'm surprised by the large errors

Evan, thanks for helping address this issue.

On the slowdown issue, I'm not sure what could be causing it. I also recommend monitoring nvidia-smi and if you identify a memory leak, there are two (largely undocumented) config options that may help if you've exhausted other options. First, I would try

CLEAR_MEMORY = true, which does a garbage collection at the end of every epoch

and finally try

LOAD_DATA_WITH_CPU = true which will force TensorFlow to use the CPU to load the data onto the GPU

2 replies

tsotop Sep 20, 2023
Author

Hey Daniel,

Right, I am not sure what is going on with the segformer model when predicting, also because the training output suggests really high precision (>90%).

Thanks for the tips. Based on another post, I have tested both config lines (CLEAR_MEMORY = True and LOAD_DATA_WITH_CPU = true), however I was still getting the slow-down error. I did not monitor nvidia-smi, which I'll do this time and report back.

Thanks again,
Tulio

tsotop Sep 27, 2023
Author

I must add that I tested the segformer model using the provided dataset (capehatteras). Same result, it trains fast but predictions are way off.

Cheers,
Tulio

ebgoldstein · 2023-09-20T16:49:07Z

ebgoldstein
Sep 20, 2023
Maintainer

OK, I can confirm that the segformer model provided above is not able to be used for predictions using seg_images_in_folder. I made a new issue to discuss this (#139 ). My thinking is this is related to saving the trained segformer model.. I linked to a related issue in the HF/transformers discussion that discusses some aspects of saving the segformer model

@tsotop - I think this will take us some time to sit down and fix. Both @dbuscombe-usgs and I are working hard on various other projects for the next few weeks.. I think you have 3 options:

you wait for us to fix it
you write a quick fix, some code to fix it that works for you. I recommend adding a new line in train_model ~L873 that saves your model (something like model.saved_pretrained(filepath) ). This will save your trained model.. Then you could either modify seg_images_in_folder to load your trained model OR just write a basic prediction script to loop through the images in your folder and predict them (to avoid relying on seg_images_in_folder)... A basic prediction notebook is here in Zoo, and it gives some ideas on how to use a doodleverse model in prediction mode... This notebook uses a resunet, and a segformer might have some different tensorshapes to deal with (Channels first vs channels last, etc..).. I think i have a notebook for a segformer that i can dig up if you are interested in going this route..
you write some code to fix it that can be submitted to us as a PR to SegmentationGym (we would be grateful and acknowledge your contribution!)

1 reply

tsotop Sep 21, 2023
Author

Thanks, I might be able to write some code but it will take a while until I can start working on it. I will try some quick fixes and report back.

dbuscombe-usgs · 2023-09-20T17:36:04Z

dbuscombe-usgs
Sep 20, 2023
Maintainer

Thanks @ebgoldstein for confirming this. If the model trained well, but is hopeless in prediction, it could be

the model is (somehow) not assigning the trained weights to the model architecture
the input imagery is very different to imagery used during training
the model didn't in fact train well, and the reported validation statistics are bogus

Not sure what else could cause it.... at this point, 1. seems most likely

It's been probably almost 2 months since I used the seg_images_in_folder script, but I can confirm that segformer and resunets were working then. We could dig into TF versions to see if anything has changed. I agree with Evan that model saving could be the culprit. We tend to save using the H5 format because that is stable, and last time we tried to use the new .keras format, we ran into lots of issues. However, it may be time to revisit that new format.

If it is a saved model format issue, there are two utilities that may help:

https://github.com/Doodleverse/segmentation_gym/blob/main/utils/gen_fullmodel_from_h5.py will 'rebuild' your model from h5

https://github.com/Doodleverse/segmentation_gym/blob/main/utils/gen_saved_model.py takes the h5 format and converts it to a compiled .pb model format

I probably can't help until at least Monday, and then it may take some time to implement any changes to Gym, because they need to be tested with downstream applications (zoo, coastseg, and seg2map).

2 replies

ebgoldstein Sep 20, 2023
Maintainer

(I think gen_fullmodel will not work b/c the segformer is a subclassed model)
(I had tried gen_saved but it did not work for me...)

dbuscombe-usgs Sep 20, 2023
Maintainer

Ok, thanks. Yeah, it was a long shot!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failing to train my model #138

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 10 comments 10 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Failing to train my model #138

tsotop Sep 6, 2023

Replies: 10 comments · 10 replies

ebgoldstein Sep 6, 2023 Maintainer

tsotop Sep 6, 2023 Author

ebgoldstein Sep 7, 2023 Maintainer

tsotop Sep 7, 2023 Author

ebgoldstein Sep 7, 2023 Maintainer

tsotop Sep 9, 2023 Author

tsotop Sep 11, 2023 Author

ebgoldstein Sep 14, 2023 Maintainer

tsotop Sep 14, 2023 Author

ebgoldstein Sep 15, 2023 Maintainer

tsotop Sep 17, 2023 Author

ebgoldstein Sep 18, 2023 Maintainer

dbuscombe-usgs Sep 19, 2023 Maintainer

tsotop Sep 20, 2023 Author

tsotop Sep 27, 2023 Author

ebgoldstein Sep 20, 2023 Maintainer

tsotop Sep 21, 2023 Author

dbuscombe-usgs Sep 20, 2023 Maintainer

ebgoldstein Sep 20, 2023 Maintainer

dbuscombe-usgs Sep 20, 2023 Maintainer

tsotop
Sep 6, 2023

Replies: 10 comments 10 replies

ebgoldstein
Sep 6, 2023
Maintainer

tsotop Sep 6, 2023
Author

ebgoldstein
Sep 7, 2023
Maintainer

tsotop Sep 7, 2023
Author

ebgoldstein
Sep 7, 2023
Maintainer

tsotop
Sep 9, 2023
Author

tsotop
Sep 11, 2023
Author

ebgoldstein
Sep 14, 2023
Maintainer

tsotop Sep 14, 2023
Author

ebgoldstein
Sep 15, 2023
Maintainer

tsotop Sep 17, 2023
Author

ebgoldstein Sep 18, 2023
Maintainer

dbuscombe-usgs
Sep 19, 2023
Maintainer

tsotop Sep 20, 2023
Author

tsotop Sep 27, 2023
Author

ebgoldstein
Sep 20, 2023
Maintainer

tsotop Sep 21, 2023
Author

dbuscombe-usgs
Sep 20, 2023
Maintainer

ebgoldstein Sep 20, 2023
Maintainer

dbuscombe-usgs Sep 20, 2023
Maintainer