Replies: 10 comments 10 replies
-
Hi Tulio, (One thing to check is the example overlays created by make_dataset - They are a great way to see if the code is workign as expected on your data..) (I have successfully worked with this sort of 'no data' are before w/ Gym, so I bet we can make it work for you) take care, |
Beta Was this translation helpful? Give feedback.
-
Tulio, thx, |
Beta Was this translation helpful? Give feedback.
-
OK:
Can you try to adjust that target size and see if it works for you? |
Beta Was this translation helpful? Give feedback.
-
Hey Evan, The model started training, thank you! unfortunately, at epoch 13 the 'slowed down' warning popped and kept repeating every couple of minutes while the training seemed completely halted. Here's the output: ` ... Epoch 12: LearningRateScheduler setting learning rate to 5.5045e-06. Epoch 13: LearningRateScheduler setting learning rate to 6.0040000000000005e-06. [Compiling module a_inference__update_step_xla_735845__XlaMustCompile_true_config_proto_8589078909834744431_executor_type_11160318154034397263_.56] Very slow compile? If you want to file a bug, run with envvar XLA_FLAGS=--xla_dump_to=/tmp/foo and attach the results. 2023-09-07 19:50:58.621014: E tensorflow/compiler/xla/service/slow_operation_alarm.cc:133] The operation took 3m42.957210043s After this, based on a post in the Issues section, I made a HOT_START from Epoch 13, specifically, I added the following lines to my config file: This trained the model up to Epoch 24 before showing the same issue. I repeated this process from Epoch23 to Epoch36, 35-47, 46-58, 57-70, 70-82, 82-95... and it basically arrived at my max epoch of 100. At this point, I am not even sure whether to continue or not...strangely, it's slowing doing every ~12 epochs. I am using the first resunet model. This is the final output after the 100 epochs: Apparently from the output plot, it just trained on the 6 last epochs (95-100). Maybe I wrote something wrong in the config file? I then tried with the segformer model. It trained really fast and the validation results are good enough:
HOWEVER, when using seg_images_in_folder.py the results are not accurate at all... it also displays the following warning (which I am not sure what it means):
I attached my segformer model (config, modelOut, weights, toPredict). You probably have a better idea of what is going on. segformer_weights.zip segformer_config_modelOut_toPredict.zip I'll be also uploading the results of the other models if I manage to run them. Thanks again for the help! |
Beta Was this translation helpful? Give feedback.
-
I tried running the unet model. After each iteration, it was properly reporting loss, mean_iou and dice coeff, however, it also had this msg:
I believe the variable loss refers to val_loss but it is not defined like so within the script? Also, the model did not stop until epoch 100, and displayed this error msg at the end, which is probably the same issue:
I made sure I installed the latest versions of the packages. Tulio |
Beta Was this translation helpful? Give feedback.
-
Sorry for the delay @tsotop - @dbuscombe-usgs and I have discussed this, and one/both of use will train up a model using your data and the config file you sent. It might take a week for use to get to it.. but we will figure it out! |
Beta Was this translation helpful? Give feedback.
-
ok, with the config that @tsotop provided, I was getting the
I still think there is a bug here, but i am hoping its a corner case and that these modifications will prohibt it from happening again... so, @tsotop , can you try trian a model using the attacehd config and report back? (i had to change it to a |
Beta Was this translation helpful? Give feedback.
-
I'm sorry I've been out the loop. Tulio, thanks for reporting, and for your patience. The segformer tends to be more accurate than the resunet for all of the segmentations I've done recently, so I'm surprised by the large errors Evan, thanks for helping address this issue. On the slowdown issue, I'm not sure what could be causing it. I also recommend monitoring
and finally try
|
Beta Was this translation helpful? Give feedback.
-
OK, I can confirm that the segformer model provided above is not able to be used for predictions using @tsotop - I think this will take us some time to sit down and fix. Both @dbuscombe-usgs and I are working hard on various other projects for the next few weeks.. I think you have 3 options:
|
Beta Was this translation helpful? Give feedback.
-
Thanks @ebgoldstein for confirming this. If the model trained well, but is hopeless in prediction, it could be
Not sure what else could cause it.... at this point, 1. seems most likely It's been probably almost 2 months since I used the If it is a saved model format issue, there are two utilities that may help: https://github.com/Doodleverse/segmentation_gym/blob/main/utils/gen_fullmodel_from_h5.py will 'rebuild' your model from h5 https://github.com/Doodleverse/segmentation_gym/blob/main/utils/gen_saved_model.py takes the h5 format and converts it to a compiled .pb model format I probably can't help until at least Monday, and then it may take some time to implement any changes to Gym, because they need to be tested with downstream applications (zoo, coastseg, and seg2map). |
Beta Was this translation helpful? Give feedback.
-
Hello there,
I've been trying to use segmentation_gym, but I've encountered some challenges. Currently, I'm facing an issue with training a model, and I suspect it's related to my data. I was able to make_dataset.py and train_model.py successfully using the dataset example (capehatteras) provided within the guide. In any case, I'd like to share my experiences with the problems I've encountered and how I managed to solve them.
1. Installation:
I'm using WSL2 (Ubuntu 22.x.x.) on Windows 11. The installation process was relatively straightforward following the provided guide. I had no issues with GPU installation. However, I did notice that some library paths (e.g., cuda) were not included, causing crashes and errors. I had to add them manually, like this: export LD_LIBRARY_PATH=/usr/local/cuda/lib:$LD_LIBRARY_PATH.
2. Data (images):
I'm working with .jpg clips extracted from an orthophoto (.tif). Initially, I faced problems with make_dataset.py, which was giving me an error message saying, "unable to fit an X size into a [Y, Z] array." Upon closer inspection, I realized that the jpg clips had a 32-bit depth. Examining the orthophoto, I found it had 4 bands (R, G, B, alpha). To resolve this issue, I removed the alpha band and reclipped the images, resulting in 24-bit jpgs that allowed me to create the dataset successfully.
3. make_dataset.py:
I suspect there might be a minimum requirement for the number of images (or classes within images) for make_dataset.py to run successfully. Can someone confirm this? Initially, I attempted with 10 images without success, but when I used 20 images, it worked as expected (or at least I believe so).
4. train_model.py:
My current roadblock is with the training process, and I suspect it's related to my data. I'm encountering the following error message:
ValueError: A
Concatenate
layer requires inputs with matching shapes except for the concatenation axis. Received: input_shape=[(None, 284, 260, 64), (None, 283, 259, 16)]My data includes black (blank) areas that are not segmented into any classes within DashDoodler. In fact, Doodler won't even detect these areas. These black regions result from the image clipping process, where I enforce a specific image size for practicality. You can view the images and labels I used for running make_dataset.py in this link. I segmented them into three classes: "wet," "dry," and "other."
Any insights on how to handle this situation would be greatly appreciated.
Cheers,
Tulio
Beta Was this translation helpful? Give feedback.
All reactions