I can't train the model with batch size : 28 in linux environment but I can get the training results in windows with batch size 28 ! #5311

eklahari · 2024-06-19T15:26:18Z

from register_dataset import* #register custom dataset
from detectron2 import model_zoo
from detectron2.engine import DefaultPredictor, DefaultTrainer
from detectron2.config import get_cfg
from detectron2.utils.visualizer import Visualizer
from detectron2.data import MetadataCatalog, DatasetCatalog
import os

CUDA_LAUNCH_BLOCKING=1.
cfg = get_cfg()
cfg.merge_from_file(model_zoo.get_config_file("COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml"))
cfg.MODEL.MASK_ON = False
cfg.DATASETS.TRAIN = ("football_train",)
cfg.DATASETS.TEST = ()
cfg.DATALOADER.NUM_WORKERS = 2
cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url("COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml")
cfg.SOLVER.IMS_PER_BATCH = 28
cfg.SOLVER.BASE_LR = 0.00025
cfg.SOLVER.MAX_ITER = 1000
cfg.MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE = 128
cfg.MODEL.ROI_HEADS.NUM_CLASSES = 5 # Number of classes in the dataset

cfg.OUTPUT_DIR = "/output1"

os.makedirs(cfg.OUTPUT_DIR, exist_ok=True)
with open(os.path.join(cfg.OUTPUT_DIR, "config.yaml"), "w") as f:
f.write(cfg.dump())

trainer = DefaultTrainer(cfg)
trainer.resume_or_load(resume=False)
trainer.train()
when i am running this code with batch size 28 i am getting cuda error

but i am able to run this file in windows which has same configuration as linux what is issue?how to overcome this could you please provide some code to perform well with increased batch size in linux environment

github-actions · 2024-06-19T15:26:31Z

You've chosen to report an unexpected problem or bug. Unless you already know the root cause of it, please include details about it by filling the issue template.
The following information is missing: "Instructions To Reproduce the Issue and Full Logs";

Programmer-RD-AI · 2024-06-20T01:36:59Z

Hi,
This is usually because of the different ways CUDA memory is managed in different environments.

There isn't any specific method to resolve this, but in a Linux environment where you are unable to train a model of batch size of 28, you could try and:

Reduce the Batch Size
Go for a Smaller Model
Using something like torch.cuda.memory_allocated() and torch.cuda.memory_cached() to check up on GPU Memory allocation

These aren't solutions but other possibilities in which you can still train your model in a Linux environment...
Hope that explains the issues,
If there are any more questions please let me know

Thank you

github-actions bot added the needs-more-info More info is needed to complete the issue label Jun 19, 2024

github-actions bot removed the needs-more-info More info is needed to complete the issue label Jun 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I can't train the model with batch size : 28 in linux environment but I can get the training results in windows with batch size 28 ! #5311

I can't train the model with batch size : 28 in linux environment but I can get the training results in windows with batch size 28 ! #5311

eklahari commented Jun 19, 2024 •

edited

Loading

github-actions bot commented Jun 19, 2024

Programmer-RD-AI commented Jun 20, 2024

I can't train the model with batch size : 28 in linux environment but I can get the training results in windows with batch size 28 ! #5311

I can't train the model with batch size : 28 in linux environment but I can get the training results in windows with batch size 28 ! #5311

Comments

eklahari commented Jun 19, 2024 • edited Loading

github-actions bot commented Jun 19, 2024

Programmer-RD-AI commented Jun 20, 2024

eklahari commented Jun 19, 2024 •

edited

Loading