Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I can't train the model with batch size : 28 in linux environment but I can get the training results in windows with batch size 28 ! #5311

Open
eklahari opened this issue Jun 19, 2024 · 2 comments

Comments

@eklahari
Copy link

eklahari commented Jun 19, 2024

from register_dataset import* #register custom dataset
from detectron2 import model_zoo
from detectron2.engine import DefaultPredictor, DefaultTrainer
from detectron2.config import get_cfg
from detectron2.utils.visualizer import Visualizer
from detectron2.data import MetadataCatalog, DatasetCatalog
import os

CUDA_LAUNCH_BLOCKING=1.
cfg = get_cfg()
cfg.merge_from_file(model_zoo.get_config_file("COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml"))
cfg.MODEL.MASK_ON = False
cfg.DATASETS.TRAIN = ("football_train",)
cfg.DATASETS.TEST = ()
cfg.DATALOADER.NUM_WORKERS = 2
cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url("COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml")
cfg.SOLVER.IMS_PER_BATCH = 28
cfg.SOLVER.BASE_LR = 0.00025
cfg.SOLVER.MAX_ITER = 1000
cfg.MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE = 128
cfg.MODEL.ROI_HEADS.NUM_CLASSES = 5 # Number of classes in the dataset

cfg.OUTPUT_DIR = "/output1"

os.makedirs(cfg.OUTPUT_DIR, exist_ok=True)
with open(os.path.join(cfg.OUTPUT_DIR, "config.yaml"), "w") as f:
f.write(cfg.dump())

trainer = DefaultTrainer(cfg)
trainer.resume_or_load(resume=False)
trainer.train()
when i am running this code with batch size 28 i am getting cuda error
Screenshot 2024-06-19 at 9 00 44 PM

but i am able to run this file in windows which has same configuration as linux what is issue?how to overcome this could you please provide some code to perform well with increased batch size in linux environment

Copy link

You've chosen to report an unexpected problem or bug. Unless you already know the root cause of it, please include details about it by filling the issue template.
The following information is missing: "Instructions To Reproduce the Issue and Full Logs";

@github-actions github-actions bot added the needs-more-info More info is needed to complete the issue label Jun 19, 2024
@eklahari eklahari changed the title I can't train the model with batchsize:28 in linux environment but i can get the training results in windows with same configuration ? I can't train the model with batch size : 28 in linux environment but I can get the training results in windows with same configuration ? Jun 19, 2024
@github-actions github-actions bot removed the needs-more-info More info is needed to complete the issue label Jun 19, 2024
@eklahari eklahari changed the title I can't train the model with batch size : 28 in linux environment but I can get the training results in windows with same configuration ? I can't train the model with batch size : 28 in linux environment but I can get the training results in windows with batch size 28 ! Jun 19, 2024
@Programmer-RD-AI
Copy link
Contributor

Hi,
This is usually because of the different ways CUDA memory is managed in different environments.

There isn't any specific method to resolve this, but in a Linux environment where you are unable to train a model of batch size of 28, you could try and:

  1. Reduce the Batch Size
  2. Go for a Smaller Model
  3. Using something like torch.cuda.memory_allocated() and torch.cuda.memory_cached() to check up on GPU Memory allocation

These aren't solutions but other possibilities in which you can still train your model in a Linux environment...
Hope that explains the issues,
If there are any more questions please let me know

Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants