Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When I trained 11G S3DIS, there was an error #139

Closed
Avril-Dragon opened this issue Jul 16, 2024 · 6 comments
Closed

When I trained 11G S3DIS, there was an error #139

Avril-Dragon opened this issue Jul 16, 2024 · 6 comments

Comments

@Avril-Dragon
Copy link

Avril-Dragon commented Jul 16, 2024

Traceback (most recent call last):
  File "/media/wcj/A4D4C4CFD4C4A4C01/zl/superpoint_transformer-master/src/utils/utils.py", line 45, in wrap
    metric_dict, object_dict = task_func(cfg=cfg)
  File "src/train.py", line 115, in train
    trainer.fit(model=model, datamodule=datamodule, ckpt_path=cfg.get("ckpt_path"))
  File "/home/wcj/anaconda3/envs/spt7/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 543, in fit
    call._call_and_handle_interrupt(
  File "/home/wcj/anaconda3/envs/spt7/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 44, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/wcj/anaconda3/envs/spt7/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 579, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/wcj/anaconda3/envs/spt7/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 986, in _run
    results = self._run_stage()
  File "/home/wcj/anaconda3/envs/spt7/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1028, in _run_stage
    self._run_sanity_check()
  File "/home/wcj/anaconda3/envs/spt7/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1057, in _run_sanity_check
    val_loop.run()
  File "/home/wcj/anaconda3/envs/spt7/lib/python3.8/site-packages/pytorch_lightning/loops/utilities.py", line 182, in _decorator
    return loop_run(self, *args, **kwargs)
  File "/home/wcj/anaconda3/envs/spt7/lib/python3.8/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 135, in run
    self._evaluation_step(batch, batch_idx, dataloader_idx, dataloader_iter)
  File "/home/wcj/anaconda3/envs/spt7/lib/python3.8/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 370, in _evaluation_step
    batch = call._call_strategy_hook(trainer, "batch_to_device", batch, dataloader_idx=dataloader_idx)
  File "/home/wcj/anaconda3/envs/spt7/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 311, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/home/wcj/anaconda3/envs/spt7/lib/python3.8/site-packages/pytorch_lightning/strategies/strategy.py", line 277, in batch_to_device
    return model._apply_batch_transfer_handler(batch, device=device, dataloader_idx=dataloader_idx)
  File "/home/wcj/anaconda3/envs/spt7/lib/python3.8/site-packages/pytorch_lightning/core/module.py", line 359, in _apply_batch_transfer_handler
    batch = self._call_batch_hook("on_after_batch_transfer", batch, dataloader_idx)
  File "/home/wcj/anaconda3/envs/spt7/lib/python3.8/site-packages/pytorch_lightning/core/module.py", line 347, in _call_batch_hook
    return trainer_method(trainer, hook_name, *args)
  File "/home/wcj/anaconda3/envs/spt7/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 181, in _call_lightning_datamodule_hook
    return fn(*args, **kwargs)
  File "/home/wcj/anaconda3/envs/spt7/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/media/wcj/A4D4C4CFD4C4A4C01/zl/superpoint_transformer-master/src/datamodules/base.py", line 333, in on_after_batch_transfer
    return on_device_transform(nag)
  File "/home/wcj/anaconda3/envs/spt7/lib/python3.8/site-packages/torch_geometric/transforms/compose.py", line 24, in __call__
    data = transform(data)
  File "/media/wcj/A4D4C4CFD4C4A4C01/zl/superpoint_transformer-master/src/transforms/transforms.py", line 23, in __call__
    return self._process(x)
  File "/media/wcj/A4D4C4CFD4C4A4C01/zl/superpoint_transformer-master/src/transforms/graph.py", line 1359, in _process
    nag[i_level].node_size = nag.get_sub_size(i_level, low=self.low)
  File "/media/wcj/A4D4C4CFD4C4A4C01/zl/superpoint_transformer-master/src/data/nag.py", line 58, in get_sub_size
    sub_sizes = self[low + 1].sub.sizes
AttributeError: 'list' object has no attribute 'sizes'

I print the related param in nag.py like:

print(self[low+1])
print(self[low + 1].sub)
print(type(self[low + 1].sub))
log_size=[19731, 1], log_surface=[19731, 1], log_volume=[19731, 1], normal=[19731, 3], super_index=[19731],
sub=[1], batch=[19731], ptr=[2])
[Cluster(num_clusters=19731, num_points=660524, device=cuda:0)]
<class 'list'>

Thanks all the help you provide

@drprojects
Copy link
Owner

drprojects commented Jul 16, 2024

It seems self[low + 1].sub is a List(Cluster) instead of simply being a Cluster. This is the first time I see this issue, I am not sure how it appeared yet. Have you made any modification to the code, even minor ? Can you please share the exact bash command are you running ?

If you ❤️ or use this project, don't forget to give it a ⭐, it means a lot to us !

@Avril-Dragon
Copy link
Author

In order to ensure that I did not make any modifications, I re -decompressed the ZIP, and only pressed the data set in. When running, I often encounter PIPE LINE ERROR because of insufficient memory, but after re -execution, I successfully generated S3DIS data, and then I encountered the above error.
by the way,i find the WARN in Processing
image
if any problems in my method?

@va-kiet
Copy link

va-kiet commented Jul 17, 2024

I have the same problem when training on DALES dataset without any modification to the code. The only difference is that I ran on python venv instead of conda environment (but I don't think it really matters). Here the logs I've got:

output.log

and when I print self[low + 1].sub, the output is: [Cluster(num_clusters=42880, num_points=1324840, device=cuda:0), Cluster(num_clusters=35140, num_points=1080737, device=cuda:0), Cluster(num_clusters=30092, num_points=959261, device=cuda:0), Cluster(num_clusters=36576, num_points=1147082, device=cuda:0)]

I guest the issue lies in the process of packing data into batchs. In my case, the batch_size was 4 resulted a batch of 4 Clusters but in the type of List, and then the whole List was pushed into the transformation progress instead of a single Cluster, which maybe the reason of this problem.

@va-kiet
Copy link

va-kiet commented Jul 17, 2024

I have solved this issue by editting the line 933 of src/data/data.py, deleting and isinstance(batch.sub, Cluster) will work. After checked the previous version of this file, I've found that isinstance(batch.sub, Cluster) will always return False in this condition so the batch will be stuck in List data type instead of being convert to ClusterBatch.

image

@Avril-Dragon
Copy link
Author

it works! Thanks!

@drprojects
Copy link
Owner

Good catch @va-kiet ! There was indeed an error there, since the PyG behavior of Batch.from_data_list() would return a List(Cluster) by default. Your fix was the correct one, I integrated this in the latest commit.

drprojects added a commit to vschelbi/superpoint_transformer_vschelbi that referenced this issue Aug 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants