Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault when iterating over multiple structures #317

Open
jackdent opened this issue May 29, 2024 · 4 comments
Open

Segfault when iterating over multiple structures #317

jackdent opened this issue May 29, 2024 · 4 comments

Comments

@jackdent
Copy link

jackdent commented May 29, 2024

Suppose I'm working with a randomly chosen PDB file:

import urllib.request

pdb_id = "5O3U"
local_path = f"/tmp/{pdb_id}.cif.gz"
urllib.request.urlretrieve(f"https://files.rcsb.org/download/{pdb_id}.cif.gz", local_path)

The following code runs perfectly fine:

structure = gemmi.read_structure(local_path)
chains = [chain for chain in structure[0]]
subchains = [chain.subchains() for chain in chains]
print(subchains)

However, the following code results in a segfault:

all_subchains = []

for _ in range(5):
    structure = gemmi.read_structure(local_path)
    chains = [chain for chain in structure[0]]
    subchains = [chain.subchains() for chain in chains]
    all_subchains.append(subchains)

print(all_subchains)

The error is:

Fatal Python error: Segmentation fault

Current thread 0x00007f36ed1541c0 (most recent call first):
  File "/workspaces/models/notebooks/tmp.py", line 25 in <module>

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, charset_normalizer.md, _cffi_backend, google.protobuf.pyext._message, grpc._cython.cygrpc (total: 17)
Segmentation fault (core dumped)

This issue strikes me as similar to #86

@jackdent jackdent changed the title Segault when iterating over multiple structures Segfault when iterating over multiple structures May 29, 2024
@jackdent
Copy link
Author

This prints the residue fine:

print(all_subchains[-1])

However, this results in a segfault, which indicates that gemmi might be garbage collecting the memory when the variable gets reassigned within each iteration of the loop:

print(all_subchains[0])

@jackdent
Copy link
Author

Interestingly, I can resolve this issue by storing an explicit reference to the structure in a separate list. This implies each structure gets garbage collected after each iteration of the loop, and I assume that results in the memory for the residues also getting garbage collected. In other words, holding a reference to the residue is insufficient to prevent that memory from getting deallocated–it seems as if you need to hold a reference to the entire structure to keep the memory for the residue around.

all_structures = []
all_subchains = []

for _ in range(5):
    structure = gemmi.read_structure(local_path)
    all_structures.append(structure)
    chains = [chain for chain in structure[0]]
    subchains = [chain.subchains() for chain in chains]
    all_subchains.append(subchains)

print(all_subchains)

@wojdyr
Copy link
Member

wojdyr commented May 29, 2024

Yes, more or less.
Chains and residues are deleted when C++ Structure is deleted. To keep using chains or residues, unless you make a copy of them, it's necessary to keep the structure alive. This is almost always done automatically: python Chain object contains a reference to Model, Model contains a reference to Structure, and that prevents the Python interpreter's GC from discarding Structure. These references are a Python thing, they are added as part of Python bindings and they only control the lifespan of objects. It's done by simple annotations in pybind11, no extra coding is required.

The subchains() function is an exception, because it returns a Python list, not a pybind11 object. It'd be possible to make each subchain in the list to keep the parent chain alive, but that's more involved, like in this answer. I'll think about it after switching to nanobind.

@jackdent
Copy link
Author

Nice--I saw that nanobind v2.0.0 just got released last week.

For the benefit of anyone else who finds this thread: I found a better solution than passing around a reference to the original structure that works for some use cases. In our case, we're extracting the Residue data from the subchains and then passing those around. We can call residue.clone() to make a copy of the data with a lifecycle that's independent of the Structure object lifecycle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants