Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_pdb: Trailing whitespace is not removed in column residue_name #32

Open
kamurani opened this issue Apr 22, 2024 · 2 comments
Open

Comments

@kamurani
Copy link

I have encountered an issue when reading in a protein PDB file where whitespace is not effectively removed.

Source: Q05655

Using the following code:

pdb = read_pdb(pdb_file="AF-Q05655-F1-model_v2.pdb", category_names=['_atom_site'])  # We use '_atom_site' here to mirror the mmCIF format and it is the default
atoms_df = pdb['_atom_site']

# Get values for residue_name
list(atoms_df.residue_name.unique())

This yields:

['MET ',
 'ALA ',
 'PRO ',
 'PHE ',
 'LEU ',
 'ARG ',
 'ILE ',
 'ASN ',
 'SER ',
 'TYR ',
 'GLU ',
 'GLY ',
 'GLN ',
 'ASP ',
 'CYS ',
 'VAL ',
 'LYS ',
 'THR ',
 'TRP ',
 'HIS ']

This whitespace should be trimmed so that filtering can take place properly.

Happy to submit a PR for this.

@Ruibin-Liu
Copy link
Owner

Ruibin-Liu commented Jul 29, 2024

Yes, it is annoying when we try to do filtering using residue_name.

I implemented that way simply because of performance consideration since we don't need to trim the extra space for every residue name while reading lines. If we use the PDBDataFrame class instead of the raw one to do filtering, it is more convenient and less confusing. The API is like this: https://moldf.readthedocs.io/en/latest/api.html#moldf.pdb_dataframe.PDBDataFrame.residue_names

To avoid the confusion, we can either sacrifice a little bit of reading performance by adding a keyword to the read_pdb function so that a residue_name column can have the compact version by default, or we can directly return a PDBDataFrame object instead of the base class. For the latter, we need to polish the code more before we can confidently use it.

I am happy to review your PR!

@kamurani
Copy link
Author

I've just realised the same occurs in other columns, for example:

>>> atom_df = pdb["_atom_site"]
>>> atom_df.record_name.unique()
array(['ATOM  ', 'TER   ', 'HETATM'], dtype=object)

This can be quite confusing especially if one wants to do filtering such as:

dff = atom_df[atom_df.record_name == "ATOM"] # this will not work as expected

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants