`read_pdb`: Trailing whitespace is not removed in column `residue_name` #32

kamurani · 2024-04-22T02:35:51Z

I have encountered an issue when reading in a protein PDB file where whitespace is not effectively removed.

Using the following code:

pdb = read_pdb(pdb_file="AF-Q05655-F1-model_v2.pdb", category_names=['_atom_site'])  # We use '_atom_site' here to mirror the mmCIF format and it is the default
atoms_df = pdb['_atom_site']

# Get values for residue_name
list(atoms_df.residue_name.unique())

This yields:

['MET ',
 'ALA ',
 'PRO ',
 'PHE ',
 'LEU ',
 'ARG ',
 'ILE ',
 'ASN ',
 'SER ',
 'TYR ',
 'GLU ',
 'GLY ',
 'GLN ',
 'ASP ',
 'CYS ',
 'VAL ',
 'LYS ',
 'THR ',
 'TRP ',
 'HIS ']

This whitespace should be trimmed so that filtering can take place properly.

Happy to submit a PR for this.

Ruibin-Liu · 2024-07-29T14:01:13Z

Yes, it is annoying when we try to do filtering using residue_name.

I implemented that way simply because of performance consideration since we don't need to trim the extra space for every residue name while reading lines. If we use the PDBDataFrame class instead of the raw one to do filtering, it is more convenient and less confusing. The API is like this: https://moldf.readthedocs.io/en/latest/api.html#moldf.pdb_dataframe.PDBDataFrame.residue_names

To avoid the confusion, we can either sacrifice a little bit of reading performance by adding a keyword to the read_pdb function so that a residue_name column can have the compact version by default, or we can directly return a PDBDataFrame object instead of the base class. For the latter, we need to polish the code more before we can confidently use it.

I am happy to review your PR!

kamurani · 2024-08-16T01:28:31Z

I've just realised the same occurs in other columns, for example:

>>> atom_df = pdb["_atom_site"]
>>> atom_df.record_name.unique()
array(['ATOM  ', 'TER   ', 'HETATM'], dtype=object)

This can be quite confusing especially if one wants to do filtering such as:

dff = atom_df[atom_df.record_name == "ATOM"] # this will not work as expected

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`read_pdb`: Trailing whitespace is not removed in column `residue_name` #32

`read_pdb`: Trailing whitespace is not removed in column `residue_name` #32

kamurani commented Apr 22, 2024

Ruibin-Liu commented Jul 29, 2024 •

edited

Loading

kamurani commented Aug 16, 2024

read_pdb: Trailing whitespace is not removed in column residue_name #32

read_pdb: Trailing whitespace is not removed in column residue_name #32

Comments

kamurani commented Apr 22, 2024

Ruibin-Liu commented Jul 29, 2024 • edited Loading

kamurani commented Aug 16, 2024

`read_pdb`: Trailing whitespace is not removed in column `residue_name` #32

`read_pdb`: Trailing whitespace is not removed in column `residue_name` #32

Ruibin-Liu commented Jul 29, 2024 •

edited

Loading