Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding pdb_addter #112

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

Conversation

andrewsb8
Copy link

Hi,

Wrote a script to add TER lines in user defined ways similar to pdb_delres.py. This would be a good functionality for multiple reasons:

  • PDB files with multiple proteins which are aggregates typically need termini changed before simulating and adding TER lines at specific intervals will help remove manual editing of PDB files.
  • It would work as a good preparation step for large PDB files before pdb_chainbows.py.
  • Large system pdb files (which I uploaded as a test file for this script, 692 residues) which require multiple TER entries (potentially at different intervals) prior to simulation can be made much more easily with a script.

I include pdbtools/pdb_addter.py, tests/test_pdb_addter.py, and tests/data/addter_test.pdb where all 13 unit tests in test_pdb_addter.py pass.

Let me know what you think,

Brian

Copy link
Member

@joaomcteixeira joaomcteixeira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @andrewsb8

First of all thanks for coming interacting with pdb-tools ☺️
I really enjoyed the fact that you followed very well the structure of our code and project. You really got into it 😉

I am adding some review comments. Cheers!

pdbtools/pdb_addter.py Outdated Show resolved Hide resolved
pdbtools/pdb_addter.py Outdated Show resolved Hide resolved
prev_res = res_id
res_counter += 1
if res_counter - min(residue_range) != 0 and (res_counter - min(residue_range)) % step == 0 and int(line[22:26]) in residue_range:
yield "TER\n"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use os.linesep. Likely we don't have that in all tools but it is something we need to change. so let's start here ☺️

pdbtools/pdb_addter.py Outdated Show resolved Hide resolved
pdbtools/pdb_addter.py Outdated Show resolved Hide resolved
pdbtools/pdb_addter.py Outdated Show resolved Hide resolved
pdbtools/pdb_addter.py Outdated Show resolved Hide resolved
tests/test_pdb_addter.py Outdated Show resolved Hide resolved


if __name__ == '__main__':
main()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to chain pdb_addter with pdb_chainbow using your example, test/data/addter_test.pdb and I have this error:

· python pdbtools/pdb_addter.py -1::10 tests/data/addter_test.pdb | python pdbtools/pdb_chainbows.py
Traceback (most recent call last):
  File "pdbtools/pdb_chainbows.py", line 159, in <module>
    main()
  File "pdbtools/pdb_chainbows.py", line 138, in main
    for lineno, line in enumerate(new_pdb):
  File "pdbtools/pdb_chainbows.py", line 106, in run
    chain_map[line[21]] = curchain
IndexError: string index out of range

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've figured out the reason for this. Even without using pdb_add_manual_ter here, the test pdb has a TER line (but not a full TER entry). This is common in pdb files output from Gromacs. These files will look like this at the end:

.
.
ATOM ...
ATOM ...
TER
ENDMDL

So pdb files output from Gromacs will break pdb_chainbows just like pdb_add_manual_ter does (before the changes of course I am making of course 😄).

Copy link
Member

@JoaoRodrigues JoaoRodrigues left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, thanks for the contribution and effort in writing this tool! It's very consistent with our style.

Besides a few minor points in the code, my main concern is the name of the tool and its purpose. I'd understand addter as a tool that would add TER statements at the right places. We've had discussions about such a tool in the past, but it wasn't very productive because it's not a simple task. The PDB format spec doesn't really tell you when to add TERs appropriately. I'm afraid that this tool would produce "non-standard" PDBs, which is fine to be honest, but then we should change the name to make it a little more obvious that that is the case, particularly, since the default mode of action of the tool is to add a TER after every residue.

What about pdb_add_manual_ter? A bit longer, but at least more specific.

pdbtools/pdb_addter.py Outdated Show resolved Hide resolved
pdbtools/pdb_addter.py Outdated Show resolved Hide resolved
pdbtools/pdb_addter.py Outdated Show resolved Hide resolved
pdbtools/pdb_addter.py Outdated Show resolved Hide resolved
pdbtools/pdb_addter.py Outdated Show resolved Hide resolved
pdbtools/pdb_addter.py Outdated Show resolved Hide resolved
pdbtools/pdb_addter.py Outdated Show resolved Hide resolved
tests/test_pdb_addter.py Outdated Show resolved Hide resolved
@joaomcteixeira
Copy link
Member

Another example, adds TER before the start of the PDB.

python pdbtools/pdb_addter.py -4::  tests/data/dummy.pdb

REMARK    CHAIN C DOES NOT HAVE ELEMENTS FOR LAST RESIDUE
REMARK    CHAIN C CONTAINS RESIDUES OUT OF ORDER
REMARK    CHAIN D IS NUCLEIC ACID AND LACKS A TER STATEMENT
TER
ATOM      1  N   ARG B   4      36.898  42.175  -2.688  1.00  0.00           N  
ATOM      2  H   ARG B   4      37.673  41.570  -2.916  1.00  0.00           H  
ATOM      3  H2  ARG B   4      35.929  41.470  -2.758  1.00  0.00           H  
ATOM      3  H3  ARG B   4      37.099  42.392  -1.524  1.00  0.00           H  
ATOM      3  CA  ARG B   4      37.080  43.455  -3.421  1.00  0.00           C  
ATOM      3  HA  ARG B   4      38.102  43.960  -3.065  1.00  0.00           H  

@joaomcteixeira
Copy link
Member

Another comment,
The PDB you propose as a test case has about 47k lines. That is too big for the scope of this project. Please look for a minimum case scenario needed to test the new tool. If this PR were accepted, the 47k lines from your PDB would collapse the whole statistics of this repository and that wouldn't be fair. For example, dummy PDB has about 200 lines.
Cheers!

@joaomcteixeira
Copy link
Member

What about pdb_add_manual_ter? A bit longer, but at least more specific.

That could be a proper name. As @JoaoRodrigues said, addter might be confusing because of pdb_tidy which already does correct TER addition.

@andrewsb8
Copy link
Author

Thanks for the feedback!

The name change makes sense. Having ability to manually place TER records will be helpful ultimately given the other scripts' purposes (based on my understanding). Ex: chainbows requires TER records, tidy requires Chain identifiers. Some tools don't always give Chain IDs (CHARMM-GUI, Gromacs if Chain IDs aren't provided and no TER records already in place, VMD in some cases, etc). So this tool could be complicated but would help complete the suite as some of the tools need one or the other to already exist.

I have some questions and comments as well:

  • Should I rename my branch? Or is just renaming the script and test sufficient?
  • I get an error using chainbows as well (on a normal pdb where I did not use pdb_addter.py first). I don't have the error message, but the last commit message for that script is "half way done". So I assumed that was not complete yet and didn't submit an issue.
  • For tests with nonsequential residue numbers, it might be best to throw an error when that's the case here. Redirect the user to pdb_reres to avoid undesired results and potential confusion.

Otherwise, I will get to those changes, write tests for a smaller pdb file and dummy.pdb (based on your feedback to the above suggestion), and make the names appropriate. Thanks for your time.

Brian

@joaomcteixeira
Copy link
Member

Hi,

you can stay on the same branch, no need to do a new branch,

I get an error using chainbows as well (on a normal pdb where I did not use pdb_addter.py first). I don't have the error message, but the last commit message for that script is "half way done". So I assumed that was not complete yet and didn't submit an issue.

i don't understand, can you rephrase? Can it be that chainbow has a bug?

For tests with nonsequential residue numbers, it might be best to throw an error when that's the case here. Redirect the user to pdb_reres to avoid undesired results and potential confusion.

pdb-tools should not raise errors (I think). these are very simple tools that should do the job when the PDB follows (more or less) the right formatting rules. if the PDB is "crazy" or has "errors incompatible with the tool", I am not sure if it should be the responsibility of pdbtools to warn about that. I think we don't do that in any other tool. @JoaoRodrigues ?

Yes please, use the smallest possible reproducible case for PDB tests.

Cheers!

@andrewsb8
Copy link
Author

I'll have to get back to you about pdb_chainbows error as I cannot check today.

If not throwing an error, the function could mimic that of pdb_delres which asserts that the residues are deleted regardless of sequence number. pdb_add_manual_ter is written similarly so having it behave this way would be consistent with previous practices in the repository. Based on the test with dummy.pdb, there would still have to be changes made which shouldn't be too difficult (i.e. not considering the value of the residue number in the pdb file when determining if in the residue range to add a TER).

@joaomcteixeira
Copy link
Member

Sounds good so far. Give it a try on the review comments and your considerations. We will review it again after that. Best, and we are very happy to have you interacting with pdb-tools ☺️

@andrewsb8
Copy link
Author

Hi all,

I've made some fairly large changes to the branch pdb_addter. I'll outline some of the largest here:

  • is_integer is changed to return_integer, which now also handles the error message for checking that inputs are correct as mentioned in a comment above. Also saves some lines in check_input.
  • The script now counts residues without the context of the residue number in the pdb file.
  • Starting residue range is now required to be 1 or greater because of previous point.
  • Added make_TER from pdb_tidy to add full TER entries.
  • Added a block of code that stops adding TER entries when records such as END and CONECT are reached.
  • Existing TER lines or entries are ignored and TER entries are no longer duplicated.
  • The test case provided by me is now much shorter lol.
  • I wrote tests for dummy.pdb to confirm that TER are not added after END/CONECT records and out of order residue numbers do not affect the placement of TER entries.

Let me know what you think or if you run into any issues!

  • Brian

@andrewsb8
Copy link
Author

Hey all,

I know that it's end of term and holiday season. Just wanted to follow up because I had actually forgotten about this until a minute ago 😅. Hope the end of the term goes well for you both and that you enjoy your holiday break!

Brian

@joaomcteixeira
Copy link
Member

Hi, @andrewsb8 sorry for the late reply. These have been busy days on all fronts. My apologies. I will review this PR today/tomorrow and let you know ☺️ stay tuned. 👍 Thanks very much for all your efforts so far 💪

Copy link
Member

@joaomcteixeira joaomcteixeira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @andrewsb8

I have been reviewing your PR. The idea remains very valuable. However still some minor issues.

  1. I noticed the residue range the tool expects is the natural order of the residues (1, 2, 3, ...) regardless of the residue numbering in the PDB. This is awkward because would force the users to calculate the residues they see in the PDB and the ones they want. Or use pdb_reres twice to correct for that, for example: pdb_reres -1 | pdb_add_manual_ter | pdb_reres -X again. I think it is not difficult to implement such that the residue range pdb_add_manual_ter expects is the residues of the actual PDB. I found it more natural.
  2. I think the tests are very loose still. Please add assert statements to confirm the actual number and position of the TER lines in the output files. The best would be to assert for the actual TER line.

Regarding number 1. I think other members of the team should give their opinion as well.

Apart from that, everything looks working. Once this is set, before merging I will clean a bit your implementation, so to homogenize with the other tools.

Sorry for these demands, but we need to keep it so 😉
Thanks for all your contributions!

@andrewsb8
Copy link
Author

andrewsb8 commented Jan 4, 2022

Happy new year! Thanks for the feedback. Have a couple of questions based on it.

  1. The main reason I switched to just counting residues, instead of considering the residue numbering in the file, was largely because of the dummy.pdb file. Say, for example, I wanted to run add_manual_ter -4:7:1 dummy.pdb, I would run into a potential issue because the residue numbering is 4,6,7,1,2,3,5,2. So, I would get a TER entry after the fifth residue but that feels like weird behavior sinces it's not next to 4, 6, and 7 and out of order. Similarly, the option -1:3 would produce TER entries after both residues identified as 2 in that pdb. The latter case could be particularly relevant if you are "combining" pdbs of two or more structures for whatever purpose as it would be likely that residue numbers would repeat (and would be a great use case in tandem with pdb_reres). Both methods definitely have their downsides. What are your thoughts?
  2. Makes perfect sense. I'm working on that right now but that actually lead to noticing something else worth mentioning.
  3. The 'TER' entries added include an entry number but ends up duplicating:

ATOM 51 HB3 ALA X 5 ...... TER 52 ALA X 5 ...... ATOM 52 N ALA X 6

So the 52 is duplicated. I can either leave that number out (so only ATOM, HETATM, etc get numbers) or work to make sure the entry numbers are incremented for the number of added TER entries preceding it. What do you all think? Edited to add: This is actually an interesting issue for dummy.pdb. As the method generating the TER entry increments the atom number and there are many duplicate atom numbers already. An example from dummy.pdb is:

ATOM 3 O ARG B 4 ..... TER 4 ARG B 4 ...... ATOM 3 N GLU B 6

There the second entry is changed but everything after the TER entry does not follow it. Let me know what you think! Thanks again for looking over my work and hope you enjoyed the holiday break!

Brian

@joaomcteixeira
Copy link
Member

Hi Brian, thanks for your cheerful words. We had great holidays, we hope you too :-)

All the points you are rising are definitively very tricky and I feel they are getting out of the PDB standard format because of the added complexity. The point you raise for 1) is a good one. Maybe it is best to leave it as "per residue" instead of the actual residue numbers of what you want. Sometimes I think what we if the tool should be add ters at specific residue/chains, for example -A:100. But the question raises: what is the result of passing pdb_tidy after pdb_add_manual_ter? I am starting to have concerns if tools do things outside the PDB format. Maybe what you were looking for is some addition to pdb_tidy? Or some pdb_split* combined with pdb_chain and pdb_tidy that maybe we don't have?

Regarding 3), pdb_tidy corrects for atom numbers. I think this tool should also.

I was rereading again the whole conversation here, definitively @JoaoRodrigues raise a good point. Could you point us to a real-world example (PDBID) and would be your perfect outcome so we can investigate further?

Large system pdb files (which I uploaded as a test file for this script, 692 residues) which require multiple TER entries (potentially at different intervals) prior to simulation can be made much more easily with a script.

Could so send us again these examples? (transfer.sh)

Cheers,

@andrewsb8
Copy link
Author

I will make it so the atom numbers are fixed, no problem!

As for examples of what I'm looking for, I'm rarely working with a single pdb id and sometimes not one at all. I have three examples of some current projects with large systems:

Before finding this project, I've been modifying these files largely manually which has been very time intensive.
Being able to organize these pdb files via some combination of pdb_reres, pdb_merge and/or pdb_tidy has been greatly useful. TER entries are particularly important in many simulation use cases for simulation of these systems. Each use case above requires a different number of TER entries at different intervals (3 in first case, 161 in second, and 7 in the third) and chainbows won't help without preexisting TER entries in most cases listed above. This seemed to be the only step I had to still performed manually when creating systems for simulation after beginning to use this package and motivated me to suggest the addition.

@andrewsb8
Copy link
Author

I have updated pdb_add_manual_ter according to the suggestions, namely:

  • Atom numbers are corrected
  • Existing TER entries are renumbered, or replaced with correctly formatted TER entries
  • The tests now have thorough checks for where the TER entries are (whether they existed in the file or were added by pdb_add_manual_ter)

Let me know what you think of my changes or of my prior comment.

Best,

Brian

@joaomcteixeira
Copy link
Member

Hi @andrewsb8 ,
Thanks so much for your clear explanatory messages. I will check the examples you gave us and your code, and let you know. Cheers,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants