Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drop chains that are missing (structure) data in training #210

Merged
merged 4 commits into from
Aug 30, 2022

Conversation

timodonnell
Copy link
Contributor

This PR moves chain_data_cache_paths from OpenFoldDataset (where it was a list, one per dataset) to a single string chain_data_cache_path in OpenFoldSingleDataset. This makes it convenient to drop entries (with a warning) where we have alignments but no structure data, instead of crashing.

The motivation here was that while trying to start a small training run, I was hitting an issue where a few chains were present in my alignment data (downloaded from RODA) but absent in my pdb_mmcif dir (downloaded from PDB via download_pdb_mmcif.sh). I see there is now scripts/download_roda_pdbs.sh to address this but I thought it might be good to be able handle this error a bit better anyway.

Curious if you think this makes sense @gahdritz

I also note a few tests are failing for me here, but the same tests appear to be failing for me on main (basic inference and training seems to work). If you approve the PR I'd rather have you wait though to merge it so I can check more carefully it's not breaking anything.

@gahdritz
Copy link
Collaborator

gahdritz commented Aug 30, 2022

Thanks Tim! I like the concept, but I think it might be a little bit more natural to look for matching structure files in the OpenFoldSingleDataset's data_dir directly during the filtering process rather than using the chain_data_cache as a proxy. First, it's a little confusing this way, and it's not immediately obvious to me from the code what's happening here. Also, ATM it's technically possible to use a chain_data_cache that's a superset of the current training data, e.g. if you're training on a subset of your full dataset but aren't regenerating a new chain_data_cache for each new subset. This would interfere with that functionality.

To do that, I'd factor out the structure file name resolution out of the __getitem__ function of the dataset object and run that for each of the proteins with alignment directories during the filtering process in the constructor.

@timodonnell
Copy link
Contributor Author

Thanks @gahdritz , that makes sense. I'll try the implementation you suggest. Separately, do you think chain_data_cache (or _caches) should live in OpenFoldDataset or OpenFoldSingleDataset ? It seemed cleaner to move it to OpenFoldSingleDataset but I wondering if I'm missing something

@gahdritz
Copy link
Collaborator

Wait hold on a second. This doesn't even break the use case I mentioned, right? If the chain_data_cache is a strict superset of the current training data, no chains are removed?

@timodonnell
Copy link
Contributor Author

Yeah, I think anything that runs already would continue to work. Extra chains in the cache wouldn't cause a problem.

The only difference is that when chain_data_cache is missing chains that are in the alignment data, they will be dropped instead of causing a crash eventually in the training loop.

I think your suggestion to look at the actual cif files on disk also makes sense, since anything missing there would also cause a crash. This could be done in addition to the change here that just drops chains not present in the cache.

@gahdritz
Copy link
Collaborator

I actually prefer your original solution to what I suggested. Looking at the .cif files wouldn't give you anything about the chains in particular without re-parsing the .cif files, which defeats the point of having the chain_data_cache. I'll merge this as-is.

@gahdritz gahdritz merged commit 9dd9cea into aqlaboratory:main Aug 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants