Drop chains that are missing (structure) data in training #210

timodonnell · 2022-08-29T21:22:55Z

This PR moves chain_data_cache_paths from OpenFoldDataset (where it was a list, one per dataset) to a single string chain_data_cache_path in OpenFoldSingleDataset. This makes it convenient to drop entries (with a warning) where we have alignments but no structure data, instead of crashing.

The motivation here was that while trying to start a small training run, I was hitting an issue where a few chains were present in my alignment data (downloaded from RODA) but absent in my pdb_mmcif dir (downloaded from PDB via download_pdb_mmcif.sh). I see there is now scripts/download_roda_pdbs.sh to address this but I thought it might be good to be able handle this error a bit better anyway.

Curious if you think this makes sense @gahdritz

I also note a few tests are failing for me here, but the same tests appear to be failing for me on main (basic inference and training seems to work). If you approve the PR I'd rather have you wait though to merge it so I can check more carefully it's not breaking anything.

gahdritz · 2022-08-30T02:05:11Z

Thanks Tim! I like the concept, but I think it might be a little bit more natural to look for matching structure files in the OpenFoldSingleDataset's data_dir directly during the filtering process rather than using the chain_data_cache as a proxy. First, it's a little confusing this way, and it's not immediately obvious to me from the code what's happening here. Also, ATM it's technically possible to use a chain_data_cache that's a superset of the current training data, e.g. if you're training on a subset of your full dataset but aren't regenerating a new chain_data_cache for each new subset. This would interfere with that functionality.

To do that, I'd factor out the structure file name resolution out of the __getitem__ function of the dataset object and run that for each of the proteins with alignment directories during the filtering process in the constructor.

timodonnell · 2022-08-30T13:13:03Z

Thanks @gahdritz , that makes sense. I'll try the implementation you suggest. Separately, do you think chain_data_cache (or _caches) should live in OpenFoldDataset or OpenFoldSingleDataset ? It seemed cleaner to move it to OpenFoldSingleDataset but I wondering if I'm missing something

gahdritz · 2022-08-30T20:15:52Z

Wait hold on a second. This doesn't even break the use case I mentioned, right? If the chain_data_cache is a strict superset of the current training data, no chains are removed?

timodonnell · 2022-08-30T21:28:41Z

Yeah, I think anything that runs already would continue to work. Extra chains in the cache wouldn't cause a problem.

The only difference is that when chain_data_cache is missing chains that are in the alignment data, they will be dropped instead of causing a crash eventually in the training loop.

I think your suggestion to look at the actual cif files on disk also makes sense, since anything missing there would also cause a crash. This could be done in addition to the change here that just drops chains not present in the cache.

gahdritz · 2022-08-30T21:38:51Z

I actually prefer your original solution to what I suggested. Looking at the .cif files wouldn't give you anything about the chains in particular without re-parsing the .cif files, which defeats the point of having the chain_data_cache. I'll merge this as-is.

timodonnell added 4 commits August 29, 2022 16:05

Drop alignments that are missing structure data in training

60e9bd5

fix

5e341f6

fix

1abe616

fix

f6d02cd

gahdritz merged commit 9dd9cea into aqlaboratory:main Aug 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Drop chains that are missing (structure) data in training #210

Drop chains that are missing (structure) data in training #210

timodonnell commented Aug 29, 2022

gahdritz commented Aug 30, 2022 •

edited

Loading

timodonnell commented Aug 30, 2022

gahdritz commented Aug 30, 2022

timodonnell commented Aug 30, 2022

gahdritz commented Aug 30, 2022

Drop chains that are missing (structure) data in training #210

Drop chains that are missing (structure) data in training #210

Conversation

timodonnell commented Aug 29, 2022

gahdritz commented Aug 30, 2022 • edited Loading

timodonnell commented Aug 30, 2022

gahdritz commented Aug 30, 2022

timodonnell commented Aug 30, 2022

gahdritz commented Aug 30, 2022

gahdritz commented Aug 30, 2022 •

edited

Loading