Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merging bids_validate into validate, stop validating nested BIDS. #1203

Merged
merged 47 commits into from
Mar 1, 2023

Conversation

TheChymera
Copy link
Contributor

@TheChymera TheChymera commented Feb 6, 2023

Long time coming, a few issues to even out, this will affect both the command line interface, and internal validation as part of other commands.

(this also includes some minor fixes for issues I came across, including addressing lack of testing for “vanilla” NWB validation).

@codecov
Copy link

codecov bot commented Feb 6, 2023

Codecov Report

Patch coverage: 84.61% and project coverage change: +0.32 🎉

Comparison is base (9b6dea6) 88.85% compared to head (8767452) 89.18%.

❗ Current head 8767452 differs from pull request most recent head 2cd050e. Consider uploading reports for the commit 2cd050e to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1203      +/-   ##
==========================================
+ Coverage   88.85%   89.18%   +0.32%     
==========================================
  Files          74       74              
  Lines        9443     9573     +130     
==========================================
+ Hits         8391     8538     +147     
+ Misses       1052     1035      -17     
Flag Coverage Δ
unittests 89.18% <84.61%> (+0.32%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
dandi/validate.py 97.61% <ø> (-0.26%) ⬇️
dandi/cli/cmd_validate.py 78.02% <40.00%> (+4.98%) ⬆️
dandi/tests/test_validate.py 82.97% <79.16%> (-17.03%) ⬇️
dandi/files/__init__.py 94.04% <92.30%> (+0.29%) ⬆️
dandi/tests/fixtures.py 97.35% <93.33%> (-0.24%) ⬇️
dandi/cli/tests/test_cmd_validate.py 100.00% <100.00%> (ø)
dandi/files/_private.py 100.00% <100.00%> (+1.88%) ⬆️
dandi/files/bids.py 99.15% <100.00%> (+1.68%) ⬆️
dandi/tests/test_files.py 100.00% <100.00%> (ø)
dandi/due.py 48.00% <0.00%> (-20.00%) ⬇️
... and 11 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@TheChymera
Copy link
Contributor Author

@jwodder having validation solely be done from within the BIDS* objects obscures the parameters space of the upstream function. That's not necessarily an issue, except that the report function is very useful for debugging. Does DANDI have some sort of global debug flag which I could somehow query inside the object and pass a default-value report parameter if it's true?

@TheChymera
Copy link
Contributor Author

TheChymera commented Feb 6, 2023

@jwodder could you have a look at

dandi-cli/dandi/validate.py

Lines 171 to 184 in d409962

# This is pretty awkward, as it turns out, if validation is called once on just the base
# files, the results are somehow cached, and subsequent validation attempts in the same
# BIDS dataset will just return the errors of the first run, which is to say, none.
# This also won't fix all instances of missed validation, since results seem sensitive
# To the order in which files are presented...
if os.path.basename(str(df.filepath)) in [
"dataset_description.json",
"README",
"README.md",
"README.txt",
"README.rst",
]:
print("🤔🤔🤔🤔🤔")
continue
? Any idea why this happens? is there some way to make it not rely on the first pass? It fails as is with one of the three broken reference datasets, if you remove the if statement, it will fail with all 3.

@jwodder
Copy link
Member

jwodder commented Feb 6, 2023

@TheChymera

Does DANDI have some sort of global debug flag

No. Global state is evil.

@jwodder
Copy link
Member

jwodder commented Feb 6, 2023

@TheChymera I believe the caching you're seeing is due to the fact that calling get_validation_errors() on a BIDS object causes the validation code in _validate() to be run, which stores its results on the BIDSDatasetDescriptionAsset and will not be rerun if any validation methods are called for the dataset a second time. Does that answer your question as to what's happening? Would it solve things if I added a method for clearing validation data?

@TheChymera
Copy link
Contributor Author

Would it solve things if I added a method for clearing validation data?

@jwodder yes, it certainly would. However, now I'm wondering whether that's the right thing to do. The issue I see with that, is that the validator would be re-run for each file. Unless there are some very nifty caching features in Python or you'd be willing to significantly redesign the already quite complex BIDS* object infrastructure, the consequence of that would be that for each file, 3 files (the file + dataset_description.json + README) are validated, so it would take thrice as long as it needs to, in addition to overhead. Currently that's not an issue since the upstream validator consists of fairly simple Python code on top of blazing fast regex, but we might end up adding more features for all the intricate ways in which BIDS files are inter-connected, and things might be a lot harder to fix once we have a bunch more features on top... Is this a valid concern from your perspective?

What I was thinking is maybe I could instead add some auto-detection logic in validate.py so that if BIDS directories are passed, we don't do individual file iteration. This would be pretty ad-hoc, however, and much more inelegant and potentially less robust than the very nice find_dandi_files() logic you've constructed in files/__init__.py 🤔

@jwodder
Copy link
Member

jwodder commented Feb 6, 2023

@TheChymera Let's back up a bit. Why exactly is the caching a problem? Are you trying to validate the same dataset more than once, and, if so, why?

@TheChymera
Copy link
Contributor Author

@jwodder I am trying to adapt bids_validate() to the logic of the vanilla validate() function. It seems to run find_dandi_files() inside the provided directory (or directories) and then get the errors for them individually. As I understand it's the iteration in the existing function which might require running the validator for each file separately.

@jwodder
Copy link
Member

jwodder commented Feb 6, 2023

@TheChymera

As I understand it's the iteration in the existing function which might require running the validator for each file separately.

Can you elaborate on this? Are you referring to this code?

dandi-cli/dandi/validate.py

Lines 168 to 171 in 67f015c

for df in find_dandi_files(
p, dandiset_path=dandiset_path, allow_all=allow_any_path
):
yield from df.get_validation_errors(

I don't see how this sort of thing would cause a problem with caching.

@TheChymera
Copy link
Contributor Author

TheChymera commented Feb 7, 2023

@jwodder yes, that is the code I am referring to. The test function for running the validator on all our example datasets and making sure we detect errors would be called with pytest -vvs tests/test_validate.py::test_validate_bids_errors

If you run it on the current status of the PR, it will succeed for all example error datasets save for one — without the code that skips explicit datset_description.json and README validation it will fail on all datasets, i.e. it will not detect any errors on any dataset.

I don't understand why exactly happens either, but I assume it has to be due to the iteration by files, it seems as soon as df.get_validation_errors() is called on some files individually, the results from future calls of df.get_validation_errors() on files from the same dataset will be unreliable.

@jwodder
Copy link
Member

jwodder commented Feb 7, 2023

@TheChymera I'm positive the problem has nothing to do with the caching. If I test only invalid_asl003, it fails to find any validation errors, both with and without your codeblock from this comment.

@TheChymera
Copy link
Contributor Author

TheChymera commented Feb 15, 2023

If I test only invalid_asl003, it fails to find any validation errors, both with and without your codeblock from #1203 (comment).

@jwodder well, yes, my previous fix attempt solved the issue for some, but not all error datasets, as would be expected if the validation was order-specific.

I continue to be extremely sure that somehow (if you say there's no way how then maybe it's not caching but this is what's happening) running the validation on a subset of valid files of the dataset interferes with subsequent detection of errors in a subset with invalid files.

I tried a few more things, but it seems the only way to make sure this doesn't happen is to always run the validation on the bids root rather than subsets of files (ec3308e).

I'm also thinking whether we might want to deprecate BIDSDatasetDescriptionAsset.dataset_files since there is no way for us to determine a list of files without using or leveraging the bidsschematools validator — we'd either need to reimplement the logic (which is slated to become more complex as upstream includes support for .bidsignore) or directly call the private upstream function which does file selection.

I actually tried to deprecate BIDSDatasetDescriptionAsset.dataset_files, which on the surface is easy since it's only used in a few places, but it's used a whole lot more in the tests, and more difficult to clean there. Thought it might be best to ask.

@TheChymera
Copy link
Contributor Author

TheChymera commented Feb 16, 2023

@jwodder @yarikoptic ok, I think this is my best shot. It passes all the example-based validation tests, but fails the tests for the object constructor. Unsure whether I broke something or the test suite should be modified as well.

dandi/files/__init__.py Outdated Show resolved Hide resolved
dandi/files/__init__.py Outdated Show resolved Hide resolved
dandi/tests/test_validate.py Outdated Show resolved Hide resolved
dandi/tests/test_validate.py Show resolved Hide resolved
@jwodder
Copy link
Member

jwodder commented Feb 16, 2023

@TheChymera

I'm also thinking whether we might want to deprecate BIDSDatasetDescriptionAsset.dataset_files since there is no way for us to determine a list of files without using or leveraging the bidsschematools validator — we'd either need to reimplement the logic (which is slated to become more complex as upstream includes support for .bidsignore) or directly call the private upstream function which does file selection.

  • Are we planning on excluding bidsignored-files when uploading a Dandiset? (cc @yarikoptic) If so, that upstream function's going to have to go public.
  • Exactly what steps do you have in mind for "deprecating" dataset_files? My definition of that word does not entail removing any tests until the feature is removed. If you're planning on entirely tearing it out in this PR, you should also remove the weakref wrapper around BIDSDatasetDescriptionAsset in BIDSAsset.

@yarikoptic
Copy link
Member

  • Are we planning on excluding bidsignored-files when uploading a Dandiset? (cc @yarikoptic) If so, that upstream function's going to have to go public.

good question. I think we could as well indeed ignore to pick them up automatically whenever uploading if those paths are not explicitly specified and --allow-any-path is given. So similarly to how we ignore non-nwb/video files by default AFAIK.

Rationale: I think it is Ok to add a little of burden upon user to go extra "mile" if they want to upload some data which is not "valid". Otherwise (if it is just easy) we end up with cases like could be found in openneuro archive where people simply ignore entire subject folders to pacify the bids-validator upon upload -- that must be really discouraged.

WDYT @satra?

@TheChymera
Copy link
Contributor Author

@jwodder @yarikoptic

For this PR

If so, that upstream function's going to have to go public.

Exactly what steps do you have in mind for "deprecating" dataset_files?

Not necessarily, my idea was to maybe have dataset_files populated automatically after the validation is run on the BIDS root. That way we get the list of what counts as a BIDS dataset file straight from the horse's mouth. No need to first run bst.validator._get_paths and then again bst.validator.validate_bids, which runs the previous function again.

Ideally not for this PR

--allow-any-path

There's no flag upstream for turning off .bidsignore support, and I don't think there should be one. You want it off, you rm the file. The purpose of .bidsignore is to tell the validator to stop nagging you about things you are aware of, most commonly the use case is that you're developing a new BEP, but indeed it can be arbitrarily abused. I think we should have a separate flag, e.g. --allow-any-path to just bypass validation entirely, and keep that separate from the .bidsignore logic. But let's maybe discuss this in another PR or issue. I just brought it up as an example for why BIDS-agnostic file detection is reaching its limits.

@jwodder
Copy link
Member

jwodder commented Feb 16, 2023

@TheChymera Only populating dataset_files via validation doesn't seem like a good idea to me. At the very least, it'd cause problems when uploading with validation disabled.

@TheChymera TheChymera marked this pull request as ready for review February 20, 2023 11:33
dandi/validate.py Outdated Show resolved Hide resolved
)
log_dir = appdirs.user_log_dir("dandi-cli", "dandi")
report_path = "{log_dir}/bids-validator-report_{{datetime}}-{{pid}}.log"
report_path = report_path.format(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we ever notify the user about the existence of this report?

Copy link
Member

@yarikoptic yarikoptic Feb 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TheChymera as before I would also argue to make it optional. FWIW we already store the log file for the run, so if desired can log what is desired to be said.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yarikoptic it's a bit more difficult now, since the validation is run by a method of the object. Not sure whether we should be burdening the object with this method parameter. The default DANDI logging does not log the validator report as that's just a dictionary produced and optionally written to a file by an external function. We could disable it entirely, but that means if we have issues, we need to ask people to run the bst validate command directly to get the actual report listing all the comparisons.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default DANDI logging does not log the validator report as that's just a dictionary produced and optionally written to a file by an external function

we get to validation in 2 cases:

am I missing some other use case where we should log bids-validator output in custom/separate file ?

@TheChymera
Copy link
Contributor Author

@jwodder @yarikoptic looking good now?

Copy link
Member

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks monumental and great to see tests passing, well done! A few overall points

  • even though we AFAIK do not have yet "nested BIDS" datasets, changes to logic relating to those are not clear to me
  • duplication and unconditional change of fixtures behavior should be avoided
  • I remain of opinion of not needing an extra log file for validation of bids
  • We better make transition smoother for the users, thus retain validate-bids for a bit longer with a DeprecationWarning but invoking validate logic

Comment on lines 108 to 109
bidsdd = None
path_queue.append((Path(p), bidsdd))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
bidsdd = None
path_queue.append((Path(p), bidsdd))
path_queue.append((Path(p), None))

ie this diff seems to have no purpose, since it seems that bidsdd is anyways overloaded below within while loop and not used outside of it, or am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, you are right, fixed in d962e7d

dandi/files/__init__.py Show resolved Hide resolved
@@ -127,10 +130,13 @@ def find_dandi_files(
# (ie., it's not a Zarr or any other directory asset type
# we may add later), so traverse through it as a regular
# directory.
if (p / BIDS_DATASET_DESCRIPTION).exists():
if (p / BIDS_DATASET_DESCRIPTION).exists() and not any(
i in p.parents for i in bids_roots
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. is there some ordering guarantee within bids_roots or original path_queue? if not, then it feels that here the check would depend on the order of items within path_queue which would then determine either we might get a hit here or not since we are populating bids_roots right here below.
    So I feel that some ordering should be introduced among path_queue so we traverse up -> bottom, or am I wrong?
  2. could you please elaborate a little more on why added that and not any here, i.e. why we do not add nested BIDS subdatasets?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. The path queue is extended as part of the while loop, it will always be traversed top-down.
  2. Nested BIDS are currently either invalid or derivatives. Support for derivatives is limited in the schema, and absent in the upstream validator. Even if we at some point support derivatives, that will require an object structure redesign. Currently if we validate them, they will be validated as normal BIDS and will fail.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO it is ok to fail them if they do not follow the schema of common derivatives which is part of the BIDS.

  • do we have a use case which somehow demands us to pretend that nested BIDS is not BIDS?
  • if not -- I would prefer for us to stay consistent and treat BIDS datasets as BIDS datasets, nested they are or not because otherwise it would just then need to treat every file under derivatives as either "I don't care legit " or "illegal" -- situation is nohow better IMHO.

bids_dataset_path, dandiset_metadata_file
)
with open(dandiset_metadata_file_path, "w") as f:
f.write(" \n")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a generic get_gitrepo_fixture. I really would prefer it to do exactly that and not anyhow change that cloned git repo unconditionally ! It feels like you want to create some tuned up fixture based on this, called it some descriptive name/docstring on what side effect it creates on top of a clone. But also not clear to me what is the goal here -- make every somefolder into a dandiset there?

or may be add it as an option to this fixture, e.g. make_subfolders_into_dandisets=False (or whatever would fit the purpose here) and enable thiscode block conditionally on that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 6bfb520

bids_dataset_path, dandiset_metadata_file
)
with open(dandiset_metadata_file_path, "w") as f:
f.write(" \n")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. see comment above in another similar fixture
  2. I have severe allergy to code duplication. Please make a helper function (e.g. _make_each_folder_into_dandiset(path) or alike) and use it in those two spots.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, same as above done in 6bfb520

@@ -236,11 +236,11 @@ def test_find_dandi_files_with_bids(tmp_path: Path) -> None:
dandiset_path=tmp_path,
bids_dataset_description_ref=ANY,
),
BIDSDatasetDescriptionAsset(
GenericBIDSAsset(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so nested BIDS dataset would not be considered a bids dataset? (nothing in PR description hints on the change in behavior for such aspect)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, though a lot of stuff has happened that's not in the PR description. I added that one though, since it is one of the bigger things.

)
log_dir = appdirs.user_log_dir("dandi-cli", "dandi")
report_path = "{log_dir}/bids-validator-report_{{datetime}}-{{pid}}.log"
report_path = report_path.format(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default DANDI logging does not log the validator report as that's just a dictionary produced and optionally written to a file by an external function

we get to validation in 2 cases:

am I missing some other use case where we should log bids-validator output in custom/separate file ?

def test_report_path(bids_examples, tmp_path):
def test_validate_bids_onefile(bids_error_examples, tmp_path):
"""
Dedicated test using to single-file validation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to myself: unlike nose, pytest doesn't use docstring as the "name" of the test while running testing so there is no obfuscation and we are indeed "open" to provide docstrings giving details on tests. good.

dandi/cli/cmd_validate.py Show resolved Hide resolved
@TheChymera TheChymera changed the title Merging bids_validate into validate Merging bids_validate into validate, deprecating nested BIDS. Feb 23, 2023
dandi/tests/fixtures.py Outdated Show resolved Hide resolved
dandi/tests/fixtures.py Show resolved Hide resolved
@TheChymera
Copy link
Contributor Author

@yarikoptic looking good?

Copy link
Member

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor code tune up + bigger question again on why there is a change in treatment of (possibly) nested BIDS datasets. Is it somehow demanded by the use case or some other cause? if not - we better keep them as bids datasets and validate them, even though indeed might fail now.

dandi/cli/cmd_validate.py Outdated Show resolved Hide resolved
report=report,
report_path=report_path,
schema_version=schema,
warnings.filterwarnings("default")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting -- without this I was shown no DeprecationWarning, odd... @jwodder - do you know why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep. took me a while to figure it out as well...

dandi/tests/fixtures.py Outdated Show resolved Hide resolved
@@ -127,10 +130,13 @@ def find_dandi_files(
# (ie., it's not a Zarr or any other directory asset type
# we may add later), so traverse through it as a regular
# directory.
if (p / BIDS_DATASET_DESCRIPTION).exists():
if (p / BIDS_DATASET_DESCRIPTION).exists() and not any(
i in p.parents for i in bids_roots
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO it is ok to fail them if they do not follow the schema of common derivatives which is part of the BIDS.

  • do we have a use case which somehow demands us to pretend that nested BIDS is not BIDS?
  • if not -- I would prefer for us to stay consistent and treat BIDS datasets as BIDS datasets, nested they are or not because otherwise it would just then need to treat every file under derivatives as either "I don't care legit " or "illegal" -- situation is nohow better IMHO.

@TheChymera
Copy link
Contributor Author

TheChymera commented Feb 28, 2023

@yarikoptic The upstream bst validator does not currently validate derivatives, it just ignores them. If someone has derivs (some of our test data has derivs) DANDI as it currently stands will pick it up as a separate non-deriv dataset and fail. I can make this the next feature we work on, but it will have to be done on the bst end and the nested-ness will be handled there.

@yarikoptic yarikoptic changed the title Merging bids_validate into validate, deprecating nested BIDS. Merging bids_validate into validate, stop validating nested BIDS. Mar 1, 2023
@yarikoptic yarikoptic merged commit 1818b2f into master Mar 1, 2023
@yarikoptic yarikoptic deleted the validate_merge branch March 1, 2023 02:10
@yarikoptic
Copy link
Member

ok, let's proceed! Thank you @TheChymera and @jwodder !

@jwodder jwodder added minor Increment the minor version when merged BIDS cmd-validate labels Mar 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BIDS cmd-validate minor Increment the minor version when merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants