-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merging bids_validate
into validate
, stop validating nested BIDS.
#1203
Conversation
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## master #1203 +/- ##
==========================================
+ Coverage 88.85% 89.18% +0.32%
==========================================
Files 74 74
Lines 9443 9573 +130
==========================================
+ Hits 8391 8538 +147
+ Misses 1052 1035 -17
Flags with carried forward coverage won't be shown. Click here to find out more.
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
@jwodder having validation solely be done from within the BIDS* objects obscures the parameters space of the upstream function. That's not necessarily an issue, except that the |
@jwodder could you have a look at Lines 171 to 184 in d409962
if statement, it will fail with all 3.
|
No. Global state is evil. |
@TheChymera I believe the caching you're seeing is due to the fact that calling |
@jwodder yes, it certainly would. However, now I'm wondering whether that's the right thing to do. The issue I see with that, is that the validator would be re-run for each file. Unless there are some very nifty caching features in Python or you'd be willing to significantly redesign the already quite complex What I was thinking is maybe I could instead add some auto-detection logic in |
@TheChymera Let's back up a bit. Why exactly is the caching a problem? Are you trying to validate the same dataset more than once, and, if so, why? |
@jwodder I am trying to adapt |
Can you elaborate on this? Are you referring to this code? Lines 168 to 171 in 67f015c
I don't see how this sort of thing would cause a problem with caching. |
@jwodder yes, that is the code I am referring to. The test function for running the validator on all our example datasets and making sure we detect errors would be called with If you run it on the current status of the PR, it will succeed for all example error datasets save for one — without the code that skips explicit I don't understand why exactly happens either, but I assume it has to be due to the iteration by files, it seems as soon as |
@TheChymera I'm positive the problem has nothing to do with the caching. If I test only |
@jwodder well, yes, my previous fix attempt solved the issue for some, but not all error datasets, as would be expected if the validation was order-specific. I continue to be extremely sure that somehow (if you say there's no way how then maybe it's not caching but this is what's happening) running the validation on a subset of valid files of the dataset interferes with subsequent detection of errors in a subset with invalid files. I tried a few more things, but it seems the only way to make sure this doesn't happen is to always run the validation on the bids root rather than subsets of files (ec3308e). I'm also thinking whether we might want to deprecate I actually tried to deprecate |
7716f4d
to
a348845
Compare
@jwodder @yarikoptic ok, I think this is my best shot. It passes all the example-based validation tests, but fails the tests for the object constructor. Unsure whether I broke something or the test suite should be modified as well. |
|
good question. I think we could as well indeed ignore to pick them up automatically whenever uploading if those paths are not explicitly specified and Rationale: I think it is Ok to add a little of burden upon user to go extra "mile" if they want to upload some data which is not "valid". Otherwise (if it is just easy) we end up with cases like could be found in openneuro archive where people simply ignore entire subject folders to pacify the bids-validator upon upload -- that must be really discouraged. WDYT @satra? |
For this PR
Not necessarily, my idea was to maybe have Ideally not for this PR
There's no flag upstream for turning off |
@TheChymera Only populating |
dandi/validate.py
Outdated
) | ||
log_dir = appdirs.user_log_dir("dandi-cli", "dandi") | ||
report_path = "{log_dir}/bids-validator-report_{{datetime}}-{{pid}}.log" | ||
report_path = report_path.format( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we ever notify the user about the existence of this report?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TheChymera as before I would also argue to make it optional. FWIW we already store the log file for the run, so if desired can log what is desired to be said.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yarikoptic it's a bit more difficult now, since the validation is run by a method of the object. Not sure whether we should be burdening the object with this method parameter. The default DANDI logging does not log the validator report as that's just a dictionary produced and optionally written to a file by an external function. We could disable it entirely, but that means if we have issues, we need to ask people to run the bst validate
command directly to get the actual report listing all the comparisons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default DANDI logging does not log the validator report as that's just a dictionary produced and optionally written to a file by an external function
we get to validation in 2 cases:
- direct use of
dandi validate
-- there we render/output results AFAIK - optional validation during
upload
-- there I believe we do logging at https://github.com/dandi/dandi-cli/blob/master/dandi/upload.py#L161
am I missing some other use case where we should log bids-validator output in custom/separate file ?
df1d919
to
b848c35
Compare
@jwodder @yarikoptic looking good now? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks monumental and great to see tests passing, well done! A few overall points
- even though we AFAIK do not have yet "nested BIDS" datasets, changes to logic relating to those are not clear to me
- duplication and unconditional change of fixtures behavior should be avoided
- I remain of opinion of not needing an extra log file for validation of bids
- We better make transition smoother for the users, thus retain
validate-bids
for a bit longer with a DeprecationWarning but invokingvalidate
logic
dandi/files/__init__.py
Outdated
bidsdd = None | ||
path_queue.append((Path(p), bidsdd)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bidsdd = None | |
path_queue.append((Path(p), bidsdd)) | |
path_queue.append((Path(p), None)) |
ie this diff seems to have no purpose, since it seems that bidsdd
is anyways overloaded below within while
loop and not used outside of it, or am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, you are right, fixed in d962e7d
@@ -127,10 +130,13 @@ def find_dandi_files( | |||
# (ie., it's not a Zarr or any other directory asset type | |||
# we may add later), so traverse through it as a regular | |||
# directory. | |||
if (p / BIDS_DATASET_DESCRIPTION).exists(): | |||
if (p / BIDS_DATASET_DESCRIPTION).exists() and not any( | |||
i in p.parents for i in bids_roots |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- is there some ordering guarantee within
bids_roots
or originalpath_queue
? if not, then it feels that here the check would depend on the order of items withinpath_queue
which would then determine either we might get a hit here or not since we are populatingbids_roots
right here below.
So I feel that some ordering should be introduced amongpath_queue
so we traverseup -> bottom
, or am I wrong? - could you please elaborate a little more on why added that
and not any
here, i.e. why we do not add nested BIDS subdatasets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The path queue is extended as part of the while loop, it will always be traversed top-down.
- Nested BIDS are currently either invalid or derivatives. Support for derivatives is limited in the schema, and absent in the upstream validator. Even if we at some point support derivatives, that will require an object structure redesign. Currently if we validate them, they will be validated as normal BIDS and will fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO it is ok to fail them if they do not follow the schema of common derivatives which is part of the BIDS.
- do we have a use case which somehow demands us to pretend that nested BIDS is not BIDS?
- if not -- I would prefer for us to stay consistent and treat BIDS datasets as BIDS datasets, nested they are or not because otherwise it would just then need to treat every file under derivatives as either "I don't care legit " or "illegal" -- situation is nohow better IMHO.
dandi/tests/fixtures.py
Outdated
bids_dataset_path, dandiset_metadata_file | ||
) | ||
with open(dandiset_metadata_file_path, "w") as f: | ||
f.write(" \n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a generic get_gitrepo_fixture
. I really would prefer it to do exactly that and not anyhow change that cloned git repo unconditionally ! It feels like you want to create some tuned up fixture based on this, called it some descriptive name/docstring on what side effect it creates on top of a clone. But also not clear to me what is the goal here -- make every somefolder into a dandiset there?
or may be add it as an option to this fixture, e.g. make_subfolders_into_dandisets=False
(or whatever would fit the purpose here) and enable thiscode block conditionally on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in 6bfb520
dandi/tests/fixtures.py
Outdated
bids_dataset_path, dandiset_metadata_file | ||
) | ||
with open(dandiset_metadata_file_path, "w") as f: | ||
f.write(" \n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- see comment above in another similar fixture
- I have severe allergy to code duplication. Please make a helper function (e.g.
_make_each_folder_into_dandiset(path)
or alike) and use it in those two spots.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, same as above done in 6bfb520
@@ -236,11 +236,11 @@ def test_find_dandi_files_with_bids(tmp_path: Path) -> None: | |||
dandiset_path=tmp_path, | |||
bids_dataset_description_ref=ANY, | |||
), | |||
BIDSDatasetDescriptionAsset( | |||
GenericBIDSAsset( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so nested BIDS dataset would not be considered a bids dataset? (nothing in PR description hints on the change in behavior for such aspect)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, though a lot of stuff has happened that's not in the PR description. I added that one though, since it is one of the bigger things.
dandi/validate.py
Outdated
) | ||
log_dir = appdirs.user_log_dir("dandi-cli", "dandi") | ||
report_path = "{log_dir}/bids-validator-report_{{datetime}}-{{pid}}.log" | ||
report_path = report_path.format( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default DANDI logging does not log the validator report as that's just a dictionary produced and optionally written to a file by an external function
we get to validation in 2 cases:
- direct use of
dandi validate
-- there we render/output results AFAIK - optional validation during
upload
-- there I believe we do logging at https://github.com/dandi/dandi-cli/blob/master/dandi/upload.py#L161
am I missing some other use case where we should log bids-validator output in custom/separate file ?
dandi/tests/test_validate.py
Outdated
def test_report_path(bids_examples, tmp_path): | ||
def test_validate_bids_onefile(bids_error_examples, tmp_path): | ||
""" | ||
Dedicated test using to single-file validation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note to myself: unlike nose
, pytest
doesn't use docstring as the "name" of the test while running testing so there is no obfuscation and we are indeed "open" to provide docstrings giving details on tests. good.
bids_validate
into validate
bids_validate
into validate
, deprecating nested BIDS.
60f7f97
to
c6d6678
Compare
Co-authored-by: John T. Wodder II <jwodder@users.noreply.github.com>
Co-authored-by: John T. Wodder II <jwodder@users.noreply.github.com>
…c assets that don't start new BIDS datasets
Co-authored-by: Yaroslav Halchenko <debian@onerussian.com>
92a908c
to
8d7068b
Compare
@yarikoptic looking good? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor code tune up + bigger question again on why there is a change in treatment of (possibly) nested BIDS datasets. Is it somehow demanded by the use case or some other cause? if not - we better keep them as bids datasets and validate them, even though indeed might fail now.
report=report, | ||
report_path=report_path, | ||
schema_version=schema, | ||
warnings.filterwarnings("default") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting -- without this I was shown no DeprecationWarning
, odd... @jwodder - do you know why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep. took me a while to figure it out as well...
@@ -127,10 +130,13 @@ def find_dandi_files( | |||
# (ie., it's not a Zarr or any other directory asset type | |||
# we may add later), so traverse through it as a regular | |||
# directory. | |||
if (p / BIDS_DATASET_DESCRIPTION).exists(): | |||
if (p / BIDS_DATASET_DESCRIPTION).exists() and not any( | |||
i in p.parents for i in bids_roots |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO it is ok to fail them if they do not follow the schema of common derivatives which is part of the BIDS.
- do we have a use case which somehow demands us to pretend that nested BIDS is not BIDS?
- if not -- I would prefer for us to stay consistent and treat BIDS datasets as BIDS datasets, nested they are or not because otherwise it would just then need to treat every file under derivatives as either "I don't care legit " or "illegal" -- situation is nohow better IMHO.
8767452
to
1c35017
Compare
@yarikoptic The upstream |
3a3cf14
to
2cd050e
Compare
2cd050e
to
b4e451f
Compare
bids_validate
into validate
, deprecating nested BIDS.bids_validate
into validate
, stop validating nested BIDS.
ok, let's proceed! Thank you @TheChymera and @jwodder ! |
Long time coming, a few issues to even out, this will affect both the command line interface, and internal validation as part of other commands.
(this also includes some minor fixes for issues I came across, including addressing lack of testing for “vanilla” NWB validation).