-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
updating parse age to cover more formats #633
Conversation
Codecov Report
@@ Coverage Diff @@
## master #633 +/- ##
==========================================
+ Coverage 83.62% 83.90% +0.28%
==========================================
Files 59 59
Lines 5379 5474 +95
==========================================
+ Hits 4498 4593 +95
Misses 881 881
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
@satra @yarikoptic - i reviewed the "age" field from raw data and tried to cover all "reasonable" examples, you can see tests I added to |
@@ -95,9 +95,100 @@ def get_metadata(path): | |||
return meta | |||
|
|||
|
|||
def _parse_iso8601(age): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here are the regex patterns we use in reproschema. for at least iso8601, we should cover the formal patterns for duration definition.
https://github.com/ReproNim/reproschema/blob/master/validation/reproschema-shacl.ttl#L327
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm checking the regex as it seems to fail for something specific, but in general the following age patterns should be supported:
https://en.wikipedia.org/wiki/ISO_8601#Durations
<duration> # This would be pure age
<start>/<duration> # This is for encoding date of birth/gestation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry the error was my mistake. this should help for the duration pattern:
In [20]: pat = "^P(?!$)(\\d+(?:\\.\\d+)?Y)?(\\d+(?:\\.\\d+)?M)?(\\d+(?:\\.\\d+)?W)?(\\d+(?:\\.\\d
...: +)?D)?(T(?=\\d)(\\d+(?:\\.\\d+)?H)?(\\d+(?:\\.\\d+)?M)?(\\d+(?:\\.\\d+)?S)?)?$"
In [21]: re.match(pat, "P149D")
Out[21]: <re.Match object; span=(0, 5), match='P149D'>
In [22]: re.findall(pat, "P149D")
Out[22]: [('', '', '', '149D', '', '', '', '')]
In [26]: re.findall(pat, "P0Y2M1W149DT23H3M6S")
Out[26]: [('0Y', '2M', '1W', '149D', 'T23H3M6S', '23H', '3M', '6S')]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, can move to this, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it doesn't cover "P149.D" - is that ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it also doesn't change "P149.0D" to "P149D", but that's perhaps is better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not valid iso8601 so that should be ok. i think anything that is close but not ok, could be converted and then verified with the regex. i think my above comment was mostly for ensuring that the age format we use corresponds to one of those two iso8601 cases.
- if input is valid for either case:
add to age as is - try converting and repeat step 1.
- if all conversions fail, raise an error and ask user to fix.
i also suspect that we need to somehow indicate in nwb the age reference (right now we are using gestat* to determine alternate or assigning birth reference). the ideal will be when we don't need to modify at all, just validate iso8601 + reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would disagree that the pattern you posted covers all valid ISO8601 duration values and only valid ones. At least not how I understand the wikipedia description, e.g. "P0,5Y" is a valid value that is not covered, and "P0.6Y2D" is not a valid value that would work with the pattern. But I understand that we try to cover as many format as possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just not clear to me should I be fixing values like this "P0.6Y2D" or we just keep them
we should include this. @djarecka - where does this stand? at least this should merge master now. |
This pull request introduces 1 alert when merging 397c53a into 02ee314 - view on LGTM.com new alerts:
|
I think I was checked this on some of the dandisets, but not all of the dandiset. I could still do it, but general comments are welcome. I included tests that can explain what and how is converted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a testing zealot me just added a few comments, otherwise, given already a nice tests coverage seems to be LGTM
if "." in matchstr.group(1): | ||
qty = float(matchstr.group(1)) | ||
if int(qty) == qty: | ||
qty = int(qty) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this code block of if
isn't tests covered. I've not read code to detail yet -- do you @djarecka see a quick test case to add to tests to catch this one as well, just to be on a safe side? (code looks logical to me)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
dandi/metadata.py
Outdated
raise ValueError( | ||
f"ISO 8601 expected, but {age} was received," | ||
f"no rules to parse {age_rem}" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you could add a test to cover this exception raise so whenever we change code and rename some variables exception raising doesn't raise some formatting exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at the end I'm not sure what case I was trying to cover here... I guess all wrong values go to the next exception (that is tested)
dandi/metadata.py
Outdated
age_rem = age.replace(matchstr.group(0), "") | ||
age_rem = age_rem.strip() | ||
if age_rem and age_rem[0] in [",", ";", "."]: | ||
age_rem = age_rem[1:] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there an easy test case to add to cover this line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this is already covered somewhere else, so no if needed here
@yarikoptic - I will improve the test coverage. But you (or others) can also see what "age format" should be convertible, I tried to take my cases from the current dandisets. |
@yarikoptic - i think this also relates to the #101 . in the ideal world we should not have to extract metadata, simply map directly and make them fix at source. but i think this kind of validation is a bigger issue for existing datasets. so we should perhaps identify issues and work with providers to fix. also age can be computed from date of birth. @djarecka - is there a notion of this could not convert and will raise an exception? |
@satra - yes, I believe everything what is not converted will rasie an exception, but there are multiple exceptions that I added, so it won't be always the same, see here btw. I'm still a bit confused about the usage of floats in the standard. What was the document you were reading during one of the meeting, can't find it, and don't see an answer in the summaries |
here is the doc: https://www.loc.gov/standards/datetime/iso-tc154-wg5_n0038_iso_wd_8601-1_2016-02-16.pdf relevant section is 4.4.3.2 |
@satra - thanks! this document says that the decimal part can be in the lowest order component (without specifying that it has to be in the Time part), so it is consistent with wikipedia. I've added an extra check, so should be fine |
are you allowing the "comma is preferred" mode in addition to decimals? |
@satra - I allow for comma in the iso format (test "P2DT10,5H" added), but I do convert it to the format with dot, so would return "P2DT10.5H". |
i would say for now that is fine. |
No description provided.