Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

updating parse age to cover more formats #633

Merged
merged 9 commits into from
Jun 11, 2021

Conversation

djarecka
Copy link
Member

No description provided.

@djarecka djarecka marked this pull request as draft May 11, 2021 04:32
@codecov
Copy link

codecov bot commented May 11, 2021

Codecov Report

Merging #633 (f1f70b1) into master (02ee314) will increase coverage by 0.28%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #633      +/-   ##
==========================================
+ Coverage   83.62%   83.90%   +0.28%     
==========================================
  Files          59       59              
  Lines        5379     5474      +95     
==========================================
+ Hits         4498     4593      +95     
  Misses        881      881              
Flag Coverage Δ
unittests 83.90% <100.00%> (+0.28%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
dandi/metadata.py 78.50% <100.00%> (+8.50%) ⬆️
dandi/tests/test_metadata.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 02ee314...f1f70b1. Read the comment docs.

@djarecka
Copy link
Member Author

@satra @yarikoptic - i reviewed the "age" field from raw data and tried to cover all "reasonable" examples, you can see tests I added to test_parse_age and I added examples that would fail

@@ -95,9 +95,100 @@ def get_metadata(path):
return meta


def _parse_iso8601(age):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here are the regex patterns we use in reproschema. for at least iso8601, we should cover the formal patterns for duration definition.

https://github.com/ReproNim/reproschema/blob/master/validation/reproschema-shacl.ttl#L327

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm checking the regex as it seems to fail for something specific, but in general the following age patterns should be supported:

https://en.wikipedia.org/wiki/ISO_8601#Durations

<duration>  # This would be pure age
<start>/<duration>  # This is for encoding date of birth/gestation

Copy link
Member

@satra satra May 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry the error was my mistake. this should help for the duration pattern:

In [20]: pat = "^P(?!$)(\\d+(?:\\.\\d+)?Y)?(\\d+(?:\\.\\d+)?M)?(\\d+(?:\\.\\d+)?W)?(\\d+(?:\\.\\d
    ...: +)?D)?(T(?=\\d)(\\d+(?:\\.\\d+)?H)?(\\d+(?:\\.\\d+)?M)?(\\d+(?:\\.\\d+)?S)?)?$"

In [21]: re.match(pat, "P149D")
Out[21]: <re.Match object; span=(0, 5), match='P149D'>

In [22]: re.findall(pat, "P149D")
Out[22]: [('', '', '', '149D', '', '', '', '')]

In [26]: re.findall(pat, "P0Y2M1W149DT23H3M6S")
Out[26]: [('0Y', '2M', '1W', '149D', 'T23H3M6S', '23H', '3M', '6S')]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, can move to this, thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it doesn't cover "P149.D" - is that ok?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it also doesn't change "P149.0D" to "P149D", but that's perhaps is better

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not valid iso8601 so that should be ok. i think anything that is close but not ok, could be converted and then verified with the regex. i think my above comment was mostly for ensuring that the age format we use corresponds to one of those two iso8601 cases.

  1. if input is valid for either case:
    add to age as is
  2. try converting and repeat step 1.
  3. if all conversions fail, raise an error and ask user to fix.

i also suspect that we need to somehow indicate in nwb the age reference (right now we are using gestat* to determine alternate or assigning birth reference). the ideal will be when we don't need to modify at all, just validate iso8601 + reference.

Copy link
Member Author

@djarecka djarecka May 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would disagree that the pattern you posted covers all valid ISO8601 duration values and only valid ones. At least not how I understand the wikipedia description, e.g. "P0,5Y" is a valid value that is not covered, and "P0.6Y2D" is not a valid value that would work with the pattern. But I understand that we try to cover as many format as possible

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just not clear to me should I be fixing values like this "P0.6Y2D" or we just keep them

@yarikoptic
Copy link
Member

what is the plan for this one @satra and @djarecka ?

@satra
Copy link
Member

satra commented Jun 9, 2021

we should include this. @djarecka - where does this stand? at least this should merge master now.

@satra satra marked this pull request as ready for review June 9, 2021 17:50
@lgtm-com
Copy link

lgtm-com bot commented Jun 9, 2021

This pull request introduces 1 alert when merging 397c53a into 02ee314 - view on LGTM.com

new alerts:

  • 1 for Module is imported more than once

@djarecka
Copy link
Member Author

djarecka commented Jun 9, 2021

I think I was checked this on some of the dandisets, but not all of the dandiset. I could still do it, but general comments are welcome. I included tests that can explain what and how is converted

Copy link
Member

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a testing zealot me just added a few comments, otherwise, given already a nice tests coverage seems to be LGTM

if "." in matchstr.group(1):
qty = float(matchstr.group(1))
if int(qty) == qty:
qty = int(qty)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code block of if isn't tests covered. I've not read code to detail yet -- do you @djarecka see a quick test case to add to tests to catch this one as well, just to be on a safe side? (code looks logical to me)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

raise ValueError(
f"ISO 8601 expected, but {age} was received,"
f"no rules to parse {age_rem}"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you could add a test to cover this exception raise so whenever we change code and rename some variables exception raising doesn't raise some formatting exception?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at the end I'm not sure what case I was trying to cover here... I guess all wrong values go to the next exception (that is tested)

age_rem = age.replace(matchstr.group(0), "")
age_rem = age_rem.strip()
if age_rem and age_rem[0] in [",", ";", "."]:
age_rem = age_rem[1:]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there an easy test case to add to cover this line?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this is already covered somewhere else, so no if needed here

@djarecka
Copy link
Member Author

djarecka commented Jun 9, 2021

@yarikoptic - I will improve the test coverage. But you (or others) can also see what "age format" should be convertible, I tried to take my cases from the current dandisets.

@satra
Copy link
Member

satra commented Jun 9, 2021

@yarikoptic - i think this also relates to the #101 . in the ideal world we should not have to extract metadata, simply map directly and make them fix at source. but i think this kind of validation is a bigger issue for existing datasets. so we should perhaps identify issues and work with providers to fix. also age can be computed from date of birth.

@djarecka - is there a notion of this could not convert and will raise an exception?

@djarecka
Copy link
Member Author

djarecka commented Jun 9, 2021

@satra - yes, I believe everything what is not converted will rasie an exception, but there are multiple exceptions that I added, so it won't be always the same, see here

btw. I'm still a bit confused about the usage of floats in the standard. What was the document you were reading during one of the meeting, can't find it, and don't see an answer in the summaries

@satra
Copy link
Member

satra commented Jun 10, 2021

here is the doc: https://www.loc.gov/standards/datetime/iso-tc154-wg5_n0038_iso_wd_8601-1_2016-02-16.pdf

relevant section is 4.4.3.2

@djarecka
Copy link
Member Author

@satra - thanks! this document says that the decimal part can be in the lowest order component (without specifying that it has to be in the Time part), so it is consistent with wikipedia. I've added an extra check, so should be fine

@satra
Copy link
Member

satra commented Jun 10, 2021

are you allowing the "comma is preferred" mode in addition to decimals?

@djarecka
Copy link
Member Author

@satra - I allow for comma in the iso format (test "P2DT10,5H" added), but I do convert it to the format with dot, so would return "P2DT10.5H".
I do not allow comma in the more natural human format, e.g. "3,5 weeks" is not allowed. Let me know if you want me to change it

@satra
Copy link
Member

satra commented Jun 10, 2021

i would say for now that is fine.

@satra satra merged commit 121ac29 into dandi:master Jun 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants