Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes behaviour for incomplete lines when recover_with_nulls is enabled #14252

Conversation

elstehle
Copy link
Contributor

@elstehle elstehle commented Oct 5, 2023

Description

Closes #14227. Adapts the behaviour of the JSON finite-state transducer (FST) when recover_with_nulls is true to be more strict and reject lines that contain incomplete JSON objects (aka records) or JSON arrays (aka lists).

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@elstehle elstehle requested a review from a team as a code owner October 5, 2023 14:53
@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 5, 2023

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Oct 5, 2023
@elstehle elstehle added bug Something isn't working 3 - Ready for Review Ready for review by team cuIO cuIO issue non-breaking Non-breaking change labels Oct 5, 2023
@elstehle elstehle changed the title Fix/recovering json lines incomplete lines Fixes behaviour for incomplete lines when recover_with_nulls is enabled Oct 5, 2023
@vuule
Copy link
Contributor

vuule commented Oct 5, 2023

/ok to test

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you leave a comment on the changes in logic in the FST (how does this translate into the intended behavior change)? I can glean some logic from the code, but it's possibly made up :D

Copy link
Contributor Author

@elstehle elstehle Oct 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added comments to the code (get_translation_table and the nl_tokens lambda).

Essentially, we just changed the behaviour when we're in recover_from_error, such that we fail if the JSON line is an incomplete JSON value. E.g.,

  • when we see a newline while we are still parsing a LIST or a STRUCT, we emit an ErrorBegin token that will mark this incomplete JSON line as invalid.
  • when we see a newline while we are still parsing a string or field name (e.g., {"a":"123\n) , we emit an ErrorBegin token that will mark this incomplete JSON line as invalid.

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the comments.
It's still challenging to understand case-to-case why regular_tokens and recovering_tokens might be different, but I think I got it.

@karthikeyann
Copy link
Contributor

/ok to test

@elstehle
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 15baa00 into rapidsai:branch-23.12 Oct 11, 2023
60 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

[BUG] JSON reader with recover_with_null fails to detect incomplete JSON as invalid
4 participants