Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Result File Dataclasses #166

Open
wants to merge 87 commits into
base: main
Choose a base branch
from
Open

Result File Dataclasses #166

wants to merge 87 commits into from

Conversation

mawelborn
Copy link
Contributor

@mawelborn mawelborn commented Sep 21, 2023

This PR adds a new set of dataclasses for working with result files as native Python types rather than nested dicts and lists.

  • Python 3.8+ compatible, fully type checked, and high unit test coverage
  • Supports result file versions 1 and 3 with a consistent API
  • Supports loading results from a dict, JSON str, or Path
  • Provides an API for sorting, filtering, grouping, and modifying predictions including:
    • predictions.where()
    • predictions.groupby()
    • predictions.orderby()
    • predictions.apply()
  • Provides functionality to accept/reject predictions and produce a changes dict suitable for SubmitReview.

See the example script for sample use cases and attribute references.

The dataclasses have been added as a new indico_toolkit.results module rather than replacing indico_toolkit.types as originally planned. The existing types module is used in many places, and updating its dependents would make this already-large PR much bigger. As a standalone PR this will be easier to review, and types can be replaced piecemeal as needed in future PRs.

mawelborn and others added 30 commits September 20, 2023 18:36
@mawelborn
Copy link
Contributor Author

Looks like there is an issue with one of the datasets utilized in a fixture (possibly got deleted in some try migration recently?) as well as a failed fileprocessing test

Looks like the tests had been updated on main. Merging that in seems to have fixed it! Also refactored the example script and split it into three in an attempt to make each more individually useful.

@mawelborn
Copy link
Contributor Author

Couple quick updates:

  • Added document.full_text_url alongside the existing document.etl_output_url.
  • Fixed the calculation for result.rejected to account for admin reviews.
  • Made extraction.text work with 6.6+ Typed Entities changes.

For the latter, I designed the API to use the text attribute consistently across versions. It does what you want it to do 99% of the time while making Typed Entities data available if needed.

  • Getting extraction.text returns the OCR text from the predicted span.
  • Setting extraction.text will update in the output of to_changes():
    • prediction["text"] for 5.X-6.5 compatibility
    • prediction["normalized"]["text"] for consistency
    • prediction["normalized"]["formatted"] for 6.6+ compatibility
  • If needed, Typed Entities data is available alongside all other not-explicitly-parsed data in the extras attribute:
    • extraction.extras["normalized"]["formatted"]

@mawelborn
Copy link
Contributor Author

mawelborn commented Jul 2, 2024

I've done a significant rewrite of this PR in light of recent changes to the platform. It now aims to provide an API consistent with our v4 result file proposal. It uses IPA 6.8+ version 3 as the new baseline for result files with partial support for IPA 6.8+ version 1 (some attributes are not present in v1). It does not attempt to support version 2 result files or the various inconsistencies of IPA 6.7 and prior due to their limited use.

Overall the code and the API are simpler and cleaner. I've dogfooded it in a couple projects and think it's ready for broader use and feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants