Result File Dataclasses #166

mawelborn · 2023-09-21T11:58:04Z

This PR adds a new set of dataclasses for working with result files as native Python types rather than nested dicts and lists.

Python 3.8+ compatible, fully type checked, and high unit test coverage
Supports result file versions 1 and 3 with a consistent API
Supports loading results from a dict, JSON str, or Path
Provides an API for sorting, filtering, grouping, and modifying predictions including:
- predictions.where()
- predictions.groupby()
- predictions.orderby()
- predictions.apply()
Provides functionality to accept/reject predictions and produce a changes dict suitable for SubmitReview.

See the example script for sample use cases and attribute references.

The dataclasses have been added as a new indico_toolkit.results module rather than replacing indico_toolkit.types as originally planned. The existing types module is used in many places, and updating its dependents would make this already-large PR much bigger. As a standalone PR this will be easier to review, and types can be replaced piecemeal as needed in future PRs.

…action`s are derived

…utes

mawelborn · 2024-01-25T15:30:34Z

Looks like there is an issue with one of the datasets utilized in a fixture (possibly got deleted in some try migration recently?) as well as a failed fileprocessing test

Looks like the tests had been updated on main. Merging that in seems to have fixed it! Also refactored the example script and split it into three in an attempt to make each more individually useful.

… attribute name

- Make the `predicate` argument positional/keyword to match `apply`, `orderby`, and `groupby`. - Make something like `extractions.where(attrgetter("rejected"))` cleaner.

…ct submissions

mawelborn · 2024-03-01T22:26:11Z

Couple quick updates:

Added document.full_text_url alongside the existing document.etl_output_url.
Fixed the calculation for result.rejected to account for admin reviews.
Made extraction.text work with 6.6+ Typed Entities changes.

For the latter, I designed the API to use the text attribute consistently across versions. It does what you want it to do 99% of the time while making Typed Entities data available if needed.

Getting extraction.text returns the OCR text from the predicted span.
Setting extraction.text will update in the output of to_changes():
- prediction["text"] for 5.X-6.5 compatibility
- prediction["normalized"]["text"] for consistency
- prediction["normalized"]["formatted"] for 6.6+ compatibility
If needed, Typed Entities data is available alongside all other not-explicitly-parsed data in the extras attribute:
- extraction.extras["normalized"]["formatted"]

mawelborn · 2024-07-02T19:58:46Z

I've done a significant rewrite of this PR in light of recent changes to the platform. It now aims to provide an API consistent with our v4 result file proposal. It uses IPA 6.8+ version 3 as the new baseline for result files with partial support for IPA 6.8+ version 1 (some attributes are not present in v1). It does not attempt to support version 2 result files or the various inconsistencies of IPA 6.7 and prior due to their limited use.

Overall the code and the API are simpler and cleaner. I've dogfooded it in a couple projects and think it's ready for broader use and feedback.

…st.where(review=None)` Use `None` for everything else to improve typing.

mawelborn and others added 30 commits September 20, 2023 18:36

Remove existing types

5e1f9b1

Add draft API for result file dataclasses

4d73343

Add initial JSON parsing logic

703d485

Classifications do have labels

5cf1a22

Add field_id for Predictions and Classifications

e2bd146

Handle spans like Classifications and Documents WRT multiple values

791c9b9

Add types to __init__.py

5cee1c9

Make ModelName TypeVar simpler

c60250f

Fix typo

7be71e6

Improve error names

85e3a94

Add missing review type

c990fb6

Add utilities for working with result file objects

1f01d22

Make Prediction a base class from which Classifications and `Extr…

545921f

…action`s are derived

Separate v1, v2, and v3 parsing into separate code paths

d824da9

Remove field_id from Predictions

bd4ecb5

Use None instead of magic defaults for optional fields

00ef16b

Handle rejected submissions not having final predictions

3c3d55d

Wordsmith errors

098cc20

Remove existing types tests

e447270

Change "Manual" to "HITL" for ReviewType to match Document attrib…

c119ab7

…utes

Add tests for types

d00e41d

Update test formatting and add instance check for Documents

0310eff

Add nfilter util

ca757b0

Implement and test PredictionList.where and PredictionList.apply

a580155

Parse extra prediction attributes into an extras dict

c750d2d

Add accept and reject logic for extractions

7a64211

Add "where" prefix to appropriate PredictionList tests

826d153

Add labels and models attributes

ddc26d1

Add v2 bundled submission parsing and test

775fa14

Add PredictionList.to_changes() for SubmitReview support

81c7993

Merge branch 'main' into result-file-dataclasses to fix tests

0474686

mawelborn force-pushed the result-file-dataclasses branch from 49f5319 to df5ce85 Compare January 25, 2024 15:34

Rewrite example script, breaking it out into simpler scripts

1dc94a9

mawelborn force-pushed the result-file-dataclasses branch from df5ce85 to 1dc94a9 Compare January 25, 2024 15:58

mawelborn added 9 commits January 26, 2024 09:35

Clarify that etl output is a URL, not the actual data, using a better…

8a4224a

… attribute name

Add full_text_url attribute computed from etl_output_url

3b36158

Improve PredictionList.where() function signature

c42a52e

- Make the `predicate` argument positional/keyword to match `apply`, `orderby`, and `groupby`. - Make something like `extractions.where(attrgetter("rejected"))` cleaner.

Update result file examples

65b1385

Fix docstring for v3 document method

841fd70

Change return type of dicts in public API to be Any

1fdb603

Change to equivalent but type-checkable mock pattern

5c91ead

Update to_changes() so that setting text works in IPA 6.6+

3455e93

Fix the rejected property to account for admin reviews that un-reje…

08a1227

…ct submissions

mawelborn added 2 commits March 8, 2024 16:07

Use set instead of Collection for private extras method

2bc00a9

Rewrite with an improved API and focus on 6.8+ functionality

8f2f615

mawelborn added 5 commits July 3, 2024 11:56

Add missing Group export

a6fc58f

Fix confidence dicts in samples

5a90cb4

Fix typo

1e9dfc6

Make attribute order consistent across prediction subclasses

3f50096

Use a sentinel value for unspecified reviews to support `PredictionLi…

c603bf4

…st.where(review=None)` Use `None` for everything else to improve typing.

mawelborn mentioned this pull request Jul 15, 2024

Result File Dataclasses IndicoDataSolutions/indico_csharp_toolkit#22

Open

mawelborn added 6 commits July 17, 2024 17:33

Fix missing spans key in v3 extraction output

f2cc80f

Update normalization to handle missing normalized sections

91c12e9

Replace StrEnum with Enum for 3.8-3.10 compatibility

6467c73

Import Self from typing-extensions for 3.8-3.10 compatibility

194481c

Fix import that formatting doesn't like

9463cc1

Rename Result.id to Result.submission_id for clarity

34fe1ad

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Result File Dataclasses #166

Result File Dataclasses #166

mawelborn commented Sep 21, 2023 •

edited

Loading

mawelborn commented Jan 25, 2024

mawelborn commented Mar 1, 2024

mawelborn commented Jul 2, 2024 •

edited

Loading

Result File Dataclasses #166

Are you sure you want to change the base?

Result File Dataclasses #166

Conversation

mawelborn commented Sep 21, 2023 • edited Loading

mawelborn commented Jan 25, 2024

mawelborn commented Mar 1, 2024

mawelborn commented Jul 2, 2024 • edited Loading

mawelborn commented Sep 21, 2023 •

edited

Loading

mawelborn commented Jul 2, 2024 •

edited

Loading