Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

feat: improve parsing of unlabeled CSVs #678

Merged
merged 9 commits into from
Mar 1, 2023

Conversation

LMMilliken
Copy link
Contributor

@LMMilliken LMMilliken commented Feb 24, 2023

This pr adjusts the load_finetuning_dataset method to parse unlabeled CSV files in a way that removes duplicate documents

This pr does not implement support for binary relevance judgement exactly, instead it changes the way unlabeled data is parsed to achieve the same result. The formats that users can provide their data in has not changed, only the DocumentArrays that are returned.

For an unlabeled data provided in this format:

d1,d2
d1,d3
d2,d4
d3,d5
d6,d7
d7,d8
d9,d10

We previously would have returned a document array with 14 elements (4 duplicates) and 7 classes, now we will return a document array with 10 elements and 3 classes


  • This PR references an open issue
  • I have added a line about this change to CHANGELOG

@LMMilliken LMMilliken marked this pull request as draft February 24, 2023 10:12
@LMMilliken LMMilliken linked an issue Feb 24, 2023 that may be closed by this pull request
@LMMilliken LMMilliken self-assigned this Feb 24, 2023
Copy link

@matousek-martin matousek-martin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried it on the modified xmarket data, lgtm

@LMMilliken LMMilliken force-pushed the feat-binary-relevance-judgement branch from 90d1f1f to feef2c1 Compare February 24, 2023 14:08
Copy link
Member

@guenthermi guenthermi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, added some comments

docs/walkthrough/create-training-data.md Outdated Show resolved Hide resolved
docs/walkthrough/create-training-data.md Outdated Show resolved Hide resolved
`This is an English sentence`, `Das ist ein englischer Satz` and `Dit is een Engelse zin` will all be given the same label.

````

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it is better to roll out the tabs, since it is not just the same thing with small modification. The content is rather different in all tabs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im not sure I understand, do you mean use tabs for all the example CSVs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean rather remove all tabs

finetuner/data.py Outdated Show resolved Hide resolved
@CatStark CatStark mentioned this pull request Feb 28, 2023
5 tasks
Copy link
Member

@bwanglzu bwanglzu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@github-actions github-actions bot added size/m and removed size/s labels Mar 1, 2023
@github-actions
Copy link

github-actions bot commented Mar 1, 2023

📝 Docs are deployed on https://ft-feat-binary-relevance-judgement--jina-docs.netlify.app 🎉

Copy link
Member

@guenthermi guenthermi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@bwanglzu bwanglzu merged commit a208fcd into main Mar 1, 2023
@bwanglzu bwanglzu deleted the feat-binary-relevance-judgement branch March 1, 2023 16:57
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for CSVs based on binary relevance judgement
4 participants