feat: improve parsing of unlabeled CSVs #678

LMMilliken · 2023-02-24T10:12:30Z

This pr adjusts the load_finetuning_dataset method to parse unlabeled CSV files in a way that removes duplicate documents

This pr does not implement support for binary relevance judgement exactly, instead it changes the way unlabeled data is parsed to achieve the same result. The formats that users can provide their data in has not changed, only the DocumentArrays that are returned.

For an unlabeled data provided in this format:

d1,d2
d1,d3
d2,d4
d3,d5
d6,d7
d7,d8
d9,d10

We previously would have returned a document array with 14 elements (4 duplicates) and 7 classes, now we will return a document array with 10 elements and 3 classes

This PR references an open issue
I have added a line about this change to CHANGELOG

matousek-martin

Tried it on the modified xmarket data, lgtm

guenthermi

Nice work, added some comments

docs/walkthrough/create-training-data.md

guenthermi · 2023-02-28T08:35:50Z

docs/walkthrough/create-training-data.md

+`This is an English sentence`, `Das ist ein englischer Satz` and `Dit is een Engelse zin` will all be given the same label. 
+
+````
+


maybe it is better to roll out the tabs, since it is not just the same thing with small modification. The content is rather different in all tabs

im not sure I understand, do you mean use tabs for all the example CSVs?

I mean rather remove all tabs

finetuner/data.py

bwanglzu

lgtm!

github-actions · 2023-03-01T15:29:40Z

📝 Docs are deployed on https://ft-feat-binary-relevance-judgement--jina-docs.netlify.app 🎉

guenthermi

LGTM

feat: improve parsing of unlabeled CSVs

98e7d5e

LMMilliken marked this pull request as draft February 24, 2023 10:12

LMMilliken linked an issue Feb 24, 2023 that may be closed by this pull request

Add support for CSVs based on binary relevance judgement #632

Closed

github-actions bot added size/s area/core labels Feb 24, 2023

LMMilliken self-assigned this Feb 24, 2023

lmmilliken added 2 commits February 24, 2023 11:38

docs: update documentation on creating training data

36fc7a3

chore: update changelog

407e548

LMMilliken marked this pull request as ready for review February 24, 2023 10:41

github-actions bot added the area/docs label Feb 24, 2023

LMMilliken requested review from gmastrapas, guenthermi, bwanglzu and matousek-martin February 24, 2023 10:44

matousek-martin approved these changes Feb 24, 2023

View reviewed changes

docs: try to build docs

feef2c1

LMMilliken force-pushed the feat-binary-relevance-judgement branch from 90d1f1f to feef2c1 Compare February 24, 2023 14:08

docs: fix typo

88303bf

guenthermi suggested changes Feb 28, 2023

View reviewed changes

CatStark mentioned this pull request Feb 28, 2023

(Release 0.7.2) To-do list #682

Closed

5 tasks

feat: only assign new labels to the first colunm

07d2c37

LMMilliken requested a review from guenthermi February 28, 2023 13:04

Merge branch 'main' into feat-binary-relevance-judgement

efc5e84

bwanglzu approved these changes Mar 1, 2023

View reviewed changes

docs: move content out of tabs

e0cd70a

github-actions bot added size/m and removed size/s labels Mar 1, 2023

docs: update example

3a4034d

guenthermi approved these changes Mar 1, 2023

View reviewed changes

bwanglzu merged commit a208fcd into main Mar 1, 2023

bwanglzu deleted the feat-binary-relevance-judgement branch March 1, 2023 16:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: improve parsing of unlabeled CSVs #678

feat: improve parsing of unlabeled CSVs #678

LMMilliken commented Feb 24, 2023 •

edited

Loading

matousek-martin left a comment

guenthermi left a comment

guenthermi Feb 28, 2023

LMMilliken Feb 28, 2023

guenthermi Mar 1, 2023

bwanglzu left a comment

github-actions bot commented Mar 1, 2023

guenthermi left a comment

		`This is an English sentence`, `Das ist ein englischer Satz` and `Dit is een Engelse zin` will all be given the same label.

		````

feat: improve parsing of unlabeled CSVs #678

feat: improve parsing of unlabeled CSVs #678

Conversation

LMMilliken commented Feb 24, 2023 • edited Loading

matousek-martin left a comment

Choose a reason for hiding this comment

guenthermi left a comment

Choose a reason for hiding this comment

guenthermi Feb 28, 2023

Choose a reason for hiding this comment

LMMilliken Feb 28, 2023

Choose a reason for hiding this comment

guenthermi Mar 1, 2023

Choose a reason for hiding this comment

bwanglzu left a comment

Choose a reason for hiding this comment

github-actions bot commented Mar 1, 2023

guenthermi left a comment

Choose a reason for hiding this comment

LMMilliken commented Feb 24, 2023 •

edited

Loading