Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load_with_spacy() in wiki_ann.py returns a total of 10 sentences #13

Closed
Lasmali opened this issue Oct 16, 2019 · 3 comments
Closed

load_with_spacy() in wiki_ann.py returns a total of 10 sentences #13

Lasmali opened this issue Oct 16, 2019 · 3 comments
Assignees

Comments

@Lasmali
Copy link

Lasmali commented Oct 16, 2019

Is there a reason why most of the data from WikiAnn is excluded here?

all_sents = file_as_json[0]['paragraphs'][0]['sentences']

Instead of having 95.930 sentences it returns a GoldCorpus with 10 sentences.

@hvingelby hvingelby self-assigned this Oct 18, 2019
@hvingelby
Copy link
Contributor

Hi @Lasmali,

I tried to replicate this problem by removing the downloaded dataset folder ~/.danlp/wikiann and then setting a breakpoint on line 50 to see the length of all_sents. I found the length to be 95924.

Have you tried clearing the wikiann dataset folder?

@Lasmali
Copy link
Author

Lasmali commented Oct 24, 2019

I believe the error is cause by an update in spacy's conll_ner2json() function. More specifically this update explosion/spaCy#4186. It is quite recent, which explains why we are getting different results. The change causes the default behaviour of conll_ner2json() to put sentences in groups of 10 via the n_sents parameter.

The newer spacy conll_ner2json version returns a list of length 9593 each element holding 10 sentences, hence indexing into file_as_json effectively gives you 10 sentences. I suspect your file_as_json is a singleton list with the entry being a list of all 95930 sentences hence the file_as_json[0] line.

I think a good fix would be not to rely on sklearn for the splitting of the dataset, and just return the spacy goldcorpus as is to give the user a reliable experience following what ever standard spacy is setting while also removing the dependency to sklearn. It would be quite trivial for users to split their data into training and validation. What do you think?

def load_with_spacy(self):
  ...
  ...
  file_as_string = file.read()
  file_as_json = conll_ner2json(file_as_string) # has n_sents=10 as default in update

  # file_as_json is of length 9593 with each entry having 10 sentences
  all_sents = file_as_json[0]['paragraphs'][0]['sentences'] # only 10 sentences are extracted
  train_sents, dev_sents = train_test_split(all_sents, test_size=0.3, random_state=42)

@hvingelby
Copy link
Contributor

I tried it out with spaCy v2.1.4 and after upgrading my spaCy version i also get the 10 sentences as you mention.
In order to catch this earlier I think our requirements.txt should fetch the newest versions of the libraries so that we would detect this with CI.

I agree that we should not rely on sklearn for this. However I believe we would have to perform the dev/train because otherwise we would not be able to instantiate the GoldCorpus. One solution would be to make the user pass a splitting function and set a simple default one that does not depend on any other framework. What do you think of this solution?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants