load_with_spacy() in wiki_ann.py returns a total of 10 sentences #13

Lasmali · 2019-10-16T13:55:39Z

Is there a reason why most of the data from WikiAnn is excluded here?

Line 49 in 2877bd0

all_sents = file_as_json[0]['paragraphs'][0]['sentences']

Instead of having 95.930 sentences it returns a GoldCorpus with 10 sentences.

hvingelby · 2019-10-18T12:17:51Z

I tried to replicate this problem by removing the downloaded dataset folder ~/.danlp/wikiann and then setting a breakpoint on line 50 to see the length of all_sents. I found the length to be 95924.

Have you tried clearing the wikiann dataset folder?

Lasmali · 2019-10-24T09:39:25Z

I believe the error is cause by an update in spacy's conll_ner2json() function. More specifically this update explosion/spaCy#4186. It is quite recent, which explains why we are getting different results. The change causes the default behaviour of conll_ner2json() to put sentences in groups of 10 via the n_sents parameter.

The newer spacy conll_ner2json version returns a list of length 9593 each element holding 10 sentences, hence indexing into file_as_json effectively gives you 10 sentences. I suspect your file_as_json is a singleton list with the entry being a list of all 95930 sentences hence the file_as_json[0] line.

I think a good fix would be not to rely on sklearn for the splitting of the dataset, and just return the spacy goldcorpus as is to give the user a reliable experience following what ever standard spacy is setting while also removing the dependency to sklearn. It would be quite trivial for users to split their data into training and validation. What do you think?

def load_with_spacy(self):
  ...
  ...
  file_as_string = file.read()
  file_as_json = conll_ner2json(file_as_string) # has n_sents=10 as default in update

  # file_as_json is of length 9593 with each entry having 10 sentences
  all_sents = file_as_json[0]['paragraphs'][0]['sentences'] # only 10 sentences are extracted
  train_sents, dev_sents = train_test_split(all_sents, test_size=0.3, random_state=42)

hvingelby · 2019-11-10T10:56:38Z

I tried it out with spaCy v2.1.4 and after upgrading my spaCy version i also get the 10 sentences as you mention.
In order to catch this earlier I think our requirements.txt should fetch the newest versions of the libraries so that we would detect this with CI.

I agree that we should not rely on sklearn for this. However I believe we would have to perform the dev/train because otherwise we would not be able to instantiate the GoldCorpus. One solution would be to make the user pass a splitting function and set a simple default one that does not depend on any other framework. What do you think of this solution?

hvingelby self-assigned this Oct 18, 2019

hvingelby added a commit that referenced this issue Feb 12, 2020

Fixes loading wikiann with spaCy as discussed in #13

cef20cb

hvingelby closed this as completed Feb 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

load_with_spacy() in wiki_ann.py returns a total of 10 sentences #13

load_with_spacy() in wiki_ann.py returns a total of 10 sentences #13

Lasmali commented Oct 16, 2019 •

edited

Loading

hvingelby commented Oct 18, 2019

Lasmali commented Oct 24, 2019

hvingelby commented Nov 10, 2019

load_with_spacy() in wiki_ann.py returns a total of 10 sentences #13

load_with_spacy() in wiki_ann.py returns a total of 10 sentences #13

Comments

Lasmali commented Oct 16, 2019 • edited Loading

hvingelby commented Oct 18, 2019

Lasmali commented Oct 24, 2019

hvingelby commented Nov 10, 2019

Lasmali commented Oct 16, 2019 •

edited

Loading