Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

Question about experiment #5

Open
xuchaoUCAS opened this issue Mar 5, 2020 · 1 comment
Open

Question about experiment #5

xuchaoUCAS opened this issue Mar 5, 2020 · 1 comment
Assignees

Comments

@xuchaoUCAS
Copy link

I don't understanding the meaning of this experiment.
Too many errors in the gold set.
For examples, in the europarl-v7.de-en.en.sentences.test.gold:
line 73:I am happy to try and answer, Mr Wijsenbeek. As you will certainly know,……. Here "I am happy to try and answer, Mr Wijsenbeek." is obviously a single sentence and the gold dost't mark is as.
Simliar data:
line 130,175... too much
So I don't understanding the meaning of "sentence boundary detection" in this dataset.

@stefan-it
Copy link
Collaborator

stefan-it commented Mar 11, 2020

Hi @xuchaoUCAS,

we use this kind of dataset, because there are no 100% gold-labeled datasets available for this task. That's why we refer to it as "quasi-segmented" datasets.

However, in preliminary experiments we used Universal Dependencies (normally used for e.g. PoS tagging). These datasets contain a more sentence-segmented structure. But: the number of sentences is less than e.g. the Europarl corpora!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants