Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue training on ud-treebanks #16

Open
neetle opened this issue Mar 2, 2019 · 4 comments
Open

Issue training on ud-treebanks #16

neetle opened this issue Mar 2, 2019 · 4 comments

Comments

@neetle
Copy link

neetle commented Mar 2, 2019

Trying to train a treebank against ud-treebanks-v2.3/UD_English-EWT/en_ewt-ud-dev.conllu and I've noticed that all rows that have a head field with the value _ panic.

Any ideas on how to deal with this data within the library? Happy to submit a PR on any advice given.

# sent_id = answers-20111108072305AAPJTjj_ans-0005
# text = It's more compact, ISO 6400 capability (SX40 only 3200), faster lens at f/2 and the SX40 only f/2.7.
1	It	it	PRON	PRP	Case=Nom|Gender=Neut|Number=Sing|Person=3|PronType=Prs	4	nsubj	4:nsubj	SpaceAfter=No
2	's	be	AUX	VBZ	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin	4	cop	4:cop	_
3	more	more	ADV	RBR	_	4	advmod	4:advmod	_
4	compact	compact	ADJ	JJ	Degree=Pos	0	root	0:root	SpaceAfter=No
5	,	,	PUNCT	,	_	8	punct	8:punct	_
6	ISO	iso	NOUN	NN	Number=Sing	8	compound	8:compound	_
7	6400	6400	NUM	CD	NumType=Card	6	nummod	6:nummod	_
8	capability	capability	NOUN	NN	Number=Sing	4	list	4:list	_
9	(	(	PUNCT	-LRB-	_	10	punct	10:punct|10.1:punct	SpaceAfter=No
10	SX40	SX40	PROPN	NNP	Number=Sing	8	parataxis	8:parataxis|10.1:nsubj	_
#dies on the following line
10.1	has	have	VERB	VBZ	_	_	_	8:parataxis	CopyOf=-1
11	only	only	ADV	RB	_	12	advmod	12:advmod	_
12	3200	3200	NUM	CD	NumType=Card	10	orphan	10.1:obj	SpaceAfter=No
13	)	)	PUNCT	-RRB-	_	10	punct	10:punct|10.1:punct	SpaceAfter=No
14	,	,	PUNCT	,	_	8	punct	8:punct	_
15	faster	faster	ADJ	JJR	Degree=Cmp	16	amod	16:amod	_
16	lens	lens	NOUN	NN	Number=Sing	4	list	4:list	_
17	at	at	ADP	IN	_	18	case	18:case	_
18	f/2	f/2	NOUN	NN	Number=Sing	16	nmod	16:nmod:at	_
19	and	and	CCONJ	CC	_	21	cc	21:cc|21.1:cc	_
20	the	the	DET	DT	Definite=Def|PronType=Art	21	det	21:det	_
21	SX40	SX40	PROPN	NNP	Number=Sing	16	conj	16:conj:and|21.1:nsubj	_
21.1	has	have	VERB	VBZ	_	_	_	16:conj:and	CopyOf=-1
22	only	only	ADJ	JJ	Degree=Pos	23	amod	23:amod	_
23	f	f	NOUN	NN	Number=Sing	21	orphan	21.1:obj	SpaceAfter=No
24	/	/	PUNCT	,	_	23	punct	23:punct	SpaceAfter=No
25	2.7	2.7	NUM	CD	NumType=Card	23	nummod	23:nummod	SpaceAfter=No
26	.	.	PUNCT	.	_	4	punct	4:punct	_```
@neetle
Copy link
Author

neetle commented Mar 4, 2019

I think this is relevant, but @ work so I can't verify it's the case. If it is, I'll see if I can extend lingo to deal with gapping entries.

@durp
Copy link

durp commented Jun 7, 2019

Since these are "empty nodes", isn't the expedient thing to simply skip word lines with an unspecified (_) head?

@neetle
Copy link
Author

neetle commented Jun 7, 2019

possibly - I believe (from what I can remember) that they're actually implied in the sentence. Not sure if giving up that fidelity is worth it.

@chewxy
Copy link
Owner

chewxy commented Jan 14, 2020

I am so sorry for not responding. apparently this library escaped my notifications. So I wasn't notified of incoming PRs and issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants