Some edge cases for nld rules #46

asharkinasuit · 2018-04-03T07:06:06Z

I think I may have found a few edge cases where the rules for Dutch split words incorrectly:

*a's has the 's chopped off, which seems inconsistent at least (oma's -> oma + 's, drama's -> drama + 's, but opa's -> opa's); seems the pattern for this to occur is -a's*, with * some non-word, non-whitespace character.
Iraki's does have its 's cut off for some reason, just like neonazi's. Seems the closest pattern for this might be -i's*, with * some non-word, non-whitespace character.
37.501ste has the period treated like a regular one to get 37 + . + 501ste, so ordinals greater than 999 are probably not dealt with correctly.
16- en 17-jarige has the - separated from 16 while with words, such a hyphen is left attached
SP.A is a terrible one, but it does have to stick together and is split up right now (maybe keep a list of such exceptions hardcoded somewhere?)
' after s is unconditionally seen as possessive, but may be simply because a word is in quotes, e.g. " 'Chaos' is het woord dat het vaakst voorkomt. "
several abbreviations from nld-afk are also common nouns: fa (F note), pers (the press), var (an animal) and a verb, verg (~require).
colons (:) seem to stick to any preceding punctuation, not sure that's supposed to happen.

kosloot · 2018-04-03T12:08:48Z

OK, i tried to reproduce this, but that proved to be hard....
e.g.
The sentence:
oma`s en opa's, dat zijn drama's.
Is tokenized as:
oma`s en opa's , dat zijn drama's . <utt>

Or with the -v option:

oma`s	WORD-WITHSUFFIX	BEGINOFSENTENCE NEWPARAGRAPH 
en	WORD	
opa's	WORD-WITHSUFFIX	NOSPACE 
,	PUNCTUATION	
dat	WORD	
zijn	WORD	
drama's	WORD-WITHSUFFIX	NOSPACE 
.	PUNCTUATION	ENDOFSENTENCE

This look OK imho, so the question arises: Do you run ucto with the correct language selected?
ucto -Lnld is required.

If so, then please collect a few examples in a file, and attach it here, together with the exact command-line you use to test is.

thanx!

asharkinasuit · 2018-04-03T12:27:13Z

The sentences with the *ma examples that seem so strange are as follows:
En opa's en oma's.
Tokenized as:
En opa's en oma 's . <utt>
Two different fragments with drama:
diepmenselijke drama's, over kinderen --> diepmenselijke drama 's , over kinderen <utt>
and
om toekomstige drama's op de --> om toekomstige drama's op de <utt>
It seems the presence or absence of punctuation after the 's is a factor.

These outputs gathered directly using ucto -Lnld. I found them in Frog output, which I hope uses the nld model by default, but that was the same as what's seen here.

kosloot · 2018-04-03T13:14:48Z

I still cannot reproduce this:
Input file:

En opa's en oma's.

diepmenselijke drama's, over kinderen

om toekomstige drama's op de

Output:

En opa's en oma's . <utt> 

diepmenselijke drama's , over kinderen <utt> 

om toekomstige drama's op de <utt>

I also tried frog on this file:

1	En			VG(neven)	0.992995						
2	opa's			N(soort,mv,basis)	0.583333						
3	en			VG(neven)	0.999194						
4	oma's			N(soort,mv,basis)	0.826923						
5	.			LET()	1.000000						

1	diepmenselijke			ADJ(prenom,basis,met-e,stan)	0.999790						
2	drama's			N(soort,mv,basis)	0.998847						
3	,			LET()	1.000000						
4	over			VZ(init)	0.996802						
5	kinderen			N(soort,mv,basis)	0.997633						

1	om			VZ(init)	0.957341						
2	toekomstige			ADJ(prenom,basis,met-e,stan)	0.999813						
3	drama's			N(soort,mv,basis)	0.995266						
4	op			VZ(init)	0.996568						
5	de			LID(bep,stan,rest)	0.981886

puzzling

kosloot · 2018-04-03T13:25:38Z

I think I know what is the problem here. You probably use the released uctodata version 0.5 from 0ct 18.
A week later a fix is committed into GIT:

Date:   Tue Oct 24 14:23:33 2017 +0200
         small change in 'nld' WORD-WITHSUFFIX rule. \P{Po} also terminates it

This might explain this drama :)
I first will have to look into the other sub-issues before releasing a new uctodata version.

kosloot · 2018-04-03T14:00:02Z

Iraki's and neonazi's shouldn't be a problem with the 24 oct fix.
37.501ste CONFIRMED
de 16- en 17-jarigen. CONFIRMED.
SP.A is not split. SP.A. isn't too, but SP.A? IS. CONFIRMED
detecting quotes is very difficult and error-prone. 'Chaos' might be doable, but 'the Chaos' is already much harder. CONFIRMED
leaving out 'ambiguous' abbreviations like fa and pres is quite easy. We could adapt the list a bit (by adding a . tot the expression. )
I wonder about 'var' which animal (or verb?) is that?
colon is stuck to punctuation: Not true for . and ! but indeed for , and ? CONFIRMED

I will try to fix some of these sub-issues. Then release a new uctodata and make new separate issues for the unresolved ones.

asharkinasuit · 2018-04-03T14:14:45Z

It looks like I obtained Lamachine at the beginning of November, but the uctodata is indeed v0.5 in the VERSION file.
A 'var' is a young bull, cf Wiki :) I guess a trailing period for things like that is a good compromise, because you're not likely to run into 'var' as bull outside the Old Testament these days...

kosloot · 2018-04-03T16:01:41Z

LaMachine is normally build on the stable releases of our software.
A development version is available at no warranty. @proycon can explain this better.
I managed to fix most of the issues already. But some more testing is still needed.
The colon problem is quite hard, as this interferes with the REVERSE-SMILEY rules.
As you can see in this output:

Een	WORD	NOSPACE BEGINOFSENTENCE NEWPARAGRAPH 
.	PUNCTUATION	NOSPACE ENDOFSENTENCE 
:	PUNCTUATION	BEGINOFSENTENCE 
/	PUNCTUATION	
en	WORD	
,:	REVERSE-SMILEY	
/	PUNCTUATION	
toch	WORD	NOSPACE 
?:	REVERSE-SMILEY	
/	PUNCTUATION	
scary	WORD	NOSPACE 
!	PUNCTUATION	NOSPACE ENDOFSENTENCE 
:	PUNCTUATION	BEGINOFSENTENCE 
/	PUNCTUATION	NOSPACE 
:	PUNCTUATION	
ook	WORD	ENDOFSENTENCE

You could argue about the use of smiley's in a lot of texts. ?:
You could create your own copy of the tokconfig-nld file and remove the SMILEY rules from the [RULE-ORDER] section on top.
That would affect both ucto and frog, as it uses ucto.

kosloot · 2018-04-04T20:17:22Z

uctodata v0.6 is released now.
All issues except for the REVESE-SMILEY case are solved. (though more abbreviations might need an explicit .)
You have to switch to the new V2 LaMachine to get this new release, as v1 is no longer updated.

asharkinasuit · 2018-04-05T06:47:57Z

Thanks for your efforts. I'm updating to v2 now, will let you know if there's more trouble.

asharkinasuit · 2018-04-05T12:22:27Z

Pay close attention to Flemish political parties, they like messing with symbols in their abbreviations: N-VA is another like SP.A ;) There's also CD&V, and a minor one called Red! (which reminds me, make sure Yahoo! also keeps its exclamation point... I bet that's frustrated so many programmers over the years...)
As for Dutch parties, 50+ causes trouble as well.

kosloot · 2018-04-05T13:55:03Z

Well, we can't make everybody happy...
You could always create your own set of rules and use those.

Maybe we can add an 'exceptions' list to ucto. As an extra parameter.
Such a list could contain words that should never be split.

Maybe that is feasible. Need some thinking.

asharkinasuit · 2018-04-06T08:59:31Z

Yea, I guess it's a stretch to account for all cases like that, because there's no end to that path... The fun part is that I'm also running ucto on text containing XML tags, which seems to confuse it somewhat, e.g. plural 's is chopped off when the token is enclosed in a tag: <a>baby's</a> --> < a > baby ' s < / a >. I bet that wasn't supposed to be supported anyway, though.

kosloot · 2018-04-09T07:22:47Z

No, regarding parsing XML with regexp's I gladly refer to this link.

asharkinasuit · 2018-04-09T07:33:09Z

I understand XML documents are not "regular" in the sense of regular expressions... I wasn't trying to parse XML using regular expressions, I was trying to tokenize text that contains XML tags :)

kosloot · 2018-04-09T12:45:55Z

I added an experimental option to ucto: --add-tokens

This option should provide ucto with a file containing words/tokens that should stay untokenized.
every token on a separate line.

The example file is:

AR#$.3
Bij.Elkaar...Houden!
Yahoo!

When these words/tokens appear in a text, they stay untouched.

When running ucto on this file like this: ucto -Lnld --add-tokens=tokens -v tokens
we get:

AR#$.3	WORD-TOKEN	BEGINOFSENTENCE NEWPARAGRAPH 
Bij.Elkaar...Houden!	WORD-TOKEN	
Yahoo!	WORD-TOKEN	ENDOFSENTENCE

kosloot · 2018-04-10T08:19:14Z

closing this issue. It's to general anyway

kosloot assigned proycon and kosloot Apr 3, 2018

kosloot pushed a commit to LanguageMachines/uctodata that referenced this issue Apr 3, 2018

fixed several issues for dutch: LanguageMachines/ucto#46

7f0d58a

kosloot pushed a commit to LanguageMachines/uctodata that referenced this issue Apr 3, 2018

fixed another issue from: LanguageMachines/ucto#46

8856082

kosloot closed this as completed Apr 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some edge cases for nld rules #46

Some edge cases for nld rules #46

asharkinasuit commented Apr 3, 2018

kosloot commented Apr 3, 2018

asharkinasuit commented Apr 3, 2018

kosloot commented Apr 3, 2018

kosloot commented Apr 3, 2018

kosloot commented Apr 3, 2018 •

edited

Loading

asharkinasuit commented Apr 3, 2018

kosloot commented Apr 3, 2018

kosloot commented Apr 4, 2018

asharkinasuit commented Apr 5, 2018

asharkinasuit commented Apr 5, 2018

kosloot commented Apr 5, 2018

asharkinasuit commented Apr 6, 2018

kosloot commented Apr 9, 2018

asharkinasuit commented Apr 9, 2018

kosloot commented Apr 9, 2018

kosloot commented Apr 10, 2018

Some edge cases for nld rules #46

Some edge cases for nld rules #46

Comments

asharkinasuit commented Apr 3, 2018

kosloot commented Apr 3, 2018

asharkinasuit commented Apr 3, 2018

kosloot commented Apr 3, 2018

kosloot commented Apr 3, 2018

kosloot commented Apr 3, 2018 • edited Loading

asharkinasuit commented Apr 3, 2018

kosloot commented Apr 3, 2018

kosloot commented Apr 4, 2018

asharkinasuit commented Apr 5, 2018

asharkinasuit commented Apr 5, 2018

kosloot commented Apr 5, 2018

asharkinasuit commented Apr 6, 2018

kosloot commented Apr 9, 2018

asharkinasuit commented Apr 9, 2018

kosloot commented Apr 9, 2018

kosloot commented Apr 10, 2018

kosloot commented Apr 3, 2018 •

edited

Loading