Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some edge cases for nld rules #46

Closed
8 tasks
asharkinasuit opened this issue Apr 3, 2018 · 16 comments
Closed
8 tasks

Some edge cases for nld rules #46

asharkinasuit opened this issue Apr 3, 2018 · 16 comments
Assignees

Comments

@asharkinasuit
Copy link

I think I may have found a few edge cases where the rules for Dutch split words incorrectly:

  • *a's has the 's chopped off, which seems inconsistent at least (oma's -> oma + 's, drama's -> drama + 's, but opa's -> opa's); seems the pattern for this to occur is -a's*, with * some non-word, non-whitespace character.
  • Iraki's does have its 's cut off for some reason, just like neonazi's. Seems the closest pattern for this might be -i's*, with * some non-word, non-whitespace character.
  • 37.501ste has the period treated like a regular one to get 37 + . + 501ste, so ordinals greater than 999 are probably not dealt with correctly.
  • 16- en 17-jarige has the - separated from 16 while with words, such a hyphen is left attached
  • SP.A is a terrible one, but it does have to stick together and is split up right now (maybe keep a list of such exceptions hardcoded somewhere?)
  • ' after s is unconditionally seen as possessive, but may be simply because a word is in quotes, e.g. " 'Chaos' is het woord dat het vaakst voorkomt. "
  • several abbreviations from nld-afk are also common nouns: fa (F note), pers (the press), var (an animal) and a verb, verg (~require).
  • colons (:) seem to stick to any preceding punctuation, not sure that's supposed to happen.
@kosloot
Copy link
Contributor

kosloot commented Apr 3, 2018

OK, i tried to reproduce this, but that proved to be hard....
e.g.
The sentence:
oma`s en opa's, dat zijn drama's.
Is tokenized as:
oma`s en opa's , dat zijn drama's . <utt>

Or with the -v option:

oma`s	WORD-WITHSUFFIX	BEGINOFSENTENCE NEWPARAGRAPH 
en	WORD	
opa's	WORD-WITHSUFFIX	NOSPACE 
,	PUNCTUATION	
dat	WORD	
zijn	WORD	
drama's	WORD-WITHSUFFIX	NOSPACE 
.	PUNCTUATION	ENDOFSENTENCE 

This look OK imho, so the question arises: Do you run ucto with the correct language selected?
ucto -Lnld is required.

If so, then please collect a few examples in a file, and attach it here, together with the exact command-line you use to test is.

thanx!

@asharkinasuit
Copy link
Author

The sentences with the *ma examples that seem so strange are as follows:
En opa's en oma's.
Tokenized as:
En opa's en oma 's . <utt>
Two different fragments with drama:
diepmenselijke drama's, over kinderen --> diepmenselijke drama 's , over kinderen <utt>
and
om toekomstige drama's op de --> om toekomstige drama's op de <utt>
It seems the presence or absence of punctuation after the 's is a factor.

These outputs gathered directly using ucto -Lnld. I found them in Frog output, which I hope uses the nld model by default, but that was the same as what's seen here.

@kosloot
Copy link
Contributor

kosloot commented Apr 3, 2018

I still cannot reproduce this:
Input file:

En opa's en oma's.

diepmenselijke drama's, over kinderen

om toekomstige drama's op de

Output:

En opa's en oma's . <utt> 

diepmenselijke drama's , over kinderen <utt> 

om toekomstige drama's op de <utt> 

I also tried frog on this file:

1	En			VG(neven)	0.992995						
2	opa's			N(soort,mv,basis)	0.583333						
3	en			VG(neven)	0.999194						
4	oma's			N(soort,mv,basis)	0.826923						
5	.			LET()	1.000000						

1	diepmenselijke			ADJ(prenom,basis,met-e,stan)	0.999790						
2	drama's			N(soort,mv,basis)	0.998847						
3	,			LET()	1.000000						
4	over			VZ(init)	0.996802						
5	kinderen			N(soort,mv,basis)	0.997633						

1	om			VZ(init)	0.957341						
2	toekomstige			ADJ(prenom,basis,met-e,stan)	0.999813						
3	drama's			N(soort,mv,basis)	0.995266						
4	op			VZ(init)	0.996568						
5	de			LID(bep,stan,rest)	0.981886	

puzzling

@kosloot
Copy link
Contributor

kosloot commented Apr 3, 2018

I think I know what is the problem here. You probably use the released uctodata version 0.5 from 0ct 18.
A week later a fix is committed into GIT:

Date:   Tue Oct 24 14:23:33 2017 +0200
         small change in 'nld' WORD-WITHSUFFIX rule. \P{Po} also terminates it

This might explain this drama :)
I first will have to look into the other sub-issues before releasing a new uctodata version.

@kosloot
Copy link
Contributor

kosloot commented Apr 3, 2018

  • Iraki's and neonazi's shouldn't be a problem with the 24 oct fix.
  • 37.501ste CONFIRMED
  • de 16- en 17-jarigen. CONFIRMED.
  • SP.A is not split. SP.A. isn't too, but SP.A? IS. CONFIRMED
  • detecting quotes is very difficult and error-prone. 'Chaos' might be doable, but 'the Chaos' is already much harder. CONFIRMED
  • leaving out 'ambiguous' abbreviations like fa and pres is quite easy. We could adapt the list a bit (by adding a . tot the expression. )
    I wonder about 'var' which animal (or verb?) is that?
  • colon is stuck to punctuation: Not true for . and ! but indeed for , and ? CONFIRMED

I will try to fix some of these sub-issues. Then release a new uctodata and make new separate issues for the unresolved ones.

@asharkinasuit
Copy link
Author

It looks like I obtained Lamachine at the beginning of November, but the uctodata is indeed v0.5 in the VERSION file.
A 'var' is a young bull, cf Wiki :) I guess a trailing period for things like that is a good compromise, because you're not likely to run into 'var' as bull outside the Old Testament these days...

kosloot pushed a commit to LanguageMachines/uctodata that referenced this issue Apr 3, 2018
kosloot pushed a commit to LanguageMachines/uctodata that referenced this issue Apr 3, 2018
@kosloot
Copy link
Contributor

kosloot commented Apr 3, 2018

LaMachine is normally build on the stable releases of our software.
A development version is available at no warranty. @proycon can explain this better.
I managed to fix most of the issues already. But some more testing is still needed.
The colon problem is quite hard, as this interferes with the REVERSE-SMILEY rules.
As you can see in this output:

Een	WORD	NOSPACE BEGINOFSENTENCE NEWPARAGRAPH 
.	PUNCTUATION	NOSPACE ENDOFSENTENCE 
:	PUNCTUATION	BEGINOFSENTENCE 
/	PUNCTUATION	
en	WORD	
,:	REVERSE-SMILEY	
/	PUNCTUATION	
toch	WORD	NOSPACE 
?:	REVERSE-SMILEY	
/	PUNCTUATION	
scary	WORD	NOSPACE 
!	PUNCTUATION	NOSPACE ENDOFSENTENCE 
:	PUNCTUATION	BEGINOFSENTENCE 
/	PUNCTUATION	NOSPACE 
:	PUNCTUATION	
ook	WORD	ENDOFSENTENCE 

You could argue about the use of smiley's in a lot of texts. ?:
You could create your own copy of the tokconfig-nld file and remove the SMILEY rules from the [RULE-ORDER] section on top.
That would affect both ucto and frog, as it uses ucto.

@kosloot
Copy link
Contributor

kosloot commented Apr 4, 2018

uctodata v0.6 is released now.
All issues except for the REVESE-SMILEY case are solved. (though more abbreviations might need an explicit .)
You have to switch to the new V2 LaMachine to get this new release, as v1 is no longer updated.

@asharkinasuit
Copy link
Author

Thanks for your efforts. I'm updating to v2 now, will let you know if there's more trouble.

@asharkinasuit
Copy link
Author

Pay close attention to Flemish political parties, they like messing with symbols in their abbreviations: N-VA is another like SP.A ;) There's also CD&V, and a minor one called Red! (which reminds me, make sure Yahoo! also keeps its exclamation point... I bet that's frustrated so many programmers over the years...)
As for Dutch parties, 50+ causes trouble as well.

@kosloot
Copy link
Contributor

kosloot commented Apr 5, 2018

Well, we can't make everybody happy...
You could always create your own set of rules and use those.

Maybe we can add an 'exceptions' list to ucto. As an extra parameter.
Such a list could contain words that should never be split.

Maybe that is feasible. Need some thinking.

@asharkinasuit
Copy link
Author

Yea, I guess it's a stretch to account for all cases like that, because there's no end to that path... The fun part is that I'm also running ucto on text containing XML tags, which seems to confuse it somewhat, e.g. plural 's is chopped off when the token is enclosed in a tag: <a>baby's</a> --> < a > baby ' s < / a >. I bet that wasn't supposed to be supported anyway, though.

@kosloot
Copy link
Contributor

kosloot commented Apr 9, 2018

No, regarding parsing XML with regexp's I gladly refer to this link.

@asharkinasuit
Copy link
Author

I understand XML documents are not "regular" in the sense of regular expressions... I wasn't trying to parse XML using regular expressions, I was trying to tokenize text that contains XML tags :)

@kosloot
Copy link
Contributor

kosloot commented Apr 9, 2018

I added an experimental option to ucto: --add-tokens

This option should provide ucto with a file containing words/tokens that should stay untokenized.
every token on a separate line.

The example file is:

AR#$.3
Bij.Elkaar...Houden!
Yahoo!

When these words/tokens appear in a text, they stay untouched.

When running ucto on this file like this: ucto -Lnld --add-tokens=tokens -v tokens
we get:

AR#$.3	WORD-TOKEN	BEGINOFSENTENCE NEWPARAGRAPH 
Bij.Elkaar...Houden!	WORD-TOKEN	
Yahoo!	WORD-TOKEN	ENDOFSENTENCE 

@kosloot
Copy link
Contributor

kosloot commented Apr 10, 2018

closing this issue. It's to general anyway

@kosloot kosloot closed this as completed Apr 10, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants