Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apostrophe "s" disambiguation issue with search query style sentences #1074

Closed
calebmer opened this issue Jan 3, 2024 · 7 comments
Closed

Comments

@calebmer
Copy link

calebmer commented Jan 3, 2024

I’m using compromise to parse search queries. Search queries are interesting in that they’re not complete sentences. I’m having an issue with this query:

john's closed tasks

Here I want to interpret “john's” as possessive, not as “john has”. However, compromise parses this as “john has”. I’ve traced the code to here:

//a gerund suggests 'is walking'
if (nextTerm.tags.has('Verb')) {
//fix 'jamie's bite'
if (nextTerm.tags.has('Infinitive')) {
return true
}
//fix 'spencer's runs'
if (nextTerm.tags.has('PresentTense')) {
return true
}
return false
}

“closed” is tagged as a verb so “john’s” is interpreted as “john has”.

A similar query:

my closed tasks

…correctly tags the phrase.

While I suspect I could add “closed” to the lexicon as a noun to fix this specific case, I need to support parsing arbitrary words between the user name and entity type.

Is there a workaround on my end I could write? Is this a bug in compromise? (I’d guess not, the phrase is truly ambiguous and you gotta pick somehow.) Could there be a configuration option for this? Is it safe to patch compromise and always return true from the code branch I linked?


Another query that’s treating a 's as not possessive when I want it to be possessive is:

john’s neat documents about georgia

…but I’m not sure whether this is the same root cause.

@ryancasburn-KAI
Copy link
Contributor

In my opinion, john's closed tasks is ambiguous because john has closed tasks is a perfectly reasonable phrase as is the possessive form. The confusion here comes from the word "closed" which can be an adjective or a past tense verb. I'm not an expert on plugins in this package yet, but I'd guess there would be a way for a plugin to lean to and prioritize the possessive tag.

However, I think that there is a bug with john’s neat documents about georgia

It's being parsed as "john is neat documents about georgia" which is almost certainly going to be an incorrect tagging in all cases. It also would be unlikely for it to be "john has neat documents about georgia" as (from what I've come to understand as a non-linguist) the has contraction generally is only used before a past participle verb. So, this one at least should be getting recognized as a possessive. I'd think by:

let twoTerm = terms[i + 2]
if (twoTerm && twoTerm.tags.has('Noun') && !twoTerm.tags.has('Pronoun')) {
return true
}

This one ultimately comes down to switches and is related to #1070 that I opened a couple of days ago. The tagger initially thinks that "documents" is a plural noun (which is correct), but changes it to a present tense verb (which is incorrect) because the next word is about, which locks in the word documents to be a verb (like "John documents Georgia wildlife"). I'll keep this in mind as I continue thinking through an improvement to the switches. Happy for your insight and Spencer's too.

An example which may be key in improving this:

John talks about Georgia

Talks should be a verb. This is currently handled because of the "about" lock.

John's talks about Georgia

talks here should be a plural noun. We know this because John is talks about Georgia doesn't make sense, John has talks about Georgia is an improper use of the contraction, so John's must be possessive. Are those all good assumptions? and a possessive needs to be followed by a noun chunk (ie noun alone, or adjective + noun, etc)

One more to make it more confusing:

John's nuts about Georgia

Should be "John is nuts about Georgia." How does this fit in with everything else? what makes talks and nuts different?

@spencermountain
Copy link
Owner

Yep, Ryan you've got it dead-on. Well done.
There's an is/has classifier, and it does an okay job, but it runs really early in the tagger.
These ambiguous cases like john's talks about are (often) shaken-out further downstream.

I'm open to suggestions about how to improve this, as it produces pretty-bad outcomes when it's wrong.

I've always wanted to keep things one-pass. Changing the contraction back, after the tagger had made various decisions, seems like a difficult solution.

The good news is that many of these problem words like 'talks' are flagged, and we can add careful rules about them to is/has classifier. We could add some extra look-arounds there, to mitigate this.

Happy to help plug away at this. Thank you Caleb and Ryan.

spencermountain added a commit that referenced this issue Jan 5, 2024
@spencermountain
Copy link
Owner

added some tests to dev, for the is-has and did-would contractions. Bout 30 of 200 failing. Seems like a fun one.

I think the john’s neat documents example is tripping-up on the unicode apostrophe, which I will take a look at next week.
cheers

@calebmer
Copy link
Author

calebmer commented Jan 5, 2024

Thank y’all for looking into this! @ryancasburn-KAI you mentioned:

…but I'd guess there would be a way for a plugin to lean to and prioritize the possessive tag…

From poking around the source code, I can’t see an obvious way to use a plugin to prioritize the possessive tag. Is there some plugin capability I’m missing? Can I write a plugin the pre-emptively tags a 's contraction as possessive? (Unclear if that would be respected from a quick read of the source code.)

@ryancasburn-KAI
Copy link
Contributor

ryancasburn-KAI commented Jan 5, 2024

@calebmer - you can try this

const plugin = {
  compute: {
    custPossessives: doc => doc.match("(#Person &&/'s$/)").tag('Possessive'),
  },
}
nlp.plugin(plugin)
nlp._world.hooks.splice(7, 0, 'custPossessives')


console.log(nlp("john's closed tasks").json()[0].terms)

The only thing this seems to get wrong is that "closed" is still labeled as a verb, even though "John's" is a possessive.

This comes down to

// rough sort, so 'Noun' is after ProperNoun, etc
let tags = Array.from(term.tags).sort((a, b) => {
let numA = tagSet[a] ? tagSet[a].parents.length : 0
let numB = tagSet[b] ? tagSet[b].parents.length : 0
return numA > numB ? -1 : 1
})

This puts the sort order as:

  1. MaleName
  2. FirstName
  3. Person
  4. Singular
  5. ProperNoun
  6. Possessive
  7. Noun

an Adj|Past with a ProperNoun before is classified as a verb (John documented things.)
an Adj|Past with a Possessive before is classified as an adjective (John's documented things)

I don't know if this is fixable via a plugin. @spencermountain thoughts on a smarter sorter? Maybe possessive is handled specially, since it is an add on tag (ie, can go on any noun, but if it applies, it's rules should be considered first)?

// rough sort, so 'Noun' is after ProperNoun, etc
let tags = Array.from(term.tags).sort((a, b) => {
let numA = tagSet[a] ? tagSet[a].parents.length : 0
let numB = tagSet[b] ? tagSet[b].parents.length : 0
return numA > numB ? -1 : 1
})

to:

  let tags = Array.from(term.tags).sort((a, b) => {
    let numA = tagSet[a] ? tagSet[a].parents.length : 0
    let numB = tagSet[b] ? tagSet[b].parents.length : 0
    if (a == 'Possessive') {
      return -1
    }
    if (b == 'Possessive') {
      return 1
    }
    return numA > numB ? -1 : 1
  })

@spencermountain
Copy link
Owner

thanks Ryan, hope to have a fix for this in the next day or two.
cheers

@spencermountain
Copy link
Owner

released as 14.11.1 - please check it out, and see if you can find more examples where it is misunderstanding an apostrophe s, either as Possessive, or through is/has.
thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants