Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apostrophe in query breaks pyterrier: Lexical error: Encountered: "\'" #253

Closed
2 tasks done
jjdelvalle opened this issue Dec 16, 2021 · 5 comments · Fixed by #340
Closed
2 tasks done

Apostrophe in query breaks pyterrier: Lexical error: Encountered: "\'" #253

jjdelvalle opened this issue Dec 16, 2021 · 5 comments · Fixed by #340
Labels
bug Something isn't working

Comments

@jjdelvalle
Copy link
Contributor

jjdelvalle commented Dec 16, 2021

Describe the bug

If an apostrophe is present in a query, pyterrier doesn't seem to know how to handle the situation. For instance if the query
is: Queen 's Statements on Brexit" you get the following error upon running retriever.search(query)`

jnius.JavaException: JVM exception occurred: Failed to process qid 1 'Queen 's Statements on Brexit' -- Lexical error at line 1, column 7. Encountered: "'" (39), after : "" org.terrier.querying.parser.QueryParserException

To Reproduce
Steps to reproduce the behavior:

  1. Index was built using pd_indexer.setProperty("termpipelines", "NoOp"), pd_indexer.setProperty("tokeniser", "EnglishTokeniser") and indexref = pd_indexer.index(df["text"], df["docno"])
  2. Retriever was built using this: retrieval_props = {"termpipelines": "NoOp", "tokeniser": "EnglishTokeniser"} and retriever = pt.BatchRetrieve(indexref, properties = retrieval_props)
  3. N/A
  4. No output cause of error.
  5. Error: jnius.JavaException: JVM exception occurred: Failed to process qid 1 'Queen 's Statements on Brexit' -- Lexical error at line 1, column 7. Encountered: "\'" (39), after : "" org.terrier.querying.parser.QueryParserException

Expected behavior

I would expect pyterrier to be able to handle a single quote/apostrophe.

Attempts at escaping the ' character result in pyterrier complaning about ' or \ being present in the query.

Documentation and Issues

Additional context
Trying to try building a very basic baseline system and then run a few experiments to improve performance. I have not tried other pipelines or other termpipelines arguments besides "" (for which the error remains). Additionally, #62 seems to have a relevant comment at the end but surely "removing the '" shouldn't be the a final solution?

@jjdelvalle jjdelvalle added the bug Something isn't working label Dec 16, 2021
@seanmacavaney
Copy link
Collaborator

Hi @jjdelvalle,

Thanks for reporting. The problem is that those characters in the query have a special meaning in the Terrier query language. You can strip them out using the Tokeniser. E.g.,:

tokenizer = pt.autoclass("org.terrier.indexing.tokenisation.Tokeniser").getTokeniser()
def strip_markup(text):
    return " ".join(tokeniser.getTokens(text))

# Example:
strip_markup("Queen 's Statements on Brexit")
# 'queen s statements on brexit'

If you have a topics dataframe, you could use a pt.apply to do this to all rows.

topics = pt.apply.query(lambda r: strip_markup(r.query))(topics)

@cmacdonald -- what do you think about adding this function (or something similar) to pt.Utils?

@jjdelvalle
Copy link
Contributor Author

Really appreciate the response, @seanmacavaney. That does seem to work nicely. I was hoping I wouldn't have to do any preprocessing at all for my baseline model but I guess this is minimal enough.

My mistake was assuming that the tokenizer was run for every query. Thanks for the tip for the dataframe, that's exactly my use case.

@seanmacavaney
Copy link
Collaborator

Great!

@cmacdonald, maybe we should consider having a flag to indicate that the query contains markup, similar to applypipeline:off (which can default to false when not present)? It seems like the normal cases are probably ones like this.

@cmacdonald
Copy link
Contributor

what do you think about adding this function (or something similar) to pt.Utils?

This is pretty much the same request as in #252. Will have a think about how to solve more generically.

@cmacdonald
Copy link
Contributor

what do you think about adding this function (or something similar) to pt.Utils?

Yes, although I would propose pt.rewrite.tokenise(). I'll post a new issue with proposals.

@cmacdonald cmacdonald linked a pull request Nov 2, 2022 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants