Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose regex token_pattern #20

Open
raj-shah opened this issue Dec 19, 2022 · 1 comment
Open

Expose regex token_pattern #20

raj-shah opened this issue Dec 19, 2022 · 1 comment
Labels
enhancement New feature or request

Comments

@raj-shah
Copy link

Hello!

Curious if it would be possible to expose a regex token pattern param like that in CountVectorizer? This would help in filtering for (un)wanted chars during tokenization, e.g. hyphens, ampersands, apostrophes, etc.

The workaround I have found so far has been to use a custom POS tagger (custom_pos_tagger param of KeyphraseVectorizer) wherein I don't change any POS patterns/behaviour but recompile and modify the underlying spacy tokenizer object's prefix, suffix, and infix params. Wondering if there is a simpler way of exposing such behaviour? Keen to hear your thoughts!

Thanks

@TimSchopf TimSchopf added the enhancement New feature or request label Jan 5, 2023
@andreivintila10
Copy link

I'm running into the same issue here where I need to consider hyphenated compound words within key-phrases.
Using the English model in spaCy I've managed to remove the infix hyphen splitting rule from the tokenizer before passing the model to the KeyphraseVectorizer. Then, further tracked it down to where it performs the transform on the CountVectorizer and the compound words are being discarded because they are not matching the default token_pattern.

I've worked on this here: https://github.com/andreivintila10/KeyphraseVectorizers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants