Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-word segmentation #220

Closed
akshatdewan opened this issue Oct 23, 2018 · 2 comments
Closed

Multi-word segmentation #220

akshatdewan opened this issue Oct 23, 2018 · 2 comments

Comments

@akshatdewan
Copy link

Hi there! I was wondering if I can utilize sentencepiece for segmenting text into tokens where tokens would be "multi-word" instead of "sub-word". I want to do this to reduce the number of tokens in the segmentation.

I was thinking that if whitespaces are considered as regular symbols then they could be in the middle of the the tokens too (unlike now where whitespaces can only be at the beginning and end of the tokens) and this could allow "mulit-word" segmentation. To that end, I thought of trying --control_symbols=" " but I think it is not a good idea because I will lose all the white space information in the encoded output.

I hope I am clear about what I intend to do. Look forward to your suggestions.

Thanks!

@taku910
Copy link
Collaborator

taku910 commented Oct 24, 2018

Does 'multi-word' mean to extract pieces like "Hello_world" ?
If so, you might want to try spm_train --split_by_whitespace=false . This flag allows us to extract pieces containing whitespaces in the middle.

However, according to my preliminary experiments, no quality improvements were observed by allowing multi-words at least in MT experiments.

Thank you.

@akshatdewan
Copy link
Author

Many thanks! I was looking exactly for this. I want to do this because my target sequences are very long and contain a lot of redundant information. I am hoping this would help. Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants