Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A recent EMNLP work to share about task-adaptive tokenization with variable segmentation #924

Open
lsy641 opened this issue Oct 24, 2023 · 4 comments

Comments

@lsy641
Copy link

lsy641 commented Oct 24, 2023

I saw this previous dicussion is really interesting at Multi-word segmentation #220 and knew you project members have experimented the segmentation beyond word-level on MT datasets and didn't see significant improvement.

I think it is because the segmentation of sub-word vocabulary was already trained from MT data, and there is little improvement room in effectiveness by changing granularity, although increaing granularity can bring efficiency boost. But in the era of pretraining models, I rethink to change the granularity and compositionality of generation in downstream domain.

image

Recently, our work((https://arxiv.org/abs/2310.05317)) provides a solution to make pretraining model be able to adopt a task-adaptive tokenizer, which supports variable segmentation optimized by the downstream data. Then it allows multi bigger granular segamentations (still retaining sub-word level) to be sampled. It does bring significant improvement in both generation effectiveness and efficiency for the tasks where task-specific terminologies often show up (e.g., medical, mental healh)
The improvement is from two sources. 1. The gap between the pretraining vocabulary (for exampl, Bert vocabulary is optimized by GNMT benchmark that may be suitable for MT, but not for other tasks) and the downstream language style. 2.The second is the potential of varabile segamentation on efficiency.

To build a task-adaptive tokenizer, currently I manually sew the pretraining vocabulary and the downstream vocabulary by using the ProtoBuf apis provided by sentencepiece_model_bp2.py and sentencepiece_bp2.py and build a new tokenizer compatible with HuggingFace. I saw wondering if your project is interested to provide a funtion for researchers to easily build a task-adatpive tokenizer.

@RubyBit
Copy link

RubyBit commented Feb 5, 2024

I read the paper, is there any code available which showcases the algorithm?

@lsy641
Copy link
Author

lsy641 commented Feb 21, 2024

I read the paper, is there any code available which showcases the algorithm?

@RubyBit Hello. I am currently organizing the codes but I am able to add you to our repository in advance. If you would like to join the repository, please give me your github account

@RubyBit
Copy link

RubyBit commented Feb 25, 2024

I read the paper, is there any code available which showcases the algorithm?

@RubyBit Hello. I am currently organizing the codes but I am able to add you to our repository in advance. If you would like to join the repository, please give me your github account

Yes that would be great (this is my github account: RubyBit)

@RubyBit
Copy link

RubyBit commented Mar 20, 2024

@lsy641 I am so sorry, can you resend the invite? I didn't check my mail in time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants