Skip to content
This repository has been archived by the owner on Apr 23, 2024. It is now read-only.

Tokenizing large corpus #80

Open
quetz opened this issue Nov 1, 2020 · 2 comments
Open

Tokenizing large corpus #80

quetz opened this issue Nov 1, 2020 · 2 comments

Comments

@quetz
Copy link

quetz commented Nov 1, 2020

Right now tokenizer loads whole corpus in memory and it becomes an issue for large files.

Is it possible to read corpus file line-by-line or split it in any other way (while training as a whole)?

@xbelonogov
Copy link
Contributor

No, there is no easy way to do it.

If the training data is so large that it does not fit into memory, then most likely you can subsample random sentences and this won't significantly affect the quality.

@rrrepsac
Copy link

rrrepsac commented Jun 29, 2021

Are you going to add encoding file-dataset? Now bpe.encode from list is working longer than bpe.train from file, isn't it odd? And bpe.train used less memory than bpe.encoding with full list loaded.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants