Tokenizing large corpus #80

quetz · 2020-11-01T00:15:10Z

Right now tokenizer loads whole corpus in memory and it becomes an issue for large files.

Is it possible to read corpus file line-by-line or split it in any other way (while training as a whole)?

xbelonogov · 2020-11-01T09:12:17Z

No, there is no easy way to do it.

If the training data is so large that it does not fit into memory, then most likely you can subsample random sentences and this won't significantly affect the quality.

rrrepsac · 2021-06-29T19:21:42Z

Are you going to add encoding file-dataset? Now bpe.encode from list is working longer than bpe.train from file, isn't it odd? And bpe.train used less memory than bpe.encoding with full list loaded.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizing large corpus #80

Tokenizing large corpus #80

quetz commented Nov 1, 2020

xbelonogov commented Nov 1, 2020

rrrepsac commented Jun 29, 2021 •

edited

Loading

Tokenizing large corpus #80

Tokenizing large corpus #80

Comments

quetz commented Nov 1, 2020

xbelonogov commented Nov 1, 2020

rrrepsac commented Jun 29, 2021 • edited Loading

rrrepsac commented Jun 29, 2021 •

edited

Loading