Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add unit tests for large input strings and a large corpus #20

Open
jcbrockschmidt opened this issue Nov 20, 2020 · 3 comments
Open

Add unit tests for large input strings and a large corpus #20

jcbrockschmidt opened this issue Nov 20, 2020 · 3 comments
Assignees
Labels
enhancement New feature or request

Comments

@jcbrockschmidt
Copy link
Collaborator

jcbrockschmidt commented Nov 20, 2020

In response to issue #19, we should add unit tests that run on large strings, as well as on a large corpus of strings. This should help us catch speed inefficiencies down the road.

@snguyenthanh
Copy link
Owner

I think this is a good idea. A problem is about where to store the test dataset, as I would prefer not to have a very big text file in the package which is used only for testing.

I would come up with a way to download and remove the dataset only for running tests. @jcbrockschmidt Can you help benchmark and improve the current algo for long texts ?

@jcbrockschmidt
Copy link
Collaborator Author

I am definitely willing to help. For the large unit tests, I was thinking it may be good enough to just include a dozen or so paragraphs (and their censored counterparts) and repeatedly test them 10-ish or 100-ish times. I think as long as the set of repeated paragraphs include an even mix of paragraphs with 1) no censored words 2) some censored words and 3) a lot of censored words it should be good enough to catch large slow-downs. This dataset shouldn't take up more than a few MBs.

It may, however, be a good idea to have a separate, more extensive benchmarking script separate from these new unit tests. For this script, yes: I think downloading the dataset would be wiser. I have a rough benchmarking script already written. The biggest challenge will just be finding a download link for our dataset that's reliable. The dataset I'm currently using is hosted on a lot of different websites with questionable reliability, so I'd need to track down its origin.

@jcbrockschmidt
Copy link
Collaborator Author

This link might be reliable enough for the Amazon reviews dataset I was looking at. We probably want to throw some extra datasets in the mix, though, such as very long documents (i.e. short stories or a books) with some profanity included.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants