Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Version 0.7.0 significantly slower than 0.6.1 #19

Open
virtustate opened this issue Nov 9, 2020 · 10 comments
Open

Version 0.7.0 significantly slower than 0.6.1 #19

virtustate opened this issue Nov 9, 2020 · 10 comments

Comments

@virtustate
Copy link

We've been using this great module in a larger analysis for months: thank you for making it available! Here's the relevant code:

corpus = reviews['review_text']
profanity.add_censor_words([x.lower() for x in other_stop_words])
corpus = corpus.apply(profanity.censor, args=(' ',))

The statement corpus.apply(profanity.censor, args=(' ',)) is taking a couple orders of magnitude longer using version 0.7.0 than 0.6.1. Here are some timings with everything the same other than the better_profanity version. "Time to apply profanity" is for just corpus = corpus.apply(profanity.censor, args=(' ',))

better_profanity=0.7.0

This product is named Oasis High-Waisted Pocket Capri
Begin by noting there are 1669 reviews for this product
Time to apply profanity: 95.60132622718811
Time it takes to run this LDA model: 97.7805495262146
{'size', 'material', 'working', 'comfortable', 'waist', 'length', 'fit', 'fabric', 'soft'}
1 of 122 products' review sets remodeled

This product is named Ryan Built In Bra Tank II
Begin by noting there are 427 reviews for this product
Time to apply profanity: 29.865559816360474
Time it takes to run this LDA model: 30.556731939315796
{'cute', 'size', 'top', 'comfortable', 'fit'}
2 of 122 products' review sets remodeled

This product is named Oasis High-Waisted Pocket 7/8
Begin by noting there are 10934 reviews for this product
Time to apply profanity: 710.3287241458893
Time it takes to run this LDA model: 726.3966491222382
{'pocket', 'comfortable', 'feel', 'waist', 'fit', 'see', 'color', 'soft'}
3 of 122 products' review sets remodeled

This product is named High-Waisted Ultracool Side Stripe Crop
Begin by noting there are 168 reviews for this product
Time to apply profanity: 10.014711618423462
Time it takes to run this LDA model: 10.347350835800171
{'comfortable', 'feel', 'waist', 'size'}
4 of 122 products' review sets remodeled

This product is named Oasis High-Waisted Twist 7/8
Begin by noting there are 1750 reviews for this product
Time to apply profanity: 121.31187510490417
Time it takes to run this LDA model: 123.6646056175232
{'cute', 'style', 'size', 'material', 'comfortable', 'detail', 'bit', 'fit', 'color', 'soft'}
5 of 122 products' review sets remodeled

better_profanity=0.6.1

This product is named Oasis High-Waisted Pocket Capri
Begin by noting there are 1669 reviews for this product
Time to apply profanity: 0.19291996955871582
Time it takes to run this LDA model: 4.058649063110352
{'size', 'material', 'working', 'comfortable', 'waist', 'length', 'fit', 'fabric', 'soft'}
1 of 122 products' review sets remodeled

This product is named Ryan Built In Bra Tank II
Begin by noting there are 427 reviews for this product
Time to apply profanity: 0.05718731880187988
Time it takes to run this LDA model: 0.7385601997375488
{'cute', 'size', 'top', 'comfortable', 'fit'}
2 of 122 products' review sets remodeled

This product is named Oasis High-Waisted Pocket 7/8
Begin by noting there are 10934 reviews for this product
Time to apply profanity: 1.264852523803711
Time it takes to run this LDA model: 17.08655619621277
{'pocket', 'size', 'comfortable', 'waist', 'fit', 'around', 'color', 'amazing'}
3 of 122 products' review sets remodeled

This product is named High-Waisted Ultracool Side Stripe Crop
Begin by noting there are 168 reviews for this product
Time to apply profanity: 0.018624067306518555
Time it takes to run this LDA model: 0.34430885314941406
{'comfortable', 'feel', 'waist', 'size'}
4 of 122 products' review sets remodeled

This product is named Oasis High-Waisted Twist 7/8
Begin by noting there are 1750 reviews for this product
Time to apply profanity: 0.2005002498626709
Time it takes to run this LDA model: 2.5792129039764404
{'cute', 'style', 'size', 'material', 'comfortable', 'detail', 'bit', 'fit', 'color', 'soft'}
5 of 122 products' review sets remodeled

@snguyenthanh
Copy link
Owner

Hi @virtustate, thank you for using the library and reporting the issue.

Could you please share some texts and censor_words which the .censor function runs on ?

@virtustate
Copy link
Author

import pandas
import better_profanity
from better_profanity import profanity
import time

df = pandas.read_csv('datasetSentences.txt', sep='\t')
df = df[1:100]
start = time.time()
corpus = df['sentence']
corpus = corpus.apply(better_profanity.profanity.censor, args=(' ',))
print(time.time() - start, [better_profanity.version)]([url](url
datasetSplit.txt
))

@virtustate
Copy link
Author

virtustate commented Nov 11, 2020

I'm getting about 30s with 0.7.0 and 0.04s with 0.6.1 debug, 4s with 0.7.0 and .013s with 0.6.1 run.

@snguyenthanh
Copy link
Owner

Interesting. 0.7.0 reduces the extra memory consumption for the word variants, but somehow increases the runtime.
I will look into it by the weekend and get back to you.

@snguyenthanh
Copy link
Owner

I suggest using version 0.6.1 in the meantime, though the memory usage could be a bit high.

@virtustate
Copy link
Author

Thanks, no hurry. 0.6.1 is working well for us without any memory issues.

@jcbrockschmidt
Copy link
Collaborator

jcbrockschmidt commented Nov 20, 2020

For some context, we made some fairly substantial changes to the underlying data-structure used to compare strings. The changes and their justification are found in pull request #17.

The benchmarking I did between 0.6.1 and 0.7.0 just used the unit tests, and I found 0.7.0 to be substantially faster. I made the assumption that their overall runtime was representative of most use cases. This assumption was incorrect, and evidently the unit tests were only faster because we cut down on word list load times so drastically (we were originally having issues with exponential memory usage with large word lists, hence the changes to pre-processing). I've opened issue #20 for adding additional unit tests to help us catch these slowdowns in the future.

A quick-and-dirty fix could be to just use the original algorithm when a word list is sufficiently small, and switch to the new memory-efficient algorithm only when necessary. In the meantime, I'm going to explore faster solutions for the comparison of strings with character variants.

@jcbrockschmidt
Copy link
Collaborator

jcbrockschmidt commented Nov 20, 2020

A slice of this dataset should be good for benchmarking on our end. It contains 278677 user reviews for clothing products. Quick preliminary tests on a slice of 1000 reviews shows 0.7.0 is roughly 5x slower than 0.6.1 (excluding the load time for the word list).

@jcbrockschmidt
Copy link
Collaborator

jcbrockschmidt commented Nov 25, 2020

Did some rough math and came up with the big-O bounds for our algorithm for 0.6.1 vs 0.7.0. Let n be the number of letters in a word, w be the number of words in a word list, and N be the maximum number of characters for a word in a word list.

Version memory lookup
0.6.1 O(w * 4^N) O(1)
0.7.0 O(w * N) O(n)
New algorithm O(w * N) O(n)

The word lookup for 0.6.1 was basically just a hash lookup in a default Python set, so it's going to be hard to beat. That said, the lookup efficiency for the algorithm I'm working on should essentially be a O(1) lookup in practice. This goes for documents with little to no curse words, or with lots curse words that are relatively short. But regardless, there will be some extra overhead versus 0.6.1. Whether this overhead will be significant is yet to be determined.

As a side note, it looks like we can get marginal speedups for dictionary look-ups (a reduction of about 17%) by interning strings, as demonstrated here.

@jcbrockschmidt
Copy link
Collaborator

jcbrockschmidt commented Nov 27, 2020

Seeing some promising results with a new algorithm, which you can see in this branch. For benchmarking, I generated lorem ipsum paragraphs with 5% of the words replaced with profanity. The filters all use the same, default word list. I then ran the censor for each relevant version against different sets of paragraphs and timed them. Here's the runtimes in seconds, averaged over 4 runs:

Version 10 paragraphs 100 paragraphs 1000 paragraphs
0.6.1 0.46 4.82 48.43
0.7.0 2.02 20.72 212.85
New algorithm 0.01 0.10 1.01

Now, I expected this new algorithm to be slower than 0.6.1, so these times are a bit surprising. The trials did include assertion checks against known censored versions of the text, and my algorithm does pass all the unit tests. But given our past mistake of not testing 0.7.0 thoroughly enough, we should resolve #20 first so we have something more verified to test with.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants