Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identical topics #416

Closed
ghost opened this issue Aug 2, 2015 · 30 comments
Closed

Identical topics #416

ghost opened this issue Aug 2, 2015 · 30 comments
Labels
bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills

Comments

@ghost
Copy link

ghost commented Aug 2, 2015

This doesn't seem right.. LDA training on enwiki with 1000 topics. (gensim unmodified)

2015-08-02 12:09:07,550 : INFO : merging changes from 3750 documents into a model of 3831719 documents
2015-08-02 12:09:35,378 : INFO : topic #938 (0.001): 0.037_census + 0.034_population + 0.027_unincorporated + 0.020_community + 0.017_households + 0.016_landmarks + 0.016_$
2015-08-02 12:09:35,522 : INFO : topic #986 (0.001): 0.015_festival + 0.014_films + 0.013_documentary + 0.010_director + 0.009_award + 0.008_directed + 0.008_producer + 0.$
2015-08-02 12:09:35,666 : INFO : topic #492 (0.001): 0.066_kaunas + 0.048_davidson + 0.037_rosenberg + 0.034_kalamazoo + 0.026_blood + 0.024_sha + 0.023_thorpe + 0.022_vei$
2015-08-02 12:09:35,811 : INFO : topic #392 (0.001): 0.018_laser + 0.016_tucker + 0.015_optical + 0.014_forensic + 0.012_imaging + 0.011_pulse + 0.011_lab + 0.009_sample +$
2015-08-02 12:09:35,954 : INFO : topic #890 (0.001): 0.126_dutch + 0.116_van + 0.071_netherlands + 0.069_amsterdam + 0.034_holland + 0.027_hague + 0.022_der + 0.021_willem$
2015-08-02 12:09:36,098 : INFO : topic #769 (0.001): 0.064_icf + 0.053_cove + 0.050_newfoundland + 0.043_vancouver + 0.041_nunataks + 0.036_columbia + 0.030_labrador + 0.0$
2015-08-02 12:09:36,242 : INFO : topic #75 (0.001): 0.043_dong + 0.042_xu + 0.042_yi + 0.025_narayana + 0.024_tao + 0.023_bingham + 0.023_fei + 0.020_parr + 0.020_ren + 0.$
2015-08-02 12:09:36,386 : INFO : topic #742 (0.001): 0.040_peters + 0.031_leith + 0.030_kahn + 0.028_levy + 0.028_bart + 0.022_hedley + 0.019_bandit + 0.018_robyn + 0.017_$
2015-08-02 12:09:36,529 : INFO : topic #438 (0.001): 0.035_editor + 0.035_newspaper + 0.034_magazine + 0.021_published + 0.018_news + 0.016_daily + 0.014_journalism + 0.01$
2015-08-02 12:09:36,673 : INFO : topic #410 (0.001): 0.046_forest + 0.030_reserve + 0.028_forests + 0.024_species + 0.023_conservation + 0.020_habitat + 0.016_moist + 0.01$
2015-08-02 12:09:36,816 : INFO : topic #322 (0.001): 0.000_jawbone + 0.000_antiochus + 0.000_ddr + 0.000_gault + 0.000_noon + 0.000_fahey + 0.000_toth + 0.000_toto + 0.000$
2015-08-02 12:09:36,960 : INFO : topic #407 (0.001): 0.000_jawbone + 0.000_antiochus + 0.000_ddr + 0.000_gault + 0.000_noon + 0.000_fahey + 0.000_toth + 0.000_toto + 0.000$
2015-08-02 12:09:37,103 : INFO : topic #808 (0.001): 0.091_sf + 0.067_jensen + 0.066_isaac + 0.056_slater + 0.047_informatics + 0.045_hospice + 0.045_rot + 0.042_koblenz +$
2015-08-02 12:09:37,248 : INFO : topic #282 (0.001): 0.000_jawbone + 0.000_antiochus + 0.000_ddr + 0.000_gault + 0.000_noon + 0.000_fahey + 0.000_toth + 0.000_toto + 0.000$
2015-08-02 12:09:37,391 : INFO : topic #894 (0.001): 0.000_jawbone + 0.000_antiochus + 0.000_ddr + 0.000_gault + 0.000_noon + 0.000_fahey + 0.000_toth + 0.000*toto + 0.000$
2015-08-02 12:09:37,606 : INFO : topic diff=inf, rho=0.008998
2015-08-02 12:09:37,902 : INFO : PROGRESS: pass 0, dispatched chunk #12366 = documents up to #3091750/3831719, outstanding queue size 3
2015-08-02 12:09:55,582 : INFO : PROGRESS: pass 0, dispatched chunk #12367 = documents up to #3092000/3831719, outstanding queue size 2
2015-08-02 12:10:03,008 : INFO : PROGRESS: pass 0, dispatched chunk #12368 = documents up to #3092250/3831719, outstanding queue size 3
2015-08-02 12:10:17,426 : INFO : PROGRESS: pass 0, dispatched chunk #12369 = documents up to #3092500/3831719, outstanding queue size 3

@ghost
Copy link
Author

ghost commented Aug 3, 2015

Well, the sampler is not guaranteed to converge :) And the perplexity was high and oscillating a lot. I'll post back if it works next time.

@ghost
Copy link
Author

ghost commented Aug 3, 2015

This seems to be due to the previous divide by zero error. It's also not limited to ldamulticore but also occurs in ldamodel, when simply trying to model Wikipedia with 1000 topics.

@ghost
Copy link
Author

ghost commented Aug 4, 2015

I have been running further tests, and it occurs with 750 topics, but not 500, when using 100,000 words in the vocab on the english wikipedia.

@piskvorky
Copy link
Owner

I received your log, I'm on it.

Sorry this is taking so long Brian. We're moving countries and I've only had time for "trivial" open source fixes lately. Debugging this one looks more substantial :)

@ghost
Copy link
Author

ghost commented Aug 5, 2015

Oh no worries, I am not trying to rush or anything. I didn't even realize they were the same bug at first.

@huihuifan
Copy link

Experiencing the same issue, but only when adjusting the eta prior

@tmylk
Copy link
Contributor

tmylk commented Jan 10, 2016

@brianmingus Is this resolved? If not, could you please post the ling to the log gist? Thanks

@ghost
Copy link
Author

ghost commented Jan 11, 2016

I doubt this is resolved - it won't be resolved by accident.

@tmylk
Copy link
Contributor

tmylk commented Jan 11, 2016

@brianmingus Ok, could you please turn into a more tractable bug report?
Upload log to a gist, provide code to reproduce etc

@ghost
Copy link
Author

ghost commented Jan 11, 2016

This is a serious bug in gensim where it fails to converge when there are a certain number of topics. I think this bug is sufficiently spec'd out - @piskvorky seems to grok it.

@ocsponge
Copy link

ocsponge commented Jun 12, 2017

I got the same bug when I set topics=1000, and I solved this problem by setting the parameter alpha=50/topic_num, eta=0.1, iteration=500

@menshikh-iv
Copy link
Contributor

@brianmingus @ocsponge please attach concrete code & dataset for reproducing your problem

@menshikh-iv menshikh-iv added bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills need info Not enough information for reproduce an issue, need more info from author labels Oct 3, 2017
@ghost
Copy link
Author

ghost commented Oct 3, 2017

You do not "need info" for this bug. It is sufficiently spec'd out. Please stop asking for more info.

@menshikh-iv
Copy link
Contributor

@brianmingus I don't agree with you because I can't reproduce it now, for this reason, I asked for additional information (code and dataset).

@ghost
Copy link
Author

ghost commented Oct 3, 2017

I provided enough info to replicate; @piskvorky did not ask for more info.

If you are interested in working on this ticket, the appropriate steps are to check out gensim from the date the ticket is posted, and a current one. If you can replicate on the old one but not the new one, it's fixed.

@TC-Rudel
Copy link

@menshikh-iv , @tmylk , @piskvorky , I'm having the same issue and am including my dataset, dictionary, and code. This is a corpus pulled from gutenberg project, split into 3.5 M documents using a rather clipped vocabulary of ~66000 words. I did not have problems when trying a 400-topic version but did run into issues with 1000 topics.

Dataset is 2GB zipped and can be downloaded from google drive.

Dictionary and repro code are attached as zips.
Create_LDA_Model_repro.zip
dictionary.zip

When I run the code I get a numerical value for topic diff in the first tranche of documents viewed. But then later I get topic diff=inf.

Here is the logging information:

C:\Winpython\WinPython-64bit-3.5.4.0Qt5\python-3.5.4.amd64\lib\site-packages\gensim\utils.py:862: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
  warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
2017-10-30 17:30:34,223 : INFO : loading Dictionary object from clean_vcompact_dictionary.pickle
2017-10-30 17:30:34,253 : INFO : loaded clean_vcompact_dictionary.pickle
2017-10-30 17:30:34,699 : INFO : loaded corpus index from E:/Clean_Corpus/Full_Corpus_LD.mm.index
2017-10-30 17:30:34,700 : INFO : initializing corpus reader from E:/Clean_Corpus/Full_Corpus_LD.mm
2017-10-30 17:30:34,700 : INFO : accepted corpus with 3443509 documents, 66457 features, 612903679 non-zero entries
2017-10-30 17:30:34,703 : INFO : using symmetric alpha at 0.001
2017-10-30 17:30:34,703 : INFO : using symmetric eta at 1.5047323833456219e-05
2017-10-30 17:30:34,712 : INFO : using serial LDA version on this node
2017-10-30 17:36:51,844 : INFO : running online LDA training, 1000 topics, 1 passes over the supplied corpus of 3443509 documents, updating every 4000 documents, evaluating every ~40000 documents, iterating 500x with a convergence threshold of 0.001000
2017-10-30 17:36:51,849 : INFO : training LDA model using 2 processes
C:\Winpython\WinPython-64bit-3.5.4.0Qt5\python-3.5.4.amd64\lib\site-packages\gensim\utils.py:862: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
  warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
C:\Winpython\WinPython-64bit-3.5.4.0Qt5\python-3.5.4.amd64\lib\site-packages\gensim\utils.py:862: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
  warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
2017-10-30 17:36:52,407 : INFO : PROGRESS: pass 0, dispatched chunk #0 = documents up to #2000/3443509, outstanding queue size 1
2017-10-30 17:36:52,718 : INFO : loading Dictionary object from clean_vcompact_dictionary.pickle
2017-10-30 17:36:52,718 : INFO : loading Dictionary object from clean_vcompact_dictionary.pickle
2017-10-30 17:36:52,760 : INFO : loaded clean_vcompact_dictionary.pickle
2017-10-30 17:36:52,762 : INFO : loaded clean_vcompact_dictionary.pickle
2017-10-30 17:36:53,317 : INFO : loaded corpus index from E:/Clean_Corpus/Full_Corpus_LD.mm.index
2017-10-30 17:36:53,317 : INFO : initializing corpus reader from E:/Clean_Corpus/Full_Corpus_LD.mm
2017-10-30 17:36:53,318 : INFO : accepted corpus with 3443509 documents, 66457 features, 612903679 non-zero entries
2017-10-30 17:36:53,330 : INFO : loaded corpus index from E:/Clean_Corpus/Full_Corpus_LD.mm.index
2017-10-30 17:36:53,330 : INFO : initializing corpus reader from E:/Clean_Corpus/Full_Corpus_LD.mm
2017-10-30 17:36:53,330 : INFO : accepted corpus with 3443509 documents, 66457 features, 612903679 non-zero entries
2017-10-30 17:36:54,732 : INFO : PROGRESS: pass 0, dispatched chunk #1 = documents up to #4000/3443509, outstanding queue size 2
2017-10-30 17:36:57,326 : INFO : PROGRESS: pass 0, dispatched chunk #2 = documents up to #6000/3443509, outstanding queue size 3
2017-10-30 17:36:59,845 : INFO : PROGRESS: pass 0, dispatched chunk #3 = documents up to #8000/3443509, outstanding queue size 4
2017-10-30 17:37:00,760 : INFO : PROGRESS: pass 0, dispatched chunk #4 = documents up to #10000/3443509, outstanding queue size 5
2017-10-30 17:37:01,560 : INFO : PROGRESS: pass 0, dispatched chunk #5 = documents up to #12000/3443509, outstanding queue size 6
2017-10-30 17:38:09,246 : INFO : PROGRESS: pass 0, dispatched chunk #6 = documents up to #14000/3443509, outstanding queue size 6
2017-10-30 17:38:26,611 : INFO : merging changes from 4000 documents into a model of 3443509 documents
2017-10-30 17:38:37,241 : INFO : topic #124 (0.001): 0.009*"like" + 0.007*"alexandra" + 0.007*"came" + 0.007*"went" + 0.006*"boys" + 0.005*"mother" + 0.005*"oak" + 0.005*"mrs" + 0.005*"looked" + 0.005*"read"
2017-10-30 17:38:37,242 : INFO : topic #64 (0.001): 0.037*"shall" + 0.029*"thy" + 0.022*"lord" + 0.020*"god" + 0.018*"king" + 0.016*"unto" + 0.015*"things" + 0.015*"precious" + 0.014*"forth" + 0.014*"nephi"
2017-10-30 17:38:37,243 : INFO : topic #287 (0.001): 0.022*"lord" + 0.014*"man" + 0.010*"unto" + 0.010*"came" + 0.009*"shall" + 0.009*"god" + 0.007*"power" + 0.006*"thee" + 0.006*"gold" + 0.006*"mormon"
2017-10-30 17:38:37,244 : INFO : topic #188 (0.001): 0.048*"god" + 0.032*"unto" + 0.031*"hezekiah" + 0.026*"hand" + 0.023*"people" + 0.022*"deliver" + 0.019*"saying" + 0.014*"lord" + 0.013*"fathers" + 0.013*"king"
2017-10-30 17:38:37,244 : INFO : topic #173 (0.001): 0.043*"unto" + 0.028*"shall" + 0.019*"god" + 0.014*"things" + 0.013*"jesus" + 0.012*"came" + 0.012*"lord" + 0.012*"come" + 0.012*"people" + 0.010*"hath"
2017-10-30 17:38:37,593 : INFO : topic diff=985.619743, rho=1.000000
2017-10-30 17:38:37,656 : INFO : PROGRESS: pass 0, dispatched chunk #7 = documents up to #16000/3443509, outstanding queue size 6
2017-10-30 17:39:29,406 : INFO : PROGRESS: pass 0, dispatched chunk #8 = documents up to #18000/3443509, outstanding queue size 6
C:\Winpython\WinPython-64bit-3.5.4.0Qt5\python-3.5.4.amd64\lib\site-packages\gensim\models\ldamodel.py:728: RuntimeWarning: divide by zero encountered in log
  diff = np.log(self.expElogbeta)
2017-10-30 17:39:50,225 : INFO : merging changes from 4000 documents into a model of 3443509 documents
2017-10-30 17:40:00,745 : INFO : topic #153 (0.001): 0.010*"letters" + 0.009*"caps" + 0.007*"small" + 0.007*"mother" + 0.007*"book" + 0.006*"word" + 0.005*"little" + 0.005*"long" + 0.005*"anne" + 0.004*"father"
2017-10-30 17:40:00,746 : INFO : topic #398 (0.001): 0.017*"shall" + 0.012*"come" + 0.010*"man" + 0.009*"father" + 0.009*"unto" + 0.009*"like" + 0.008*"thee" + 0.008*"know" + 0.008*"think" + 0.007*"world"
2017-10-30 17:40:00,747 : INFO : topic #264 (0.001): 0.015*"father" + 0.014*"unto" + 0.010*"girl" + 0.008*"went" + 0.008*"let" + 0.008*"away" + 0.008*"came" + 0.008*"jesus" + 0.007*"tarzan" + 0.006*"shall"
2017-10-30 17:40:00,748 : INFO : topic #134 (0.001): 0.016*"shall" + 0.014*"things" + 0.012*"unto" + 0.009*"know" + 0.008*"come" + 0.007*"let" + 0.007*"thy" + 0.007*"man" + 0.006*"hath" + 0.006*"alma"
2017-10-30 17:40:00,749 : INFO : topic #917 (0.001): 0.070*"sir" + 0.018*"gareth" + 0.013*"knight" + 0.012*"smote" + 0.011*"encountered" + 0.010*"came" + 0.009*"spear" + 0.009*"king" + 0.009*"lord" + 0.008*"unto"
2017-10-30 17:40:01,201 : INFO : topic diff=inf, rho=0.333333
2017-10-30 17:40:01,292 : INFO : PROGRESS: pass 0, dispatched chunk #9 = documents up to #20000/3443509, outstanding queue size 6
2017-10-30 17:40:33,389 : INFO : PROGRESS: pass 0, dispatched chunk #10 = documents up to #22000/3443509, outstanding queue size 6
2017-10-30 17:41:06,384 : INFO : merging changes from 4000 documents into a model of 3443509 documents
2017-10-30 17:41:17,228 : INFO : topic #35 (0.001): 0.016*"shall" + 0.009*"lord" + 0.008*"came" + 0.007*"alice" + 0.007*"anne" + 0.007*"went" + 0.007*"thee" + 0.007*"hath" + 0.006*"know" + 0.005*"little"
2017-10-30 17:41:17,230 : INFO : topic #802 (0.001): 0.027*"unto" + 0.017*"lord" + 0.017*"god" + 0.012*"shall" + 0.011*"men" + 0.010*"came" + 0.008*"hand" + 0.007*"jacob" + 0.006*"david" + 0.006*"man"
2017-10-30 17:41:17,232 : INFO : topic #713 (0.001): 0.013*"thy" + 0.008*"shall" + 0.006*"power" + 0.006*"people" + 0.005*"time" + 0.005*"lord" + 0.005*"ether" + 0.005*"thou" + 0.005*"account" + 0.004*"unto"
2017-10-30 17:41:17,234 : INFO : topic #958 (0.001): 0.013*"companions" + 0.010*"dog" + 0.009*"ape" + 0.009*"men" + 0.009*"man" + 0.008*"ancestors" + 0.008*"great" + 0.008*"preparations" + 0.008*"traveling" + 0.008*"selfish"
2017-10-30 17:41:17,235 : INFO : topic #819 (0.001): 0.012*"tom" + 0.011*"jim" + 0.009*"let" + 0.009*"cor" + 0.008*"warmed" + 0.008*"life" + 0.007*"time" + 0.006*"got" + 0.006*"come" + 0.006*"says"
2017-10-30 17:41:19,478 : INFO : topic diff=inf, rho=0.200000
2017-10-30 17:41:19,555 : INFO : PROGRESS: pass 0, dispatched chunk #11 = documents up to #24000/3443509, outstanding queue size 6
2017-10-30 17:41:21,343 : INFO : PROGRESS: pass 0, dispatched chunk #12 = documents up to #26000/3443509, outstanding queue size 6
2017-10-30 17:41:53,421 : INFO : merging changes from 4000 documents into a model of 3443509 documents
2017-10-30 17:42:04,382 : INFO : topic #395 (0.001): 0.040*"shall" + 0.012*"unto" + 0.010*"man" + 0.007*"days" + 0.007*"king" + 0.006*"came" + 0.006*"vision" + 0.006*"lord" + 0.005*"thee" + 0.005*"hand"
2017-10-30 17:42:04,383 : INFO : topic #965 (0.001): 0.013*"unto" + 0.012*"shall" + 0.012*"god" + 0.011*"man" + 0.009*"alma" + 0.008*"came" + 0.007*"went" + 0.006*"great" + 0.006*"mosiah" + 0.006*"yea"
2017-10-30 17:42:04,384 : INFO : topic #234 (0.001): 0.028*"shall" + 0.011*"lord" + 0.009*"thy" + 0.009*"unto" + 0.008*"god" + 0.007*"nephi" + 0.007*"day" + 0.006*"hath" + 0.006*"behold" + 0.006*"know"
2017-10-30 17:42:04,385 : INFO : topic #856 (0.001): 0.046*"shall" + 0.008*"president" + 0.008*"lord" + 0.007*"king" + 0.007*"day" + 0.006*"priest" + 0.005*"like" + 0.005*"chuse" + 0.005*"man" + 0.005*"let"
2017-10-30 17:42:04,386 : INFO : topic #440 (0.001): 0.007*"heaven" + 0.005*"great" + 0.004*"far" + 0.004*"time" + 0.004*"like" + 0.004*"old" + 0.004*"place" + 0.003*"feet" + 0.003*"looked" + 0.003*"deep"
2017-10-30 17:42:06,639 : INFO : topic diff=inf, rho=0.142857
2017-10-30 17:42:06,716 : INFO : PROGRESS: pass 0, dispatched chunk #13 = documents up to #28000/3443509, outstanding queue size 6
2017-10-30 17:42:08,490 : INFO : PROGRESS: pass 0, dispatched chunk #14 = documents up to #30000/3443509, outstanding queue size 6
2017-10-30 17:42:40,719 : INFO : merging changes from 4000 documents into a model of 3443509 documents
2017-10-30 17:42:53,718 : INFO : topic #466 (0.001): 0.008*"half" + 0.007*"state" + 0.007*"man" + 0.006*"came" + 0.006*"like" + 0.006*"apology" + 0.006*"lord" + 0.006*"thousand" + 0.006*"owes" + 0.005*"place"
2017-10-30 17:42:53,719 : INFO : topic #12 (0.001): 0.014*"lord" + 0.007*"unto" + 0.007*"people" + 0.007*"know" + 0.006*"day" + 0.005*"jacob" + 0.005*"time" + 0.005*"shall" + 0.005*"like" + 0.005*"god"
2017-10-30 17:42:53,721 : INFO : topic #460 (0.001): 0.029*"cook" + 0.017*"thee" + 0.013*"come" + 0.013*"blood" + 0.012*"place" + 0.012*"bid" + 0.011*"god" + 0.010*"emma" + 0.009*"shall" + 0.009*"things"
2017-10-30 17:42:53,722 : INFO : topic #6 (0.001): 0.095*"captain" + 0.038*"shall" + 0.015*"camp" + 0.012*"house" + 0.012*"pitch" + 0.012*"children" + 0.010*"man" + 0.010*"sanctuary" + 0.009*"lord" + 0.008*"unto"
2017-10-30 17:42:53,723 : INFO : topic #165 (0.001): 0.014*"shall" + 0.008*"day" + 0.008*"like" + 0.007*"man" + 0.007*"lord" + 0.006*"hath" + 0.006*"great" + 0.006*"come" + 0.005*"know" + 0.005*"way"
2017-10-30 17:42:54,143 : INFO : topic diff=inf, rho=0.111111
2017-10-30 17:42:54,228 : INFO : PROGRESS: pass 0, dispatched chunk #15 = documents up to #32000/3443509, outstanding queue size 6
2017-10-30 17:42:55,937 : INFO : PROGRESS: pass 0, dispatched chunk #16 = documents up to #34000/3443509, outstanding queue size 6
2017-10-30 17:43:27,559 : INFO : merging changes from 4000 documents into a model of 3443509 documents
2017-10-30 17:43:38,648 : INFO : topic #773 (0.001): 0.009*"god" + 0.007*"suburbs" + 0.006*"unto" + 0.006*"little" + 0.005*"lord" + 0.005*"man" + 0.005*"thy" + 0.005*"like" + 0.005*"shall" + 0.004*"know"
2017-10-30 17:43:38,649 : INFO : topic #834 (0.001): 0.018*"party" + 0.012*"man" + 0.010*"person" + 0.010*"property" + 0.009*"common" + 0.009*"point" + 0.009*"officer" + 0.008*"duty" + 0.008*"object" + 0.008*"case"
2017-10-30 17:43:38,650 : INFO : topic #124 (0.001): 0.010*"like" + 0.008*"got" + 0.007*"went" + 0.007*"mother" + 0.006*"come" + 0.006*"room" + 0.005*"mrs" + 0.005*"night" + 0.005*"long" + 0.005*"came"
2017-10-30 17:43:38,651 : INFO : topic #16 (0.001): 0.019*"king" + 0.010*"lord" + 0.010*"shall" + 0.008*"unto" + 0.008*"come" + 0.007*"man" + 0.007*"men" + 0.007*"let" + 0.006*"know" + 0.005*"thy"
2017-10-30 17:43:38,652 : INFO : topic #463 (0.001): 0.013*"lord" + 0.011*"house" + 0.009*"chest" + 0.008*"man" + 0.007*"otter" + 0.007*"went" + 0.006*"came" + 0.006*"money" + 0.005*"day" + 0.005*"shall"
2017-10-30 17:43:41,012 : INFO : topic diff=inf, rho=0.090909

@TC-Rudel
Copy link

Based on @ocsponge post, I tried modifying alpha and eta. I have found that in mine case the problem goes away if I set eta = 0.01 but persists if I set eta=0.001. With 2000 topics, default alpha, and eta=0.01, my topics were converging fine.

@menshikh-iv
Copy link
Contributor

Thank you very much @TC-Rudel for additional information, now this problem can be reproduced.

@menshikh-iv menshikh-iv removed the need info Not enough information for reproduce an issue, need more info from author label Oct 31, 2017
@stevemarin
Copy link

Are there any updates on this issue?

@menshikh-iv
Copy link
Contributor

@stevemarin not yet

@johann-petrak
Copy link
Contributor

johann-petrak commented Nov 10, 2018

Same here, I am getting "topic diff=inf" on the log after the second merge (running multicore).
The topic diff is 25.4 after the first merge.

What does "topic diff=inf" actually mean and what are potential causes? It would be good to understand the meaning of this better in order to come up with strategies for how to avoid this. Previous comments mentioned changing the number of topics, the eta, or maybe alpha or the number of iterations, but I do not understand how those settings are related to the topic diff? Could the vocabulary size have an influence?

@menshikh-iv
Copy link
Contributor

What does "topic diff=inf" actually mean and what are potential causes?

This means that overflow happens somewhere (typically - division by "almost-zero" value) -> model breaks (produce inf\nans). Related issue: #2115

@menshikh-iv
Copy link
Contributor

@johann-petrak we applied some "workaround" for this, see #2308, hope that helps

@horpto
Copy link
Contributor

horpto commented Jan 17, 2019

@menshikh-iv I don't think that 'workaround" will solve this problem. I've had the same problem even after my patch. I can try to explore this a bit later.

@menshikh-iv
Copy link
Contributor

@horpto I still hope that #2308 at least reduce the number of "overflow-related" errors.

I can try to explore this a bit later.

Sounds pretty useful and nice, please go ahead when you will have time!

@horpto
Copy link
Contributor

horpto commented Jan 18, 2019

This issue is caused by width of dtype. First of all I have had a warning on diff = np.log(self.expElogbeta) in the second m-step: RuntimeWarning: divide by zero encountered in log. So that's why inf-s appear in output (self.expElogbeta have contained zeros). get_Elogbeta() after first blend has returned something like this:

[[ -11.186146   -13.545639   -11.4461155 ... -112.541405  -112.541405
  -112.541405 ]
 [ -11.8831415  -11.548369    -9.9233265 ... -112.556595  -112.556595
  -112.556595 ]
 [ -11.561755   -10.991329   -11.953122  ... -112.475     -112.475
  -112.475    ]
 ...
 [ -11.4945545  -11.350912    -9.209938  ... -112.61384   -112.61384
  -112.61384  ]
 [ -11.081068   -12.508811   -10.777531  ... -112.40563   -112.40563
  -112.40563  ]
 [ -11.711579   -13.1611     -13.570866  ... -112.315475  -112.315475
  -112.315475 ]]

It's not obvious at a first glance (of course, all thinks that log(exp(x)) == x)) but there is a surprise:

>>> np.exp(-123)
3.817497188671175e-54
>>> np.exp(-123, dtype=np.float32)
0.0

Default dtype of LDA is np.float32. After I've changed it on np.float64 problem disappears.

@piskvorky
Copy link
Owner

piskvorky commented Jan 18, 2019

@horpto do you see a way to use float64 precision only where needed (internal calculations), but keep the big parameter matrices in float32 (less RAM)?

IIRC the only reason for the float32 default was to save memory.

@horpto
Copy link
Contributor

horpto commented Jan 18, 2019

@piskvorky I guess, we can change diff = np.log(self.expElogbeta) line to diff = self.state.get_Elogbeta() due to invariant self.expElogbeta == np.exp(self.state.get_Elogbeta()) but it does not solve possible problem with zeros in self.expElogbeta instead of small values.
I can add PR if it's good enough suggestion and if you agree with it.

@gauravkoradiya
Copy link

division

can u suggest what should be the value of topic_diff in general?

@gauravkoradiya
Copy link

@horpto do you see a way to use float64 precision only where needed (internal calculations), but keep the big parameter matrices in float32 (less RAM)?

IIRC the only reason for the float32 default was to save memory.

Then is there any issue for np.float16? What happens when i changed to np.float16 because i got same thing as in np.float32.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills
Projects
None yet
Development

No branches or pull requests

10 participants