[WIP] Add topic coherence pipeline to gensim #710

devashishd12 · 2016-05-27T17:08:26Z

Addresses #278.
@tmylk @piskvorky is this alright?

piskvorky · 2016-05-28T02:20:41Z

What is "segmentation"?

Can you add some background info / motivation?

devashishd12 · 2016-05-29T09:01:45Z

@piskvorky sorry I forgot to tag the issue before. I've tagged it now.

piskvorky · 2016-05-29T13:20:34Z

Ah, nice! :-)

@tmylk please review.

devashishd12 · 2016-05-29T18:40:25Z

@piskvorky @tmylk I've added the API I was thinking of for adding the topic coherence pipeline. I've currently implemented only the U_mass topic coherence (still have to work a lot on it though) using this pipeline. Sorry for not adding the tests yet. Just wanted to show you what kind of an API I was thinking of. Is this API fine?

piskvorky · 2016-05-30T01:15:09Z

Looks ok to me conceptually (but have a look at LdaModel.get_topic_terms() method).

piskvorky · 2016-05-30T01:17:57Z

gensim/confirmation_measure.py

+
+EPSILON = 1e-12 # Should be small. Value as suggested in paper.
+
+def Log_Conditional_Probability(segmented_topics, per_topic_probability):


PEP8: function names are lowercase.

devashishd12 · 2016-05-31T19:08:26Z

I've added initial tests for the segmentation module. Although I'm still thinking about what kind of inputs can I get in the segmentation module (or what different outputs can the users use from LDA to input into the pipeline).

devashishd12 · 2016-05-31T19:39:38Z

gensim/segmentation.py

+
+logger = logging.getLogger(__name__)
+
+def s_one_pre(topics):


Add a model argument here. Handle it inside

devashishd12 · 2016-06-07T16:05:31Z

docs/notebooks/u_mass_tutorial.ipynb

+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Demonstration for the `u_mass` topic coherence using topic coherence pipeline"


@tmylk I've added an example notebook for u_mass topic coherence. Is this alright?

devashishd12 · 2016-06-09T21:24:38Z

@tmylk @piskvorky I have updated this PR with the latest version. I have also added a U_mass tutorial notebook. Could you please have a look at this intermediate version briefly?

devashishd12 · 2016-06-14T19:12:32Z

Added initial draft of c_v topic coherence (the best one from the paper). Will now check for mathematical correctness and other optimizations

tmylk · 2016-06-15T16:11:53Z

gensim/models/coherencemodel.py

+            self.conf = direct_confirmation_measure.log_conditional_probability
+            self.aggr = aggregation.arithmetic_mean
+
+        elif self.coherence == 'c_v':


'c_v' should be a constant to avoid duplication

Sorry can you please elaborate on this a bit more?

devashishd12 · 2016-06-15T21:38:14Z

I've added c_v coherence to the existing notebook.

tmylk · 2016-06-16T02:36:01Z

Thanks! Why doesn't pyldavis show on github?

devashishd12 · 2016-06-16T16:19:05Z

Added tests for checking mathematical correctness of the direct_confirmation_measures module.

devashishd12 · 2016-06-16T18:38:13Z

Added mathematical correctness test for indirect_confirmation_measure module.

devashishd12 · 2016-06-17T15:02:02Z

I've modified the ipython notebook a bit. Hopefully it's more human-interpretable now 😄

devashishd12 · 2016-06-17T20:58:46Z

Added introduction to the notebook. @tmylk can you please review?

tmylk · 2016-06-21T22:53:12Z

@dsquareindia Could you move the new files you created from gensim/.py to gensim/topic_coherence/.py ? These modules are only needed for topic coherence and shouldn't be in root.

devashishd12 · 2016-06-22T05:34:12Z

@tmylk done. Could you please check topic_coherence/__init__.py? Not quite sure about that.

devashishd12 · 2016-06-22T16:57:33Z

@tmylk tests pass now. Just used HashDictionary instead of Dictionary.

devashishd12 · 2016-06-22T20:00:22Z

@tmylk I have added support for LdaVowpalWabbit wrapper. You can check out an example here

guntherzhao · 2018-09-24T16:28:08Z

I am trying to use Umass measure to pick the best number of topics, but I do not know what Umass exactly means? About the coherence score, is it the bigger, the better, or just the opposite? Below is the output of my test with Umass measure. How many topics should I pick?

menshikh-iv · 2018-09-28T07:15:24Z

@guntherzhao simple rule: more - better, here - 7 topics.

Our coherence is port of https://github.com/dice-group/Palmetto , you can read more about it on Palmetto wiki page or http://svn.aksw.org/papers/2015/WSDM_Topic_Evaluation/public.pdf

For future - please use mailing list for questions (GitHub only for feature requests & bug reports)

guntherzhao · 2018-09-28T07:48:12Z

@menshikh-iv Thanks for your help!

dave-jacques · 2023-02-08T12:04:39Z

Hi, sorry to resurrect this post but my question relates directly to one of the answers.

@menshikh-iv, you said above:

simple rule: more - better, here - 7 topics.

which I read as "for Umass, the more negative value the better."

Looking at the example graph, 5 topics has the most negative Umass coherence score which I therefore take to be best.

However your answer states 7 topics (which just so happens to be the least negative value).

I'm hoping you could clarify this point, especially as there seems to be a lot of confusion on this topic out there in the world (e.g. here)

Thanks!

devashishd12 added 2 commits May 27, 2016 13:05

Created segmentation module. Added S_One_Pre segmentation.

9eb98cc

Made minor changes.

9245a4d

Returns a list now.

cfda773

piskvorky assigned tmylk May 29, 2016

piskvorky added the feature Issue described a new feature label May 29, 2016

devashishd12 changed the title ~~Created segmentation module. Added S_One_Pre segmentation.~~ [WIP] Add topic coherence pipeline to gensim May 29, 2016

Added example topic coherence API.

f4445e4

piskvorky reviewed May 30, 2016
View reviewed changes

Addressed pep8 comments, added test for segmentation.

5913d39

devashishd12 reviewed May 31, 2016
View reviewed changes

Added boolean sliding window, tests for prob estimation

ab85652

devashishd12 reviewed Jun 7, 2016
View reviewed changes

Added u_mass tutorial notebook, log ratio confirmation measure.

7a04893

devashishd12 force-pushed the topic_coherence branch from 8749621 to 7a04893 Compare June 9, 2016 21:22

[ci skip]added initial draft of c_v

ddfaf09

tmylk reviewed Jun 15, 2016
View reviewed changes

[ci skip]Added c_v tutorial to notebook.

b45c00e

devashishd12 force-pushed the topic_coherence branch from 213fcff to b45c00e Compare June 15, 2016 21:37

added coherence model documentation.

62cd0d7

Added tests for checking direct confirmation measures.

d2c4a19

Added test for indirect_confirmation_measure

a4b2629

Made changes to ipython notebook.

b41e01b

Added introduction to notebook.

7c0c495

Added documentation for indirect confirmation module.

c296624

Made topic coherence package.

f7b9d7b

Use HashDictionary to pass tests

738235d

devashishd12 force-pushed the topic_coherence branch from 4cbd3ae to 738235d Compare June 22, 2016 16:29

Added support for LdaVowpalWabbit

280375f

tmylk merged commit 6151747 into piskvorky:develop Jun 22, 2016

devashishd12 deleted the topic_coherence branch June 23, 2016 15:09

This was referenced Oct 3, 2016

[MRG] Topic Coherence #750

Merged

[MRG] Topic coherence update 3 #793

Closed

guntherzhao unassigned tmylk Sep 24, 2018

menshikh-iv mentioned this pull request Sep 28, 2018

How to pick the number of topics with Umass measure? #2199

Closed

This comment was marked as spam.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Add topic coherence pipeline to gensim #710

[WIP] Add topic coherence pipeline to gensim #710

devashishd12 commented May 27, 2016 •

edited

Loading

piskvorky commented May 28, 2016

devashishd12 commented May 29, 2016

piskvorky commented May 29, 2016 •

edited

Loading

devashishd12 commented May 29, 2016

piskvorky commented May 30, 2016

piskvorky May 30, 2016

devashishd12 commented May 31, 2016

devashishd12 May 31, 2016

devashishd12 Jun 7, 2016

devashishd12 commented Jun 9, 2016

devashishd12 commented Jun 14, 2016 •

edited

Loading

tmylk Jun 15, 2016

devashishd12 Jun 16, 2016

devashishd12 commented Jun 15, 2016

tmylk commented Jun 16, 2016

devashishd12 commented Jun 16, 2016

devashishd12 commented Jun 16, 2016

devashishd12 commented Jun 17, 2016

devashishd12 commented Jun 17, 2016

tmylk commented Jun 21, 2016

devashishd12 commented Jun 22, 2016

devashishd12 commented Jun 22, 2016

devashishd12 commented Jun 22, 2016

guntherzhao commented Sep 24, 2018

menshikh-iv commented Sep 28, 2018

guntherzhao commented Sep 28, 2018

This comment was marked as spam.

This comment was marked as spam.

dave-jacques commented Feb 8, 2023 •

edited

Loading


		EPSILON = 1e-12 # Should be small. Value as suggested in paper.

		def Log_Conditional_Probability(segmented_topics, per_topic_probability):

[WIP] Add topic coherence pipeline to gensim #710

[WIP] Add topic coherence pipeline to gensim #710

Conversation

devashishd12 commented May 27, 2016 • edited Loading

piskvorky commented May 28, 2016

devashishd12 commented May 29, 2016

piskvorky commented May 29, 2016 • edited Loading

devashishd12 commented May 29, 2016

piskvorky commented May 30, 2016

piskvorky May 30, 2016

Choose a reason for hiding this comment

devashishd12 commented May 31, 2016

devashishd12 May 31, 2016

Choose a reason for hiding this comment

devashishd12 Jun 7, 2016

Choose a reason for hiding this comment

devashishd12 commented Jun 9, 2016

devashishd12 commented Jun 14, 2016 • edited Loading

tmylk Jun 15, 2016

Choose a reason for hiding this comment

devashishd12 Jun 16, 2016

Choose a reason for hiding this comment

devashishd12 commented Jun 15, 2016

tmylk commented Jun 16, 2016

devashishd12 commented Jun 16, 2016

devashishd12 commented Jun 16, 2016

devashishd12 commented Jun 17, 2016

devashishd12 commented Jun 17, 2016

tmylk commented Jun 21, 2016

devashishd12 commented Jun 22, 2016

devashishd12 commented Jun 22, 2016

devashishd12 commented Jun 22, 2016

guntherzhao commented Sep 24, 2018

menshikh-iv commented Sep 28, 2018

guntherzhao commented Sep 28, 2018

This comment was marked as spam.

This comment was marked as spam.

dave-jacques commented Feb 8, 2023 • edited Loading

devashishd12 commented May 27, 2016 •

edited

Loading

piskvorky commented May 29, 2016 •

edited

Loading

devashishd12 commented Jun 14, 2016 •

edited

Loading

dave-jacques commented Feb 8, 2023 •

edited

Loading