Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add topic coherence pipeline to gensim #710

Merged
merged 18 commits into from
Jun 22, 2016

Conversation

devashishd12
Copy link
Contributor

@devashishd12 devashishd12 commented May 27, 2016

Addresses #278.
@tmylk @piskvorky is this alright?

@piskvorky
Copy link
Owner

What is "segmentation"?

Can you add some background info / motivation?

@devashishd12
Copy link
Contributor Author

@piskvorky sorry I forgot to tag the issue before. I've tagged it now.

@piskvorky
Copy link
Owner

piskvorky commented May 29, 2016

Ah, nice! :-)

@tmylk please review.

@piskvorky piskvorky added the feature Issue described a new feature label May 29, 2016
@devashishd12 devashishd12 changed the title Created segmentation module. Added S_One_Pre segmentation. [WIP] Add topic coherence pipeline to gensim May 29, 2016
@devashishd12
Copy link
Contributor Author

@piskvorky @tmylk I've added the API I was thinking of for adding the topic coherence pipeline. I've currently implemented only the U_mass topic coherence (still have to work a lot on it though) using this pipeline. Sorry for not adding the tests yet. Just wanted to show you what kind of an API I was thinking of. Is this API fine?

@piskvorky
Copy link
Owner

Looks ok to me conceptually (but have a look at LdaModel.get_topic_terms() method).


EPSILON = 1e-12 # Should be small. Value as suggested in paper.

def Log_Conditional_Probability(segmented_topics, per_topic_probability):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8: function names are lowercase.

@devashishd12
Copy link
Contributor Author

I've added initial tests for the segmentation module. Although I'm still thinking about what kind of inputs can I get in the segmentation module (or what different outputs can the users use from LDA to input into the pipeline).


logger = logging.getLogger(__name__)

def s_one_pre(topics):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a model argument here. Handle it inside

"cell_type": "markdown",
"metadata": {},
"source": [
"## Demonstration for the `u_mass` topic coherence using topic coherence pipeline"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmylk I've added an example notebook for u_mass topic coherence. Is this alright?

@devashishd12
Copy link
Contributor Author

@tmylk @piskvorky I have updated this PR with the latest version. I have also added a U_mass tutorial notebook. Could you please have a look at this intermediate version briefly?

@devashishd12
Copy link
Contributor Author

devashishd12 commented Jun 14, 2016

Added initial draft of c_v topic coherence (the best one from the paper). Will now check for mathematical correctness and other optimizations

self.conf = direct_confirmation_measure.log_conditional_probability
self.aggr = aggregation.arithmetic_mean

elif self.coherence == 'c_v':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'c_v' should be a constant to avoid duplication

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry can you please elaborate on this a bit more?

@devashishd12
Copy link
Contributor Author

I've added c_v coherence to the existing notebook.

@tmylk
Copy link
Contributor

tmylk commented Jun 16, 2016

Thanks! Why doesn't pyldavis show on github?

@devashishd12
Copy link
Contributor Author

Added tests for checking mathematical correctness of the direct_confirmation_measures module.

@devashishd12
Copy link
Contributor Author

Added mathematical correctness test for indirect_confirmation_measure module.

@devashishd12
Copy link
Contributor Author

I've modified the ipython notebook a bit. Hopefully it's more human-interpretable now 😄

@devashishd12
Copy link
Contributor Author

Added introduction to the notebook. @tmylk can you please review?

@tmylk
Copy link
Contributor

tmylk commented Jun 21, 2016

@dsquareindia Could you move the new files you created from gensim/.py to gensim/topic_coherence/.py ? These modules are only needed for topic coherence and shouldn't be in root.

@devashishd12
Copy link
Contributor Author

@tmylk done. Could you please check topic_coherence/__init__.py? Not quite sure about that.

@devashishd12
Copy link
Contributor Author

@tmylk tests pass now. Just used HashDictionary instead of Dictionary.

@devashishd12
Copy link
Contributor Author

@tmylk I have added support for LdaVowpalWabbit wrapper. You can check out an example here

@tmylk tmylk merged commit 6151747 into piskvorky:develop Jun 22, 2016
@devashishd12 devashishd12 deleted the topic_coherence branch June 23, 2016 15:09
This was referenced Oct 3, 2016
@guntherzhao
Copy link

I am trying to use Umass measure to pick the best number of topics, but I do not know what Umass exactly means? About the coherence score, is it the bigger, the better, or just the opposite? Below is the output of my test with Umass measure. How many topics should I pick?
image

@menshikh-iv
Copy link
Contributor

@guntherzhao simple rule: more - better, here - 7 topics.

Our coherence is port of https://github.com/dice-group/Palmetto , you can read more about it on Palmetto wiki page or http://svn.aksw.org/papers/2015/WSDM_Topic_Evaluation/public.pdf

For future - please use mailing list for questions (GitHub only for feature requests & bug reports)

@guntherzhao
Copy link

@menshikh-iv Thanks for your help!

@VictorSuarezL

This comment was marked as spam.

@ZeviAltus

This comment was marked as spam.

@dave-jacques
Copy link

dave-jacques commented Feb 8, 2023

Hi, sorry to resurrect this post but my question relates directly to one of the answers.

@menshikh-iv, you said above:

simple rule: more - better, here - 7 topics.

which I read as "for Umass, the more negative value the better."

Looking at the example graph, 5 topics has the most negative Umass coherence score which I therefore take to be best.

However your answer states 7 topics (which just so happens to be the least negative value).

I'm hoping you could clarify this point, especially as there seems to be a lot of confusion on this topic out there in the world (e.g. here)

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Issue described a new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants