Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redundant get_Elogbeta calls in LdaModel #2051

Closed
darindf opened this issue May 15, 2018 · 3 comments
Closed

Redundant get_Elogbeta calls in LdaModel #2051

darindf opened this issue May 15, 2018 · 3 comments
Labels
difficulty medium Medium issue: required good gensim understanding & python skills performance Issue related to performance (in HW meaning)

Comments

@darindf
Copy link
Contributor

darindf commented May 15, 2018

In reviewing performance profiling I'm finding that get_Elogbeta is taking some time.

I notice that there are 2 calls to this function, one after the other, LdaModel.do_mstep

The first call is obvious, in LdaModel.do_mstep we have

logger.debug("updating topics")
# update self with the new blend; also keep track of how much did
# the topics change through this update, to assess convergence
diff = np.log(self.expElogbeta)
self.state.blend(rho, other)
diff -= self.state.get_Elogbeta()
self.sync_state()

But do you see the second call? In sync_state we have

def sync_state(self):
    self.expElogbeta = np.exp(self.state.get_Elogbeta())
    assert self.expElogbeta.dtype == self.dtype

The question then becomes does get_Elogbeta perform some side effect that the 2nd call would be different from the first?

Looking at get_Elogbeta

def get_Elogbeta(self):
    return dirichlet_expectation(self.get_lambda())

This function doesn't modify state directly.

Looking at dirichlet_expectation python code (easier to read than the cython code).

def dirichlet_expectation(alpha):
    """Expected value of log(theta) where theta is drawn from a Dirichlet distribution.

    Parameters
    ----------
    alpha : numpy.ndarray
        Dirichlet parameter 2d matrix or 1d vector, if 2d - each row is treated as a separate parameter vector.

    Returns
    -------
    numpy.ndarray
        Log of expected values, dimension same as `alpha.ndim`.

    """
    if len(alpha.shape) == 1:
        result = psi(alpha) - psi(np.sum(alpha))
    else:
        result = psi(alpha) - psi(np.sum(alpha, 1))[:, np.newaxis]
    return result.astype(alpha.dtype, copy=False)  # keep the same precision as input

This doesn't modify the input alpha, and the function psi is from scipy.special.

Thus it appears that the second call get_Elogbeta duplicates the work on the first and can be eliminated.

Is this analysis correct?

Here is a profile chart showing the timing.
image

This is taken from the overall ldaModel init & update where get_Elogbeta is the two top boxes on the far right over lda'ing a 1.3M corpus.

image

@arlenk
Copy link
Contributor

arlenk commented Jun 19, 2018

From a quick look, I think you are right.

Out of curiosity, what version of gensim are you using and what parameters are you passing LdaModel? I am asking because in my use cases, get_Elogbeta takes about 5% of the total run time, which isn't negligible, but it seems like get_Elogbeta is taking up a much larger percent of the total run time for you (if I am reading your profile chart correctly).

@darindf
Copy link
Contributor Author

darindf commented Jun 19, 2018

I forget which version of gensim I was when I performed the performance but it was in the 3.x release. I'm currently using 3.4. I believe the performance will depend on the number of topics and number of terms (size of vocab), this may be run over my test case of 1.3 million documents with 55k unique terms for 16 topics,

Here is a patch to modify the LdaModel object. I have verified that self.state.get_Elogbeta() is stateless, so that the second call in the original code is getting the same value as the first call. With this patch, the corpus needs to be passed via the update vs the init.

lda = LdaModel(id2word=master_dictionary, num_topics=num_of_topics, passes=1)
import numpy as np

def sync_state(self, elogbeta=None):
    # change start
    if elogbeta is None:
        elogbeta = self.state.get_Elogbeta()
    # change end
    self.expElogbeta = np.exp(elogbeta)
    assert self.expElogbeta.dtype == self.dtype

def do_mstep(self, rho, other, extra_pass=False):
    """
    M step: use linear interpolation between the existing topics and
    collected sufficient statistics in `other` to update the topics.

    """
    logger.debug("updating topics")
    # update self with the new blend; also keep track of how much did
    # the topics change through this update, to assess convergence
    diff = np.log(self.expElogbeta)
    self.state.blend(rho, other)
    elogbeta = self.state.get_Elogbeta()
    # change start
    diff -= elogbeta
    self.sync_state(elogbeta)
    # change end

    # print out some debug info at the end of each EM iteration
    self.print_topics(5)
    logger.info("topic diff=%f, rho=%f",
                np.mean(np.abs(diff)), rho)

    if self.optimize_eta:
        self.update_eta(self.state.get_lambda(), rho)

    if not extra_pass:
        # only update if this isn't an additional pass
        self.num_updates += other.numdocs

import types
lda.sync_state = types.MethodType(sync_state, lda)
lda.do_mstep = types.MethodType(do_mstep, lda)
lda.update(corpus=corpus)

@piskvorky piskvorky added the performance Issue related to performance (in HW meaning) label Jun 19, 2018
@menshikh-iv menshikh-iv added the difficulty medium Medium issue: required good gensim understanding & python skills label Jun 22, 2018
menshikh-iv pushed a commit that referenced this issue Jan 28, 2019
* Fix #416, #2051: Infinite diff in LdaModel.do_mstep

* fix build
@menshikh-iv
Copy link
Contributor

Fixed in #2344 by @horpto

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty medium Medium issue: required good gensim understanding & python skills performance Issue related to performance (in HW meaning)
Projects
None yet
Development

No branches or pull requests

4 participants