Redundant get_Elogbeta calls in LdaModel #2051

darindf · 2018-05-15T19:41:05Z

In reviewing performance profiling I'm finding that get_Elogbeta is taking some time.

I notice that there are 2 calls to this function, one after the other, LdaModel.do_mstep

The first call is obvious, in LdaModel.do_mstep we have

logger.debug("updating topics")
# update self with the new blend; also keep track of how much did
# the topics change through this update, to assess convergence
diff = np.log(self.expElogbeta)
self.state.blend(rho, other)
diff -= self.state.get_Elogbeta()
self.sync_state()

But do you see the second call? In sync_state we have

def sync_state(self):
    self.expElogbeta = np.exp(self.state.get_Elogbeta())
    assert self.expElogbeta.dtype == self.dtype

The question then becomes does get_Elogbeta perform some side effect that the 2nd call would be different from the first?

Looking at get_Elogbeta

def get_Elogbeta(self):
    return dirichlet_expectation(self.get_lambda())

This function doesn't modify state directly.

Looking at dirichlet_expectation python code (easier to read than the cython code).

def dirichlet_expectation(alpha):
    """Expected value of log(theta) where theta is drawn from a Dirichlet distribution.

    Parameters
    ----------
    alpha : numpy.ndarray
        Dirichlet parameter 2d matrix or 1d vector, if 2d - each row is treated as a separate parameter vector.

    Returns
    -------
    numpy.ndarray
        Log of expected values, dimension same as `alpha.ndim`.

    """
    if len(alpha.shape) == 1:
        result = psi(alpha) - psi(np.sum(alpha))
    else:
        result = psi(alpha) - psi(np.sum(alpha, 1))[:, np.newaxis]
    return result.astype(alpha.dtype, copy=False)  # keep the same precision as input

This doesn't modify the input alpha, and the function psi is from scipy.special.

Thus it appears that the second call get_Elogbeta duplicates the work on the first and can be eliminated.

Is this analysis correct?

Here is a profile chart showing the timing.

This is taken from the overall ldaModel init & update where get_Elogbeta is the two top boxes on the far right over lda'ing a 1.3M corpus.

arlenk · 2018-06-19T01:58:37Z

From a quick look, I think you are right.

Out of curiosity, what version of gensim are you using and what parameters are you passing LdaModel? I am asking because in my use cases, get_Elogbeta takes about 5% of the total run time, which isn't negligible, but it seems like get_Elogbeta is taking up a much larger percent of the total run time for you (if I am reading your profile chart correctly).

darindf · 2018-06-19T17:09:40Z

I forget which version of gensim I was when I performed the performance but it was in the 3.x release. I'm currently using 3.4. I believe the performance will depend on the number of topics and number of terms (size of vocab), this may be run over my test case of 1.3 million documents with 55k unique terms for 16 topics,

Here is a patch to modify the LdaModel object. I have verified that self.state.get_Elogbeta() is stateless, so that the second call in the original code is getting the same value as the first call. With this patch, the corpus needs to be passed via the update vs the init.

lda = LdaModel(id2word=master_dictionary, num_topics=num_of_topics, passes=1)
import numpy as np

def sync_state(self, elogbeta=None):
    # change start
    if elogbeta is None:
        elogbeta = self.state.get_Elogbeta()
    # change end
    self.expElogbeta = np.exp(elogbeta)
    assert self.expElogbeta.dtype == self.dtype

def do_mstep(self, rho, other, extra_pass=False):
    """
    M step: use linear interpolation between the existing topics and
    collected sufficient statistics in `other` to update the topics.

    """
    logger.debug("updating topics")
    # update self with the new blend; also keep track of how much did
    # the topics change through this update, to assess convergence
    diff = np.log(self.expElogbeta)
    self.state.blend(rho, other)
    elogbeta = self.state.get_Elogbeta()
    # change start
    diff -= elogbeta
    self.sync_state(elogbeta)
    # change end

    # print out some debug info at the end of each EM iteration
    self.print_topics(5)
    logger.info("topic diff=%f, rho=%f",
                np.mean(np.abs(diff)), rho)

    if self.optimize_eta:
        self.update_eta(self.state.get_lambda(), rho)

    if not extra_pass:
        # only update if this isn't an additional pass
        self.num_updates += other.numdocs

import types
lda.sync_state = types.MethodType(sync_state, lda)
lda.do_mstep = types.MethodType(do_mstep, lda)
lda.update(corpus=corpus)

* Fix #416, #2051: Infinite diff in LdaModel.do_mstep * fix build

menshikh-iv · 2019-01-29T02:56:50Z

Fixed in #2344 by @horpto

piskvorky added the performance Issue related to performance (in HW meaning) label Jun 19, 2018

menshikh-iv added the difficulty medium Medium issue: required good gensim understanding & python skills label Jun 22, 2018

horpto added a commit to horpto/gensim that referenced this issue Jan 20, 2019

Fix wrong diff value in LdaModel.do_mstep piskvorky#416, piskvorky#2051

1231f00

horpto added a commit to horpto/gensim that referenced this issue Jan 20, 2019

Fix piskvorky#416, piskvorky#2051: Infinite diff in LdaModel.do_mstep

5659054

horpto mentioned this issue Jan 23, 2019

Fix infinite diff in LdaModel.do_mstep #2344

Merged

menshikh-iv pushed a commit that referenced this issue Jan 28, 2019

Fix infinite diff in LdaModel.do_mstep (#2344)

179a2c1

* Fix #416, #2051: Infinite diff in LdaModel.do_mstep * fix build

menshikh-iv closed this as completed Jan 29, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redundant get_Elogbeta calls in LdaModel #2051

Redundant get_Elogbeta calls in LdaModel #2051

darindf commented May 15, 2018 •

edited by menshikh-iv

Loading

arlenk commented Jun 19, 2018

darindf commented Jun 19, 2018 •

edited by menshikh-iv

Loading

menshikh-iv commented Jan 29, 2019

Redundant get_Elogbeta calls in LdaModel #2051

Redundant get_Elogbeta calls in LdaModel #2051

Comments

darindf commented May 15, 2018 • edited by menshikh-iv Loading

arlenk commented Jun 19, 2018

darindf commented Jun 19, 2018 • edited by menshikh-iv Loading

menshikh-iv commented Jan 29, 2019

darindf commented May 15, 2018 •

edited by menshikh-iv

Loading

darindf commented Jun 19, 2018 •

edited by menshikh-iv

Loading