Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: LKJCorr breaks when used as covariance with MvNormal #7101

Open
velochy opened this issue Jan 14, 2024 · 10 comments · May be fixed by #7380
Open

BUG: LKJCorr breaks when used as covariance with MvNormal #7101

velochy opened this issue Jan 14, 2024 · 10 comments · May be fixed by #7380
Labels
bug enhancements hackathon Suitable for hackathon

Comments

@velochy
Copy link
Contributor

velochy commented Jan 14, 2024

Describe the issue:

LKJCorr backwards sampling is currently broken and does not properly constrain values to a positive semidefinite matrix early enough. This leads to compilation randomly failing when providing it as a covariance matrix for MvNormal and other similar multivariate distributions. The failure is inherently random, but it gets increasingly likely with larger n values and fails almost always with n>=10 and can in theory be seen with n as low as 3.

Error message is informative, as it is easy to verify the matrix reported in error has negative eigenvalues, proving it is not PSD.

This only affects the backwards sampling (i.e. logp computations), with .eval() and predictive sampling working as they should.

Reproduceable code example:

import pymc as pm
import pytensor.tensor as pt
import numpy as np

n, eta = 10, 1

# Turn LKJCorr output to a proper square corr matrix
def corr_to_mat(corr_values, n):
    corr = pt.zeros((n, n))
    corr = pt.set_subtensor(corr[np.triu_indices(n, k=1)], corr_values)
    return corr + corr.T + pt.identity_like(corr)

# Generate Data
corr = corr_to_mat(pm.LKJCorr.dist(n=n, eta=eta),n)
y = pm.draw(pm.MvNormal.dist(mu=0, cov=corr), 100)

# PyMC Model
with pm.Model() as m:
    corr_values = pm.LKJCorr('corr_values', n=n, eta=eta)
    corr = corr_to_mat(corr_values,n)
    y_hat = pm.MvNormal('y_hat', mu=0, cov=corr, observed=y)
    idata = pm.sample()

Error message:

Traceback (most recent call last):
  File "/home/velochy/pymc/test3.py", line 22, in <module>
    idata = pm.sample()
            ^^^^^^^^^^^
  File "/home/velochy/pymc/pymc/sampling/mcmc.py", line 744, in sample
    model.check_start_vals(ip)
  File "/home/velochy/pymc/pymc/model/core.py", line 1660, in check_start_vals
    raise SamplingError(
pymc.exceptions.SamplingError: Initial evaluation of model at starting point failed!
Starting values:
{'corr_values_interval__': array([ 0.96546568,  0.55627918, -0.52092415, -0.79551073, -0.04416735,
       -0.61208579, -0.04889871, -0.62181685,  0.13621748,  0.32908589,
       -0.38471896, -0.89088229, -0.1326561 , -0.18828583,  0.26747631,
       -0.47067211,  0.51480876, -0.56323033, -0.3513053 ,  0.30480191,
       -0.36865109,  0.1479414 , -0.66890036,  0.98484518, -0.85993115,
        0.48094929, -0.04547197, -0.07743477, -0.3381161 ,  0.12565052,
       -0.06450642, -0.69698417,  0.80246982, -0.67958155, -0.29175309,
       -0.02445876, -0.7858127 ,  0.17738997, -0.87779485, -0.27196011,
       -0.52231265,  0.46271978,  0.0810375 ,  0.84458683, -0.72285369])}

Logp initial evaluation results:
{'corr_values': -inf, 'y_hat': -inf}
You can call `model.debug()` for more details.

PyMC version information:

pymc 5.10.3

Context for the issue:

This arose in discussion with @ricardoV94 and @jessegrabowski in https://discourse.pymc.io/t/using-lkjcorr-together-with-mvnormal/13606/29 . Thread has more useful context

@ricardoV94
Copy link
Member

ricardoV94 commented Jan 14, 2024

The initial point problem happens because of jittering and the insufficient IntervalTransform that is being used.

I assume model.point_logps() still comes out finite for every variable? Otherwise it would indicate we have a problem with the moment function that is used to provide initial valid values for sampling from a distribution.

@ricardoV94
Copy link
Member

See also discussion in pymc-devs/pymc-experimental#13

@jessegrabowski
Copy link
Member

@velochy We just merged pymc-devs/pytensor#614, which lets us now use pt.linalg.norm inside of pymc models. We can now fix this bug by defining a new transformation for LKJCorr. Basically we just need to copy the tensorflowprobably bijector here, then set the newly defined transformation as the default transformation for LKJCorr. Is that something you'd be interested in helping with?

@velochy
Copy link
Contributor Author

velochy commented May 3, 2024

Depends on how urgent it is :) currently pretty swamped but should be able to get to it within a month or so

@fonnesbeck fonnesbeck added the hackathon Suitable for hackathon label Jun 14, 2024
@johncant
Copy link

Hi - I'm at Pydata London and will a look at this.

@johncant
Copy link

For anyone in the pymc hack, I'm in the back middle of the room

@jessegrabowski
Copy link
Member

Hi @johncant , let me know if anything isn't clear about how to proceed

@johncant
Copy link

@jessegrabowski thanks a lot! Nothing is clear and I am out of my depth. However, it's great learning and I still intend to fix this issue.

@twiecki and @fonnesbeck were there and helped me understand more about pymc than I would have learned from days of hacking on it myself.

Here's what we found:

  • Issue can be replicated with an even smaller example [1]
  • LKJCholeskyCov does not have this problem [2]
  • LKJCholeskyCov did appear to have this problem when I passed sd_dist=pm.Uniform.dist(1.8, 2.2) to MvNormal. Chris indicated that this might cause convergence problems and we changed it to pm.HalfNormal.dist(5, shape=10))
  • Return type section of LKJCorr is missing from the documentation
  • Typo in the code where Cholesky is spelled Choleksy
  • LKJ paper is paywalled

I think it's obvious to you that the problem is the default transformation for LKJCorr and it needs to be replaced with this. For me to do this, the next steps are:

  • Figure out how to draw a single sample directly from jupyter that replicates this problem
  • Figure out how to get the transform from an LKJCorr object
  • Understand the tensorflow bijector, why it should work, and why the current impl does not work
  • Read the LKJ paper
  • Port the tensorflow bijector (the easy part!)
  • Test
  • Update docs
  • Submit PR

[1] Replicates issue as long as n is high enough

import pymc as pm
import pytensor.tensor as pt

n, eta = 10, 3

# PyMC Model
with pm.Model() as m:
    corr_values = pm.LKJCorr('corr_values', n=n, eta=eta, return_matrix=True)
    idata = pm.sample()

[2] Replacement with LKJCholeskyCovariance - it samples without error (but with divergences)

import pymc as pm
import pytensor.tensor as pt
import numpy as np

n, eta = 10, 1

# Turn LKJCorr output to a proper square corr matrix
def corr_to_mat(corr_values, n):
    corr = pt.zeros((n, n))
    corr = pt.set_subtensor(corr[np.triu_indices(n, k=1)], corr_values)
    return corr + corr.T + pt.identity_like(corr)

# Generate Data
corr = corr_to_mat(pm.LKJCorr.dist(n=n, eta=eta),n)
y = pm.draw(pm.MvNormal.dist(mu=0, cov=corr), 100)

# PyMC Model
with pm.Model() as m:
    #sigma = pm.Uniform('sigma', 1.9, 2.1)
    #corr_values = pm.LKJCorr('corr_values', n=n, eta=eta)
    cov_values = pm.LKJCholeskyCov('cov_values', n=n, eta=eta, sd_dist=pm.HalfNormal.dist(5, shape=10))
    #corr = corr_to_mat(corr_values,n)
    y_hat = pm.MvNormal('y_hat', mu=0, chol=cov_values[1], observed=y)
    idata = pm.sample()

@johncant
Copy link

johncant commented Jun 18, 2024

Progress tracking comment

  • Figure out how to draw a single sample directly from jupyter that replicates this problem
  • Figure out how to get the transform from an LKJCorr object
  • Understand the tensorflow bijector, why it should work, and why the current impl does not work
  • Read the LKJ paper
  • Port the tensorflow bijector (the easy part!)
  • Test
  • Update docs
  • Submit PR

@johncant
Copy link

I decided to post my progress from yesterday as a jupyter notebook: https://colab.research.google.com/drive/1UNkgxEuy6j_Eww0aMFnsfzQfQmlxG2-d?usp=sharing . Basically, I now also understand why the transform on LKJCorr is the problem and can start understanding and porting the TF transform.

@johncant johncant linked a pull request Jun 21, 2024 that will close this issue
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug enhancements hackathon Suitable for hackathon
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants