Implement logcdf method for discrete distributions #4387

ricardoV94 · 2020-12-27T16:38:49Z

This PR adds logcdf methods to all univariate discrete distributions. Formulas were based in either scipy or wikipedia entries. For some distributions, I could not find specialized formulas and had to rely on summing up all the logps in tt.arange(0, value+1).

Closes #4331

Unittests:

Using check_logcdf in all distributions with scipy counterpart.
Added a new type of unit test check_selfconsistency_discrete_logcdf, which tests that the logcdf is equivalent to adding all the logps starting from ~~zero~~ domain.lower. This provides coverage to the distributions that are not in scipy (not covered by 1.). I added it to all distributions, but maybe this is too redundant for those that are covered by 1.

Some remaining issues / questions:

Should all bound calls in CDF methdos be replaced by tt.switch? Most logcdf methods for continous distributions seem to use tt.switch, except for the Gamma and InverseGamma distributions. I used bound whenever there were at least two conditions that needed checking leading no -np.inf, otherwise went with tt.switch. I am also not completely sure if this interferes with the recent Add flag to disable bounds check for speed-up #4377 PR.
Several logcdf methods do not work with multiple values (or even with an array containing a single value such as np.array([1])). In these cases, I excluded information in the docstrings about the possibility of computing the logcdf for multiple values, but this seems like a suboptimal solution. We should either fix it or raise a ValueError to explain the limitation to the user, as this seems to go against the expected behavior in most distributions. This occurs for two reasons:
1. Computing logcdf by summing over the individual logps with logsumexp(tt.arange(0, value+1), keepdims=False) for the BetaBinomial and HyperGeometric. Any alternatives?
2. Using the incomplete_beta function in the Binomial and NegativeBinomial (and ZeroInflated counterparts). This is also an issue in the current implementation of the Beta and StudentT distributions (see Beta logcdf method fails with array of values #4342). In addition, the unit tests for these functions seem to be very slow compared to those of other distributions. Is the incomplete_beta particularly slow?
Slightly different, the logcdf method of the Poisson distribution (and its ZeroInflated counterpart) fails with a C-assertion (exiting the python process altogether) when asked to evaluate multiple invalid values. This is also a problem in the InverseGamma distribution (see InverseGamma logcdf method fails with invalid parameters when array is used #4340). It will be solved once this Theano issue is fixed (see Return NaN in C implementations of SciPy Ops aesara-devs/aesara#224). Update: I have found a temporary hack to "hide" this problem (which can be removed once those other issues are solved).
Unittests: check_logcdf tests only for values in the output domain, ignoring the edge values, but we might want to test also for values at the edges as well as below or beyond the domain (which can be evaluated to either -np.inf or 0). For example pm.Bernoulli(p=.1).logcdf([-1, 2]) -> [-np.inf, 0]. Should we use more comprehensive extra-domain checks? Update: This is now implemented in Increase unittest check_logcdf coverage and fix issues with some distribution methods #4393.

Other changes not directly related to this PR:

Added a missing pymc3_matches_scipy test to the BetaBinomial. Was there a reason why this was missing?
Changed the order of the logp and random methods in the DiscreteWeibull distribution to be in line with the rest of the library.
Removed unused local variables in the init methods of DiscreteWeibull and ZeroInflatedPoisson distributions.

codecov · 2020-12-27T17:18:57Z

Codecov Report

Merging #4387 (dc4ce4a) into master (3cfee77) will increase coverage by 0.05%.
The diff coverage is 91.54%.

@@            Coverage Diff             @@
##           master    #4387      +/-   ##
==========================================
+ Coverage   88.04%   88.09%   +0.05%     
==========================================
  Files          88       88              
  Lines       14482    14538      +56     
==========================================
+ Hits        12750    12807      +57     
+ Misses       1732     1731       -1

Impacted Files	Coverage Δ
pymc3/distributions/discrete.py	`95.00% <91.54%> (-0.85%)`	⬇️
pymc3/sampling_jax.py	`0.00% <0.00%> (ø)`
pymc3/distributions/multivariate.py	`83.64% <0.00%> (+0.72%)`	⬆️

…Geometric` to avoid errors when evaluating negative logcdfs.

…domains, such as the DiscreteUniform.

ricardoV94 · 2020-12-29T13:16:19Z

I am sorry for all the changes after opening the PR. I think it is now ready for review.

AlexAndorra

This is great, thanks @ricardoV94 !
This is a pretty big PR, so a second review by another core dev would be beneficial.

I agree that the logcdf methods that do not work with multiple values should raise a ValueError to explain the limitation to the user. Related to this, I added some comments to add the mutiple values' types (numpy array or theano tensor) to the docstrings

AlexAndorra · 2020-12-29T19:44:10Z

pymc3/distributions/discrete.py

+        at the specified value.
+        Parameters
+        ----------
+        value: numeric


Should probably add the types for multiple values then. Something like the following?

Suggested change

value: numeric

value: numeric or np.ndarray or theano.tensor

AlexAndorra · 2020-12-29T19:44:39Z

pymc3/distributions/discrete.py

+
+        Parameters
+        ----------
+        value: numeric


Suggested change

value: numeric

value: numeric or np.ndarray or theano.tensor

AlexAndorra · 2020-12-29T19:45:20Z

pymc3/distributions/discrete.py

+        at the specified value.
+        Parameters
+        ----------
+        value: numeric


Suggested change

value: numeric

value: numeric or np.ndarray or theano.tensor

AlexAndorra · 2020-12-29T19:46:13Z

pymc3/distributions/discrete.py

+        at the specified value.
+        Parameters
+        ----------
+        value: numeric


Suggested change

value: numeric

value: numeric or np.ndarray or theano.tensor

AlexAndorra · 2020-12-29T19:47:07Z

pymc3/distributions/discrete.py

+        at the specified value.
+        Parameters
+        ----------
+        value: numeric


Suggested change

value: numeric

value: numeric or np.ndarray or theano.tensor

AlexAndorra · 2020-12-29T19:47:34Z

pymc3/distributions/discrete.py

+        at the specified value.
+        Parameters
+        ----------
+        value: numeric


Suggested change

value: numeric

value: numeric or np.ndarray or theano.tensor

AlexAndorra · 2020-12-29T19:47:46Z

pymc3/distributions/discrete.py

+        at the specified value.
+        Parameters
+        ----------
+        value: numeric


Suggested change

value: numeric

value: numeric or np.ndarray or theano.tensor

ricardoV94 · 2020-12-29T22:19:18Z

Thanks for the review @AlexAndorra. I agree about the ValueError, although I have to check how to check that with Theano variables.

I was hoping someone would suggest a fix for those that use tt.arange. Scipy actually does a loop to return results for multiple values.

About the type hints in the docstring. What if we do a separate PR just for that, so that all logcdfs (discrete and continuous) are homogeneous? Right now, none of the other docstrings have it and they should as well.

pymc3/distributions/discrete.py

twiecki

A lot of the doc-strings miss new-lines before Parameters and Returns, otherwise this looks great to me!

ricardoV94 · 2020-12-30T09:20:41Z

A lot of the doc-strings miss new-lines before Parameters and Returns, otherwise this looks great to me!

Thanks for reviewing it. I will fix those.

Unrelated to that, does anyone know a good way to check that value is not a scalar in order to raise an informative ValueError in those methods that fail with arrays/tensors?

twiecki · 2020-12-30T09:29:41Z

You can try whether np.isscalar works on numpy and theano arrays.

Fix docstring formatting.

ricardoV94 · 2020-12-31T10:37:04Z

New commit fixes missing newlines in docstrings and raises informative TypeError in logcdf methods that only accept scalars.

I would address @AlexAndorra suggestion of including more informative docstring type hints, as well as expanding check_logcdf unittest to make sure that logcdf methods that fail with nonscalar values raise a TypeError in #4393. The reason for this is that the logcdf methods of some continuous distributions also need to be changed and I think it would be easier to keep track / review those in that separate PR. Do you agree?

…` making use of `tt.log1p` and `logaddexp`. More informative comment on workaround for `Poisson.logcdf`.

twiecki · 2020-12-31T11:50:01Z

This is a great contribution, thanks @ricardoV94!

Implement logcdf method for discrete distributions

47021d4

Add release note

0583d8e

ricardoV94 force-pushed the discrete_cdf branch from 9aab229 to 0583d8e Compare December 28, 2020 11:27

ricardoV94 added 5 commits December 28, 2020 13:30

Small reformatting

a09d93b

Add more comprehensive bound check for impossible parameters

831afba

Fix bounds for logcdf of DiscreteUniform

450a6f7

Add safe values/ parameters for BetaBinomial, Poisson, and `Hyper…

2c3c2a0

…Geometric` to avoid errors when evaluating negative logcdfs.

Change check_selfconsistency_discrete_logcdf to work with negative …

670a652

…domains, such as the DiscreteUniform.

ricardoV94 force-pushed the discrete_cdf branch from dd54634 to 670a652 Compare December 29, 2020 13:14

This was referenced Dec 29, 2020

Increase unittest check_logcdf coverage and fix issues with some distribution methods #4393

Merged

Update log1mexp and remove redundant local reimplementations in the library #4394

Merged

AlexAndorra requested changes Dec 29, 2020

View reviewed changes

AlexAndorra added this to the vNext (3.11.0) milestone Dec 29, 2020

AlexAndorra added the enhancements label Dec 29, 2020

twiecki reviewed Dec 30, 2020

View reviewed changes

pymc3/distributions/discrete.py Show resolved Hide resolved

twiecki reviewed Dec 30, 2020

View reviewed changes

pymc3/distributions/discrete.py Show resolved Hide resolved

twiecki requested changes Dec 30, 2020

View reviewed changes

Raise TypeError in logcdf methods that only accept scalar values.

bcf87a2

Fix docstring formatting.

ricardoV94 requested review from twiecki and AlexAndorra December 31, 2020 10:37

Small change to logcdf methods of DiscreteWeibull and `ZeroInflated…

dc4ce4a

…` making use of `tt.log1p` and `logaddexp`. More informative comment on workaround for `Poisson.logcdf`.

twiecki approved these changes Dec 31, 2020

View reviewed changes

twiecki merged commit d871b80 into pymc-devs:master Dec 31, 2020

ricardoV94 deleted the discrete_cdf branch January 2, 2021 11:09

ricardoV94 mentioned this pull request Jan 2, 2021

Logcdf methods of several distributions do not check for invalid parameters #4399

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement logcdf method for discrete distributions #4387

Implement logcdf method for discrete distributions #4387

ricardoV94 commented Dec 27, 2020 •

edited

Loading

codecov bot commented Dec 27, 2020 •

edited

Loading

ricardoV94 commented Dec 29, 2020

AlexAndorra left a comment

AlexAndorra Dec 29, 2020

AlexAndorra Dec 29, 2020

AlexAndorra Dec 29, 2020

AlexAndorra Dec 29, 2020

AlexAndorra Dec 29, 2020

AlexAndorra Dec 29, 2020

AlexAndorra Dec 29, 2020

ricardoV94 commented Dec 29, 2020

twiecki left a comment

ricardoV94 commented Dec 30, 2020

twiecki commented Dec 30, 2020

ricardoV94 commented Dec 31, 2020 •

edited

Loading

twiecki commented Dec 31, 2020

Implement logcdf method for discrete distributions #4387

Implement logcdf method for discrete distributions #4387

Conversation

ricardoV94 commented Dec 27, 2020 • edited Loading

codecov bot commented Dec 27, 2020 • edited Loading

Codecov Report

ricardoV94 commented Dec 29, 2020

AlexAndorra left a comment

Choose a reason for hiding this comment

AlexAndorra Dec 29, 2020

Choose a reason for hiding this comment

AlexAndorra Dec 29, 2020

Choose a reason for hiding this comment

AlexAndorra Dec 29, 2020

Choose a reason for hiding this comment

AlexAndorra Dec 29, 2020

Choose a reason for hiding this comment

AlexAndorra Dec 29, 2020

Choose a reason for hiding this comment

AlexAndorra Dec 29, 2020

Choose a reason for hiding this comment

AlexAndorra Dec 29, 2020

Choose a reason for hiding this comment

ricardoV94 commented Dec 29, 2020

twiecki left a comment

Choose a reason for hiding this comment

ricardoV94 commented Dec 30, 2020

twiecki commented Dec 30, 2020

ricardoV94 commented Dec 31, 2020 • edited Loading

twiecki commented Dec 31, 2020

ricardoV94 commented Dec 27, 2020 •

edited

Loading

codecov bot commented Dec 27, 2020 •

edited

Loading

ricardoV94 commented Dec 31, 2020 •

edited

Loading