Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Represent Model._coords values as 1-dim numpy arrays #6589

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

covertg
Copy link

@covertg covertg commented Mar 11, 2023

What is this PR about?
Seeks to address #6496: buggy behavior could occur in InferenceData creation (and perhaps elsewhere, see diff) when a user provides a coordinate value sequence that numpy interprets as ndim>1.

We prevent this by changing the internal representation of Model._coords to be Dict[str, Optional[np.ndarray]] rather than Dict[str, Optional[Tuple]], and we ensure that the internal ndarray is 1dim.

Currently this PR does not change the publicly-named Model.coords property. For backwards-compatibility, that property still returns a dict of tuples. Instead we expose a new model property Model.coords_typed that gives the model coord values in the ndarray format.

As it turns out, nearly all of the times that pymc library would access Model.coords, it would seek to convert the (tupled) coordinate values to numpy array. This PR instead updates the whole library (not currently tests) to use the new coords_typed property. Ultimately this is how #6496 is resolved.

This PR is currently a draft to request discussion on if this is the best approach. One alternative would be to update the Model.coords property itself—this would prevent cluttering Model with another property but would be less backwards-compatible. Another question is if np.ndarray is in fact the best way to represent coordinate values. E.g. one existing test case in the library gives an example of passing pd.MultiIndex to coords when converting a trace to InferenceData, and allowing pandas indexes has been discussed in #6081.

Once an approach is given the 👍 I'll add some tests (especially regression test).

Checklist

Major / Breaking Changes

  • n/a

New features

  • The new Model.coords_typed property ensures that coordinate values are 1dim (if not none)

Bugfixes

Documentation

  • n/a

Maintenance

  • n/a

@codecov
Copy link

codecov bot commented Mar 11, 2023

Codecov Report

Merging #6589 (7677f0c) into main (cf6d4ce) will decrease coverage by 11.80%.
The diff coverage is 56.81%.

❗ Current head 7677f0c differs from pull request most recent head c535718. Consider uploading reports for the commit c535718 to get more accurate results

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##             main    #6589       +/-   ##
===========================================
- Coverage   92.03%   80.23%   -11.80%     
===========================================
  Files          92       92               
  Lines       15535    15551       +16     
===========================================
- Hits        14297    12478     -1819     
- Misses       1238     3073     +1835     
Impacted Files Coverage Δ
pymc/backends/mcbackend.py 100.00% <ø> (ø)
pymc/distributions/bound.py 46.15% <0.00%> (-53.85%) ⬇️
pymc/distributions/discrete.py 60.92% <ø> (-38.31%) ⬇️
pymc/sampling/forward.py 70.00% <0.00%> (-25.46%) ⬇️
pymc/sampling/jax.py 98.26% <ø> (ø)
pymc/stats/log_likelihood.py 89.13% <ø> (-10.87%) ⬇️
pymc/util.py 75.10% <54.54%> (-4.90%) ⬇️
pymc/model.py 70.94% <58.33%> (-18.58%) ⬇️
pymc/backends/arviz.py 78.46% <80.00%> (-17.94%) ⬇️
pymc/variational/opvi.py 87.02% <100.00%> (-0.46%) ⬇️

... and 35 files with indirect coverage changes

@covertg covertg changed the title Represent Model._coords as a 1-dim numpy array Represent Model._coords values as 1-dim numpy arrays Mar 11, 2023
@covertg
Copy link
Author

covertg commented Mar 12, 2023

CI tests found a bug in the original PR; now corrected! I also added an initial regression test for #6496.

},
)
assert isinstance(converter.coords["city"], pd.MultiIndex)
# TODO: we now ensure that everything passed to arviz is converted to
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flagging this for discussion. This PR currently updates InferenceDataConverter to always convert coordinate values to ndarray. (Previously they were only converted if the sequence was a tuple.) Is there a reason we would want to maintain the original behavior...?

This test was originally made for #5043

@michaelosthege michaelosthege added this to the v5.2.0 milestone Mar 16, 2023
@covertg
Copy link
Author

covertg commented Mar 16, 2023

The latest run revealed one other case in the test suite that is incompatible with this PR as it is. One test (link) for pm.Bound uses a model with,

pm.Model(coords={"sample": np.ones((2, 5))})

Then when creating a Bound variable with dims="sample", it expects that variable to have shape (2,5). This is another case beyond my confident assessment of what should be done, but I'm pretty sure that we don't want users to do this, right? (i.e. each dim/dimension in xarray is definitionally 1-D?)

Also looks like the pre-commit run failed due to a different commit touching mcmc.py.

...
Anyways, before changing anything up it would be helpful to have some input on how to best internally represent coordinate values. (cc @michaelosthege @ricardoV94 @aseyboldt ?) I'll be rather unavailable for the next ~7 days but will be glad to help on this however I can after that time!

@ricardoV94
Copy link
Member

ricardoV94 commented Mar 17, 2023

I didn't have time to read on this issue about the internal representation. That Bound test is incorrect, not a design choice, feel free to fix it.

@ricardoV94 ricardoV94 removed this from the v5.2.0 milestone Mar 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants