Represent `Model._coords` values as 1-dim numpy arrays #6589

covertg · 2023-03-11T21:18:39Z

What is this PR about?
Seeks to address #6496: buggy behavior could occur in InferenceData creation (and perhaps elsewhere, see diff) when a user provides a coordinate value sequence that numpy interprets as ndim>1.

We prevent this by changing the internal representation of Model._coords to be Dict[str, Optional[np.ndarray]] rather than Dict[str, Optional[Tuple]], and we ensure that the internal ndarray is 1dim.

Currently this PR does not change the publicly-named Model.coords property. For backwards-compatibility, that property still returns a dict of tuples. Instead we expose a new model property Model.coords_typed that gives the model coord values in the ndarray format.

As it turns out, nearly all of the times that pymc library would access Model.coords, it would seek to convert the (tupled) coordinate values to numpy array. This PR instead updates the whole library (not currently tests) to use the new coords_typed property. Ultimately this is how #6496 is resolved.

This PR is currently a draft to request discussion on if this is the best approach. One alternative would be to update the Model.coords property itself—this would prevent cluttering Model with another property but would be less backwards-compatible. Another question is if np.ndarray is in fact the best way to represent coordinate values. E.g. one existing test case in the library gives an example of passing pd.MultiIndex to coords when converting a trace to InferenceData, and allowing pandas indexes has been discussed in #6081.

Once an approach is given the 👍 I'll add some tests (especially regression test).

Checklist

Explain important implementation details 👆
Make sure that the pre-commit linting/style checks pass.
Link relevant issues (preferably in nice commit messages)
Are the changes covered by tests and docstrings?
Fill out the short summary sections 👇

Major / Breaking Changes

n/a

New features

The new Model.coords_typed property ensures that coordinate values are 1dim (if not none)

Bugfixes

BUG: Issues creating InferenceData if there is a coord of type Sequence[Sequence] #6496
Also as discussed, both coords properties return coordinates/coordinate values as copies

Documentation

n/a

Maintenance

n/a

codecov · 2023-03-11T21:36:52Z

Codecov Report

Merging #6589 (7677f0c) into main (cf6d4ce) will decrease coverage by 11.80%.
The diff coverage is 56.81%.

❗ Current head 7677f0c differs from pull request most recent head c535718. Consider uploading reports for the commit c535718 to get more accurate results

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #6589       +/-   ##
===========================================
- Coverage   92.03%   80.23%   -11.80%     
===========================================
  Files          92       92               
  Lines       15535    15551       +16     
===========================================
- Hits        14297    12478     -1819     
- Misses       1238     3073     +1835

Impacted Files	Coverage Δ
pymc/backends/mcbackend.py	`100.00% <ø> (ø)`
pymc/distributions/bound.py	`46.15% <0.00%> (-53.85%)`	⬇️
pymc/distributions/discrete.py	`60.92% <ø> (-38.31%)`	⬇️
pymc/sampling/forward.py	`70.00% <0.00%> (-25.46%)`	⬇️
pymc/sampling/jax.py	`98.26% <ø> (ø)`
pymc/stats/log_likelihood.py	`89.13% <ø> (-10.87%)`	⬇️
pymc/util.py	`75.10% <54.54%> (-4.90%)`	⬇️
pymc/model.py	`70.94% <58.33%> (-18.58%)`	⬇️
pymc/backends/arviz.py	`78.46% <80.00%> (-17.94%)`	⬇️
pymc/variational/opvi.py	`87.02% <100.00%> (-0.46%)`	⬇️

... and 35 files with indirect coverage changes

…6496

covertg · 2023-03-12T22:23:45Z

CI tests found a bug in the original PR; now corrected! I also added an initial regression test for #6496.

covertg · 2023-03-12T22:27:45Z

tests/backends/test_arviz.py

-                },
-            )
-            assert isinstance(converter.coords["city"], pd.MultiIndex)
+            # TODO: we now ensure that everything passed to arviz is converted to


Flagging this for discussion. This PR currently updates InferenceDataConverter to always convert coordinate values to ndarray. (Previously they were only converted if the sequence was a tuple.) Is there a reason we would want to maintain the original behavior...?

This test was originally made for #5043

covertg · 2023-03-16T23:09:17Z

The latest run revealed one other case in the test suite that is incompatible with this PR as it is. One test (link) for pm.Bound uses a model with,

pm.Model(coords={"sample": np.ones((2, 5))})

Then when creating a Bound variable with dims="sample", it expects that variable to have shape (2,5). This is another case beyond my confident assessment of what should be done, but I'm pretty sure that we don't want users to do this, right? (i.e. each dim/dimension in xarray is definitionally 1-D?)

Also looks like the pre-commit run failed due to a different commit touching mcmc.py.

...
Anyways, before changing anything up it would be helpful to have some input on how to best internally represent coordinate values. (cc @michaelosthege @ricardoV94 @aseyboldt ?) I'll be rather unavailable for the next ~7 days but will be glad to help on this however I can after that time!

ricardoV94 · 2023-03-17T06:09:08Z

I didn't have time to read on this issue about the internal representation. That Bound test is incorrect, not a design choice, feel free to fix it.

covertg added 2 commits March 11, 2023 13:53

Change underlying representation of Model._coords to ndarray

47c62d9

Change use of Model.coords -> Model.coords_typed throughout pymc

7677f0c

covertg changed the title ~~Represent Model._coords as a 1-dim numpy array~~ Represent Model._coords values as 1-dim numpy arrays Mar 11, 2023

Fix bug introduced in arviz.py and add regression test for pymc-devs#…

c535718

…6496

covertg commented Mar 12, 2023

View reviewed changes

michaelosthege added this to the v5.2.0 milestone Mar 16, 2023

ricardoV94 removed this from the v5.2.0 milestone Mar 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Represent `Model._coords` values as 1-dim numpy arrays #6589

Represent `Model._coords` values as 1-dim numpy arrays #6589

covertg commented Mar 11, 2023 •

edited

Loading

codecov bot commented Mar 11, 2023 •

edited

Loading

covertg commented Mar 12, 2023

covertg Mar 12, 2023

covertg commented Mar 16, 2023 •

edited

Loading

ricardoV94 commented Mar 17, 2023 •

edited

Loading

Represent Model._coords values as 1-dim numpy arrays #6589

Are you sure you want to change the base?

Represent Model._coords values as 1-dim numpy arrays #6589

Conversation

covertg commented Mar 11, 2023 • edited Loading

Major / Breaking Changes

New features

Bugfixes

Documentation

Maintenance

codecov bot commented Mar 11, 2023 • edited Loading

Codecov Report

covertg commented Mar 12, 2023

covertg Mar 12, 2023

Choose a reason for hiding this comment

covertg commented Mar 16, 2023 • edited Loading

ricardoV94 commented Mar 17, 2023 • edited Loading

Represent `Model._coords` values as 1-dim numpy arrays #6589

Represent `Model._coords` values as 1-dim numpy arrays #6589

covertg commented Mar 11, 2023 •

edited

Loading

codecov bot commented Mar 11, 2023 •

edited

Loading

covertg commented Mar 16, 2023 •

edited

Loading

ricardoV94 commented Mar 17, 2023 •

edited

Loading