Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify check_point_clouds to allow for sparse input #424

Merged
merged 11 commits into from
Jul 14, 2020

Conversation

ulupo
Copy link
Collaborator

@ulupo ulupo commented Jun 19, 2020

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Description
Following the addition of FlagserPersistence in #339 and the added support for list input in the simplicial homology transformers, it is a shame to keep excluding list of sparse matrix input, especially since the performance gains can be considerable for situations in which several edges are infinitely-valued (as @lewtun found out recently when using ripser). This PR modifies the check_point_clouds validation function to accept list of sparse matrix input, while still performing important checks. Furthermore, some docstrings have been edited for clarity/consistency.

Any other comments?
I tried to extend the changes to SparseRipsPersistence but got a mysterious error thrown when trying it on a single 2x2 sparse matrix with ones in the off-diagonal entries. @MonkeyBreaker @gtauzin, do you know why this might be happening? UPDATE: It is now clear that SparseRipsPersistence is not meant to receive sparse input yet.

Checklist

  • I have read the guidelines for contributing.
  • My code follows the code style of this project. I used flake8 to check my Python changes.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have added tests to cover my changes.
  • All new and existing tests passed. I used pytest to check this on Python tests.

Umberto added 2 commits June 19, 2020 11:14
- Allow for sparse input in VietorisRipsPersistence and FlagserPersistence when metric is precomputed.
- Fix docstrings to reflect the changes.
- Fix some typos and wording issues.
@MonkeyBreaker
Copy link
Collaborator

Could you provide MWE code producing the error to be easier to profile ?

Thank you !

@ulupo
Copy link
Collaborator Author

ulupo commented Jun 19, 2020

@MonkeyBreaker sure!

when trying it on a single 2x2 sparse matrix with ones in the off-diagonal entries.

By this I meant:

from scipy.sparse import csr_matrix
from gtda.homology import SparseRipsPersistence

sparse_1 = csr_matrix(([1, 1], ([0, 1], [1, 0])))
SparseRipsPersistence(metric='precomputed')._gudhi_diagram(sparse_1)

which returns

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-2-edb7b0d0cfa2> in <module>
      3 
      4 sparse_1 = csr_matrix(([1, 1], ([0, 1], [1, 0])))
----> 5 SparseRipsPersistence(metric='precomputed')._gudhi_diagram(sparse_1)

~/Workspace/giotto-tda_ulupo/gtda/homology/simplicial.py in _gudhi_diagram(self, X)
    373         sparse_rips_complex = SparseRipsComplex(
    374             distance_matrix=Xdgms, max_edge_length=self.max_edge_length,
--> 375             sparse=self.epsilon)
    376         simplex_tree = sparse_rips_complex.create_simplex_tree(
    377             max_dimension=max(self._homology_dimensions) + 1)

~/Workspace/giotto-tda_ulupo/gtda/externals/python/rips_complex_interface.py in __init__(self, points, distance_matrix, max_edge_length, sparse)
     80             self.thisref.init_matrix_sparse(distance_matrix,
     81                                             max_edge_length,
---> 82                                             sparse)
     83         else:
     84             if points is None:

ValueError: allocator<T>::allocate(size_t n) 'n' exceeds maximum supported size

@ulupo
Copy link
Collaborator Author

ulupo commented Jun 19, 2020

@MonkeyBreaker just to be clear, I am not aware whether SparseRipsPersistence should be supporting sparse matrix input, even in principle!

@MonkeyBreaker
Copy link
Collaborator

@ulupo I just give it a look, self.thisref.init_matrix_sparse(distance_matrix, max_edge_length, sparse) expects this type of arguments:

const std::vector<std::vector<double>>& matrix, double threshold, double epsilon

But looking the values passed to the arguments:

distance matrix   (0, 1)        1.0
                  (1, 0)        1.0
max_edge_length inf
sparse 0.1

distance matrix isn't a vector of vector, hence the error that occurs.

@ulupo
Copy link
Collaborator Author

ulupo commented Jun 19, 2020

@MonkeyBreaker OK. So, quite simply, we just don't support sparse matrices in SparseRipsPersistence, correct?

@MonkeyBreaker
Copy link
Collaborator

If sparse format means this:

(0, 1)        1.0
(1, 0)        1.0

Then no. But what we could support is:

((0, 1), (1.0))
((1, 0), (1.0))

@MonkeyBreaker
Copy link
Collaborator

Also, the documentation of the Class SparseRipsComplex says:

"""
:param distance_matrix: A distance matrix (full square or lower
              triangular).
"""

I just don't understand the 1.0 at the end of the lines. Because ((0, 1), (1,0)) could also be a valid distance matrix.

@ulupo
Copy link
Collaborator Author

ulupo commented Jun 19, 2020

I just mean a sparse data structure in general. The ripser bindings support many scipy sparse formats (CRS, COO, ..., see https://docs.scipy.org/doc/scipy/reference/sparse.html), while for SparseRipsPersistence we seem to restrict the Python user to dense numpy arrays.

@ulupo
Copy link
Collaborator Author

ulupo commented Jun 19, 2020

I just don't understand the 1.0 at the end of the lines. Because ((0, 1), (1,0)) could also be a valid distance matrix.

I'm interpreting your question as being about the meaning of

(0, 1)        1.0
(1, 0)        1.0

What you see there is just a way of printing a CSR matrix in python. I means that entry [0, 1] (row 0, column 1) has value 1.0, and entry [1, 0] has value 1.0. All other entries ([0, 0] and [1, 1] in this case) are not set (this is what being sparse allows you to do) and hence are automatically assumed to be zero (or infinity or missing, according to the context).

@MonkeyBreaker
Copy link
Collaborator

Correct me if I'm wrong, but before calling ripser there is some preprocessing. If I modify the example like this:

from scipy.sparse import csr_matrix
from gtda.homology import SparseRipsPersistence
from sklearn.metrics.pairwise import pairwise_distances

sparse_1 = csr_matrix(([1, 1], ([0, 1], [1, 0])))
sparse_2 = pairwise_distances(sparse_1, metric='euclidean')
SparseRipsPersistence(metric='precomputed')._gudhi_diagram(sparse_2)

I get a new error related to the python code in simplicial.py:

Traceback (most recent call last):
  File "test.py", line 7, in <module>
    SparseRipsPersistence(metric='precomputed')._gudhi_diagram(sparse_2)
  File "/home/julian/workspace/TWS/giotto_learn_github/gtda/homology/simplicial.py", line 375, in _gudhi_diagram
    max_dimension=max(self._homology_dimensions) + 1)
AttributeError: 'SparseRipsPersistence' object has no attribute '_homology_dimensions'

@ulupo
Copy link
Collaborator Author

ulupo commented Jun 19, 2020

@MonkeyBreaker with your example, you are testing a different mathematical situation. You are interpreting sparse_1 as a list of 2D vectors, and then computing all pairwise distances between them, which will be a numpy array indeed. This was not the intended usage in my example. I intended to pass sparse_1 directly as the matrix of pairwise distances, and wanted this to be sparse. My initial example works with no problems when SparseRipsPersistence is exchanged with VietorisRipsPersistence or FlagserPersistence .

@ulupo
Copy link
Collaborator Author

ulupo commented Jun 19, 2020

I'm not sure how we handle sparse matrices in the ripser interface, we'd have to check. I don't know if we do something very trivial like converting them to dense numpy array format and feeding the resulting arrays to the ripser bindings, or we exploit the sparse structure in a more clever way to avoid expensive conversions.

@MonkeyBreaker
Copy link
Collaborator

In ripser.py, before calling the c++ code, the input is formatted to a specific format.

The same with FlagserPersistence, maybe we should update rips_complex_interface.py to allow more input formats and then convert them to the format expected by the implementation ?
Maybe an input from @gtauzin could help us in the right direction ?

@MonkeyBreaker
Copy link
Collaborator

@MonkeyBreaker with your example, you are testing a different mathematical situation. You are interpreting sparse_1 as a list of 2D vectors, and then computing all pairwise distances between them, which will be a numpy array indeed. This was not the intended usage in my example. I intended to pass sparse_1 directly as the matrix of pairwise distances, and wanted this to be sparse. My initial example works with no problems when SparseRipsPersistence is exchanged with VietorisRipsPersistence or FlagserPersistence .

Sorry for that, I just "made it work", I was looking at the reason why it didn't work.

@ulupo
Copy link
Collaborator Author

ulupo commented Jun 19, 2020

I believe this

def DRFDMSparse(I, J, V, N, maxHomDim, thresh=-1, coeff=2, do_cocycles=1):
is the function responsible for passing sparse input to C++ ripser. The input must be in COO format, so any other sparse format is converted to COO before calling this function.

@MonkeyBreaker
Copy link
Collaborator

You're right, from the C++ code:

//Initialize from COO format
	sparse_distance_matrix(int* I, int* J, float* V, int NEdges, int N, float threshold) :
							neighbors(N), vertex_births(N, 0) {
		int i, j;
		value_t val;
		for (int idx = 0; idx < NEdges; idx++) {
			i = I[idx];
			j = J[idx];
			val = V[idx];
			if (i < j && val <= threshold) {
				neighbors[i].push_back(std::make_pair(val, j));
				neighbors[j].push_back(std::make_pair(val, i));
			}
			else if (i == j) {
				vertex_births[i] = val;
			}
		}
	}

But in this case we can conclude that at the moment, SparseRipsPersistence cannot handle sparse data structure input.

We could before calling the c++ code in SparseRipsPersistence transform the data ?

@ulupo
Copy link
Collaborator Author

ulupo commented Jun 19, 2020

@MonkeyBreaker I need to think about this. At the moment, I think that if using sparse formats only gives a performance penalty because of conversions secretly performed by us, we should not fool the user into thinking that it might be a good idea to use these formats. The situation is different for VietorisRipsPersistence and FlagserPersistence where there can be large rewards for doing so.

@ulupo ulupo removed the request for review from MonkeyBreaker June 19, 2020 12:34
@ulupo ulupo changed the title Modify check_point_cloud to allow for sparse input Modify check_point_clouds to allow for sparse input Jun 19, 2020
@@ -278,7 +280,7 @@ def check_point_clouds(X, distance_matrices=False, **kwargs):
for i, x in enumerate(X):
try:
xnew = _check_array_mod(x, **kwargs_, ensure_2d=True)
if distance_matrices:
if distance_matrices and not issparse(xnew):
Copy link
Collaborator Author

@ulupo ulupo Jun 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reason for this: Sparse matrices can easily end up being non-square, but that is OK in all current applications for which distance_matrices might be set to True, i.e. the persistent homology transformers in gtda.homology.simplicial. Missing non-diagonal entries are consistently interpreted as np.inf in those cases.

@gtauzin
Copy link
Collaborator

gtauzin commented Jun 22, 2020

@ulupo As I mentioned earlier, I believe the check_point_clouds function should not be the one checking adjacency matrices of graphs. A long time ago a check_graphs function was started but never finished. I suggest the logic for checking adjacency matrices is offloaded there. Also, all preprocessing transformers in gtda.graphs could make use of it.

@ulupo
Copy link
Collaborator Author

ulupo commented Jun 22, 2020

@gtauzin in principle, I agree with your suggestion. In practice, it is a slightly complicated one to implement: check_point_clouds is already being used to check adjacency matrices of graphs, because it is already being used by VietorisRipsPersistence and SparseRipsPersistence for instance, when metric='precomputed'. Presumably, the function would have to be rewritten substantially (and split into two) even in the current state. The current changes are simply minimal with respect to the existing (debatable) decisions.

What do you think about making this the subject of a separate PR?

@ulupo
Copy link
Collaborator Author

ulupo commented Jun 22, 2020

Another smaller note: if #422 is merged, we won't need a check function for the graph transformers. This is because in fact only one existing transformer in gtda.graphs receives graphs as inputs, namely GraphGeodesicDistance. But, as per #422, a thorough scipy check_graph-type function is run automatically on each graph before computing geodesic distances, so it seems to me that any batch-level check_graph function would just end up doing duplicate work in that specific case.

For the future, there remains of course a general case for having check_graph.

Copy link
Collaborator

@gtauzin gtauzin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ulupo I think it is quite good! I have also a general comment. In check_array, if the input is a list but the shape of its element allow it to be formatted as a numpy array, it returns a numpy array. I think this would be good to have as well to enhance performance.

gtda/utils/validation.py Show resolved Hide resolved
gtda/utils/validation.py Outdated Show resolved Hide resolved
@CLAassistant
Copy link

CLAassistant commented Jun 24, 2020

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ ulupo
❌ Umberto


Umberto seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@ulupo
Copy link
Collaborator Author

ulupo commented Jun 24, 2020

@gtauzin I have now addressed two of the points you raised (conversion to an overall array when possible, removal of dimensionality warnings). Concerning copies, this is the current situation:

When copy=False and the input is a "regular" list – of, say, float arrays – the return value of check_point_clouds is a new list "typically" pointing to the same objects as the input list. So e.g. doing

X_list = [np.ones((3,4)), np.zeros((3,3))]
X_arr = check_point_clouds(X_list, copy=False)
X_arr[0][0, 0] = 2

would also modify the top-left entry of the first array in X_list. Notice that this is not true (it cannot be, as far as I can tell) when the arrays in X_list have the same shapes because then the addition made in 1b110ea basically triggers a copy regardless of whether copy was set to True or False.

So in a sense we are only "half" copying in these scenarios (and the copying we do is a "light" one, i.e. a new list of pointers, with little new memory allocation).

The reason I think it's not so straightforward to do better is as follows: One could think that the new behaviour should be something like this:

  1. When copy is set to True, keep the current behaviour;
  2. When copy is set to False, the loop could be as follows:
        has_check_failed = False
        messages = []
        for x in X:
            try:
                _check_array_mod(x, **kwargs_, ensure_2d=True)
                if distance_matrices:
                    if not x.shape[0] == x.shape[1]:
                        raise ValueError(
                            f"All arrays must be square: {x.shape[0]} rows "
                            f"and {x.shape[1]} columns found in this array.")
            except ValueError as e:
                has_check_failed = True
                messages.append(f"Entry {i}:\n{e}")
        if has_check_failed:
            raise ValueError(
                "The following errors were raised by the inputs:\n\n" +
                "\n\n".join(messages))
        else:
            Xnew = X

i.e. we just run _check_array_mod (which is virtually the same as check_array) on each entry, and if no errors are thrown we simply make Xnew an alias of the original data.

The problem with 2 I think is that not always does passing copy=False in sklearn's check_array mean that no copies are triggered. Sometimes they can be triggered by data type conversions. In these cases, we would ignore the copies which are presumably triggered by check_array for a good reason. What do you think?

@ulupo
Copy link
Collaborator Author

ulupo commented Jun 25, 2020

@gtauzin a particular case of my point above is the following: suppose (God forbid!) that a user passes a nested 3-level list instead of a list of 2D arrays. Then, in the current design, thanks to check_array's behaviour we would return a new list of 2D arrays which is important for downstream tasks. However, if the alternative in the above code block is used instead, we would just return the original nested list.

@ulupo ulupo requested a review from gtauzin June 25, 2020 22:00
@ulupo ulupo merged commit 6b32773 into giotto-ai:master Jul 14, 2020
@ulupo ulupo deleted the simplicial_homology/allow_sparse_input branch July 14, 2020 16:34
ulupo added a commit that referenced this pull request Oct 9, 2020
* Make path to giotto-tda logo absolute for display on PyPI (#415)

* Fix formatting in CONTRIBUTING.rst (#416)

* Fix typo in Mapper docs for n_sig_figs (#417)

* Adds FlagserPersistence (#339)

* Add pyflagser >= 0.4.0 to requirements, GH pages installation instructions, and README.rst

* Add FlagserPersistence

* Add tests for FlagserPersistence

* Add FlagserPersistence to doc index

* Add See Also entries in simplicial

* Linting in ripser_interface

* Update See also for Cubical

* Fix max_edge_length docs in other simplicial transformers

* Add tests to check max_edge and infinity_values

* Fix low infinity value bug and put dimension padding in _utils

* Add test for low infinity values

* Simplify code in homology/_utils.py

Signed-off-by: Guillaume Tauzin <guillaumetauzin.ut@gmail.com>
Co-authored-by: Umberto <umberto.lupo@gmail.com>
Co-authored-by: Wojciech Reise <reisewojtus@gmail.com>

* Introduce `nightly_release` variable in azure-pipelines.yml (#421)

* Link Giotto-tda logo to raw.githubusercontent.com

* Add FlagserPersistence to the documentation (#425)

* Refactor GraphGeodesicDistance (#422)

* Refactor GraphGeodesicDistance

- Change return type to list for compatibility with homology transformers
- Use scipy's shortest_path instead of sklearn's graph_shortest_path
- Make clearer rules on the role of zero entries, infinity entries, and non-stored values
- Add parameters directed, unweighted, and method
- Support masked arrays
- Rewrite docstring

* Fix See Also in KNeighborsGraph and TransitionGraph

* Prevent using FW algorithms when some edges are zero

- Work around scipy/scipy#12424
- Introduce user warnings when algorithm cannot be chosen automatically or set to FW
- Fix test ground truth
- Improves array conversion in GraphGeodesicDistance.transform

* Add GraphGeodesicDistance unit tests on sparse/masked array input and more dtypes

* Fix Azure CI for nightly releases (#429)

* Add *.pyo and *.pyd to .gitignore (#430)

Extend list of ignored extensions to other Python bytecode-type files

* Modify check_point_clouds to allow for sparse input (#424)

* Modify check_point_cloud to allow for sparse input

- Allow for sparse input in VietorisRipsPersistence and FlagserPersistence when metric is precomputed.
- Fix docstrings to reflect the changes.
- Fix some typos and wording issues.

* Make docstrings more consistent across simplicial transformers

* Fix reference for FlagserPersistence

* Remove warning for different point cloud embedding dimensions

* Eliminate tests for different embedding dimension warnings

* Convert list input to 3D ndarray when possible

* Update my email address

* Accept sparse inputs even in non-precomputed case for VietorisRips and SparseRips

* Cover sparse cases and more metrics in simplicial unit tests

Co-authored-by: Umberto <u.lupo@l2f.ch>

* Update collaborator email addresses (#435)

* Update @lewtun's and @ulupo's emails in CODE_AUTHORS.rst

* Update @lewtun's and @ulupo's emails in GOVERNANCE.rst

* Refactor _subdiagrams to take all homology dimensions into account (#439)

* Fix mmap settings used by joblib.Parallel (#428)

* Fix mmap settings used by joblib.Parallel in HeatKernel and PersistenceImage

* Add tests

* Slice X first inside parallel calls to _subdiagrams

* Minor simplifications in plot_diagram

* Add @NickSale to CODE_AUTHORS.rst

Co-authored-by: Umberto Lupo <umberto.lupo@gmail.com>

* Fix axis ranges in `plot_diagram` (#437)

* Remove **input_layout kwargs, refactor layouts in plot_diagram

- Fix #409
- Fix calculation of the maximum filtration parameter which incorrectly included the homology dimensions
- Simplify code

* Transpose the output of PairwiseDistance to match scikit-learn convention (#420)

* Transpose output shape in PairwiseDistance.transform

* Update tests

* Simplify _parallel_pairwise

Co-authored-by: Umberto Lupo <46537483+ulupo@users.noreply.github.com>

* Replace pip install with python -m pip install in docs and Readme (#440)

* Remove incorrect assumptions in Filtering (#436)

* Remove `_sort` and refactor `filter` in gtda/diagrams/_utils.py

Arrays are no longer sorted by lifetime before filtering.

* Fix test for Filtering

* Improve documentation, begin addressing #233

* Small improvement to `_subdiagrams`

* Return figures instead of showing them (#441)

* Refactor of plotting API

- Make all functions in gtda/plotting return figures (or tuples of figures for betti_surfaces) instead of showing them
- `plot` class methods also return figures instead of showing them
- `transform_plot` and `fit_transform_plot` still show figures and only return transformed data
- Add `plot_params` kwarg throughout to allow user customisability of output figures (subtlety: one of the key can be either "trace" when the output figure only has one trace, or "traces" when it has several)

* Suppress user warnings on graph geodesic distance algorithms in tests

* Resolve overflow warnings in mapper filter tests

* Resolve numpy DeprecationWarning in test_validation

* [DOCS] Clarify expected diagram properties to fix #233 (#443)

* Allow plot_heatmap to take boolean arguments (#444)

* Add pullback cover set labels and partial cluster labels to Mapper hovertext, improve docs (#445)

* Add pullback set ID to mapper hovertext

* Add partial cluster label to Mapper hovertext

* Improve docs for plot_static_mapper_graph and plot_interactive_mapper_graph

* Fix tests after changes

* Slight wording improvement

* Further wording clarification

* Make wording on edges even clearer

* Refactor igraph.Graph output of Nerve, modify `fit` behaviour of Nerve, add `store_edge_elements` kwarg to Nerve and make_mapper_pipeline, add Nerve and ParallelClustering to docs (#447)

* Refactor igraph.Graph output, add Nerve and ParallelClustering to docs, add store_edge_elements kwarg to Nerve and make_mapper_pipeline

- Store node metadata not as a graph-level dictionary, but as vertex attributes accessible by graph.vs[attr_name][node_id] or graph.vs[node_id][attr_name] for attr_name in ["pullback_set_label", "partial_cluster_label", "node_elements"]
- Remove "node_id" from node attributes as it always coincided with the igraph.Graph node number anyway.
- Automatically store sizes of intersections as edge weights, accessible by graph.es["weight"].
- Add "store_intersections" kwarg to Nerve and make_mapper_pipeline to allow storing indices of node intersections as edge attributes, accessible via graph.es["edge_elements"].
- Simplify logic of Nerve.fit_transform code
- Change the attributes stored by Nerve.fit. Now the entire graph is stored as graph_ instead.
- Improve documentation of make_mapper_pipeline
- Expose ParallelClustering and Nerve in __init__ and docs.
- Adapt tests, mapper quickstart notebook, and mapper plotting functions.

* Add two tests for the behaviour of store_edge_elements and min_intersection

* Remove check for shape of `layout` in _calculate_graph_data

`layout` can only be a string or a callable, not an ndarray

* Improve test coverage of mapper visualization modules

* Create tests for plot_betti_surfaces and plot_betti_curves

* Add plotly_params to remaining plot methods in diagrams/representations, missed out in #441 

* Fix some linting and docstrings

* Minor improvements

* Avoid shadowing range function in plot_diagram

* Fix minimum birth bug in `plot_diagram` (#449)

* Fix bug introduced in 4bc90b2

np.max should have been np.min in plot_diagram for minimum birth and death calculation

* Clean/simplify plot_diagram further to be more ready for extended persistence and better behave under plotly HTML autoscaling

* Make `store_edge_elements` work with MapperPipeline.get_mapper_params and MapperPipeline.set_params, missed out in 4bc90b2

* Remove displayed SegmentLocal from gif hyperlink in classifying_shapes example notebook (#448)

* Remove displayed SegmentLocal from classifying_shapes example notebook

* Fix doc typos

* Pybind11 quick fix (#451)

* Fix pybind11 broken master

pybind11 checkout version v2.5.0, some issues are observed on master

Signed-off-by: julian <julian.burellaperez@heig-vd.ch>

* Fix bug in use of homology_dimension_ix with samplings_, rename homology_dimension_ix (#452)

* Rename homology_dimension_ix to homology_dimension_idx, fix bug

* Add normalization of persistence entropy (#450)

* Add normalization of persistence entropy via `normalize` kwarg in gtda.diagrams.PersistenceEntropy

* Use entropy from scipy.stats in gtda.diagrams.PersistenceEntropy, gtda.time_series.PermutationEntropy and gtda.mapper.Entropy

* Rename _entropy hidden method in gtda.diagrams.PersistenceEntropy

* Add tests for normalize=True

* Add `fill_nan_value` kwarg to gtda.diagrams.PersistenceEntropy

- See #450 (comment)
- Adapt pipeline tests to use this kwarg

Co-authored-by: Umberto Lupo <umberto.lupo@gmail.com>

* Fix y-axis in HeatKernel plots, add default titles to plot methods (#453)

* Fix y-axis in HeatKernel.plot

* Add titles to figures generated via plot_heatmap in plot class methods

* PEP8 E133 improvements

* Refactor gtda/diagrams (#454)

* First attempt at fix of #438

* Fix y-axis in `PersistenceImage` plots, extending #453

* Represent multiplicity of persistence pairs in hovertext in `plot_diagram`

* Change reflect mode of `gaussian_filter` to "constant" from "reflect"

Affects `HeatKernel` and `PersistenceImage`

* Fix `PersistenceLandscape` plot method

- Only the figure corresponding to the first seen homology dimension was returned
- Output is now a figure with subplots, a main title, and one subtitle per plot (homology dimension)

* Improve tests for plot methods in gtda.diagrams.representations

Cover use of `plotly_params`

* Minor docstring linting

* Miscellaneous docstring improvements in gtda/diagrams

* Fix validation dictionary for `metric_params` in the case of `PersistenceImage`

* Change default value of `order` in Amplitude, from 2. to None (vector features)

* Change meaning of default None for `weight_function` in `PersistenceImage`

- ``None`` corresponding the identity means that there really is a non-trivial weighting in that case. Semantically, this does not seem correct ("None" should mean no weighting at all)

* Improve code style and clarity in plot methods in gtda.diagrams.representations

* Refactor gtda/diagrams/_metrics.py to fix several bugs

- Change computation of heat/persistence image distances and amplitudes to yield the continuum limit when `n_bins` tends to infinity.
- Make `sigma` in persistence-image-- and heat-kernel--based representations/distances/amplitudes measured in the same units as the filtration parameter (not in pixels), thus decoupling its effect from the effect of `n_bins`. Also change the default value in PairwiseDistance, Amplitude, HeatKernel and PersistenceImage from 1. to 0.1.
- Remove trivial points from diagrams before creating a sampled image in `heats` and `persistence_images`. This ensures in particular that the trivial points really give no contribution regardless of the weight function used for persistence images.
- Finish fixing #438 and similar issues with PairwiseDistance when the metric is persistence-image--based.
- Ensure `silhouettes` does not create NaNs when a subdiagram is trivial.
- Change default value of `power` in `silhouette_amplitudes`
 and `silhouette_distances` to 1., to agree with Amplitude and PairwiseDistance docs.
- Fix use of np.array_equal in `persistence_image_distances` and `heat_distances` -- so far the boolean check for equal inputs always failed due to the in-place modifications of `diagrams_1`.
- No longer store weights in `effective_metric_params_` attribute of PairwiseDistance and Amplitude when the metric is persistence-image--based.
- Remove _heat from gtda.diagrams._metrics.
- Remove identity from gtda.diagrams._utils and make default `weight_function` kwargs equal to np.ones_like everywhere to agree with default in `PersistenceImage`.
- Other style improvements (variable names, linting)

* Fix trace name when homology dimension is np.inf in `BettiCurve` and `Silhouette`

* Adapt `test_all_pts_the_same` to new behaviour of `heats_` in gtda.diagrams._metrics

* Improve test coverage of `Amplitude` and `PairwiseDistance`

- Make sure silhouettes and persistence images are covered throughout
- Cover `order` parameter throughout

* Add test of zero `weight_function` for `PersistenceImage`

* Make behaviour of `Scaler.fit` when the metric is persistence image the same as `Amplitude`

Accordingly add more combinations of metrics and metric_params in tests for Scaler

* Delete never-used `_matrix_wrapper` and `_arrays_wrapper` functions

* Remove `_pad` from gtda.diagrams._utils as it is never used

* Make `copy=True` in calls to check_diagrams in Scaler.transform and Scaler.inverse_transform

* Make `homology_dimensions_` attributes tuples instead of lists, with integers when possible

* Improve code style

* Hard-code zero array outputs by `heats` and `persistence_images` when step sizes are zero

* Add `homology_dimensions` kwarg to `_bin`

Achieves beautification of self._samplings and self.samplings_ (ints shown instead of floats) in several transformers (and saves some computation)

* Adapt choices of `min_values`, `max_values` and `sigma` in hypothesis-based tests

New meaning of sigma led to overflow issues in existing tests.

* Make all homology dimensions equal in `test_hk_big_sigma`

Also extend this test to `PersistenceImage`, and rename it accordingly

* Cover use of `plotly_params` kwarg in diagram preprocessing classes plot methods

* Extract some common logic from plot methods in gtda.diagrams.representations

* Silence expected warnings from image transformers in test_common

* Implement @wreise's suggestion to abstract away sorting and integer conversion of fit hom dims

* Add tests for `Amplitude` and `PairwiseDistance` to check that a zero weight function yields identically zero amplitudes/distances.

* Refactor `_subdiagrams` to throw informative errors on unfulfilled input properties

* Bump package dependencies to latest available in conda (#457)

* Bump dependencies to latest available in conda (or PyPI if conda unavailable)

* [CI] Bump pip version in .azure-ci/docker_scripts.sh

* Plot infinite deaths in plot_diagrams (#461)

* Plot infinite deaths in plot_diagrams

- Dynamically establish a certain value greater than the maximum observed death as the y-coordinate of infinity
- Plot points with infinite death as having this death value
- Plot a horizontal dashed line to show the position of infinity

* Add first draft of FlagserPersistence notebook

* Fix errors with use of homology_dimensions kwarg in FlagserPersistence

- Implementation was incorrect in cases where minimum in homology_dimensions is greater than 1
- Implementation would crash whenever `flagser_weighted` does not return a list of size self._max_homology_dimension + 1 - self._min_homology_dimension, as can happen with small graphs

* Fix #92

Document removal of one infinite bar in simplicial transformers

* Polish up classifying_shapes notebook

* Remove null cell from vietoris_rips_quickstart

* Upload clique complex filtration images

* Crop image

* Crop and colour images

* New versions of clique complex

* New notebook version

* Test a small version of a png

* Update notebook and pngs

* Remove plotly code

* Add SVGs for directed flag complex illustration

* Correct one SVG

* Add color fill

* Fix color fill

* Reorient one triangle

* Finish up notebook

* Remove some PNGs

* Delete legacy image

* More "correct" name for a PNG

* Reflect name change on notebook

* LaTeX formatting

* Add reference to ripser.py tutorials

* Fix handling of COO matrices by ripser mirroring scikit-tda/ripser.py#104 (#465)

* Update gudhi bindings to latest changes on August 2020 (#468)

* Pull latest changes on Gudhi-devel:giotto

* Add Eigen as a submodule

Eigen is now required for latest Gudhi version

* Add Eigen for gudhi modules that require it

* Update get_filtration and get_skeleton with the new Gudhi API

* Update ccache in CI

* Fix type issue for get_persistence in simplex_tree

* Fix persistence in periodic cubical complex

* Fix simplex tree

* Fix cubical complex

* Beautify setup.py

* Move towards fixing #307 in azure_pipelines.yml (still not addressing docker_scripts.sh)

Signed-off-by: julian <julian.burellaperez@heig-vd.ch>
Co-authored-by: Umberto Lupo <umberto.lupo@gmail.com>

* Change DISTNAME back form "giotto-tda" to 'giotto-tda' in setup.py (#470)

Avoids uploading nightly distributions accidentally to PyPI from CI as happened on 28.08.2020

* Change title as suggested by @lewtun, fix GitHub capitalization in all notebooks

* Unify postprocessing of diagrams between simplicial and cubical, remove one H0 inf bar in cubical

* Fix imports, reinstate reshape

* Fix slicing and tests

* Document removal of infinite bar in 0D

* Improve readability

* Place _replace_infinity_values and _pad_diagram in scope of _postprocess_diagrams

* Forfeit use of n_jobs in _postprocess_diagrams

Favour use of purely in-place operations by a logic replacing _pad_diagram

* Linting

* Remove unused joblib imports from homology/_utils.py

* Ignore SparseRipsPersistence in code cells as suggested by @lewtun

* LaTeXify indices as suggested by @lewtun

* Fix typos

* Fix typo

* Convert pycairo note from inline comments to markdown text

* Implement some of @lewtun's suggestions for classifying shapes notebook

* Migrate general persistence diagram discussion to vr notebook, create "General API" section

In partial fulfillment of @lewtun's review suggestions

* Fix VR notebook and add explanation of homology_dimensions parameter

* Make PyPI nightly uploads more robust to changes in quote types (#471)

* Make PyPI nightly uploads more robust to changes in quote types

Look for both single and double quotes when modifying DISTNAME in setup.py

* Reinstate double quotes in setup.py

* Change sed -i to sed -i.bak in macOS Azure jobs (#472)

* Adds WeakAlphaPersistence (#464)

* Add WeakAlphaPersistence (gtda.homology.simplicial)

* Change relative order of simplicial transformers

* Add tests of correctness of VietorisRipsPersistence and WeakAlphaPersistence (circle PH)

* Adapt to sklearn 0.23 API in test_common

Silences warnings

* Improve docstrings in gtda.homogy (type of X in fit and transform, and type of Xt in plot)

* Avoid dtype search and always make a float array

* Adapt WeakAlphaPersistence to changes

* Some style consistency

* Fix small bug with plot_diagrams

Trivial points were not excluded from computation of plot ranges

* Bump pyflagser version in requirements

* Improve docs for VietorisRipsPersistence by explaining how precomputed input is handled

* Improve writing following @lewtun's comments, fix hyperlinks in preparation for merge

* Add a regression test for FlagserPersistence

* Introduce a `reduced_homology` kwarg to toggle or not removal of infinite H0 bar

* Change default value of nan_fill_value in PersistenceEntropy to -1 (#474)

* Change default value of nan_fill_value in PersistenceEntropy to -1.

* Remove false URL in README (#476)

* Fix read-only error analogous to #427 in Amplitude and PairwiseDistance (#481)

* Improve naming of parallelism tests for PersistenceImage and HeatKernel

* Extend #428 to Amplitude and PairwiseDistance, add regression test, improve edge case handling in _metrics.py

* Cover n_jobs > 1 in test_features_representations.py

* Fix formatting issues in notebooks (#482)

* Add persistent homology of graphs notebook to docs

* Add Vietoris-Rips gif to examples/images and use it instead of Medium links notebooks

* Change single backticks for monospace to double backticks, recommended for Markdown

* Fix a parenthesis in classifying_shapes.ipynb

* Add bindings for collapser, refactor ripser_interface (#469)

* Add bindings for collapser

* Add collapse_edges kwarg to ripser to enable edge collapser

* Add tests for the collapser bindings

* Add tests for ripser with edge collapse

* Add regression test for #465

* Refactor ripser_interface to simplify code, improve lexsort logic, and improve detection of sparse matrix shape

* Disable computation of cocycles by default

Signed-off-by: julian <julian.burellaperez@heig-vd.ch>
Co-authored-by: Umberto Lupo <umberto.lupo@gmail.com>

* Enable edge collapse in VietorisRipsPersistence (#483)

* Enable edge collapse in VietorisRipsPersistence

Also cover coefficient different from 2 in tests of collapser

* Cover collapse_edges kwarg in simple VietorisRipsPersistence tests

* Rename TakensEmbedding to SingleTakensEmbedding, introduce new TakensEmbeding transformer on collections and a function for time-delay embedding parameter optimization, replace "width" parameter with "size" in Labeller and SlidingWindow, fix bug with Labeller.resampler when n_steps_future >= size - 1 (#460)

* Extract time delay embedding logic into a _time_delay_embedding function in the new file gtda.time_series._utils

* Rename TakensEmbedding to SingleTakensEmbedding

* Extract and expose code for finding optimal Takens embedding parameters, as a new takens_embedding_optimal_parameters function

* Simplify logic of resample method of SingleTakensEmbedding and SlidingWindow

* Improve tests for Stationarizer

* Fix bug in Labeller.resample when n_steps_future >= width, add test to cover this scenario

* Allow zero width (meaning 1 point) in SlidingWindow and Labeller

* Add simple test for invalid width in Labeller

* Faster implementation of _slice_windows in SlidingWindow and of SlidingWindow.transform

* Make SlidingWindow.slice_windows public, add test

* Rename "width" parameter of Labeller to "size", addressing #94

* Rename "width" parameter of SlidingWindow to "size", addressing #94 

* Create new TakensEmbedding function to deal with collections of univariate or multivariate time series

* Support list of time series input in gtda.time_series.embedding by means of a new check_collection function in gtda.utils.validation

* Fix typos in glossary.tex

* Remove plot method from SlidingWindow and place it in new TakensEmbedding

* Refactor Lorenz attractor notebook following API changes involving TakensEmbedding and SlidingWindow

* Use check_collection in CubicalPersistence

* TransformerResamplerMixin notebook, make Labeller.transform return 1D arrays (#475)

* Add notebook on TransformerResamplerMixins and time series forecasting with sliding windows + topology

* Change return shape or Labeller.transform to 1D

* Fix Labeller docstring following change in #475

* Modify `PlotterMixin.transform_plot` to give a dictionary to the call to `plot` (#484)

* Fix sample numbers in automatically-generated plot titles when using `transform_plot`

* Add test for Takens to use fit_transform_plot

* Remove papermill version from azure-pipelines

Co-authored-by: wreise <wojciech.reise@epfl.ch>

* removed check_collection from docs

* Fix references in takens and simplicial

* Change todo in weak alpha filtration reference to the section name, remove empty line from release

* Add `mode` kwarg to KNeighborsGraph, refactor input-output formats in gtda.graphs, turn outputs of KNeighborsGraph and TransitionGraph into asymmetric adjacency matrices (#478)

* Add mode kwarg to KNeighborsGraph, allow for list input in gtda.graphs, return lists instead of ragged arrays

* Turn output of TransitionGraph into a directed 0-1 adjacency matrix

* Docstring improvements in gtda.graphs

* Add check_collection again

* Reference ripser [1]

* Move the references before the full stops

* Replace missing glossry entries with TODOs, following Umberto's comment

* Docstring improvements following @gtauzin's review

* Tiny wording fix

* Improve docstring following @gtauzin's comment

* Add missing references

* Flag weak_alpha_complex as missing entry

* Reference in the documentation

* Add a target for testing notebooks in the docs

* Make html does not fail on missing versions

* Capitalize "Euclidean" in glossary.tex

* Remove spaces after references, check that publication names are in italic

* Add check_collection to gtda/utils/__init__.py (#491)

* Add check_collection to gtda/utils/__init__.py

* Improve paths in some tests

* Add option to contract Mapper vertices when pairs are in a subset relation, change layout in interactive mode (#456)

* Add contract_nodes kwarg to Nerve

- When True, igraph.Graph's contract_vertices is employed together with the limit flow by a mapping with fixed points to execute the contraction
- Add a test for this option

* Filter out nodes from same pullback set when checking intersections, to increase performance

* Ensure that setting min_intersection < 1 does not lead to fully connected graphs

* Remove unused variable name in ParallelClustering

* Make nerve parameters interactive in plot_interactive_mapper_graph, refactor get_widgets_per_params, change layouts

* Improve method_to_transform code and docstring

* Small improvements to transformer_from_callable_on_rows

* Linting throughout gtda.mapper

* Remove effective_metric_params_ from Eccentricity

* Remove unused import and sort the remaining ones

* Provide an explicit name for the substituition

* Copy images from examples to notebooks

* Remove the use of imgonverter, change the logo back to svg and add supported image types list

* Name for substitution did not work

* Wording improvements in persistent_homology_graphs.ipynb

* Typo fix in notebook

* Fix variable name error in GraphGeodesicDistance.transform

* Extend Padder and Inverter to greyscale images (#489)

* Extend Padder to greyscale images

* Extend Inverter to greyscale images

Signed-off-by: Guillaume Tauzin <guillaumetauzin.ut@gmail.com>

Co-authored-by: wreise <wojciech.reise@epfl.ch>

* Add notebook for topological time series classification to examples (#458)

* Add notebook for topological time series classification example

* Add Takens embedding GIF to images directory

* Add gravitational wave dataset

Co-authored-by: Umberto Lupo <umberto.lupo@gmail.com>

* Fix links to time series classification notebook by @lewtun

* Move datasets.py and gravitational-wave-signals.npy to new data subfolder, and rename them

* Fix errors in check_point_clouds docstrings

* Try easier paths in See also

Test for @wreise

* Create captions across all notebooks

* Small wording improvement in persistent_homology_graphs.ipynb

* Change the path (sub)package in the index

* Implement double backticks consistently

* Simplify some captions

* Slightly reduce size of plots by plot_point_cloud

* Fix some errors in make_mapper_pipeline docs

* Fix typo in Nerve docs

* Change citation style for consistency

* Remove new lines in array printouts in docs

* Make citation style more consistent throughout

* Reorder notebooks, modularize Makefile

* Reformat the references

* More link and backtick fixes in mapper_quickstart

* More RST fixes in notebooks

* Fix linting

* Fix an error in WeakAlphaPersistence Notes

* Fix URLs for GUDHI

* Fix TakensEmbedding docstrings

* Linting

* Try changing some SVG attributes for better display

* Further SVG improvements

* Update CODE_AUTHORS.rst (#497)

* More Mapper tests, small change to `max_fraction` in FirstSimpleGap and FirstHistogramGap (#412)

* Add first tests for plot_interactive_mapper_graph

* Change deprecated 'overflow_y' to 'overflow' property in_logging, and remove unnecessary warning catching

* Avoid name clashes with Python built-ins throughout mapper/tests

* Slightly change meaning of `max_fraction` in FirstSimpleGap and FirstHistogramGap

Make the default 1. instead of None, and give it a simpler interpretation: (the floor of) max_fraction * n_samples is the maximum number of clusters the algorithm can return

* Implement a looser criterion for the hk_pi_big_sigma test

Co-authored-by: Umberto Lupo <umberto.lupo@gmail.com>

* Fix some warnings seen in tests (#498)

- `continuous_updates` is deprecated from ToggleButton
- `np.stack` should not take generators
- should explicitly pass dtype=object for ragged arrays

* Add the giotto-tda icon in the browser tab (#500) (favicon)

* Add NumberOfPoints (#496)

* Add NumberOfPoints

* Make See alsos more consistent throughout diagrams.features

Sined-off-by: Guillaume Tauzin <guillaumetauzin.ut@gmail.com>
Co-authored-by: wreise <wojciech.reise@epfl.ch>
Co-authored-by: Umberto Lupo <umberto.lupo@gmail.com>

* Fix and improve validate_params (#502)

* Fix validate_params bug and improve behaviour when both numeric and list-like types are allowed

* Add a test

* Fix mistake in validate_params docs

* Pybind11 as a git submodule (#459)

* Add MODULE keyword in CMakeLists.txt to explicitly show expectation

* Update collapser to latest standard in the CMake file

* Add pybind11 as a submodule

* Remove downloading pybind11 from setup.py

Signed-off-by: julian <julian.burellaperez@heig-vd.ch>
Co-authored-by: Umberto Lupo <umberto.lupo@gmail.com>

* Fix metric_params documentation in KNeighborsGraph

* Improve ripser bindings performance (#501)

* Update ripser & collapser bindings to allow pass by reference with numpy

* Update making the distance matrix triangular more memory friendly

* Remove unnecessary dict inherited from ripser.py cython

Signed-off-by: julian <julian.burellaperez@heig-vd.ch>

* Add MNIST image classification example notebook (#477)

* Add MNIST image classification/full blown ML example notebook

Signed-off-by: Guillaume Tauzin <guillaumetauzin.ut@gmail.com>
Co-authored-by: Lewis Tunstall <lewis.c.tunstall@gmail.com>
Co-authored-by: Umberto Lupo <umberto.lupo@gmail.com>

* Add ComplexPolynomial (#479)

* Add ComplexPolynomial

Signed-off-by: Guillaume Tauzin <guillaume.tauzin@inait.ai>
Co-authored-by: Umberto Lupo <umberto.lupo@gmail.com>
Co-authored-by: wreise <wojciech.reise@epfl.ch>

* Fix precomputed behaviour in KNeighborsGraph (#506)

* Fix precomputed behaviour in KNeighborsGraph

- Avoid warnings when metric is passed as 'precomputed'
- Fix errors in docstrings and improve wording
- Improve tests

* Fix #499 by covering case of list input in CubicalPersistence.fit (#503)

* Improve memory use in ripser_interface (#507)

Implement suggestions in #501 (comment)
- Use scipy's squareform function for fast extraction of the upper diagonal part of dm

* Add DensityFiltration (#473)

Signed-off-by: Guillaume Tauzin <guillaumetauzin.ut@gmail.com>
Co-authored-by: Umberto Lupo <umberto.lupo@gmail.com>

* Add metaestimators subpackage with CollectionTransformer meta-estimator (#495)

* Add gtda/metaestimators with CollectionTransformer meta-estimator 

* Fix docstring for make_mapper_pipeline

* Cchange parallel_backend_prefer default to None in ParallelClustering and make_mapper_pipeline

* Cross-reference two time series notebooks with See also

* Improve time series classification notebook by making use of CollectionTransformer

* Temp fix for macOS CI failures by preventing automatic brew cleanup (#513)

Also make mapper tests more lenient to reduce number of flaky tests

Co-authored-by: ulupo <umberto.lupo@googlemail.com>

* Add a curves subpackage and a StandardFeatures transformer to extract features from them (#480)

* Add curves subpackage with StandardFeatures

* Reshape output of PersistenceLandscape so that it's a multi channel curve

* Add curves to main init and to rst docs

* Add Feature extraction subtitle in curves.rst

* Split simplicial homology into undirected and directed subsections

* Fix typo in validate_params

* Raise deadlines for some mapper tests

* Remove notebook tests even in macOS jobs

Signed-off-by: Guillaume Tauzin <guillaume.tauzin@inait.ai>
Co-authored-by: Umberto Lupo <umberto.lupo@gmail.com>

* Add curves.Derivative (#492)

* Add curves.Derivative

* Fix curves.StandardFeatures docs

Signed-off-by: Guillaume Tauzin <guillaume.tauzin@inait.ai>

Co-authored-by: Umberto Lupo <umberto.lupo@gmail.com>
Co-authored-by: wreise <wojciech.reise@epfl.ch>

* Raise test coverage and make ParallelClustering clusterer parameter required (#508)

* Improve test coverage of mapper visualization tools

* Make pandas part of test requirements in setup.py

* Add pandas installation in Linux Azure jobs

* Avoid notebooks checks even in macOS unless notebooks_checks is true

* Make clusterer a required parameter in ParallelClustering, add ParallelClustering tests

* Change kind of error for bad transformers in CollectionTransformer

* Improve check_diagrams and its tests

* Use validate_params in plot_point_cloud

* Cover plotting of diagrams with infinite deaths in simplicial tests

* Prepare for v0.3.0 release (#516)

* Bump version to 0.3.0

* Add release notes

* Minor corrections in notebooks

Co-authored-by: wreise <wojciech.reise@epfl.ch>

Co-authored-by: Guillaume Tauzin <guillaumetauzin.ut@gmail.com>
Co-authored-by: Wojciech Reise <reisewojtus@gmail.com>
Co-authored-by: Wojciech Reise <32167173+wreise@users.noreply.github.com>
Co-authored-by: Umberto <u.lupo@l2f.ch>
Co-authored-by: Nicholas Sale <nicholas.j.sale@gmail.com>
Co-authored-by: MonkeyBreaker <julian.burellaperez@heig-vd.ch>
Co-authored-by: REDS institute <reds-heig@users.noreply.github.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Co-authored-by: wreise <wojciech.reise@epfl.ch>
Co-authored-by: ulupo <umberto.lupo@googlemail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants