Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add datasets module to load and generate toy datasets #345

Open
lewtun opened this issue Mar 2, 2020 · 4 comments
Open

Add datasets module to load and generate toy datasets #345

lewtun opened this issue Mar 2, 2020 · 4 comments

Comments

@lewtun
Copy link
Collaborator

lewtun commented Mar 2, 2020

Description

scikit-learn has a datasets module that provides handy utility functions to load and generate toy datasets. These functions feature prominently in the scikit-learn examples and it would be nice to have a similar functionality in giotto-tda.

Suggestions for synthetic datasets include:

  • make_point_clouds: Generate an array of spheres and tori in 3-dimensions with corresponding label (useful for showing persistent homology + shape classification).
  • make_time_series: Generate an array periodic and non-periodic time series with corresponding label (useful for showing sliding window embeddings and time series classification).

Suggestions for point cloud and graph datasets could take inspiration from PyTorch geometric's dataset module

@ammedmar
Copy link
Collaborator

ammedmar commented Mar 3, 2020

This would be good. @gtauzin and I started doing something along the make_point_clouds methods you envisioning and manage to get a few nice spaces and constructions on spaces. The reason this was not completed was the lack of uniformity of the sampling. In order to get this done well, the probability function has to be modified by a hessian term associated to the parametrization of the curved space. Maybe we can revisit this point sometime.

@lewtun
Copy link
Collaborator Author

lewtun commented Mar 3, 2020

Cool, it seems you guys went for the hardcore version :) All I had in mind were spheres and tori with gaussian noise added, but perhaps this is too limiting.

If you have some Python code lying around, you could make GitHub gist and link it in these comments.

@ammedmar
Copy link
Collaborator

ammedmar commented Mar 3, 2020

The code is not so important, specially since it doesn't do what one would really like it to do, but since you asked, I am sending code that samples a point cloud near the real projective plane embedded in R4.

To get this thing properly done, what we need is a method that can sample an interval according to a costume, non necessarily uniform, probability distribution function. Any leads on something like this?

The first part of this notebook has the sampling functions for S2 and RP2. I just run it and the plotting still works.

@wreise
Copy link
Collaborator

wreise commented Mar 4, 2020

I wanted to have a look at the notebook, but i do not have access rights- you should receive an email requesting them.

For sampling from arbitrary densities, something like Metropolis-Hastings? Or, if the density is represented as a discretize array, maybe inverse transform sampling?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants