Test all metrics against sklearn (with many input trials) #3230

awaelchli · 2020-08-27T18:54:28Z

Hand-chosen values are not enough, we need to test with a large batch of inputs where possible.
Something in this style, maybe with a fixed seed:

def test_auroc_versus_sklearn():
    for i in range(100):
        target = torch.randint(0, 2, size=(10, ))
        pred = torch.randint(0, 2, size=(10,))
        score_sk = sklearn.metrics.roc_auc_score(target.numpy(), pred.numpy())  # sklearn
        score_pl = auroc(pred, target)  # Pytorch Lightning 
        assert torch.allclose(torch.tensor(score_pl).float(), torch.tensor(score_sk).float())

rohitgr7 · 2020-08-27T19:05:50Z

maybe with a fixed seed

I don't think a seed is required here.

Borda · 2020-08-28T08:12:10Z

maybe with a fixed seed

I don't think a seed is required here.

I would agree here that all metric shall be deterministic so even using different seeds increase the coverage or cases and so far you pas the same values to the two functions it shall always give the same result, right?

rohitgr7 · 2020-08-28T19:59:02Z

I would agree here that all metric shall be deterministic so even using different seeds increase the coverage or cases and so far you pas the same values to the two functions it shall always give the same result, right?

Agreed, this is one good point.

SkafteNicki · 2020-09-01T17:34:51Z

@awaelchli agree that we need such test. We already do that for many metrics:
https://github.com/PyTorchLightning/pytorch-lightning/blob/3910ad033074367f6abfe0001562db725a75cb73/tests/metrics/functional/test_classification.py#L38-L59
and
https://github.com/PyTorchLightning/pytorch-lightning/blob/3910ad033074367f6abfe0001562db725a75cb73/tests/metrics/functional/test_regression.py#L26-L44
tests are not deterministic, because no seed is used.

CamiVasz · 2020-09-25T21:20:01Z

Hi! I am not a contributor, but I am a user of a library called hypothesis, which may be more suitable for this specific case. This library allows the user to write parameterized tests and then chooses the cases that are most likely to make the program fail, that is, the library is really robust to edge cases and can really help to find those that can be problematic to the implementation.

Borda · 2020-09-27T09:38:49Z

Hi! I am not a contributor, but I am a user of a library called hypothesis, which may be more suitable for this specific case. This library allows the user to write parameterized tests and then chooses the cases that are most likely to make the program fail, that is, the library is really robust to edge cases and can really help to find those that can be problematic to the implementation.

hi that you for your recommendation, just not sure if I follow your recommendation, mind write a bit more how you would use https://hypothesis.works in PL...

CamiVasz · 2020-09-28T21:23:39Z

Hi! I am not a contributor, but I am a user of a library called hypothesis, which may be more suitable for this specific case. This library allows the user to write parameterized tests and then chooses the cases that are most likely to make the program fail, that is, the library is really robust to edge cases and can really help to find those that can be problematic to the implementation.

hi that you for your recommendation, just not sure if I follow your recommendation, mind write a bit more how you would use https://hypothesis.works in PL...

The idea would be to generate a hypothesis test, to test Sklearn metrics against the PL metrics, and let the library test the corner cases (as well as the "common" ones) and that way assert that both implementations are concordant, without the need to design the cases by hand, nor search for complicated patterns.

Borda · 2020-09-28T23:07:47Z

@CamiVasz that sounds cool, mind draft a small example of how to use it, and eventually we can extend in on more PL cases... 🐰

CamiVasz · 2020-09-29T02:48:02Z

@CamiVasz that sounds cool, mind draft a small example of how to use it, and eventually we can extend in on more PL cases...

https://colab.research.google.com/drive/1Dprqr1nbtgCFwsyUyb6UbXe9FE7X73Q5?usp=sharing
Here is a small example featuring the mse.

SkafteNicki · 2020-09-29T10:10:19Z

@CamiVasz looks cool, but could you tell me the difference between hypothesis and just creating two random tensors myself (using torch.randn for example)

CamiVasz · 2020-09-29T22:50:48Z

@CamiVasz looks cool, but could you tell me the difference between hypothesis and just creating two random tensors myself (using torch.randn for example)

Hypothesis generation is biased towards edge cases, maximizing the probability of failure. When you generate random numbers, these edge cases that you want to find have the same probability of appearing that easy cases.

awaelchli · 2020-10-02T15:23:34Z

Just found that pytorch is also using hypothesis
https://github.com/pytorch/pytorch/blob/master/CONTRIBUTING.md#unit-testing

Borda · 2020-10-02T16:04:14Z

Just found that pytorch is also using hypothesis
https://github.com/pytorch/pytorch/blob/master/CONTRIBUTING.md#unit-testing

shall we open a new issue just for hypothesis testing? :]

awaelchli · 2020-10-02T16:26:23Z

Yeah, maybe @CamiVasz can do that and I think it would also be great to see an example of it using one of our tests, to show the motivation.

Borda · 2020-10-02T16:35:36Z

Yeah, maybe @CamiVasz can do that and I think it would also be great to see an example of it using one of our tests, to show the motivation.

it would be great to have it as HackOctober issue :]

CamiVasz · 2020-10-07T04:17:22Z

It would be great to work on that! Is this still on board?

awaelchli · 2020-10-08T01:18:59Z

@justusschock @SkafteNicki @ananyahjha93 @teddykoker Do you guys need help with testing the new metrics? @CamiVasz wants to help.

justusschock · 2020-10-08T06:48:10Z

Yeah sure. I think the whole functional API would be a good place to start.

And we could then later extend it to the revamped class interface

awaelchli added feature Is an improvement or enhancement help wanted Open to be worked on ci Continuous Integration labels Aug 27, 2020

awaelchli self-assigned this Aug 27, 2020

awaelchli changed the title ~~Test all metrics agains sklearn (with many inputs)~~ Test all metrics against sklearn (with many inputs) Aug 27, 2020

awaelchli changed the title ~~Test all metrics against sklearn (with many inputs)~~ Test all metrics against sklearn (with many input trials) Aug 27, 2020

awaelchli mentioned this issue Aug 28, 2020

Have an example of showing explicitly how to calculate metrics in DDP #3159

Closed

Borda added the good first issue Good for newcomers label Aug 28, 2020

Borda added the Metrics label Aug 28, 2020

edenlightning added this to the 0.9.x milestone Sep 1, 2020

SkafteNicki mentioned this issue Sep 4, 2020

[Metrics] AUROC error on multilabel + improved testing #3350

Merged

7 tasks

SkafteNicki mentioned this issue Sep 16, 2020

Metric aggregation testing #3517

Merged

7 tasks

edenlightning modified the milestones: 0.9.x, 1.1 Sep 23, 2020

edenlightning unassigned awaelchli Sep 24, 2020

SkafteNicki closed this as completed in #3517 Oct 1, 2020

Borda modified the milestones: 1.1, 1.0 Oct 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test all metrics against sklearn (with many input trials) #3230

Test all metrics against sklearn (with many input trials) #3230

awaelchli commented Aug 27, 2020 •

edited

Loading

rohitgr7 commented Aug 27, 2020

Borda commented Aug 28, 2020

rohitgr7 commented Aug 28, 2020

SkafteNicki commented Sep 1, 2020

CamiVasz commented Sep 25, 2020 •

edited

Loading

Borda commented Sep 27, 2020

CamiVasz commented Sep 28, 2020

Borda commented Sep 28, 2020

CamiVasz commented Sep 29, 2020

SkafteNicki commented Sep 29, 2020

CamiVasz commented Sep 29, 2020 •

edited

Loading

awaelchli commented Oct 2, 2020

Borda commented Oct 2, 2020

awaelchli commented Oct 2, 2020

Borda commented Oct 2, 2020

CamiVasz commented Oct 7, 2020

awaelchli commented Oct 8, 2020 •

edited

Loading

justusschock commented Oct 8, 2020 •

edited

Loading

Test all metrics against sklearn (with many input trials) #3230

Test all metrics against sklearn (with many input trials) #3230

Comments

awaelchli commented Aug 27, 2020 • edited Loading

rohitgr7 commented Aug 27, 2020

Borda commented Aug 28, 2020

rohitgr7 commented Aug 28, 2020

SkafteNicki commented Sep 1, 2020

CamiVasz commented Sep 25, 2020 • edited Loading

Borda commented Sep 27, 2020

CamiVasz commented Sep 28, 2020

Borda commented Sep 28, 2020

CamiVasz commented Sep 29, 2020

SkafteNicki commented Sep 29, 2020

CamiVasz commented Sep 29, 2020 • edited Loading

awaelchli commented Oct 2, 2020

Borda commented Oct 2, 2020

awaelchli commented Oct 2, 2020

Borda commented Oct 2, 2020

CamiVasz commented Oct 7, 2020

awaelchli commented Oct 8, 2020 • edited Loading

justusschock commented Oct 8, 2020 • edited Loading

awaelchli commented Aug 27, 2020 •

edited

Loading

CamiVasz commented Sep 25, 2020 •

edited

Loading

CamiVasz commented Sep 29, 2020 •

edited

Loading

awaelchli commented Oct 8, 2020 •

edited

Loading

justusschock commented Oct 8, 2020 •

edited

Loading