Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Wilcoxon's signed rank test #237

Merged
merged 3 commits into from
Aug 11, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion comparisons/mcnemar/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ print(results)

## Limitations and bias

The McNemar test is a non-parametric test, so it has relatively few assumptions (basically only that the observations are independent). It should be used used to analyze paired nominal data only.
The McNemar test is a non-parametric test, so it has relatively few assumptions (basically only that the observations are independent). It should be used to analyze paired nominal data only.

## Citations

Expand Down
70 changes: 70 additions & 0 deletions comparisons/wilcoxon/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
---
title: Wilcoxon
emoji: 🤗
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 3.0.2
app_file: app.py
pinned: false
tags:
- evaluate
- comparison
description: >-
Wilcoxon's test is a signed-rank test for comparing paired samples.
---


# Comparison Card for Wilcoxon

## Comparison description

Wilcoxon's test is a non-parametric signed-rank test that tests whether the distribution of the differences is symmetric about zero. It can be used to compare the predictions of two models.

## How to use

The Wilcoxon comparison is used to analyze paired ordinal data.

## Inputs

Its arguments are:

`predictions1`: a list of predictions from the first model.

`predictions2`: a list of predictions from the second model.

## Output values

The Wilcoxon comparison outputs two things:

`stat`: The Wilcoxon statistic.

`p`: The p value.

## Examples

Example comparison:

```python
wilcoxon = evaluate.load("wilcoxon")
results = wilcoxon.compute(predictions1=[-7, 123.45, 43, 4.91, 5], predictions2=[1337.12, -9.74, 1, 2, 3.21])
print(results)
{'stat': 5.0, 'p': 0.625}
```

## Limitations and bias

The Wilcoxon test is a non-parametric test, so it has relatively few assumptions (basically only that the observations are independent). It should be used to analyze paired ordinal data only.

## Citations

```bibtex
@incollection{wilcoxon1992individual,
title={Individual comparisons by ranking methods},
author={Wilcoxon, Frank},
booktitle={Breakthroughs in statistics},
pages={196--202},
year={1992},
publisher={Springer}
}
```
6 changes: 6 additions & 0 deletions comparisons/wilcoxon/app.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
import evaluate
from evaluate.utils import launch_gradio_widget


module = evaluate.load("wilcoxon", module_type="comparison")
launch_gradio_widget(module)
3 changes: 3 additions & 0 deletions comparisons/wilcoxon/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
git+https://github.com/huggingface/evaluate@a45df1eb9996eec64ec3282ebe554061cb366388
datasets~=2.0
scipy
78 changes: 78 additions & 0 deletions comparisons/wilcoxon/wilcoxon.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
# Copyright 2022 The HuggingFace Evaluate Authors
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Wilcoxon test for model comparison."""

import datasets
from scipy.stats import wilcoxon

import evaluate


_DESCRIPTION = """
Wilcoxon's test is a non-parametric signed-rank test that tests whether the distribution of the differences is symmetric about zero. It can be used to compare the predictions of two models.
"""


_KWARGS_DESCRIPTION = """
Args:
predictions1 (`list` of `float`): Predictions for model 1.
predictions2 (`list` of `float`): Predictions for model 2.

Returns:
stat (`float`): Wilcoxon test score.
p (`float`): The p value. Minimum possible value is 0. Maximum possible value is 1.0. A lower p value means a more significant difference.

Examples:
>>> wilcoxon = evaluate.load("wilcoxon")
>>> results = wilcoxon.compute(predictions1=[-7, 123.45, 43, 4.91, 5], predictions2=[1337.12, -9.74, 1, 2, 3.21])
>>> print(results)
{'stat': 5.0, 'p': 0.625}
"""


_CITATION = """
@incollection{wilcoxon1992individual,
title={Individual comparisons by ranking methods},
author={Wilcoxon, Frank},
booktitle={Breakthroughs in statistics},
pages={196--202},
year={1992},
publisher={Springer}
}
"""


@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
class Wilcoxon(evaluate.Comparison):
def _info(self):
return evaluate.ComparisonInfo(
module_type="comparison",
description=_DESCRIPTION,
citation=_CITATION,
inputs_description=_KWARGS_DESCRIPTION,
features=datasets.Features(
{
"predictions1": datasets.Value("float"),
"predictions2": datasets.Value("float"),
}
),
)

def _compute(self, predictions1, predictions2):
# calculate difference
d = [p1 - p2 for (p1, p2) in zip(predictions1, predictions2)]

# compute statistic
res = wilcoxon(d)
return {"stat": res.statistic, "p": res.pvalue}