Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added metric Brier Score #275

Merged
merged 13 commits into from
Aug 30, 2022
96 changes: 96 additions & 0 deletions metrics/brier_score/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
---
title: Brier Score
emoji: 🤗
colorFrom: blue
colorTo: red
sdk: gradio
sdk_version: 3.0.2
app_file: app.py
pinned: false
tags:
- evaluate
- metric
description: >-
The Brier score is a measure of the error between two probability distributions.
kadirnar marked this conversation as resolved.
Show resolved Hide resolved
---

# Metric Card for Brier Score


## Metric Description
Brier score is a type of evaluation metric for classification tasks, where you predict outcomes such as win/lose, spam/ham, click/no-click etc.
`BrierScore = 1/N * sum( (p_i - o_i)^2 )`

Where p is the prediction probability of occurrence of the event, and the term oi is equal to 1 if the event occurred and 0 if not. Which means: the lower the value of this score, the better the prediction.
kadirnar marked this conversation as resolved.
Show resolved Hide resolved
## How to Use

At minimum, this metric requires predictions and references as inputs.

```python
>>> brier_score = evaluate.load("brier_score")
>>> predictions = np.array([0, 0, 1, 1])
>>> references = np.array([0.1, 0.9, 0.8, 0.3])
>>> results = brier_score.compute(predictions=predictions, references=references)
```

### Inputs

Mandatory inputs:
- `predictions`: numeric array-like of shape (`n_samples,`) or (`n_samples`, `n_outputs`), representing the estimated target values.

- `references`: numeric array-like of shape (`n_samples,`) or (`n_samples`, `n_outputs`), representing the ground truth (correct) target values.

Optional arguments:
- `sample_weight`: numeric array-like of shape (`n_samples,`) representing sample weights. The default is `None`.
- `pos_label`: the label of the positive class. The default is `1`.


### Output Values
This metric returns a dictionary with the following keys:
- `brier_score (float)`: the computed Brier score.


Output Example(s):
```python
{'brier_score': 0.5}
```

#### Values from Popular Papers


### Examples
```python
>>> brier_score = evaluate.load("brier_score")
>>> predictions = np.array([0, 0, 1, 1])
>>> references = np.array([0.1, 0.9, 0.8, 0.3])
>>> results = brier_score.compute(predictions=predictions, references=references)
>>> print(results)
{'brier_score': 0.3375}
```
Example with `y_true` contains string, an error will be raised and `pos_label` should be explicitly specified.
```python
>>> brier_score_metric = evaluate.load("brier_score")
>>> predictions = np.array(["spam", "ham", "ham", "spam"])
>>> references = np.array([0.1, 0.9, 0.8, 0.3])
>>> results = brier_score.compute(predictions, references, pos_label="ham")
>>> print(results)
{'brier_score': 0.0374}
```
## Limitations and Bias
The [brier_score](https://huggingface.co/metrics/brier_score) is appropriate for binary and categorical outcomes that can be structured as true or false, but it is inappropriate for ordinal variables which can take on three or more values.
## Citation(s)
```bibtex
@article{scikit-learn,
title={Scikit-learn: Machine Learning in {P}ython},
author={Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V.
and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P.
and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and
Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E.},
journal={Journal of Machine Learning Research},
volume={12},
pages={2825--2830},
year={2011}
}
kadirnar marked this conversation as resolved.
Show resolved Hide resolved
```
## Further References
- [Brier Score - Wikipedia](https://en.wikipedia.org/wiki/Brier_score)
6 changes: 6 additions & 0 deletions metrics/brier_score/app.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
import evaluate
from evaluate.utils import launch_gradio_widget


module = evaluate.load("brier_score")
launch_gradio_widget(module)
97 changes: 97 additions & 0 deletions metrics/brier_score/brier_score.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
# Copyright 2022 The HuggingFace Datasets Authors and the current dataset script contributor.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Brier Score Metric"""

import datasets
from sklearn.metrics import brier_score_loss

import evaluate


_CITATION = """\
@article{scikit-learn,
title={Scikit-learn: Machine Learning in {P}ython},
author={Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V.
and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P.
and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and
Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E.},
journal={Journal of Machine Learning Research},
volume={12},
pages={2825--2830},
year={2011}
}
"""

_DESCRIPTION = """\
Brier score is a type of evaluation metric for classification tasks, where you predict outcomes such as win/lose, spam/ham, click/no-click etc.
`BrierScore = 1/N * sum( (p_i - o_i)^2 )`
"""


_KWARGS_DESCRIPTION = """
Args:
predictions: Ground truth (correct) target values. shape = [n_samples]
references: Estimated probabilities. shape = [n_samples]
sample_weight: Sample weights.
pos_label: The label of the positive class.
Returns:
The Brier score.
Examples:
Example-1: if y_true in {-1, 1} or {0, 1}, pos_label defaults to 1.
>>> import numpy as np
>>> brier_score = evaluate.load("brier_score")
>>> predictions = np.array([0, 0, 1, 1])
>>> references = np.array([0.1, 0.9, 0.8, 0.3])
>>> results = brier_score.compute(predictions=predictions, references=references)
>>> print(results)
{'brier_score': 0.3375}
Example-2: if y_true contains string, an error will be raised and pos_label should be explicitly specified.
>>> brier_score = evaluate.load("brier_score")
>>> import numpy as np
>>> predictions = np.array(["spam", "ham", "ham", "spam"])
>>> references = np.array([0.1, 0.9, 0.8, 0.3])
>>> result = brier_score.compute(predictions, references, pos_label="ham")
>>> print(result)
{'brier_score': 0.0374}
"""
kadirnar marked this conversation as resolved.
Show resolved Hide resolved


@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
class BrierScore(evaluate.Metric):
def _info(self):
return evaluate.MetricInfo(
description=_DESCRIPTION,
citation=_CITATION,
inputs_description=_KWARGS_DESCRIPTION,
features=datasets.Features(self._get_feature_types()),
kadirnar marked this conversation as resolved.
Show resolved Hide resolved
reference_urls=["https://scikit-learn.org/stable/modules/generated/sklearn.metrics.brier_score_loss.html"],
)

def _get_feature_types(self):
if self.config_name == "multilist":
return {
"predictions": datasets.Sequence(datasets.Value("float")),
"references": datasets.Sequence(datasets.Value("float")),
}
else:
return {
"predictions": datasets.Value("float"),
"references": datasets.Value("float"),
}
kadirnar marked this conversation as resolved.
Show resolved Hide resolved

def _compute(self, predictions, references, sample_weight=None, pos_label=1):

brier_score = brier_score_loss(references, predictions, sample_weight=sample_weight, pos_label=pos_label)

return {"brier_score": brier_score}
2 changes: 2 additions & 0 deletions metrics/brier_score/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
git+https://github.com/huggingface/evaluate@{COMMIT_PLACEHOLDER}
sklearn