Skip to content

Commit

Permalink
Grouped instance metric inherit from InstanceMetrics (#452)
Browse files Browse the repository at this point in the history
* add tests for grouped instance metrics

* modify InstanceMetric to accept grouped_mean reduction

* initial commit

* apply ruff formatting

* apply ruff formatting, reduce complexity

* merge with main

* initial commit

* rename grouped instance metrics so artifact type and name correspond

* rename grouped instance metrics so artifact type and name correspond

* rename grouped instance metrics so artifact type and name correspond

* remove newline formatting

* remove (catalog from removed metric)

* fix some variation in expected values

* add catching of nanmean warning; fix InstanceMetric verification function

* InstanceMetric need to specify ci_scores for fields that have calculated CIs. score_based_confidence_interval accepts list of score fields without definining bootstrap function

* add ci_scores to several InstanceMetrics
move aggregate_instance_scores as static method to MetricWithConfidenceInterval so can be used in score_based_confidence_interval

* ruff formatting

* add test_grouped_instance_metric_errors for code coverage

* add grouped instance metrics with normalized Cohen's h aggregation function

* add normalized Cohen's h

* marge with main

* change description of group_instance_metrics test since is no longer inheriting from GroupMetric

* checkout from main

* slight difference in results for confidence interval between Travis and local for Cohen's H

* add note for grouped instance CI for Cohen + StringContainment

* add documentation to InstanceMetric group_mean reduction validation
rename aggregate_instance_scores with average_instance_scores
add _ directly to ci prefix

* rename field as group_aggregation_func;
use resample_from_non_nan in globalmetric confidence interval to ensure scores are not NaN

* rename field as group_aggregation_func;
use resample_from_non_nan in globalmetric confidence interval to ensure scores are not NaN

* use same predictions and references for tokenoverlap as the other metrics

* add additional comments to resample_from_non_nan from original version

* use same references and predictions for tokenoverlap as for other grouped instance metrics; remove np.nan conditions on CIs since doesn't happen in our examples.

* return global result to CI test for grouped instance because of tokenoverlap

* add interpretation option and comment to cohen's h

* add group_mean_subgroup_comparison reduction to InstanceMetric; update CIs for Cohen's h and PDR classes that were incorrectly specified before

* modify test_grouped_instance_metric_errors to take into account boolean third field in reduction.
Modify confidence intervals according to fixed grouping or not.

* class InstanceMetric can have group reductions done either taking the groups as fixed, or not

* add FixedGroupMeanAccuracy.  Modify expected global results to take into account grouping.

* add notes to cohen's h

* add other_mean and baseline_mean functions.  Combine the subgroup_comparison reduction into the group_mean.  Any metric that uses fixed group sampling renamed FixedGroup...

* import statistics.mean at the top

* remove __name__

* Delete src/unitxt/catalog/metrics/group_mean_accuracy.json

replace with file in robustness directory

* remove from catalog

* move to own directory

* return class name

* write metrics to robustness directory in catalog

* rename others to paraphrase; use variant_score_dict rather than is_baseline boolean indicator, to accomodate cases where there are >2 variant types and we want to run two or more metrics on them

* initial commit

* fix type hint in validate_variant_types

* fix type hint in validate_variant_types

* Delete src/unitxt/catalog/metrics/robustness/fixed_group_mean_others_accuracy.json

metric was renamed

* Delete src/unitxt/catalog/metrics/robustness/fixed_group_mean_others_string_containment.json

metric was renamed

* initial commit

* implement PR changes; rename variant to subgroup; add Cohen's d metric

* correct condition on cohen's d sample sizes

* adapt PDR, Cohens' D and H to accept a list of list of labels (so that a comparison group can consist of multiple sub-groups)

* Delete src/unitxt/catalog/metrics/robustness/fixed_group_cohens_d_accuracy.json

rename metric

* Delete src/unitxt/catalog/metrics/robustness/fixed_group_cohens_d_string_containment.json

rename metric

* Delete src/unitxt/catalog/metrics/robustness/fixed_group_norm_cohens_h_accuracy.json

rename metric

* Delete src/unitxt/catalog/metrics/robustness/fixed_group_norm_cohens_h_string_containment.json

rename metric

* Delete src/unitxt/catalog/metrics/robustness/fixed_group_pdr_accuracy.json

rename metric

* Delete src/unitxt/catalog/metrics/robustness/fixed_group_pdr_string_containment.json

rename metric

* rename to include string 'paraphrase' to distinguish from 'all variants'

* Delete src/unitxt/catalog/metrics/robustness/fixed_group_cohens_d_paraphrase_accuracy.json

rename to Hedges' g

* Delete src/unitxt/catalog/metrics/robustness/fixed_group_cohens_d_paraphrase_string_containment.json

rename to Hedges' g

* redefine Cohen's d as Hedge's g, with correction.
for grouped comparison aggregations, use two list arguments rather than a single list argument with two sub-lists

* rename Cohen's d

* add ZeroDivisionError in Hedge's g

* rename Hedges g to Norm Hedges g, and divide by maximum to rescale to -1, 1

* initial commit, rename from hedges_g

* Delete src/unitxt/catalog/metrics/robustness/fixed_group_hedges_g_paraphrase_string_containment.json

rename to norm hedges g

* Delete src/unitxt/catalog/metrics/robustness/fixed_group_hedges_g_paraphrase_accuracy.json

rename to norm hedges g

* fix PDR so if both means are 0, return 0 rather than NaN

* final PR changes, remove agg_func definition

* remove checks on instances in get_group_scores that were already validated

* remove deepcopy

* fix some comments and parameter names.  Make TokenOverlap do conversion of prediction and reference to strings internally.

* initial commit

* add absolute value version of Hedges G / Cohens H

* add absolute value version of Hedges G / Cohens H to tests

* changes to global metric confidence interval now resample non-NaN values, so CI will not be NaN

* Revert "changes to global metric confidence interval now resample non-NaN values, so CI will not be NaN"

This reverts commit 6249538.

* changes to global metric confidence interval now resample non-NaN values, so CI will not be NaN

---------

Co-authored-by: Samuel Ackerman <samuel.ackerman@ibm.com>
Co-authored-by: matanor <55045955+matanor@users.noreply.github.com>
  • Loading branch information
3 people committed Feb 22, 2024
1 parent 3637f8e commit a0443c2
Show file tree
Hide file tree
Showing 23 changed files with 2,077 additions and 79 deletions.
659 changes: 659 additions & 0 deletions prepare/metrics/grouped_instance_metrics.py

Large diffs are not rendered by default.

8 changes: 4 additions & 4 deletions prepare/metrics/roc_auc.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,11 @@
instance_targets = [{"roc_auc": np.nan, "score": np.nan, "score_name": "roc_auc"}] * 3
global_targets = {
"roc_auc": 0.5,
"roc_auc_ci_high": np.nan,
"roc_auc_ci_low": np.nan,
"roc_auc_ci_high": 0.9,
"roc_auc_ci_low": 0.5,
"score": 0.5,
"score_ci_high": np.nan,
"score_ci_low": np.nan,
"score_ci_high": 0.9,
"score_ci_low": 0.5,
"score_name": "roc_auc",
}

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{
"type": "fixed_group_absval_norm_cohens_h_paraphrase_accuracy"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{
"type": "fixed_group_absval_norm_cohens_h_paraphrase_string_containment"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{
"type": "fixed_group_absval_norm_hedges_g_paraphrase_accuracy"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{
"type": "fixed_group_absval_norm_hedges_g_paraphrase_string_containment"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{
"type": "fixed_group_mean_accuracy"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{
"type": "fixed_group_mean_baseline_accuracy"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{
"type": "fixed_group_mean_baseline_string_containment"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{
"type": "fixed_group_mean_paraphrase_accuracy"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{
"type": "fixed_group_mean_paraphrase_string_containment"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{
"type": "fixed_group_mean_string_containment"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{
"type": "fixed_group_norm_cohens_h_paraphrase_accuracy"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{
"type": "fixed_group_norm_cohens_h_paraphrase_string_containment"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{
"type": "fixed_group_norm_hedges_g_paraphrase_accuracy"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{
"type": "fixed_group_norm_hedges_g_paraphrase_string_containment"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{
"type": "fixed_group_pdr_paraphrase_accuracy"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{
"type": "fixed_group_pdr_paraphrase_string_containment"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{
"type": "group_mean_accuracy"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{
"type": "group_mean_string_containment"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{
"type": "group_mean_token_overlap"
}
1,053 changes: 978 additions & 75 deletions src/unitxt/metrics.py

Large diffs are not rendered by default.

Loading

0 comments on commit a0443c2

Please sign in to comment.