Grouped instance metric inherit from InstanceMetrics (#452)

* add tests for grouped instance metrics * modify InstanceMetric to accept grouped_mean reduction * initial commit * apply ruff formatting * apply ruff formatting, reduce complexity * merge with main * initial commit * rename grouped instance metrics so artifact type and name correspond * rename grouped instance metrics so artifact type and name correspond * rename grouped instance metrics so artifact type and name correspond * remove newline formatting * remove (catalog from removed metric) * fix some variation in expected values * add catching of nanmean warning; fix InstanceMetric verification function * InstanceMetric need to specify ci_scores for fields that have calculated CIs. score_based_confidence_interval accepts list of score fields without definining bootstrap function * add ci_scores to several InstanceMetrics move aggregate_instance_scores as static method to MetricWithConfidenceInterval so can be used in score_based_confidence_interval * ruff formatting * add test_grouped_instance_metric_errors for code coverage * add grouped instance metrics with normalized Cohen's h aggregation function * add normalized Cohen's h * marge with main * change description of group_instance_metrics test since is no longer inheriting from GroupMetric * checkout from main * slight difference in results for confidence interval between Travis and local for Cohen's H * add note for grouped instance CI for Cohen + StringContainment * add documentation to InstanceMetric group_mean reduction validation rename aggregate_instance_scores with average_instance_scores add _ directly to ci prefix * rename field as group_aggregation_func; use resample_from_non_nan in globalmetric confidence interval to ensure scores are not NaN * rename field as group_aggregation_func; use resample_from_non_nan in globalmetric confidence interval to ensure scores are not NaN * use same predictions and references for tokenoverlap as the other metrics * add additional comments to resample_from_non_nan from original version * use same references and predictions for tokenoverlap as for other grouped instance metrics; remove np.nan conditions on CIs since doesn't happen in our examples. * return global result to CI test for grouped instance because of tokenoverlap * add interpretation option and comment to cohen's h * add group_mean_subgroup_comparison reduction to InstanceMetric; update CIs for Cohen's h and PDR classes that were incorrectly specified before * modify test_grouped_instance_metric_errors to take into account boolean third field in reduction. Modify confidence intervals according to fixed grouping or not. * class InstanceMetric can have group reductions done either taking the groups as fixed, or not * add FixedGroupMeanAccuracy. Modify expected global results to take into account grouping. * add notes to cohen's h * add other_mean and baseline_mean functions. Combine the subgroup_comparison reduction into the group_mean. Any metric that uses fixed group sampling renamed FixedGroup... * import statistics.mean at the top * remove __name__ * Delete src/unitxt/catalog/metrics/group_mean_accuracy.json replace with file in robustness directory * remove from catalog * move to own directory * return class name * write metrics to robustness directory in catalog * rename others to paraphrase; use variant_score_dict rather than is_baseline boolean indicator, to accomodate cases where there are >2 variant types and we want to run two or more metrics on them * initial commit * fix type hint in validate_variant_types * fix type hint in validate_variant_types * Delete src/unitxt/catalog/metrics/robustness/fixed_group_mean_others_accuracy.json metric was renamed * Delete src/unitxt/catalog/metrics/robustness/fixed_group_mean_others_string_containment.json metric was renamed * initial commit * implement PR changes; rename variant to subgroup; add Cohen's d metric * correct condition on cohen's d sample sizes * adapt PDR, Cohens' D and H to accept a list of list of labels (so that a comparison group can consist of multiple sub-groups) * Delete src/unitxt/catalog/metrics/robustness/fixed_group_cohens_d_accuracy.json rename metric * Delete src/unitxt/catalog/metrics/robustness/fixed_group_cohens_d_string_containment.json rename metric * Delete src/unitxt/catalog/metrics/robustness/fixed_group_norm_cohens_h_accuracy.json rename metric * Delete src/unitxt/catalog/metrics/robustness/fixed_group_norm_cohens_h_string_containment.json rename metric * Delete src/unitxt/catalog/metrics/robustness/fixed_group_pdr_accuracy.json rename metric * Delete src/unitxt/catalog/metrics/robustness/fixed_group_pdr_string_containment.json rename metric * rename to include string 'paraphrase' to distinguish from 'all variants' * Delete src/unitxt/catalog/metrics/robustness/fixed_group_cohens_d_paraphrase_accuracy.json rename to Hedges' g * Delete src/unitxt/catalog/metrics/robustness/fixed_group_cohens_d_paraphrase_string_containment.json rename to Hedges' g * redefine Cohen's d as Hedge's g, with correction. for grouped comparison aggregations, use two list arguments rather than a single list argument with two sub-lists * rename Cohen's d * add ZeroDivisionError in Hedge's g * rename Hedges g to Norm Hedges g, and divide by maximum to rescale to -1, 1 * initial commit, rename from hedges_g * Delete src/unitxt/catalog/metrics/robustness/fixed_group_hedges_g_paraphrase_string_containment.json rename to norm hedges g * Delete src/unitxt/catalog/metrics/robustness/fixed_group_hedges_g_paraphrase_accuracy.json rename to norm hedges g * fix PDR so if both means are 0, return 0 rather than NaN * final PR changes, remove agg_func definition * remove checks on instances in get_group_scores that were already validated * remove deepcopy * fix some comments and parameter names. Make TokenOverlap do conversion of prediction and reference to strings internally. * initial commit * add absolute value version of Hedges G / Cohens H * add absolute value version of Hedges G / Cohens H to tests * changes to global metric confidence interval now resample non-NaN values, so CI will not be NaN * Revert "changes to global metric confidence interval now resample non-NaN values, so CI will not be NaN" This reverts commit 6249538. * changes to global metric confidence interval now resample non-NaN values, so CI will not be NaN --------- Co-authored-by: Samuel Ackerman <samuel.ackerman@ibm.com> Co-authored-by: matanor <55045955+matanor@users.noreply.github.com>
IBM · Feb 22, 2024 · a0443c2 · a0443c2
1 parent 3637f8e
commit a0443c2
Show file tree

Hide file tree

Showing 23 changed files with 2,077 additions and 79 deletions.
diff --git a/prepare/metrics/grouped_instance_metrics.py b/prepare/metrics/grouped_instance_metrics.py
diff --git a/prepare/metrics/roc_auc.py b/prepare/metrics/roc_auc.py
@@ -12,11 +12,11 @@
 instance_targets = [{"roc_auc": np.nan, "score": np.nan, "score_name": "roc_auc"}] * 3
 global_targets = {
     "roc_auc": 0.5,
-    "roc_auc_ci_high": np.nan,
-    "roc_auc_ci_low": np.nan,
+    "roc_auc_ci_high": 0.9,
+    "roc_auc_ci_low": 0.5,
     "score": 0.5,
-    "score_ci_high": np.nan,
-    "score_ci_low": np.nan,
+    "score_ci_high": 0.9,
+    "score_ci_low": 0.5,
     "score_name": "roc_auc",
 }
 

diff --git a/...itxt/catalog/metrics/robustness/fixed_group_absval_norm_cohens_h_paraphrase_accuracy.json b/...itxt/catalog/metrics/robustness/fixed_group_absval_norm_cohens_h_paraphrase_accuracy.json
@@ -0,0 +1,3 @@
+{
+    "type": "fixed_group_absval_norm_cohens_h_paraphrase_accuracy"
+}
diff --git a/...og/metrics/robustness/fixed_group_absval_norm_cohens_h_paraphrase_string_containment.json b/...og/metrics/robustness/fixed_group_absval_norm_cohens_h_paraphrase_string_containment.json
@@ -0,0 +1,3 @@
+{
+    "type": "fixed_group_absval_norm_cohens_h_paraphrase_string_containment"
+}
diff --git a/...itxt/catalog/metrics/robustness/fixed_group_absval_norm_hedges_g_paraphrase_accuracy.json b/...itxt/catalog/metrics/robustness/fixed_group_absval_norm_hedges_g_paraphrase_accuracy.json
@@ -0,0 +1,3 @@
+{
+    "type": "fixed_group_absval_norm_hedges_g_paraphrase_accuracy"
+}
diff --git a/...og/metrics/robustness/fixed_group_absval_norm_hedges_g_paraphrase_string_containment.json b/...og/metrics/robustness/fixed_group_absval_norm_hedges_g_paraphrase_string_containment.json
@@ -0,0 +1,3 @@
+{
+    "type": "fixed_group_absval_norm_hedges_g_paraphrase_string_containment"
+}
diff --git a/src/unitxt/catalog/metrics/robustness/fixed_group_mean_accuracy.json b/src/unitxt/catalog/metrics/robustness/fixed_group_mean_accuracy.json
@@ -0,0 +1,3 @@
+{
+    "type": "fixed_group_mean_accuracy"
+}
diff --git a/src/unitxt/catalog/metrics/robustness/fixed_group_mean_baseline_accuracy.json b/src/unitxt/catalog/metrics/robustness/fixed_group_mean_baseline_accuracy.json
@@ -0,0 +1,3 @@
+{
+    "type": "fixed_group_mean_baseline_accuracy"
+}
diff --git a/src/unitxt/catalog/metrics/robustness/fixed_group_mean_baseline_string_containment.json b/src/unitxt/catalog/metrics/robustness/fixed_group_mean_baseline_string_containment.json
@@ -0,0 +1,3 @@
+{
+    "type": "fixed_group_mean_baseline_string_containment"
+}
diff --git a/src/unitxt/catalog/metrics/robustness/fixed_group_mean_paraphrase_accuracy.json b/src/unitxt/catalog/metrics/robustness/fixed_group_mean_paraphrase_accuracy.json
@@ -0,0 +1,3 @@
+{
+    "type": "fixed_group_mean_paraphrase_accuracy"
+}
diff --git a/src/unitxt/catalog/metrics/robustness/fixed_group_mean_paraphrase_string_containment.json b/src/unitxt/catalog/metrics/robustness/fixed_group_mean_paraphrase_string_containment.json
@@ -0,0 +1,3 @@
+{
+    "type": "fixed_group_mean_paraphrase_string_containment"
+}
diff --git a/src/unitxt/catalog/metrics/robustness/fixed_group_mean_string_containment.json b/src/unitxt/catalog/metrics/robustness/fixed_group_mean_string_containment.json
@@ -0,0 +1,3 @@
+{
+    "type": "fixed_group_mean_string_containment"
+}
diff --git a/src/unitxt/catalog/metrics/robustness/fixed_group_norm_cohens_h_paraphrase_accuracy.json b/src/unitxt/catalog/metrics/robustness/fixed_group_norm_cohens_h_paraphrase_accuracy.json
@@ -0,0 +1,3 @@
+{
+    "type": "fixed_group_norm_cohens_h_paraphrase_accuracy"
+}
diff --git a/...t/catalog/metrics/robustness/fixed_group_norm_cohens_h_paraphrase_string_containment.json b/...t/catalog/metrics/robustness/fixed_group_norm_cohens_h_paraphrase_string_containment.json
@@ -0,0 +1,3 @@
+{
+    "type": "fixed_group_norm_cohens_h_paraphrase_string_containment"
+}
diff --git a/src/unitxt/catalog/metrics/robustness/fixed_group_norm_hedges_g_paraphrase_accuracy.json b/src/unitxt/catalog/metrics/robustness/fixed_group_norm_hedges_g_paraphrase_accuracy.json
@@ -0,0 +1,3 @@
+{
+    "type": "fixed_group_norm_hedges_g_paraphrase_accuracy"
+}
diff --git a/...t/catalog/metrics/robustness/fixed_group_norm_hedges_g_paraphrase_string_containment.json b/...t/catalog/metrics/robustness/fixed_group_norm_hedges_g_paraphrase_string_containment.json
@@ -0,0 +1,3 @@
+{
+    "type": "fixed_group_norm_hedges_g_paraphrase_string_containment"
+}
diff --git a/src/unitxt/catalog/metrics/robustness/fixed_group_pdr_paraphrase_accuracy.json b/src/unitxt/catalog/metrics/robustness/fixed_group_pdr_paraphrase_accuracy.json
@@ -0,0 +1,3 @@
+{
+    "type": "fixed_group_pdr_paraphrase_accuracy"
+}
diff --git a/src/unitxt/catalog/metrics/robustness/fixed_group_pdr_paraphrase_string_containment.json b/src/unitxt/catalog/metrics/robustness/fixed_group_pdr_paraphrase_string_containment.json
@@ -0,0 +1,3 @@
+{
+    "type": "fixed_group_pdr_paraphrase_string_containment"
+}
diff --git a/src/unitxt/catalog/metrics/robustness/group_mean_accuracy.json b/src/unitxt/catalog/metrics/robustness/group_mean_accuracy.json
@@ -0,0 +1,3 @@
+{
+    "type": "group_mean_accuracy"
+}
diff --git a/src/unitxt/catalog/metrics/robustness/group_mean_string_containment.json b/src/unitxt/catalog/metrics/robustness/group_mean_string_containment.json
@@ -0,0 +1,3 @@
+{
+    "type": "group_mean_string_containment"
+}
diff --git a/src/unitxt/catalog/metrics/robustness/group_mean_token_overlap.json b/src/unitxt/catalog/metrics/robustness/group_mean_token_overlap.json
@@ -0,0 +1,3 @@
+{
+    "type": "group_mean_token_overlap"
+}
diff --git a/src/unitxt/metrics.py b/src/unitxt/metrics.py