Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Grouped instance metric inherit from InstanceMetrics (#452)
* add tests for grouped instance metrics * modify InstanceMetric to accept grouped_mean reduction * initial commit * apply ruff formatting * apply ruff formatting, reduce complexity * merge with main * initial commit * rename grouped instance metrics so artifact type and name correspond * rename grouped instance metrics so artifact type and name correspond * rename grouped instance metrics so artifact type and name correspond * remove newline formatting * remove (catalog from removed metric) * fix some variation in expected values * add catching of nanmean warning; fix InstanceMetric verification function * InstanceMetric need to specify ci_scores for fields that have calculated CIs. score_based_confidence_interval accepts list of score fields without definining bootstrap function * add ci_scores to several InstanceMetrics move aggregate_instance_scores as static method to MetricWithConfidenceInterval so can be used in score_based_confidence_interval * ruff formatting * add test_grouped_instance_metric_errors for code coverage * add grouped instance metrics with normalized Cohen's h aggregation function * add normalized Cohen's h * marge with main * change description of group_instance_metrics test since is no longer inheriting from GroupMetric * checkout from main * slight difference in results for confidence interval between Travis and local for Cohen's H * add note for grouped instance CI for Cohen + StringContainment * add documentation to InstanceMetric group_mean reduction validation rename aggregate_instance_scores with average_instance_scores add _ directly to ci prefix * rename field as group_aggregation_func; use resample_from_non_nan in globalmetric confidence interval to ensure scores are not NaN * rename field as group_aggregation_func; use resample_from_non_nan in globalmetric confidence interval to ensure scores are not NaN * use same predictions and references for tokenoverlap as the other metrics * add additional comments to resample_from_non_nan from original version * use same references and predictions for tokenoverlap as for other grouped instance metrics; remove np.nan conditions on CIs since doesn't happen in our examples. * return global result to CI test for grouped instance because of tokenoverlap * add interpretation option and comment to cohen's h * add group_mean_subgroup_comparison reduction to InstanceMetric; update CIs for Cohen's h and PDR classes that were incorrectly specified before * modify test_grouped_instance_metric_errors to take into account boolean third field in reduction. Modify confidence intervals according to fixed grouping or not. * class InstanceMetric can have group reductions done either taking the groups as fixed, or not * add FixedGroupMeanAccuracy. Modify expected global results to take into account grouping. * add notes to cohen's h * add other_mean and baseline_mean functions. Combine the subgroup_comparison reduction into the group_mean. Any metric that uses fixed group sampling renamed FixedGroup... * import statistics.mean at the top * remove __name__ * Delete src/unitxt/catalog/metrics/group_mean_accuracy.json replace with file in robustness directory * remove from catalog * move to own directory * return class name * write metrics to robustness directory in catalog * rename others to paraphrase; use variant_score_dict rather than is_baseline boolean indicator, to accomodate cases where there are >2 variant types and we want to run two or more metrics on them * initial commit * fix type hint in validate_variant_types * fix type hint in validate_variant_types * Delete src/unitxt/catalog/metrics/robustness/fixed_group_mean_others_accuracy.json metric was renamed * Delete src/unitxt/catalog/metrics/robustness/fixed_group_mean_others_string_containment.json metric was renamed * initial commit * implement PR changes; rename variant to subgroup; add Cohen's d metric * correct condition on cohen's d sample sizes * adapt PDR, Cohens' D and H to accept a list of list of labels (so that a comparison group can consist of multiple sub-groups) * Delete src/unitxt/catalog/metrics/robustness/fixed_group_cohens_d_accuracy.json rename metric * Delete src/unitxt/catalog/metrics/robustness/fixed_group_cohens_d_string_containment.json rename metric * Delete src/unitxt/catalog/metrics/robustness/fixed_group_norm_cohens_h_accuracy.json rename metric * Delete src/unitxt/catalog/metrics/robustness/fixed_group_norm_cohens_h_string_containment.json rename metric * Delete src/unitxt/catalog/metrics/robustness/fixed_group_pdr_accuracy.json rename metric * Delete src/unitxt/catalog/metrics/robustness/fixed_group_pdr_string_containment.json rename metric * rename to include string 'paraphrase' to distinguish from 'all variants' * Delete src/unitxt/catalog/metrics/robustness/fixed_group_cohens_d_paraphrase_accuracy.json rename to Hedges' g * Delete src/unitxt/catalog/metrics/robustness/fixed_group_cohens_d_paraphrase_string_containment.json rename to Hedges' g * redefine Cohen's d as Hedge's g, with correction. for grouped comparison aggregations, use two list arguments rather than a single list argument with two sub-lists * rename Cohen's d * add ZeroDivisionError in Hedge's g * rename Hedges g to Norm Hedges g, and divide by maximum to rescale to -1, 1 * initial commit, rename from hedges_g * Delete src/unitxt/catalog/metrics/robustness/fixed_group_hedges_g_paraphrase_string_containment.json rename to norm hedges g * Delete src/unitxt/catalog/metrics/robustness/fixed_group_hedges_g_paraphrase_accuracy.json rename to norm hedges g * fix PDR so if both means are 0, return 0 rather than NaN * final PR changes, remove agg_func definition * remove checks on instances in get_group_scores that were already validated * remove deepcopy * fix some comments and parameter names. Make TokenOverlap do conversion of prediction and reference to strings internally. * initial commit * add absolute value version of Hedges G / Cohens H * add absolute value version of Hedges G / Cohens H to tests * changes to global metric confidence interval now resample non-NaN values, so CI will not be NaN * Revert "changes to global metric confidence interval now resample non-NaN values, so CI will not be NaN" This reverts commit 6249538. * changes to global metric confidence interval now resample non-NaN values, so CI will not be NaN --------- Co-authored-by: Samuel Ackerman <samuel.ackerman@ibm.com> Co-authored-by: matanor <55045955+matanor@users.noreply.github.com>
- Loading branch information