Skip to content

Releases: Lightning-AI/torchmetrics

Adding Multimodal and nominal domain

30 Nov 16:19
Compare
Choose a tag to compare

We are happy to announce that Torchmetrics v0.11 is now publicly available. In Torchmetrics v0.11 we have primarily focused on the cleanup of the large classification refactor from v0.10 and adding new metrics. With v0.11 are crossing 90+ metrics in Torchmetrics nearing the milestone of having 100+ metrics.

New domains

In Torchmetrics we are not only looking to expand with new metrics in already established metric domains such as classification or regression, but also new domains. We are therefore happy to report that v0.11 includes two new domains: Multimodal and nominal.

Multimodal

If there is one topic within machine learning that is hot right now then it is generative models and in particular image-to-text generative models. Just recently stable diffusion v2 was released, able to create even more photorealistic images from a single text prompt than ever

In Torchmetrics v0.11 we are adding a new domain called multimodal to support the evaluation of such models. For now, we are starting out with a single metric, the CLIPScore from this paper that can be used to evaluate such image-to-text models. CLIPScore currently achieves the highest correlation with human judgment, and thus a high CLIPScore for an image-text pair means that it is highly plausible that an image caption and an image are related to each other.

Nominal

If you have ever taken any course in statistics or introduction to machine learning you should hopefully have heard about data can be of different types of attributes: nominal, ordinal, interval, and ratio. This essentially refers to how data can be compared. For example, nominal data cannot be ordered and cannot be measured. An example, would it be data that describes the color of your car: blue, red, or green? It does not make sense to compare the different values. Ordinal data can be compared but does have not a relative meaning. An example, would it be the safety rating of a car: 1,2,3? We can say that 3 is better than 1 but the actual numerical value does not mean anything.

In v0.11 of TorchMetrics, we are adding support for classic metrics on nominal data. In fact, 4 new metrics have already been added to this domain:

  • CramersV
  • PearsonsContingencyCoefficient
  • TschuprowsT
  • TheilsU

All metrics are measures of association between two nominal variables, giving a value between 0 and 1, with 1 meaning that there is a perfect association between the variables.

Small improvements

In addition to metrics within the two new domains v0.11 of Torchmetrics contains other smaller changes and fixes:

  • TotalVariation metric has been added to the image package, which measures the complexity of an image with respect to its spatial variation.

  • MulticlassExactMatch metric has been added to the classification package, which for example can be used to measure sentence level accuracy where all tokens need to match for a sentence to be counted as correct

  • KendallRankCorrCoef have been added to the regression package for measuring the overall correlation between two variables

  • LogCoshError have been added to the regression package for measuring the residual error between two variables. It is similar to the mean squared error close to 0 but similar to the mean absolute error away from 0.


Finally, Torchmetrics now only supports v1.8 and higher of Pytorch. It was necessary to increase from v1.3 to secure because we were running into compatibility issues with an older version of Pytorch. We strive to support as many versions of Pytorch, but for the best experience, we always recommend keeping Pytorch and Torchmetrics up to date.


[0.11.0] - 2022-11-30

Added

  • Added MulticlassExactMatch to classification metrics (#1343)
  • Added TotalVariation to image package (#978)
  • Added CLIPScore to new multimodal package (#1314)
  • Added regression metrics:
    • KendallRankCorrCoef (#1271)
    • LogCoshError (#1316)
  • Added new nominal metrics:
  • Added option to pass distributed_available_fn to metrics to allow checks for custom communication backend for making dist_sync_fn actually useful (#1301)
  • Added normalize argument to Inception, FID, KID metrics (#1246)

Changed

  • Changed minimum Pytorch version to be 1.8 (#1263)
  • Changed interface for all functional and modular classification metrics after refactor (#1252)

Removed

  • Removed deprecated BinnedAveragePrecision, BinnedPrecisionRecallCurve, RecallAtFixedPrecision (#1251)
  • Removed deprecated LabelRankingAveragePrecision, LabelRankingLoss and CoverageError (#1251)
  • Removed deprecated KLDivergence and AUC (#1251)

Fixed

  • Fixed precision bug in pairwise_euclidean_distance (#1352)

Contributors

@Borda, @justusschock, @ragavvenkatesan, @shenoynikhil, @SkafteNicki, @stancld

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Minor patch release

16 Nov 18:09
Compare
Choose a tag to compare

[0.10.3] - 2022-11-16

Fixed

  • Fixed bug in Metrictracker.best_metric when return_step=False (#1306)
  • Fixed bug to prevent users from going into an infinite loop if trying to iterate of a single metric (#1320)
  • Fixed bug in Metrictracker.best_metric when return_step=False (#1306)

Contributors

@SkafteNicki

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Fixed Performance

31 Oct 21:29
Compare
Choose a tag to compare

[0.10.2] - 2022-10-31

Changed

  • Changed in-place operation to out-of-place operation in pairwise_cosine_similarity (#1288)

Fixed

  • Fixed high memory usage for certain classification metrics when average='micro' (#1286)
  • Fixed precision problems when structural_similarity_index_measure was used with autocast (#1291)
  • Fixed slow performance for confusion matrix-based metrics (#1302)
  • Fixed restrictive dtype checking in spearman_corrcoef when used with autocast (#1303)

Contributors

@SkafteNicki

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Minor patch release

21 Oct 14:21
Compare
Choose a tag to compare

[0.10.1] - 2022-10-21

Fixed

  • Fixed broken clone method for classification metrics (#1250)
  • Fixed unintentional downloading of nltk.punkt when lsum not in rouge_keys (#1258)
  • Fixed type casting in MAP metric between bool and float32 (#1150)

Contributors

@dreaquil, @SkafteNicki, @stancld

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Large changes to classifications

04 Oct 19:54
Compare
Choose a tag to compare

TorchMetrics v0.10 is now out, significantly changing the whole classification package. This blog post will go over the reasons why the classification package needs to be refactored, what it means for our end users, and finally, what benefits it gives. A guide on how to upgrade your code to the recent changes can be found near the bottom.

Why the classification metrics need to change

We have for a long time known that there were some underlying problems with how we initially structured the classification package. Essentially, classification tasks can e divided into either binary, multiclass, or multilabel, and determining what task a user is trying to run a given metric on is hard just based on the input. The reason a package such as sklearn can do this is to only support input in very specific formats (no multi-dimensional arrays and no support for both integer and probability/logit formats).

This meant that some metrics, especially for binary tasks, could have been calculating something different than expected if the user were to provide another shape but the expected. This is against the core value of TorchMetrics, that our users, of course should trust that the metric they are evaluating is given the excepted result.

Additionally, classification metrics were missing consistency. For some, metrics num_classes=2 meant binary, and for others num_classes=1 meant binary. You can read more about the underlying reasons for this refactor in this and this issue.

The solution

The solution we went with was to split every classification metric into three separate metrics with the prefix binary_* , multiclass_* and multilabel_* . This solves a number of the above problems out of the box because it becomes easier for us to match our users' expectations for any given input shape. It additionally has some other benefits both for us as developers and ends users

  • Maintainability: by splitting the code into three distinctive functions, we are (hopefully) lowering the code complexity, making the codebase easier to maintain in the long term.
  • Speed: by completely removing the auto-detection of task at runtime, we can significantly increase computational speed (more on this later).
  • Task-specific arguments: by splitting into three functions, we also make it more clear what input arguments affect the computed result. Take - Accuracy as an example: both num_classes , top_k , average are arguments that have an influence if you are doing multiclass classification but doing nothing for binary classification and vice versa with the thresholds argument. The task-specific versions only contain the arguments that influence the given task.
  • There are many smaller quality-of-life improvements hidden throughout the refactor, however here are our top 3:

Standardized arguments

The input arguments for the classification package are now much more standardized. Here are a few examples:

  • Each metric now only supports arguments that influence the final result. This means that num_classes is removed from all binary_* metrics are now required for all multiclass_* metrics and renamed to num_labels for all multilabel_* metrics.
  • The ignore_index argument is now supported by ALL classification metrics and supports any value and not only values in the [0,num_classes] range (similar to torch loss functions). Below is shown an example:
  • We added a new validate_args to all classification metrics to allow users to skip validation of inputs making the computations completely faster. By default, we will still do input validation because it is the safest option for the user. Still, if you are confident that the input to the metric is correct, then you can now disable this, checking for a potential speed-up (more on this later).

Constant memory implementations

Some of the most useful metrics for evaluating classification problems are metrics such as ROC, AUROC, AveragePrecision, etc., because they not only evaluate your model for a single threshold but a whole range of thresholds, essentially giving you the ability to see the trade-off between Type I and Type II errors. However, a big problem with the standard formulation of these metrics (which we have been using) is that they require access to all data for their calculation. Our implementation has been extremely memory-intensive for these kinds of metrics.

In v0.10 of TorchMetrics, all these metrics now have an argument called thresholds. By default, it is None and the metric will still save all targets and predictions in memory as you are used to. However, if this argument is instead set to a tensor - torch.linspace(0,1,100) it will instead use a constant-memory approximation by evaluating the metric under those provided thresholds.

Setting thresholds=None has an approximate memory footprint of O(num_samples) whereas using thresholds=torch.linspace(0,1,100) has an approximate memory footprint of O(num_thresholds). In this particular case, users will save memory when the metric is computed on more than 100 samples. This feature can save memory by comparing this to modern machine learning, where evaluation is often done on thousands to millions of data points.

This also means that the Binned* metrics that currently exist in TorchMetrics are being deprecated as their functionality is now captured by this argument.

All metrics are faster (ish)

By splitting each metric into 3 separate metrics, we reduce the number of calculations needed. We, therefore, expected out-of-the-box that our new implementations would be faster. The table below shows the timings of different metrics with the old and new implementations (with and without input validation). Numbers in parentheses denote speed-up over old implementations.

The following observations can be made:

  • Some metrics are a bit faster (1.3x), and others are much faster (4.6x) after the refactor!
  • Disabling input validation can speed up things. For example, multiclass_confusion_matrix goes from a speedup of 3.36x to 4.81 when input validation is disabled. A clear advantage for users that are familiar with the metrics and do not need validation of their input at every update.
  • If we compare binary with multiclass, the biggest speedup can be seen for multiclass problems.
  • Every metric is faster except for the precision-recall curve, even the new approximative binning method. This is a bit strange, as the non-approximation should be equally fast (it's the same code). We are actively looking into this.

[0.10.0] - 2022-10-04

Added

  • Added a new NLP metric InfoLM (#915)
  • Added Perplexity metric (#922)
  • Added ConcordanceCorrCoef metric to regression package (#1201)
  • Added argument normalize to LPIPS metric (#1216)
  • Added support for multiprocessing of batches in PESQ metric (#1227)
  • Added support for multioutput in PearsonCorrCoef and SpearmanCorrCoef (#1200)

Changed

Fixed

  • Fixed a bug in ssim when return_full_image=True where the score was still reduced (#1204)
  • Fixed MPS support for:
  • Fixed bug in ClasswiseWrapper such that compute gave wrong result (#1225)
  • Fixed synchronization of empty list states (#1219)

Contributors

@Borda, @bryant1410, @geoffrey-g-delhomme, @justusschock, @lucadiliello, @nicolas-dufour, @Queuecumber, @SkafteNicki, @stancld

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Minor patch release

23 Jul 21:26
Compare
Choose a tag to compare

[0.9.3] - 2022-08-22

Added

  • Added global option sync_on_compute to disable automatic synchronization when compute is called (#1107)

Fixed

  • Fixed missing reset in ClasswiseWrapper (#1129)
  • Fixed JaccardIndex multi-label compute (#1125)
  • Fix SSIM propagate device if gaussian_kernel is False, add test (#1149)

Contributors

@KeVoyer1, @krshrimali, @SkafteNicki

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Minor patch release

29 Jun 22:48
Compare
Choose a tag to compare

[0.9.2] - 2022-06-29

Fixed

  • Fixed mAP calculation for areas with 0 predictions (#1080)
  • Fixed bug where avg precision state and auroc state was not merge when using MetricCollections (#1086)
  • Skip box conversion if no boxes are present in MeanAveragePrecision (#1097)
  • Fixed inconsistency in docs and code when setting average="none" in AvaragePrecision metric (#1116)

Contributors

@23pointsNorth, @kouyk, @SkafteNicki

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Minor PL compatibility patch

08 Jun 20:23
Compare
Choose a tag to compare

[0.9.1] - 2022-06-08

Added

  • Added specific RuntimeError when metric object is on the wrong device (#1056)
  • Added an option to specify own n-gram weights for BLEUScore and SacreBLEUScore instead of using uniform weights only. (#1075)

Fixed

  • Fixed aggregation metrics when input only contains zero (#1070)
  • Fixed TypeError when providing superclass arguments as kwargs (#1069)
  • Fixed bug related to state reference in metric collection when using compute groups (#1076)

Contributors

@jlcsilva, @SkafteNicki, @stancld

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Faster forward

31 May 14:32
Compare
Choose a tag to compare

Highligths

TorchMetrics v0.9 is now out, and it brings significant changes to how the forward method works. This blog post goes over these improvements and how they affect both users of TorchMetrics and users that implement custom metrics. TorchMetrics v0.9 also includes several new metrics and bug fixes.

Blog: TorchMetrics v0.9 — Faster forward

The Story of the Forward Method

Since the beginning of TorchMetrics, Forward has served the dual purpose of calculating the metric on the current batch and accumulating in a global state. Internally, this was achieved by calling update twice: one for each purpose, which meant repeating the same computation. However, for many metrics, calling update twice is unnecessary to achieve both the local batch statistics and accumulating globally because the global statistics are simple reductions of the local batch states.

In v0.9, we have finally implemented a logic that can take advantage of this and will only call update once before making a simple reduction. As you can see in the figure below, this can lead to a single call of forward being 2x faster in v0.9 compared to v0.8 of the same metric.

With the improvements to forward, many metrics have become significantly faster (up to 2x)
It should be noted that this change mainly benefits metrics (for example, confusionmatrix) where calling update is expensive.

We went through all existing metrics in TorchMetrics and enabled this feature for all appropriate metrics, which was almost 95% of all metrics. We want to stress that if you are using metrics from TorchMetrics, nothing has changed to the API, and no code changes are necessary.

[0.9.0] - 2022-05-31

Added

  • Added RetrievalPrecisionRecallCurve and RetrievalRecallAtFixedPrecision to retrieval package (#951)
  • Added class property full_state_update that determines forward should call update once or twice (#984,#1033)
  • Added support for nested metric collections (#1003)
  • Added Dice to classification package (#1021)
  • Added support to segmentation type segm as IOU for mean average precision (#822)

Changed

  • Renamed reduction argument to average in Jaccard score and added additional options (#874)

Removed

Fixed

  • Fixed non-empty state dict for a few metrics (#1012)
  • Fixed bug when comparing states while finding compute groups (#1022)
  • Fixed torch.double support in stat score metrics (#1023)
  • Fixed FID calculation for non-equal size real and fake input (#1028)
  • Fixed case where KLDivergence could output Nan (#1030)
  • Fixed deterministic for PyTorch<1.8 (#1035)
  • Fixed default value for mdmc_average in Accuracy (#1036)
  • Fixed missing copy of property when using compute groups in MetricCollection (#1052)

Contributors

@Borda, @burglarhobbit, @charlielito, @gianscarpe, @MrShevan, @phaseolud, @razmikmelikbekyan, @SkafteNicki, @tanmoyio, @vumichien

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Minor patch release

06 May 06:25
Compare
Choose a tag to compare

[0.8.2] - 2022-05-06

Fixed

  • Fixed multi-device aggregation in PearsonCorrCoef (#998)
  • Fixed MAP metric when using a custom list of thresholds (#995)
  • Fixed compatibility between compute groups in MetricCollection and prefix/postfix arg (#1007)
  • Fixed compatibility with future Pytorch 1.12 in safe_matmul (#1011, #1014)

Contributors

@ben-davidson-6, @Borda, @SkafteNicki, @tanmoyio

If we forgot someone due to not matching commit email with GitHub account, let us know :]