Make most metrics work on GPU #3851

bryant1410 · 2020-02-26T05:25:10Z

I did the following: renamed unwrap_to_tensors to detach_tensors, and replaced it to send to cuda() instead of cpu(). Then ran all the tests with 2 GPUs (only 4 tests were skipped). It failed on:

spearman correlation
fbeta
entropy
boolean accuracy
bleu
auc
allennlp/tests/predictors/srl_test.py:82 (TestSrlPredictor.test_prediction_with_no_verbs)
allennlp/tests/predictors/coref_test.py:86 (TestCorefPredictor.test_replace_corefs)
simpleseq2seqtest
allennlp/tests/models/encoder_decoders/copynet_seq2seq_test.py:15 (CopyNetTest.test_model_can_train_save_load_predict)
coreftest
graphparsertest
allennlp/tests/interpret/simple_gradient_test.py:30 (TestSimpleGradient.test_simple_gradient_coref)

Then went through all detach_tensors() callers and all metrics, and check they are compatible with GPU usage, and calling .cpu() if necessary.

The following should work fine on GPU:

Then I ran the tests again, and fix one test that does .numpy() without doing .cpu() first. Then I ran all the tests again (2 GPUs; only 4 skipped tests) and it works fine. Then I removed the .cuda() call from detach_tensors and ran the tests again (still with 2 GPUs available; only 4 skipped tests). They were successful. Feel free to try it yourselves because the CI won't do (multi-)GPU testing (remember to change detach_tensors so it does .cuda()).

(Note there are 4 unconditionally skipped tests.)

I realized all metrics have tests except for Average, Perplexity, and MentionRecall. However, if you look at what I changed on those files you'll realize is really minor and it should still work. Note I didn't modify Perplexity but it subclasses Average. It'd be good to still have tests for those though.

Btw, I saw a bunch of nominator / (denominator + 1e-13) (and other similar values), which makes me think those won't work well on FP16. I believe those ideally should be eps args that could be changed if using FP16 for example.

bryant1410 · 2020-02-26T05:26:24Z

allennlp/training/metrics/bleu.py

@@ -108,7 +108,7 @@ def _get_brevity_penalty(self) -> float:
        return math.exp(1.0 - self._reference_lengths / self._prediction_lengths)

    def _get_valid_tokens_mask(self, tensor: torch.LongTensor) -> torch.ByteTensor:
-        valid_tokens_mask = torch.ones(tensor.size(), dtype=torch.bool)
+        valid_tokens_mask = torch.ones_like(tensor, dtype=torch.bool)


Using _like uses the same device as the input tensor also.

bryant1410 · 2020-02-26T05:27:02Z

allennlp/training/metrics/conll_coref_scores.py

        return 0 if precision + recall == 0 else 2 * precision * recall / (precision + recall)

    def get_recall(self):
-        if self.recall_numerator == 0:
+        if self.recall_denominator == 0:


This and the next one were bugs IMO.

bryant1410 · 2020-02-26T05:27:52Z

allennlp/training/metrics/conll_coref_scores.py

            return 0
        else:
-            return self.recall_numerator / float(self.recall_denominator)
+            return self.recall_numerator / self.recall_denominator


Not necessary anymore in Python 3 these float castings. I tried checking if they were tensors or not before removing them so as not to remove them if they were.

bryant1410 · 2020-02-26T05:28:26Z

allennlp/training/metrics/mention_recall.py

@@ -18,7 +18,7 @@ def __call__(
        batched_top_spans: torch.Tensor,
        batched_metadata: List[Dict[str, Any]],
    ):
-        for top_spans, metadata in zip(batched_top_spans.data.tolist(), batched_metadata):
+        for top_spans, metadata in zip(batched_top_spans.tolist(), batched_metadata):


I think in previous PyTorch versions this was necessary, but not anymore.

bryant1410 · 2020-02-26T05:29:16Z

allennlp/training/metrics/spearman_correlation.py

        # the vectors, since each element in the predictions and gold_labels tensor is assumed
        # to be a separate observation.
        predictions = predictions.view(-1)
        gold_labels = gold_labels.view(-1)

+        self.total_predictions = self.total_predictions.to(predictions.device)


Note that at initialization time we don't know the device we should use, but here we move it. If it was already in that device, it's a no-op so it's fine.

DeNeutoy

@bryant1410 This is awesome and very impactful.

One question - we have GPU tests which we run occasionally in our CI - is it easy to parametrise all of the metrics tests in a conditional way such that they run on the cpu when available, and cpu and gpu when both are?

bryant1410 · 2020-02-26T18:41:23Z

One question - we have GPU tests which we run occasionally in our CI - is it easy to parametrise all of the metrics tests in a conditional way such that they run on the cpu when available, and cpu and gpu when both are?

The approach I can come up with is using pytest.param for every test function that should support both GPU and CPU, like (I guess it works with self):

@pytest.mark.parametrize("device", [
    "cpu",
    pytest.param(
        "cuda",
        marks=pytest.mark.skipif(not torch.cuda.is_available(), reason="requires cuda")
    ),
])
def test_func(self, device):
    ...

Does it make sense?

bryant1410 · 2020-02-26T18:58:41Z

The approach I can come up with is using pytest.param for every test function that should support both GPU and CPU, like (I guess it works with self):
@pytest.mark.parametrize("device", [
    "cpu",
    pytest.param(
        "cuda",
        marks=pytest.mark.skipif(not torch.cuda.is_available(), reason="requires cuda")
    ),
])
def test_func(self, device):
    ...
Does it make sense?

In jax they do something similar.

DeNeutoy · 2020-02-26T21:17:02Z

Nice, is there a way to make that decorator itself a decorator? It's a bit verbose and it would be nice to be able to do:

@multi_device
def test_func(self, device):

This might not be possible with pytest, I know it's a bit finicky about how it uses those decorators.

bryant1410 · 2020-02-27T15:59:59Z

allennlp/tests/training/metrics/span_based_f1_measure_test.py

+        with open(gold_file_path, "w") as gold_file, open(
+            prediction_file_path, "w"


I changed this because the test failed (function called twice), and also the append wasn't necessary.

bryant1410 · 2020-02-27T16:00:38Z

I added that utility. I discovered that pytest and unittest don't go along well together, and it's hard to parametrize in that context with the utils they provide.

I had to change the test to actually use the device. That implied to avoid using NumPy if I could as well. In the end, torch provides a testing module with assert_allclose that's convenient. It has good defaults, and when you specify rtol you also have to specify atol (either both or none), don't know why. I had also to change some FloatTensor creations and similar ones for tensor, because of some exceptions (and I saw they are recommended to not use anymore).

I think as a general practice, we should specify the device when we create new tensors with ones, zeros, randn, rand, and tensor (or any constructor; not with like ones_like because the device is copied).

bryant1410 · 2020-02-27T16:02:13Z

allennlp/tests/training/metrics/fbeta_measure_test.py

-        numpy.testing.assert_almost_equal(precisions, self.desired_precisions, decimal=2)
-        numpy.testing.assert_almost_equal(recalls, self.desired_recalls, decimal=2)
-        numpy.testing.assert_almost_equal(fscores, self.desired_fscores, decimal=2)
+        assert_allclose(precisions, self.desired_precisions)
+        assert_allclose(recalls, self.desired_recalls)
+        assert_allclose(fscores, self.desired_fscores)


In many of these changes, the expected precision is actually stronger. It works because I changed 0.33 as the expected value to 1 / 3.

bryant1410 · 2020-02-27T16:03:15Z

allennlp/training/metrics/spearman_correlation.py

@@ -44,14 +44,14 @@ def __call__(
        # Flatten predictions, gold_labels, and mask. We calculate the Spearman correlation between
        # the vectors, since each element in the predictions and gold_labels tensor is assumed
        # to be a separate observation.
-        predictions = predictions.view(-1)
-        gold_labels = gold_labels.view(-1)
+        predictions = predictions.reshape(-1)


For some reason, on GPU sometimes this view fails. We can use reshape anyway, that tries to do a view but if it can't will reshape.

bryant1410 · 2020-02-27T18:06:41Z

allennlp/tests/training/metrics/span_based_f1_measure_test.py

+        assert metric._ignore_classes == ["V"]  # type: ignore
+        assert metric._label_vocabulary == self.vocab.get_index_to_token_vocabulary(  # type: ignore


Idk why this mypy errors didn't appear before.

DeNeutoy

Sweeeeet, looks great, thanks @bryant1410 !

allennlp/common/testing/test_case.py

DeNeutoy · 2020-02-27T18:12:51Z

allennlp/common/testing/test_case.py

@@ -40,3 +43,34 @@ def setUp(self):

    def tearDown(self):
        shutil.rmtree(self.TEST_DIR)
+
+
+def parametrize(arg_names: Iterable[str], arg_values: Iterable[Iterable[Any]]):


Very cute, this is really nice!

Co-Authored-By: Mark Neumann <markn@allenai.org>

matt-gardner · 2020-02-27T18:33:18Z

allennlp/tests/common/testing.py

+from allennlp.common.testing import AllenNlpTestCase, multi_device
+
+
+class TestFromParams(AllenNlpTestCase):


Class name here needs updating.

(I came to see what Mark thought looked cute, noticed a copy-paste bug.)

Thanks! Good catch.

I put TestTesting after the module name, hope it's fine.

bryant1410 · 2020-02-29T20:11:13Z

@DeNeutoy related to this and to be on the safe side, in Trainer, before calling get_metrics(), shouldn't we do with torch.no_grad(): (in training; in validation it's already there)?

Make most metrics work on GPU

28956d5

bryant1410 commented Feb 26, 2020

View reviewed changes

DeNeutoy approved these changes Feb 26, 2020

View reviewed changes

bryant1410 added 3 commits February 27, 2020 01:55

Make metric tests work on both GPU and CPU

b7d89c1

Add a test for the test utility

99660ba

Merge branch 'master' into metrics-gpu

fda7e80

bryant1410 commented Feb 27, 2020

View reviewed changes

mypy

97772f7

bryant1410 commented Feb 27, 2020

View reviewed changes

DeNeutoy approved these changes Feb 27, 2020

View reviewed changes

Update allennlp/common/testing/test_case.py

8cb62df

Co-Authored-By: Mark Neumann <markn@allenai.org>

matt-gardner reviewed Feb 27, 2020

View reviewed changes

bryant1410 added 2 commits February 27, 2020 13:38

Fix a PR comment

a7180b0

flake8

4d708ba

DeNeutoy merged commit ddebbdc into allenai:master Feb 27, 2020

bryant1410 deleted the metrics-gpu branch February 27, 2020 19:50

bryant1410 mentioned this pull request Feb 28, 2020

Enable AMP with Apex #3866

Merged

rzepinskip mentioned this pull request Mar 30, 2020

[Metrics] Sklearn metrics Lightning-AI/pytorch-lightning#1305

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make most metrics work on GPU #3851

Make most metrics work on GPU #3851

bryant1410 commented Feb 26, 2020 •

edited

Loading

bryant1410 Feb 26, 2020

bryant1410 Feb 26, 2020 •

edited

Loading

bryant1410 Feb 26, 2020

bryant1410 Feb 26, 2020

bryant1410 Feb 26, 2020

DeNeutoy left a comment

bryant1410 commented Feb 26, 2020 •

edited

Loading

bryant1410 commented Feb 26, 2020 •

edited

Loading

DeNeutoy commented Feb 26, 2020

bryant1410 Feb 27, 2020

bryant1410 commented Feb 27, 2020

bryant1410 Feb 27, 2020

bryant1410 Feb 27, 2020

bryant1410 Feb 27, 2020

DeNeutoy left a comment

DeNeutoy Feb 27, 2020

matt-gardner Feb 27, 2020

bryant1410 Feb 27, 2020

bryant1410 Feb 27, 2020

bryant1410 commented Feb 29, 2020

		with open(gold_file_path, "w") as gold_file, open(
		prediction_file_path, "w"

		assert metric._ignore_classes == ["V"] # type: ignore
		assert metric._label_vocabulary == self.vocab.get_index_to_token_vocabulary( # type: ignore

		from allennlp.common.testing import AllenNlpTestCase, multi_device


		class TestFromParams(AllenNlpTestCase):

Make most metrics work on GPU #3851

Make most metrics work on GPU #3851

Conversation

bryant1410 commented Feb 26, 2020 • edited Loading

Choose a reason for hiding this comment

bryant1410 Feb 26, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DeNeutoy left a comment

Choose a reason for hiding this comment

bryant1410 commented Feb 26, 2020 • edited Loading

bryant1410 commented Feb 26, 2020 • edited Loading

DeNeutoy commented Feb 26, 2020

Choose a reason for hiding this comment

bryant1410 commented Feb 27, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DeNeutoy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bryant1410 commented Feb 29, 2020

bryant1410 commented Feb 26, 2020 •

edited

Loading

bryant1410 Feb 26, 2020 •

edited

Loading

bryant1410 commented Feb 26, 2020 •

edited

Loading

bryant1410 commented Feb 26, 2020 •

edited

Loading