Better align methods for creating tensors from .NET arrays with Pytorch APIs. #670

TheBlackPlague · 2022-07-31T13:05:19Z

Hello,

It seems to me that this library's tensor allocation performance is not good at all. See the following code:

    [MethodImpl(MethodImplOptions.AggressiveOptimization)]
    public static (Tensor, Tensor) Batch(this Board[] boards, float[,] sideToMove, float[,] notSideToMove, int batchSize)
    {
        Stopwatch stopwatch = Stopwatch.StartNew();
        for (int i = 0; i < batchSize; i++) {
            FillArrayFromBoard(boards[i], sideToMove, notSideToMove, i);
        }
        stopwatch.Stop();
        LastElapsedMs = (int)stopwatch.ElapsedMilliseconds;

        return (from_array(sideToMove, dtype:float32), from_array(notSideToMove, dtype:float32));
    }

When using arrays of size [100000, 768] for both sideToMove and notSideToMove, LastElapsedMs = 201ms on a run. However, the whole method takes 6570ms. That means the tensor allocation alone is taking > 6000ms even accounting for the time Stopwatch.StartNew() and Stopwatch.Stop() take.

To test it, you can just do from_array() 2x on a [100000, 768] array. Is there a faster way to do this? Because otherwise, it slows down performance quite a lot.

I'm using TorchSharp-cuda-windows v0.97.0.

The text was updated successfully, but these errors were encountered:

TheBlackPlague · 2022-08-01T10:40:50Z

I would like to add that I've confirmed the Tensors are being allocated on the CPU & not the GPU. Moreover, the results are consistent with TorchSharp-cpu v0.97.0.

TheBlackPlague · 2022-08-01T10:53:27Z

It seems that normal tensor allocation isn't slow.

Stopwatch stopwatch = Stopwatch.StartNew();
Tensor zero = zeros(100000, 768);
stopwatch.Stop();
zero.Dispose();

Console.WriteLine("Allocation took: " + stopwatch.ElapsedMilliseconds + "ms");

The output of this code is:

Allocation took: 140ms

However, the from_array() method is insanely slow.

NiklasGustafsson · 2022-08-01T17:56:47Z

@TheBlackPlague -- I deleted my first comment after measuring for myself. Thanks for reporting this performance issue, which pertains to both from_array() and the more canonical 'tensor()' factories for creating tensors from .NET arrays.

I have verified that it is the rawArray.Cast<T>().ToArray() calls in both from_array() and tensor() that takes up all the time.

TheBlackPlague · 2022-08-01T18:10:18Z

Will there be a version shipping soon with the fix (if you have one in mind)? Sorry for rushing, but I intended to use this library to train neural networks for my chess engine, StockNemo.

NiklasGustafsson · 2022-08-01T18:12:19Z

@TheBlackPlague -- that depends on how much time it takes to get a reasonable fix in place. In the meantime, switch to using 'tensor()' instead of 'from_array()' which I think is more general than you need. The fix I have in mind may not apply to from_array().

TheBlackPlague · 2022-08-01T18:40:29Z

@TheBlackPlague -- that depends on how much time it takes to get a reasonable fix in place. In the meantime, switch to using 'tensor()' instead of 'from_array()' which I think is more general than you need. The fix I have in mind may not apply to from_array().

Yeah, after making the issue, I learned that tensor() was better since it wasn't for n dimensions. However, the performance of it, like you mentioned, is nearly the exact same. It currently puts it in an unusable state for my case since my project relies on fast tensor creation and then training with it. I understand it's hard to come up with a good fix instantly, so I'm not expecting one within 24 hours, but I would love it if you could keep me updated.

If there's anything necessary or that might be helpful from my end, please feel free to reach out.

NiklasGustafsson · 2022-08-01T20:21:13Z

It looks like the fix is going to work, and will be several orders of magnitude faster. I can make a release later this week after more testing.

NiklasGustafsson · 2022-08-01T21:59:49Z

@TheBlackPlague -- I have a PR for this bug. Maybe you can help code review?

#671

TheBlackPlague · 2022-08-01T23:10:05Z

Thank you for working hard on this. I appreciate it. I've submitted some comments and suggestions, both for stack allocations & overall improvement of testing, etc.

Thank you for requesting my input.

NiklasGustafsson · 2022-08-01T23:44:30Z

By the way, from_array() will be just as fast as tensor() except for complex numbers which will retain the copy for now.

TheBlackPlague · 2022-08-01T23:55:42Z

By the way, from_array() will be just as fast as tensor() except for complex numbers which will retain the copy for now.

Yep, I saw. That's great to hear. :)

NiklasGustafsson · 2022-08-02T03:05:47Z

@TheBlackPlague -- thanks for reviewing. If you want, it would be great if you could introduce yourself to the community around TorchSharp here: #333

Not required, by any means, but appreciated. It's great to know the people who are working on the code, whether putting in PRs or reviewing them.

Fix performance issue reported in #670

NiklasGustafsson · 2022-08-02T16:59:37Z

@TheBlackPlague -- 0.97.1 has been released to NuGet. Please validate that it addresses your performance issue.

TheBlackPlague · 2022-08-02T17:03:14Z

@TheBlackPlague -- 0.97.1 has been released to NuGet. Please validate that it addresses your performance issue.

I will try today. Thank you very much.

NiklasGustafsson · 2022-08-03T21:56:01Z

@TheBlackPlague, @tarekgh:

A follow-up on this bug. According to Pytorch docs, torch.tensor() is supposed to make a copy of the input, which the current implementation doesn't do. There are others, like as_tensor() and frombuffer() which share the buffer, but tensor() is not supposed to.

I have an implementation of frombuffer(), which allocates a [100000,768] tensor in 5 μs, while a proper tensor() call cloning the array will take 190 ms, i.e. a 40,000x difference.

In order to comply with Pytorch, I will have to change the implementation of tensor() and recommend using frombuffer() when speed is essential, with the caveat that the resulting tensor shares its buffer with the source.

TheBlackPlague · 2022-08-03T22:10:36Z

That's understandable. Of course complying with Pytorch documentation is necessary.

tarekgh · 2022-08-03T22:15:16Z

@NiklasGustafsson should we have option to allow full cloning and not sharing the buffer? like torch.tensor(fullCloning = false).

NiklasGustafsson · 2022-08-03T22:16:10Z

I would think that cloning would be the default, if anything.

NiklasGustafsson · 2022-08-03T22:22:32Z

@MovGP0 -- FYI, since you introduced from_array()

NiklasGustafsson · 2022-08-03T22:23:45Z

I was thinking that from_array() will correspond to from_numpy() in Pytorch, with the same non-copying semantics.

NiklasGustafsson · 2022-08-03T23:54:27Z

This would be the various ways of creating tensors from arrays, parallel to Pytorch, if from_array() is treated as from_numpy():

// Never copy:
public static Tensor from_array(Array data)

// Copy if necessary:
public static Tensor frombuffer(Array data, ScalarType dtype, long count = -1, long offset = 0, bool requiresGrad = false, Device? device = null)
public static Tensor as_tensor(Array data, ScalarType? dtype = null, Device? device = null)
public static Tensor as_tensor(Tensor data, ScalarType? dtype = null, Device? device = null)

// Always copy:
public static Tensor as_tensor(IList<<VARIOUS>> data, torch.Device? device = null, bool requiresGrad = false)
public static Tensor tensor(<<VARIOUS>> data, torch.Device? device = null, bool requiresGrad = false)

GeorgeS2019 · 2022-08-04T19:32:35Z

Related discussion

MovGP0 · 2022-08-04T22:16:39Z

However, the from_array() method is insanely slow.

The generic method uses reflection internally, which works for almost any kind of array, but has a performance impact. There are surely ways to optimize the method, but currently the best way is to prefer method overloads for specific types instead of the generic one.

NiklasGustafsson · 2022-08-04T22:21:04Z

The problem turned out to be that a copy was being made, making the method linear by the size of the array. With this PR, from_array() is constant-time (a few micro-seconds) but shares memory with the original, which was how come the overloaded versions of torch.tensor() were fast (and incorrect).

With the proposed semantics, everything has been flipped -- from_array() is fast, tensor() is slow.

NiklasGustafsson · 2022-08-04T22:34:36Z

To summarize, here's what the pytorch docs have to say about copying semantics. Since both frombuffer() and as_tensor() take dtype and/or device arguments, sharing memory is not necessarily possible, since the resulting tensor may end up with a different element type or on a different device.

torch.tensor():
Constructs a tensor with no autograd history (also known as a “leaf tensor”, see Autograd mechanics) by copying data

torch.frombuffer():
The returned tensor and buffer share the same memory. Modifications to the tensor will be reflected in the buffer and vice versa. The returned tensor is not resizable.

torch.as_tensor():
Converts data into a tensor, sharing data and preserving autograd history if possible.

torch.from_numpy():
The returned tensor and ndarray share the same memory.

NiklasGustafsson · 2022-08-04T22:42:11Z

As measured on my workstation for a tensor 100000x768 as originally reported:

Total time per call to as_tensor(float[]): 4.78 µs
Total time per call to frombuffer(): 3.212 µs
Total time per call to from_array(): 4.002 µs
Total time per call to tensor(): 187.8 ms
Total time per call to as_tensor(List<float>): 209.13 ms

TheBlackPlague · 2022-08-06T01:09:57Z

@NiklasGustafsson, I wanted to ask, are the recent changes available on NuGet?

NiklasGustafsson · 2022-08-06T02:57:02Z

No, I haven't made another release yet. Maybe early next week.

NiklasGustafsson · 2022-08-09T03:30:42Z

@TheBlackPlague -- it is now on NuGet. 0.97.2

TheBlackPlague changed the title ~~Tensor allocation insanely slow.~~ Tensor allocation insanely slow for "from_array(Array)". Aug 1, 2022

NiklasGustafsson added a commit to NiklasGustafsson/TorchSharp that referenced this issue Aug 1, 2022

Fixed issue dotnet#670 -- slow performance of tensor().

9939a86

NiklasGustafsson added a commit that referenced this issue Aug 2, 2022

Merge pull request #671 from NiklasGustafsson/main

122ed70

Fix performance issue reported in #670

NiklasGustafsson closed this as completed Aug 2, 2022

NiklasGustafsson reopened this Aug 3, 2022

NiklasGustafsson changed the title ~~Tensor allocation insanely slow for "from_array(Array)".~~ Better align methods for creating tensors from .NET arrays with Pytorch APIs. Aug 3, 2022

NiklasGustafsson mentioned this issue Aug 4, 2022

Rationalize how tensors are created from arrays. #680

Merged

NiklasGustafsson closed this as completed in 0750d66 Aug 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better align methods for creating tensors from .NET arrays with Pytorch APIs. #670

Better align methods for creating tensors from .NET arrays with Pytorch APIs. #670

TheBlackPlague commented Jul 31, 2022 •

edited

Loading

TheBlackPlague commented Aug 1, 2022 •

edited

Loading

TheBlackPlague commented Aug 1, 2022 •

edited

Loading

NiklasGustafsson commented Aug 1, 2022 •

edited

Loading

TheBlackPlague commented Aug 1, 2022 •

edited

Loading

NiklasGustafsson commented Aug 1, 2022

TheBlackPlague commented Aug 1, 2022 •

edited

Loading

NiklasGustafsson commented Aug 1, 2022

NiklasGustafsson commented Aug 1, 2022

TheBlackPlague commented Aug 1, 2022

NiklasGustafsson commented Aug 1, 2022

TheBlackPlague commented Aug 1, 2022

NiklasGustafsson commented Aug 2, 2022

NiklasGustafsson commented Aug 2, 2022

TheBlackPlague commented Aug 2, 2022

NiklasGustafsson commented Aug 3, 2022 •

edited

Loading

TheBlackPlague commented Aug 3, 2022

tarekgh commented Aug 3, 2022

NiklasGustafsson commented Aug 3, 2022

NiklasGustafsson commented Aug 3, 2022

NiklasGustafsson commented Aug 3, 2022

NiklasGustafsson commented Aug 3, 2022 •

edited

Loading

GeorgeS2019 commented Aug 4, 2022

MovGP0 commented Aug 4, 2022 •

edited

Loading

NiklasGustafsson commented Aug 4, 2022 •

edited

Loading

NiklasGustafsson commented Aug 4, 2022

NiklasGustafsson commented Aug 4, 2022 •

edited

Loading

TheBlackPlague commented Aug 6, 2022

NiklasGustafsson commented Aug 6, 2022

NiklasGustafsson commented Aug 9, 2022

Better align methods for creating tensors from .NET arrays with Pytorch APIs. #670

Better align methods for creating tensors from .NET arrays with Pytorch APIs. #670

Comments

TheBlackPlague commented Jul 31, 2022 • edited Loading

TheBlackPlague commented Aug 1, 2022 • edited Loading

TheBlackPlague commented Aug 1, 2022 • edited Loading

NiklasGustafsson commented Aug 1, 2022 • edited Loading

TheBlackPlague commented Aug 1, 2022 • edited Loading

NiklasGustafsson commented Aug 1, 2022

TheBlackPlague commented Aug 1, 2022 • edited Loading

NiklasGustafsson commented Aug 1, 2022

NiklasGustafsson commented Aug 1, 2022

TheBlackPlague commented Aug 1, 2022

NiklasGustafsson commented Aug 1, 2022

TheBlackPlague commented Aug 1, 2022

NiklasGustafsson commented Aug 2, 2022

NiklasGustafsson commented Aug 2, 2022

TheBlackPlague commented Aug 2, 2022

NiklasGustafsson commented Aug 3, 2022 • edited Loading

TheBlackPlague commented Aug 3, 2022

tarekgh commented Aug 3, 2022

NiklasGustafsson commented Aug 3, 2022

NiklasGustafsson commented Aug 3, 2022

NiklasGustafsson commented Aug 3, 2022

NiklasGustafsson commented Aug 3, 2022 • edited Loading

GeorgeS2019 commented Aug 4, 2022

MovGP0 commented Aug 4, 2022 • edited Loading

NiklasGustafsson commented Aug 4, 2022 • edited Loading

NiklasGustafsson commented Aug 4, 2022

NiklasGustafsson commented Aug 4, 2022 • edited Loading

TheBlackPlague commented Aug 6, 2022

NiklasGustafsson commented Aug 6, 2022

NiklasGustafsson commented Aug 9, 2022

TheBlackPlague commented Jul 31, 2022 •

edited

Loading

TheBlackPlague commented Aug 1, 2022 •

edited

Loading

TheBlackPlague commented Aug 1, 2022 •

edited

Loading

NiklasGustafsson commented Aug 1, 2022 •

edited

Loading

TheBlackPlague commented Aug 1, 2022 •

edited

Loading

TheBlackPlague commented Aug 1, 2022 •

edited

Loading

NiklasGustafsson commented Aug 3, 2022 •

edited

Loading

NiklasGustafsson commented Aug 3, 2022 •

edited

Loading

MovGP0 commented Aug 4, 2022 •

edited

Loading

NiklasGustafsson commented Aug 4, 2022 •

edited

Loading

NiklasGustafsson commented Aug 4, 2022 •

edited

Loading