Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better align methods for creating tensors from .NET arrays with Pytorch APIs. #670

Closed
TheBlackPlague opened this issue Jul 31, 2022 · 29 comments

Comments

@TheBlackPlague
Copy link

TheBlackPlague commented Jul 31, 2022

Hello,

It seems to me that this library's tensor allocation performance is not good at all. See the following code:

    [MethodImpl(MethodImplOptions.AggressiveOptimization)]
    public static (Tensor, Tensor) Batch(this Board[] boards, float[,] sideToMove, float[,] notSideToMove, int batchSize)
    {
        Stopwatch stopwatch = Stopwatch.StartNew();
        for (int i = 0; i < batchSize; i++) {
            FillArrayFromBoard(boards[i], sideToMove, notSideToMove, i);
        }
        stopwatch.Stop();
        LastElapsedMs = (int)stopwatch.ElapsedMilliseconds;

        return (from_array(sideToMove, dtype:float32), from_array(notSideToMove, dtype:float32));
    }

When using arrays of size [100000, 768] for both sideToMove and notSideToMove, LastElapsedMs = 201ms on a run. However, the whole method takes 6570ms. That means the tensor allocation alone is taking > 6000ms even accounting for the time Stopwatch.StartNew() and Stopwatch.Stop() take.

To test it, you can just do from_array() 2x on a [100000, 768] array. Is there a faster way to do this? Because otherwise, it slows down performance quite a lot.

I'm using TorchSharp-cuda-windows v0.97.0.

@TheBlackPlague
Copy link
Author

TheBlackPlague commented Aug 1, 2022

I would like to add that I've confirmed the Tensors are being allocated on the CPU & not the GPU. Moreover, the results are consistent with TorchSharp-cpu v0.97.0.

@TheBlackPlague TheBlackPlague changed the title Tensor allocation insanely slow. Tensor allocation insanely slow for "from_array(Array)". Aug 1, 2022
@TheBlackPlague
Copy link
Author

TheBlackPlague commented Aug 1, 2022

It seems that normal tensor allocation isn't slow.

Stopwatch stopwatch = Stopwatch.StartNew();
Tensor zero = zeros(100000, 768);
stopwatch.Stop();
zero.Dispose();

Console.WriteLine("Allocation took: " + stopwatch.ElapsedMilliseconds + "ms");

The output of this code is:

Allocation took: 140ms

However, the from_array() method is insanely slow.

@NiklasGustafsson
Copy link
Contributor

NiklasGustafsson commented Aug 1, 2022

@TheBlackPlague -- I deleted my first comment after measuring for myself. Thanks for reporting this performance issue, which pertains to both from_array() and the more canonical 'tensor()' factories for creating tensors from .NET arrays.

I have verified that it is the rawArray.Cast<T>().ToArray() calls in both from_array() and tensor() that takes up all the time.

@TheBlackPlague
Copy link
Author

TheBlackPlague commented Aug 1, 2022

Will there be a version shipping soon with the fix (if you have one in mind)? Sorry for rushing, but I intended to use this library to train neural networks for my chess engine, StockNemo.

@NiklasGustafsson
Copy link
Contributor

@TheBlackPlague -- that depends on how much time it takes to get a reasonable fix in place. In the meantime, switch to using 'tensor()' instead of 'from_array()' which I think is more general than you need. The fix I have in mind may not apply to from_array().

@TheBlackPlague
Copy link
Author

TheBlackPlague commented Aug 1, 2022

@TheBlackPlague -- that depends on how much time it takes to get a reasonable fix in place. In the meantime, switch to using 'tensor()' instead of 'from_array()' which I think is more general than you need. The fix I have in mind may not apply to from_array().

Yeah, after making the issue, I learned that tensor() was better since it wasn't for n dimensions. However, the performance of it, like you mentioned, is nearly the exact same. It currently puts it in an unusable state for my case since my project relies on fast tensor creation and then training with it. I understand it's hard to come up with a good fix instantly, so I'm not expecting one within 24 hours, but I would love it if you could keep me updated.

If there's anything necessary or that might be helpful from my end, please feel free to reach out.

@NiklasGustafsson
Copy link
Contributor

It looks like the fix is going to work, and will be several orders of magnitude faster. I can make a release later this week after more testing.

NiklasGustafsson added a commit to NiklasGustafsson/TorchSharp that referenced this issue Aug 1, 2022
@NiklasGustafsson
Copy link
Contributor

@TheBlackPlague -- I have a PR for this bug. Maybe you can help code review?

#671

@TheBlackPlague
Copy link
Author

Thank you for working hard on this. I appreciate it. I've submitted some comments and suggestions, both for stack allocations & overall improvement of testing, etc.

Thank you for requesting my input.

@NiklasGustafsson
Copy link
Contributor

By the way, from_array() will be just as fast as tensor() except for complex numbers which will retain the copy for now.

@TheBlackPlague
Copy link
Author

By the way, from_array() will be just as fast as tensor() except for complex numbers which will retain the copy for now.

Yep, I saw. That's great to hear. :)

@NiklasGustafsson
Copy link
Contributor

@TheBlackPlague -- thanks for reviewing. If you want, it would be great if you could introduce yourself to the community around TorchSharp here: #333

Not required, by any means, but appreciated. It's great to know the people who are working on the code, whether putting in PRs or reviewing them.

NiklasGustafsson added a commit that referenced this issue Aug 2, 2022
Fix performance issue reported in #670
@NiklasGustafsson
Copy link
Contributor

@TheBlackPlague -- 0.97.1 has been released to NuGet. Please validate that it addresses your performance issue.

@TheBlackPlague
Copy link
Author

@TheBlackPlague -- 0.97.1 has been released to NuGet. Please validate that it addresses your performance issue.

I will try today. Thank you very much.

@NiklasGustafsson
Copy link
Contributor

NiklasGustafsson commented Aug 3, 2022

@TheBlackPlague, @tarekgh:

A follow-up on this bug. According to Pytorch docs, torch.tensor() is supposed to make a copy of the input, which the current implementation doesn't do. There are others, like as_tensor() and frombuffer() which share the buffer, but tensor() is not supposed to.

I have an implementation of frombuffer(), which allocates a [100000,768] tensor in 5 μs, while a proper tensor() call cloning the array will take 190 ms, i.e. a 40,000x difference.

In order to comply with Pytorch, I will have to change the implementation of tensor() and recommend using frombuffer() when speed is essential, with the caveat that the resulting tensor shares its buffer with the source.

@TheBlackPlague
Copy link
Author

That's understandable. Of course complying with Pytorch documentation is necessary.

@tarekgh
Copy link
Member

tarekgh commented Aug 3, 2022

@NiklasGustafsson should we have option to allow full cloning and not sharing the buffer? like torch.tensor(fullCloning = false).

@NiklasGustafsson
Copy link
Contributor

I would think that cloning would be the default, if anything.

@NiklasGustafsson
Copy link
Contributor

@MovGP0 -- FYI, since you introduced from_array()

@NiklasGustafsson
Copy link
Contributor

I was thinking that from_array() will correspond to from_numpy() in Pytorch, with the same non-copying semantics.

@NiklasGustafsson
Copy link
Contributor

NiklasGustafsson commented Aug 3, 2022

This would be the various ways of creating tensors from arrays, parallel to Pytorch, if from_array() is treated as from_numpy():

// Never copy:
public static Tensor from_array(Array data)

// Copy if necessary:
public static Tensor frombuffer(Array data, ScalarType dtype, long count = -1, long offset = 0, bool requiresGrad = false, Device? device = null)
public static Tensor as_tensor(Array data, ScalarType? dtype = null, Device? device = null)
public static Tensor as_tensor(Tensor data, ScalarType? dtype = null, Device? device = null)

// Always copy:
public static Tensor as_tensor(IList<<VARIOUS>> data, torch.Device? device = null, bool requiresGrad = false)
public static Tensor tensor(<<VARIOUS>> data, torch.Device? device = null, bool requiresGrad = false)

@NiklasGustafsson NiklasGustafsson changed the title Tensor allocation insanely slow for "from_array(Array)". Better align methods for creating tensors from .NET arrays with Pytorch APIs. Aug 3, 2022
@GeorgeS2019
Copy link

Related discussion

@MovGP0
Copy link
Contributor

MovGP0 commented Aug 4, 2022

However, the from_array() method is insanely slow.

The generic method uses reflection internally, which works for almost any kind of array, but has a performance impact. There are surely ways to optimize the method, but currently the best way is to prefer method overloads for specific types instead of the generic one.

@NiklasGustafsson
Copy link
Contributor

NiklasGustafsson commented Aug 4, 2022

The problem turned out to be that a copy was being made, making the method linear by the size of the array. With this PR, from_array() is constant-time (a few micro-seconds) but shares memory with the original, which was how come the overloaded versions of torch.tensor() were fast (and incorrect).

With the proposed semantics, everything has been flipped -- from_array() is fast, tensor() is slow.

@NiklasGustafsson
Copy link
Contributor

To summarize, here's what the pytorch docs have to say about copying semantics. Since both frombuffer() and as_tensor() take dtype and/or device arguments, sharing memory is not necessarily possible, since the resulting tensor may end up with a different element type or on a different device.

torch.tensor():
Constructs a tensor with no autograd history (also known as a “leaf tensor”, see Autograd mechanics) by copying data

torch.frombuffer():
The returned tensor and buffer share the same memory. Modifications to the tensor will be reflected in the buffer and vice versa. The returned tensor is not resizable.

torch.as_tensor():
Converts data into a tensor, sharing data and preserving autograd history if possible.

torch.from_numpy():
The returned tensor and ndarray share the same memory.

@NiklasGustafsson
Copy link
Contributor

NiklasGustafsson commented Aug 4, 2022

As measured on my workstation for a tensor 100000x768 as originally reported:

Total time per call to as_tensor(float[]): 4.78 µs
Total time per call to frombuffer(): 3.212 µs
Total time per call to from_array(): 4.002 µs
Total time per call to tensor(): 187.8 ms
Total time per call to as_tensor(List<float>): 209.13 ms

@TheBlackPlague
Copy link
Author

@NiklasGustafsson, I wanted to ask, are the recent changes available on NuGet?

@NiklasGustafsson
Copy link
Contributor

No, I haven't made another release yet. Maybe early next week.

@NiklasGustafsson
Copy link
Contributor

@TheBlackPlague -- it is now on NuGet. 0.97.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants