-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better align methods for creating tensors from .NET arrays with Pytorch APIs. #670
Comments
I would like to add that I've confirmed the Tensors are being allocated on the CPU & not the GPU. Moreover, the results are consistent with |
It seems that normal tensor allocation isn't slow. Stopwatch stopwatch = Stopwatch.StartNew();
Tensor zero = zeros(100000, 768);
stopwatch.Stop();
zero.Dispose();
Console.WriteLine("Allocation took: " + stopwatch.ElapsedMilliseconds + "ms"); The output of this code is:
However, the |
@TheBlackPlague -- I deleted my first comment after measuring for myself. Thanks for reporting this performance issue, which pertains to both from_array() and the more canonical 'tensor()' factories for creating tensors from .NET arrays. I have verified that it is the |
Will there be a version shipping soon with the fix (if you have one in mind)? Sorry for rushing, but I intended to use this library to train neural networks for my chess engine, StockNemo. |
@TheBlackPlague -- that depends on how much time it takes to get a reasonable fix in place. In the meantime, switch to using 'tensor()' instead of 'from_array()' which I think is more general than you need. The fix I have in mind may not apply to from_array(). |
Yeah, after making the issue, I learned that If there's anything necessary or that might be helpful from my end, please feel free to reach out. |
It looks like the fix is going to work, and will be several orders of magnitude faster. I can make a release later this week after more testing. |
@TheBlackPlague -- I have a PR for this bug. Maybe you can help code review? |
Thank you for working hard on this. I appreciate it. I've submitted some comments and suggestions, both for stack allocations & overall improvement of testing, etc. Thank you for requesting my input. |
By the way, |
Yep, I saw. That's great to hear. :) |
@TheBlackPlague -- thanks for reviewing. If you want, it would be great if you could introduce yourself to the community around TorchSharp here: #333 Not required, by any means, but appreciated. It's great to know the people who are working on the code, whether putting in PRs or reviewing them. |
Fix performance issue reported in #670
@TheBlackPlague -- 0.97.1 has been released to NuGet. Please validate that it addresses your performance issue. |
I will try today. Thank you very much. |
A follow-up on this bug. According to Pytorch docs, I have an implementation of In order to comply with Pytorch, I will have to change the implementation of tensor() and recommend using frombuffer() when speed is essential, with the caveat that the resulting tensor shares its buffer with the source. |
That's understandable. Of course complying with Pytorch documentation is necessary. |
@NiklasGustafsson should we have option to allow full cloning and not sharing the buffer? like |
I would think that cloning would be the default, if anything. |
@MovGP0 -- FYI, since you introduced from_array() |
I was thinking that from_array() will correspond to from_numpy() in Pytorch, with the same non-copying semantics. |
This would be the various ways of creating tensors from arrays, parallel to Pytorch, if from_array() is treated as from_numpy(): // Never copy:
public static Tensor from_array(Array data)
// Copy if necessary:
public static Tensor frombuffer(Array data, ScalarType dtype, long count = -1, long offset = 0, bool requiresGrad = false, Device? device = null)
public static Tensor as_tensor(Array data, ScalarType? dtype = null, Device? device = null)
public static Tensor as_tensor(Tensor data, ScalarType? dtype = null, Device? device = null)
// Always copy:
public static Tensor as_tensor(IList<<VARIOUS>> data, torch.Device? device = null, bool requiresGrad = false)
public static Tensor tensor(<<VARIOUS>> data, torch.Device? device = null, bool requiresGrad = false) |
The generic method uses reflection internally, which works for almost any kind of array, but has a performance impact. There are surely ways to optimize the method, but currently the best way is to prefer method overloads for specific types instead of the generic one. |
The problem turned out to be that a copy was being made, making the method linear by the size of the array. With this PR, With the proposed semantics, everything has been flipped -- |
To summarize, here's what the pytorch docs have to say about copying semantics. Since both
|
As measured on my workstation for a tensor 100000x768 as originally reported:
|
@NiklasGustafsson, I wanted to ask, are the recent changes available on NuGet? |
No, I haven't made another release yet. Maybe early next week. |
@TheBlackPlague -- it is now on NuGet. 0.97.2 |
Hello,
It seems to me that this library's tensor allocation performance is not good at all. See the following code:
When using arrays of size
[100000, 768]
for bothsideToMove
andnotSideToMove
,LastElapsedMs = 201ms
on a run. However, the whole method takes6570ms
. That means the tensor allocation alone is taking> 6000ms
even accounting for the timeStopwatch.StartNew()
andStopwatch.Stop()
take.To test it, you can just do
from_array()
2x on a[100000, 768]
array. Is there a faster way to do this? Because otherwise, it slows down performance quite a lot.I'm using
TorchSharp-cuda-windows v0.97.0
.The text was updated successfully, but these errors were encountered: