Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Pytorch] New version of the presets #1360

Merged
merged 71 commits into from
Jul 25, 2023
Merged

Conversation

HGuillemet
Copy link
Collaborator

@HGuillemet HGuillemet commented May 22, 2023

Reorganization of the Pytorch presets.

Main improvements are

  • Handling of the virtual inheritance between torch::nn::Module and its subclasses. Avoiding segmentation faults when a function accepting torch::nn::Module or a Module instance method is called from a subclass instance.
  • Transparent and more rigorous handling of shared_ptr. This concerns Module and its subclasses, Tensor and a few less important classes. A consequence is that modules are now deallocated normally (they were never deallocated in previous version of the presets).
  • Module holders that are the C++ counterpart of the transparent handling of shared_ptr are not mapped any more.
  • Friend functions that were ignored are now mapped, giving access to additional operator overloads, like on iterators.
  • Module is now virtualized, meaning that if you override in Java its virtual member functions (train, is_training, to, zero_grad, save, load, pretty_print, is_serializable), your Java code will get called by libtorch.
  • Added support for Sequential, AnyModule, AnyValue.
  • Hopefully a better coverage of the API. Almost all what is available in C++ with:
#include <torch/torch.h>
#include <ATen/native/TensorShape.h>
#include <torch/csrc/jit/runtime/custom_operator.h>
#include <torch/csrc/jit/serialization/storage_context.h>
#include <torch/csrc/jit/serialization/import.h>
#include <ATen/cudnn/Descriptors.h>
#include <ATen/cudnn/Types.h>
#include <c10/cuda/CUDAGuard.h>

should be mapped, ... baring a lot of exceptions and a lot of missing template instantiations.

The CUDA versions of the library now includes PTX code for compute capability 7.0 and binary codes for 5.0 and 6.0. This means you need a Maxwell card minimum, and that at first launch, if you have a Volta or more recent card, the library will compile the PTX code of all libtorch kernels into binaries for your card. This can take 10 to 20 minutes. The binaries are stored in a cache for future launches. If your application still lag at startup after first launch, this means your cache size is too low. Try to increase it by setting environment variable CUDA_CACHE_MAXSIZE to 2000000.
If this is not acceptable for your needs and you cannot afford the delay at first launch, or if you want to benefit from newer compute capabilities, you can install libtorch from pytorch.org and set the org.bytedeco.javacpp.pathsFirst system property to true.

Please give it a test and report here any problem, comment, missing API or other RFE.

@saudet
Copy link
Member

saudet commented May 23, 2023

Could you undo unnecessary changes in indenting?

Add missing ATen/ops/from_blob.h
Add missing ATen/ops/tensor.h
@HGuillemet
Copy link
Collaborator Author

Could you undo unnecessary changes in indenting?

I reindented using rules closer, I think, that the ones you use.

Also added 2 missing includes.

@saudet
Copy link
Member

saudet commented May 23, 2023

Could you undo unnecessary changes in indenting?

I reindented using rules closer, I think, that the ones you use.

Also added 2 missing includes.

I mean, could you please not change the lines that you haven't touched? It makes it harder to review.

@HGuillemet
Copy link
Collaborator Author

HGuillemet commented May 23, 2023

You won't be able to review by comparing line by line with the previous version. Too much has changed.
I grouped the rules by categories, and added comments. I suggest that you simply read them and see if you have comments or if I got something wrong.

But please hold on, another commit coming. I realized some includes are still missing + I need to commit the gen directory so that the CI checks pass.

Edit: in fact, for a thorough review, it might be easier to check the difference in the generated files, after filtering out non significant changes.

Added parsing of files included from additional includes (Descriptors.h, Types.h...)
Remove _object
Remove _compute_enum_name
Remove _CopyBytesFunctionRegisterer
@HGuillemet
Copy link
Collaborator Author

I removed the includes depending on CUDA includes.
But, if needed, I guess there is a way to patch the include list depending on the availability of cuda, like you did for libraries.

@saudet
Copy link
Member

saudet commented May 24, 2023

The build is still failing on Windows. Please fix this!

@HGuillemet
Copy link
Collaborator Author

Pytorch tries a dlopen on nvfuser_codegen on my machine at startup and issues a warning because it does not found it.
I added this library to the list of preloads. No warning anymore. Was it the right thing to do ?

@saudet
Copy link
Member

saudet commented May 27, 2023 via email

@HGuillemet
Copy link
Collaborator Author

HGuillemet commented May 27, 2023

The answer is that I did't remove anything. I have started this before your update for Pytorch 2.0 and I didn't merge your changes.

Well I did merge 2.0.0 but not 2.0.1

@sbrunk
Copy link
Contributor

sbrunk commented Jul 22, 2023

I just built this locally to get a better estimate how much I'll have to change when upgrading.

One thing I noticed is that sqrt for Tensor in global.torch is missing:

// aten::sqrt(Tensor self) -> Tensor
- @Namespace("at") public static native @ByVal Tensor sqrt(@Const @ByRef Tensor self);

Instead we have the complex variants:

@Namespace("c10_complex_math") public static native @ByVal @Name("sqrt<float>") FloatComplex sqrt(@Const @ByRef FloatComplex x);

@Namespace("c10_complex_math") public static native @ByVal @Name("sqrt<double>") DoubleComplex sqrt(@Const @ByRef DoubleComplex x);

Probably due to this mapping:

        infoMap.put(new Info("c10_complex_math::sqrt<float>").javaNames("sqrt"))
               .put(new Info("c10_complex_math::sqrt<double>").javaNames("sqrt"))

Is it possible to keep the original sqrt mapping somehow?

@saudet
Copy link
Member

saudet commented Jul 22, 2023

I don't think it's due to those, but if it works by adding one more for at::sqrt() as well, let's just do that?

@HGuillemet
Copy link
Collaborator Author

complex_math.h is parsed before ATen/ops/sqrt.h, where at::sqrt is defined.
So when ATen/ops/sqrt.h is parsed, because of this loop, sqrt qualified name is set to the first matching cppName it finds a info for. So c10_complex_math::sqrt in this case.

I added an explicit info for at::sqrt to prevent this. And I added the other missing complex math operators for good measure.

@sbrunk
Copy link
Contributor

sbrunk commented Jul 23, 2023

sqrt is working again. Thanks! I got Storch compiling now with minimal changes (just a few type names, additional parameters needed etc.).

I'm seeing a bunch of crashes now when running our tests, especially in IndexingSlicingJoiningOpsSuite.I think they are starting to become a nice regression test suite for the native bindings. :)

I've tried to isolate the crashing ops. Most seem to take lists of tensors in some form, like stack or cat:

import org.bytedeco.pytorch.*;
global.torch.stack(
  new TensorArrayRef(
    new TensorVector(
      AbstractTensor.create(0),
      AbstractTensor.create(1)
    )
  ),
  0
).print();
[info] # C  [libtorch_cpu.so+0x1de6e4f]  at::_ops::stack::call(c10::ArrayRef<at::Tensor>, long)+0x9f

The same as above, cat instead of stack:

[info] # C  [libtorch_cpu.so+0x1cc0a2c]  c10::detail::MultiDispatchKeySet::operator()(c10::IListRef<at::Tensor>)+0xcc

Calling stack variants like dstack, column_stack and hstack are also causing the crash. The other ops we're testing are all working fine.

@saudet
Copy link
Member

saudet commented Jul 23, 2023

Should we wait after that is fixed to merge this pull request?

@sbrunk
Copy link
Contributor

sbrunk commented Jul 23, 2023

From my side it's fine merging now and then fixing this in a separate PR.

I'd actually like to try with the CI snapshots to rule out that the issue is caused by me building locally.

@HGuillemet
Copy link
Collaborator Author

I can reproduce your crash, so it's not related to your local build.
@saudet, please give me some minutes before merging in case I quickly find out the reason for this.

@HGuillemet
Copy link
Collaborator Author

The ArrayRef constructor taking a Vector isn't mapped any more. So it's the "pointer-cast" constructor that is called instead in your case, causing the crash.
I'll try to re-add it but you can also use the constructor taking a Tensor and a length instead.

@sbrunk
Copy link
Contributor

sbrunk commented Jul 23, 2023

So you mean s.th. like this, relying on the fact that vectors are stored contiguously?

var vector = new TensorVector(AbstractTensor.create(0), AbstractTensor.create(1));
global.torch.stack(
  new TensorArrayRef(vector.front(), vector.size()),
  0
).print();

Seems to be working, I got all tests passing again. :)

@HGuillemet
Copy link
Collaborator Author

Or something like:

Tensor t = new Tensor(2);
t.put(AbstractTensor.create(0));
t.position(1).put(AbstractTensor.create(1));
torch.stack(new TensorArrayRef(t.position(0), 2)).print();

sbrunk added a commit to sbrunk/storch that referenced this pull request Jul 23, 2023
@saudet saudet merged commit d370dbc into bytedeco:master Jul 25, 2023
6 checks passed
@HGuillemet HGuillemet deleted the hg_pytorch branch July 25, 2023 13:30
@sbrunk
Copy link
Contributor

sbrunk commented Jul 25, 2023

This is great! Thanks for these improvements @HGuillemet!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants