Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PyTorch] Update to 2.2.1 and minor changes #1466

Merged
merged 24 commits into from
Mar 3, 2024

Conversation

HGuillemet
Copy link
Collaborator

@HGuillemet HGuillemet commented Feb 2, 2024

Work in Progress

Included in this PR:

  • Update to PyTorch 2.2.1
  • Restore ExampleStack and TensorExampleStack constructors
  • Generate more overloads for methods taking an array ref
  • Add parsing of CUDAFunctions.h (wrappers around commonly used CUDA API functions)
  • Add AOTInductor (new way to run models exported from Python)
  • Virtualize FunctionPreHook and FunctionPostHook to enable setting hooks on autograd graph
  • Passing a vector of tensors, and some other classes, by value does not remove the data anymore
  • Add Module.asXXX to test if a Module is of a specific subclass (and do the cast).
  • Add new AMD Float8 types Float8_e5m2fnuz and Float8_e4m3fnuz with unsigned zero
  • Add preload of nvrtc-builtins (fixes Problems deploying pytorch 2.1.2-1.5.10 #1468)

@HGuillemet HGuillemet marked this pull request as draft February 2, 2024 09:57
@sbrunk
Copy link
Contributor

sbrunk commented Feb 2, 2024

@HGuillemet thanks for the update and especially adding the AOTInductor mapping. I think that's an interesting new variant to be able to use the new optimizations at least for inference. Training still needs to happen in Python in that case but we can export it to a C++ lib and then use that from Java with the bindings.

Have you been able to try it already?I'll give it a try myself, just curious if you got something working already.

@HGuillemet
Copy link
Collaborator Author

HGuillemet commented Feb 2, 2024

No I haven't, yet.
I'm training from Java, so there is little chance I use it in the near future. I must rely on Torchscript when I need to import python models, and not a lot of people bother to write Torchscript compatible models :(

@HGuillemet
Copy link
Collaborator Author

I don't understand the linking error on Windows:

jnitorch_cuda.obj : error LNK2001: unresolved external symbol "__declspec(dllimport) public: __cdecl torch::inductor::AOTIModelContainerRunnerCuda::~AOTIModelContainerRunnerCuda(void)" (__imp_??1AOTIModelContainerRunnerCuda@inductor@torch@@QEAA@XZ)
jnitorch_cuda.obj : error LNK2001: unresolved external symbol "__declspec(dllimport) public: __cdecl torch::inductor::AOTIModelContainerRunnerCuda::AOTIModelContainerRunnerCuda(char const *,unsigned __int64,char const *)" (__imp_??0AOTIModelContainerRunnerCuda@inductor@torch@@QEAA@PEBD_K0@Z)

AOTIModelContainerRunnerCuda constructor is defined in aoti_model_container_runner_cuda.h which is included from jnitorch_cuda.cpp and no destructor is declared anywhere for this class.
Any idea ?
@saudet, could this be related to ccache and, if yes, could you clear the cache ?

@saudet
Copy link
Member

saudet commented Feb 6, 2024

There's probably some template somewhere that requires them. You'll probably get the same error on Linux and Mac if you try to link with -Wl,--no-undefined, so try to fix the errors you get with that, and it should fix those errors on Windows too.

@HGuillemet
Copy link
Collaborator Author

Thanks for the suggestion. Adding the linker option raised an error about cudnn not linked to jnitorch_cuda. Let's see but I doubt it's related to the error on windows.

@HGuillemet
Copy link
Collaborator Author

It seems it has been spotted and fixed upstream in pytorch/pytorch@79ba397 after 2.2.0 release.
I guess we'd better postpone the inclusion of the AOTInductor feature to next release.

@HGuillemet
Copy link
Collaborator Author

HGuillemet commented Feb 7, 2024

To enable the setting of hooks in autograd graphs, I need to virtualize FunctionPreHook and FunctionPostHook, which have a virtual method taking a ref to a vector of tensors and returning a vector of tensors. Compilation passes only if I remove the valueTypes in this info:

new Info("std::vector<torch::Tensor>", "std::vector<at::Tensor>", "std::vector<torch::autograd::Variable>", "torch::autograd::variable_list")
  .valueTypes("@Cast({\"\", \"std::vector<torch::Tensor>\"}) @StdMove TensorVector")
  .pointerTypes("TensorVector").define())

I wonder why this valueTypes is here.
It will save a copy when we pass a vector of tensors to a native functions but OTOH it will destroy the vector, while the user could need it after the function call.
If I understand well, if a native function takes a rvalue ref (&&) , parser will generate @ByRef(true) which is enough to avoid copies.
@saudet could you share your infinite knowledge about this point ?
Could it break something if I remove the valueTypes ? First attempts seem to show it does not.

There are some other types with this kind of @Cast @StdMove value types (DataPtr, Storage, TensorMaybeOwned, TensorBaseMaybeOwned, TensorName, EdgeVector)

@saudet
Copy link
Member

saudet commented Feb 8, 2024

If you're not getting any compile errors, then I guess PyTorch's API was improved so that we don't need them anymore, yes

@HGuillemet
Copy link
Collaborator Author

@sbrunk could you run your tests on this PR ?
Anything you'd like to be added ?

@HGuillemet HGuillemet marked this pull request as ready for review February 11, 2024 20:37
@saudet saudet requested a review from sbrunk February 12, 2024 01:40
@sbrunk
Copy link
Contributor

sbrunk commented Feb 12, 2024

@sbrunk could you run your tests on this PR ? Anything you'd like to be added ?

Tests are looking good!

@saudet
Copy link
Member

saudet commented Mar 1, 2024

If you're not planning on making more changes for now, we can merge this?

@HGuillemet
Copy link
Collaborator Author

Ok for me.

@HGuillemet HGuillemet changed the title [PyTorch] WIP [PyTorch] Update to 2.2.1 and minor changes Mar 2, 2024
@saudet saudet merged commit 575a44e into bytedeco:master Mar 3, 2024
7 checks passed
@saudet
Copy link
Member

saudet commented Mar 5, 2024

2024-03-04T11:44:04.1783504Z Caused by: java.io.IOException: No space left on device

Could you try to fix this? We probably just need to uninstall a couple of large unnecessary packages...

@HGuillemet
Copy link
Collaborator Author

I had already added a bunch of rm in deploy-ubuntu once, on downloaded archives after there installation, and you reverted that.
I can try to add them again, that seems the easiest and fastest way to make room.
In a new PR ?

@saudet
Copy link
Member

saudet commented Mar 5, 2024

Really? Could you point me to that revert and I'll try it on the actions branch here

@HGuillemet
Copy link
Collaborator Author

I'm seeing it was on deploy-centos, in fact:
3e3fe5c

I'm reviewing deploy-ubuntu and adding similar cleanup. Shall I push the commit here or on a new PR ?

@saudet
Copy link
Member

saudet commented Mar 5, 2024

Ah, that won't be enough. We'll probably need to remove a lot more stuff. You can try it here, but we won't know if it works until actual deploy, so I don't know. Let's check how many more GB we can with df -h I guess for now is good indication

@HGuillemet
Copy link
Collaborator Author

I pushed aaa37a1 on my branch but it doesn't update this PR now that it's merged.

Anyway, this commit will indeed only save ~ 700Mb for pytorch build, due to mkl archive removal.

If it's not enough, what about a maven clean phase on main artifact after its deploy phase and before the deploy phase of the platform/ext artifact ? This should get rid of cppbuild directory (about 7 or 8G)

@saudet
Copy link
Member

saudet commented Mar 5, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Problems deploying pytorch 2.1.2-1.5.10
3 participants