-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Java API: Fix OrtSession.close() crashing in multi-threaded environments #4329
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern is that this is a lot of complexity to add into the system, for something that I'd really consider to be the application code's job (i.e. managing shared resources like an OrtSession). I'm a little worried that blocking on calls to close will be unexpected behaviour.
Given your use case I can understand why you would want the session to keep running after close has been called, but I don't think that's how any of the other language interfaces work. If I called close on the session I would expect it to kill all the running jobs and return, which I guess it does at the moment by causing it to crash. We could make that safe by adding an implicit RunOptions to all session.run calls, setting the terminate flag when close is called, and then returning from close once each terminate call has finished.
I made a few comments about the changes, mostly minor things like reformatting of comments which changes the meaning.
I'll let people from MS comment on the overall idea.
@@ -104,12 +94,12 @@ public static OnnxMapValueType mapFromOnnxJavaType(OnnxJavaType type) { | |||
* <p>Called from native code. | |||
* | |||
* @param nativeHandle The reference to the native map object. | |||
* @param allocatorHandle The reference to the allocator that created the map. | |||
* @param allocator The allocator used to when extracting data from this object. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"The allocator used to when" -> "The allocator used when"
public static class SessionOptions implements AutoCloseable { | ||
public static class SessionOptions extends NativeObject { | ||
|
||
static final OrtSession.SessionOptions DEFAULT_SESSION_OPTIONS = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a good idea? I think the session options isn't designed to be shared between different sessions. I realise the default one doesn't seem to currently hold any shared resources, but that's not to say it won't in the future, and such a change wouldn't be visible from the Java API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I'm fine creating a new session options each time, since they are very minimal.
* The order they care called indicates the preference order as well. In other words call this method | ||
* on your most preferred execution provider first followed by the less preferred ones. | ||
* If none are called Ort will use its internal CPU execution provider. | ||
* functions to enable them in the session: OrtSessionOptionsAppendExecutionProvider_CPU |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reformatting broke this list here, please revert it.
(void) jobj; // Required JNI parameter not needed by functions which don't need to access their host object. | ||
const OrtApi* api = (const OrtApi*) apiHandle; | ||
OrtAllocator* allocator = (OrtAllocator*) allocatorHandle; | ||
OrtSession* session = (OrtSession*) sessionHandle; | ||
OrtRunOptions* runOptions = (OrtRunOptions*) runOptionsHandle; | ||
jsize numInputs = (*jniEnv)->GetArrayLength(jniEnv, inputNamesArr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it faster to pass in the longs rather than make an upcall into the JVM?
@Craigacp I believe I addressed your comments. The formatting changes were done automatically by the spotless plugin. I did change how they were formatted to convey the original layout better. I do admit this is a little more complex, but I think it shields developers (Java devs especially) from the gnarly world of segmentation faults, JNI race conditions, and other nasty crashes. I'd much prefer to have some exceptions or blocking closes rather than whole process crashes. I would consider tensorflow to be similar to this project and they have it. I'm going to think about the RunOptions idea. I think it could work in conjunction with what I currently have. What do you think of adding a terminate() (or some other named) method to OrtSession which would do what you mentioned. So if somebody was in a multithreaded environment they would call session.terminate() and then session.close() and theoretically the close() would not block. |
Hmm, it's weird that spotless wasn't changing the format when I ran it. Are you using a different one to the one pulled in by gradle? The fixes for my comments look good, thanks. I agree exceptions are vastly preferable to runtime crashes. I contribute to the Tensorflow Java API, and I'm not sure that TF has it all figured out either. For instance in the PR what happens if the second session is created using the same GPU as the first session? Does it cause contention, or does it crash? I'm reasonably sure TF will complain if there are two processes trying to use the same GPU and I'm not too sure how graceful it is about it. I think documenting the behaviour and then leaving it to the user to manage is probably better, rather than inducing long blocking calls, but I suppose the blocking calls are just a backup. I think I'd prefer that closing an OrtSession implicitly called terminate on the running calls, but I accept that in a long running service that's less than ideal as you'd drop some requests on the floor. Maybe we could have a flag on OrtSession which controls if close terminates running inferences or blocks? |
I have been running gradle's I added a OrtSession.terminate() method which can be used optionally in cases when a developer is in a multi-threaded environment and they want the subsequent call to OrtSession.close() to not block. I tried to document this. I used your idea of using a default RunOptions. It looks like TF does block on close() in the same manner as this implementation. In both cases the blocking is only done in cases when there are extant usages at the time of closing, specifically in multi-threaded cases. Single-threaded developers need not worry about blocking. I added some logging to indicate there is wait in progress. Unrelated, but I also reduced the number of JNI calls done on session creation which seemed redundant. If there was a specific reason the quantity, names, and NodeInfos are 3 different JNI calls, let me know and I can leave it alone. |
I think @saudet has made all the points that I neglected to make clear so I do not need to add anything in principle.
Whether one disagrees with the value of Segmentation Fault or not in a particular setting, it is an unchangeable feature of the operating system. Any change we do should be aimed at the correctness, not running away from it. We certainly can not afford writing code that consumes CPU cycles merely to account for different levels of coding maturity. |
I've changed the nomenclature to stay away from reference counting. Here is a gist with a very barebones multithreaded environment: https://gist.github.com/yuzawa-san/e12b16a4aac7dc7bec2a94c4b8c2da89 This patch fixes the crash, so I'll update the title to reflect this. |
@yuslepukhin I agree that this is more about managing object lifetimes than thread safety, but it turns out the major issue for managing the object lifetimes is sharing an object between threads (in this case the OrtSession) and having one session accidentally close it out from under you. Everything else in the API pretty much just works when used by multiple threads, provided those threads don't call close. try-with-resources isn't the best solution in multi-threaded code like this, especially when the object is long lived (e.g. OrtSession), but it's all we've got which is why everything is already AutoCloseable. This PR adds another layer which alters the behaviour when closing OrtSession if another thread has a run call in flight, which prevents a JVM crash. Preventing JVM crashes is definitely worthwhile, as unlike C# the JVM doesn't really have any operations that can cause a segfault other than JNI so Java developers are not accustomed to seeing them. |
What about doing like @mcimadamore suggests on Panama's mailing list? I'm guessing something like this: OrtSession s = ...
s.withHandle(SessionHandle sh -> {
sh.run(input)
});
s.close() // will fail (or wait) if there are any pending handles The point is the OrtSession cannot be used without a handle, so all processing has to be done within the closure. |
Some observations:
I guess it is just a different way of thinking about it, but I don't know if it is worth the complexity. Also what if you want to make MyResult immutable?
I agree with you: definitely interesting, but I'm not fully sold. Honestly, I have not examined Panama in close detail, but I think the closure idea could work in certain cases. If you or somebody else want to do closures here in an elegant and proper way in a future PR, but I think I'll hold off for now. |
@yuzawa-san Don't misunderstand, I'm not recommending to force that kind of API on users, but it does work, for some things, just like try-with-resources works, for some things, and garbage collection works, for some things, and reference counting works, for some things (but there is no magic bullet that works for everything, and Panama isn't really interested in offering one either, so don't set your hopes on that front either). There is no way however to make reference counting thread-safe, unless we follow some magic enchantment protocol, so it doesn't make sense to force users into an API like that and tell them that it's thread-safe! It doesn't offer enough to be worth forcing users to use these things, and that's the point. With the JavaCPP wrapped C/C++ API, it's very easy to implement this kind of resource management on top, something like this: class SessionHolder {
Session session = ...
List<Session> list = ...
Lock lock = ...
public void withSession(SessionClosure closure) {
lock.lock();
try {
if (session == null) // throw some exception
Session s = new Session(session); // only copies the pointer address
list.add(s);
lock.unlock();
closure.run(s);
lock.lock()
list.remove(s);
s.setNull();
} finally {
lock.unlock();
}
}
public void close() {
lock.lock()
try {
if (!list.isEmpty()) {
// wait somehow or throw exception
}
session.deallocate();
session = null;
} finally {
lock.unlock();
}
}
}
SessionHolder h = new SessionHolder(...);
holder.withSession(Session s -> {
s.run(input);
});
holder.close(); And we don't need to add that to ONNX Runtime or anything. Users can already do that or anything else on their own just fine. JavaCPP doesn't force anyone into doing things in one specific way. @Craigacp This is the kind of thing I'm talking about. You're offering users an API that isn't actually usable, and forcing them to open pull requests like this that are getting rejected. Why not offer them something usable, that they can build upon? |
@saudet The API is perfectly usable (because we and others are using it in production), this patch fixes an issue that I hadn't hit because I use the session in a try-with-resources and don't attempt to close it while run calls are in flight, and it doesn't require adding large numbers of locks or inducing new dependencies. I don't think exposing the closure version to users is particularly helpful, it's too low level. The purpose of reviews is to discuss things, and hopefully we're still making process on that front here. I'm not sure that you presupposing the outcome of the review when you haven't contributed anything to the project is particularly helpful. I think the reference counting approach is sufficiently hidden that it will work in the OrtSession and prevent the crashes. It's not really counting references, it's counting the number of methods in flight using the native resource, and it's completely hidden from the user of the Java API. For the similar issues in OnnxValue subclasses then I think making the relevant methods synchronized is sufficient as they are immutable and it's not necessary to have concurrent access to their |
@saudet please check your facts. The benchmark you refer to is an year old, and is using a completely different Panama architecture (and API) compared to the one we are using now, for both memory and foreign function access. Out of curiosity I put together a small (unscientific) benchmark which does something similar to the one you linked (but using Linux, I used |
We could debate the nuances of the internal implementation (which was done prior to my arrival) ad infinitum and yes there are philosophical differences between segfaults in C/CPP-world and the somewhat-equivalent NullPointerException in Java that could be re-litigated, but I would like to get this underlying bug fixed, as it is a blocker to my (and likely others') future use of this API. The main objective here was to fix the defect shown here which @Craigacp mentioned they had overlooked in the original implementation: https://gist.github.com/yuzawa-san/e12b16a4aac7dc7bec2a94c4b8c2da89 I did not alter the Java public facing API with the exception of adding the OrtSession.terminate() method, which I documented. What are the next steps to getting this approved? |
/azp run Linux CPU CI Pipeline,Linux CPU x64 NoContribops CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,MacOS CI Pipeline,MacOS NoContribops CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline |
/azp run orttraining-linux-ci-pipeline,orttraining-mac-ci-pipeline,orttraining-linux-gpu-ci-pipeline,centos7_cpu,Linux OpenVINO CI Pipeline |
Azure Pipelines successfully started running 5 pipeline(s). |
Azure Pipelines successfully started running 9 pipeline(s). |
@Craigacp @yuslepukhin - is this change good to go ? |
I have been trying to contribute, but with your authority here as representative from Oracle you are preventing any potential contributions that I have to offer. We need something like JavaCPP for performance reasons and to interoperate with other libraries like OpenCV and TensorFlow. (BTW, you do know that OpenCV is probably the second most used AI-related framework after TensorFlow, right? Why isn't Oracle doing anything about that one?) The only reasons I have ever heard from you about why you are refusing to consider using (or even creating yourself something similar to JavaCPP), is what you said above, but you provide no specifics:
Where exactly is the build "taken over"? What is "overly complex" about a ~100 lines Bash script and Java config file? What additional dependencies? How do you want to package the binaries to have them share stuff with OpenCV and TensorFlow? The way this sounds to me is you're saying that everything the Python community does with tools like conda, pip, setuptools, etc is wrong, and Java developers shouldn't use tools like that. If that's not what you mean, then at least please offer an alternative! Remember that you're speaking for Oracle here.
That's exactly the problem, it's all hidden, it prevents users from doing things in whatever way they may want to do things! |
That's great news! Thanks for the update. I'm looking forward to the day when Panama actually stops using JNI though. |
It's not just a bug fix (a bug fix would just throw an exception or something from somewhere), it forces users to adopt a way of doing things that purports to be thread-safe, but that really isn't, and that will slow down processing. If there were a way to disable all of this, I would be fine with these changes though. Is there a way to disable all of this? (I also think that kind of feature should be optional in TensorFlow. I'll probably get to it when I need to hack that there. The developer that used to work at Google on this API wasn't an expert in Java and I don't think his code got reviewed.) |
@saudet As I've told you several times before, I work in Oracle Labs, and do not speak for the company or the Java org in any way. In this particular case I'm speaking as the person who wrote the Java API for this project. I've not written anything for OpenCV for two reasons: one there is already a Java API, and two I don't work on computer vision. I'm unclear why the ORT library would need to interop with Tensorflow. If you want to use a TF model then export it in onnx format and compose it using the mechanisms already available. If there are performance problems that need addressing then open an issue with some numbers and I'll take a look.
Using JavaCPP requires having JavaCPP build the native layer, opting into JavaCPP's mechanisms for binary distribution, and you need to get the right set of dependencies. As we've found out with the recent issues in the TF Java project, the way that JavaCPP works can be confusing for some developers and it can be tricky to figure out the dependencies. I don't think it's worthwhile to switch to something more complex when we have something that works and is simpler for users of the library. None of your points are relevant to this discussion, and adopting JavaCPP wouldn't fix the race condition between object use and closing here, given it's essentially a problem with the C API for the library (as calling the C API's close method while there is a run call in flight shouldn't segfault the process). |
I took a look at the PR and the scenario. While the scenario is valid, I do not feel that this specific user scenario complexity should be pushed down to the API level. A simple inference would have to pay for all refocunting and synchronization mechanisms. |
synchronized (handleLock) { | ||
if (closed) { | ||
return; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should throw some exception that would clearly tell that things are going wrong. Somehow, a thread is operating on an object that was already be destroyed and but it thinks it has a valid handle on it. With property implemented refcounting it should never happen, as it looks like the a thread is holding a reference to it but the object is destroyed anyway. This condition should prompt an assert or an non-recoverable exception. Which is what Segfault is currently doing for you.
Refcounting or not, each native object should have a proper and clear ownership and judjing by this piece of code, you do not expect that refcounting would actually save you. Why introducing it?
You can wrap a session object into your recounted class and save other users from the bearing the expense of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The close() method is called by developers. It is within reason that somebody may close the object more that once and we do not really care since the object was closed already. An exception is thrown if the session is used again (like if they use run() again).
The active usage (NOT reference) counting is beside the point here, since that is done in private implementation hidden from the users and with try-with-resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
close() is usually the last thing that is being called on the object. Since the calling code seems to be unware that the underlying native object is no longer there, it can call all sorts of functions on it before close(). By this reasoning, one would need to check of the object is still around in every interface which is not being done. So it would still lead you to a segfault, would it not? Discovering the the object is gone at the time of close() does nothing to save you if the program continues to be invalid. If there is not a clear ownership and the recounting is not implemented right such checks actually do nothing. If two threads are trying to close the object but the recount is 1, both would pass the check, and they would happily destroy the object twice thereby corrupting memory. And you already know where it leads -SEGFAULT.
Let me summarize: this PR introduces atomics and locks, both of which impose additional usage requirements and performance costs on other users for the sake of your usage scenario. However, I feel, that the additional overhead and complexity should be shifted to your code.
On another note, there are inter dependencies on some objects. You will see in the next release, OrtAllocators API being exposed and IOBidning. Those depend on Session. Session depends on Environment. Refcounting them individually would not make sense.
In reply to: 471666137 [](ancestors = 471666137)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docs for Java's AutoCloseable interface says:
Note that unlike the close method of Closeable, this close method is not required to be idempotent. In other words, calling this close method more than once may have some visible side effect, unlike Closeable.close which is required to have no effect if called more than once. However, implementers of this interface are strongly encouraged to make their close methods idempotent.
So having it accept multiple calls to close is idiomatic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However the upcoming changes to allocators and IObinding do make me worried we'll need a bigger solution to this. The allocators were originally exposed and had a more complicated tracking mechanism, but during the initial integration of the Java API we decided that it wasn't necessary and lots of complexity was removed by making the allocators an internal detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The initial allocator exposed was a default CPU allocator. It was a singleton that never required destruction.
The allocators obtained with the new API CreateAllocator() are wrapping instances that require ReleaeAllocator() call when no longer needed.
Those refer to the internal Session allocators.
The internal sessions allocators are tied to the execution providers. Those depend on two factors: 1) whether it is present in the binary 2) whether it was added in the session options.
For example, if you have a build with CPU and CUDA, you specify them in the SessionOptions, create a Session.
Then you create an instance of OrtMemoryInfo that has all the characteristics of the allocator required. (Note, this is also a native Closable/Disposable class)
Using that you call CreateAllocator() which internally checks if the requested allocator is present within the session. It would then create a wrapper and return it.
You now can allocate memory on the device which returns OrtMemoryAllocation instance.
This is useful in some scenarios 1) You do not want to reallocate memory for input/output and just re-use a pre-allocated chunk progressively specifying bigger shapes but re-use the same chunk
2) You want to feed output back to input. So you just re-bind input/output trading places.
3) You have a dynamically shaped output. In this case you can not pre-allocate, but by using OrtMemoryInfo you can specify where to allocate and it will do so.
This is especially useful, if you need it to be on the device or you want to feed input that is already allocated, say on CUDA. So no copying to host memory is needed.
in cases 1 and 2 the caller owns the memory, in case of 3, Onnxruntime owns that (Returned OrtValue does).
In reply to: 471751379 [](ancestors = 471751379,471708975)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- You have a dynamically shaped output. In this case you can not pre-allocate, but by using OrtMemoryInfo you can specify where to allocate and it will do so.
This is especially useful, if you need it to be on the device or you want to feed input that is already allocated, say on CUDA. So no copying to host memory is needed.
in cases 1 and 2 the caller owns the memory, in case of 3, Onnxruntime owns that (Returned OrtValue does).
@Craigacp Here you go, you should consider your Java API not supporting CUDA memory as a "performance problem". No need to run benchmarks to know that copying data around unnecessarily is going to be slow. And, no, the official Java API of OpenCV doesn't support CUDA, so even if you figure out something for ORT, it's still going to be a problem! Why do you refuse the recognize the importance of this? I'm not saying that JavaCPP is perfect, but at least I'm trying to do something about it, while you are not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The initial allocator exposed was a default CPU allocator. It was a singleton that never required destruction.
The allocators obtained with the new API CreateAllocator() are wrapping instances that require ReleaeAllocator() call when no longer needed.
Those refer to the internal Session allocators.
The internal sessions allocators are tied to the execution providers. Those depend on two factors: 1) whether it is present in the binary 2) whether it was added in the session options.
For example, if you have a build with CPU and CUDA, you specify them in the SessionOptions, create a Session.
Then you create an instance of OrtMemoryInfo that has all the characteristics of the allocator required. (Note, this is also a native Closable/Disposable class)
Using that you call CreateAllocator() which internally checks if the requested allocator is present within the session. It would then create a wrapper and return it.
You now can allocate memory on the device which returns OrtMemoryAllocation instance.
This is useful in some scenarios 1) You do not want to reallocate memory for input/output and just re-use a pre-allocated chunk progressively specifying bigger shapes but re-use the same chunk
2) You want to feed output back to input. So you just re-bind input/output trading places.
3) You have a dynamically shaped output. In this case you can not pre-allocate, but by using OrtMemoryInfo you can specify where to allocate and it will do so.
This is especially useful, if you need it to be on the device or you want to feed input that is already allocated, say on CUDA. So no copying to host memory is needed.
in cases 1 and 2 the caller owns the memory, in case of 3, Onnxruntime owns that (Returned OrtValue does).In reply to: 471751379 [](ancestors = 471751379,471708975)
Ok. I'll work up a PR in a few weeks that starts to incorporate this. I need to expose the getAvailableSessions call anyway and I think there are a couple of other C API features that landed while I was distracted working on Tribuo. Tracking the allocators might be a little tricky, as sharing them across sessions adds yet another layer of things to be held in try-with-resources. Are the new allocators thread safe?
@saudet these are in the context of the IOBinding work happening elsewhere in the ORT project. It allows the use of ORT allocators to allocate memory for specific devices and then bind the inputs to that, the allocation is as far as I can tell performed by ORT, just in a separate call to the call to run. I do not know if it's possible to pass in a slab of GPU memory allocated by some other system using either the C or C++ APIs of ORT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Yes, the allocators are thread-safe. They are the same ones used by the session object and the session object support concurrent execution my multiple threads.
- Yes, it is possible to bind inputs/outputs to a slub of memory allocated by someone else. When we bind a pre-allocated piece of memory, it is not owned by ORT or the corresponding OrtValue. One of the python binding tests makes use of CUDA memory allocated by PyTorch. Current C# API requires passing OrtMemoryAllocation wrapping the chunk. That currently requires an allocator ptr. So I can see that in C# we could make it more convinient to supply 3rd party allocations.
In reply to: 474764836 [](ancestors = 474764836)
Nope. We can not accept this change. In reply to: 673709909 [](ancestors = 673709909) |
I'm going to port the non-controversial code cleanups into a new PR. |
@Craigacp FYI, we can easily integrate JavaCPP into other build systems, using it only to generate code. I've just done it successfully for Android Studio, and that uses Gradle to call CMake: Without any explanations about why writing JNI code manually is less "complex" than using a code generator that does all that for us, your statements are simply not true! |
Description: This change makes all of the classes backed by native objects extends a new NativeObject class which properly handles the the safe access and execution of native methods. Specifically, any native objects needed for a specific native method must be not closed. Additionally, an object's close method to wait until the last usage of any active reference is completed. Failure to do those things may result in a native method being invoked with closed native handle. Or a native close occurring during an active native execution. I was inspired by tensorflow's JNI https://github.com/tensorflow/tensorflow/blob/master/tensorflow/java/src/main/java/org/tensorflow/Graph.java#L359 Like their implementation, I utilized a lot of try-with-resources. I did not like the amount of locking in their implementation, so I decided to use reference counts using AtomicInteger to ensure consistency using CAS rather than synchronization. I also did some general cleanup:
Thread-Safety Description
Here is some pseudocode of what is going on in NativeObject:
Motivation and Context