Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Java API: Fix OrtSession.close() crashing in multi-threaded environments #4329

Closed
wants to merge 35 commits into from

Conversation

yuzawa-san
Copy link
Contributor

Description: This change makes all of the classes backed by native objects extends a new NativeObject class which properly handles the the safe access and execution of native methods. Specifically, any native objects needed for a specific native method must be not closed. Additionally, an object's close method to wait until the last usage of any active reference is completed. Failure to do those things may result in a native method being invoked with closed native handle. Or a native close occurring during an active native execution. I was inspired by tensorflow's JNI https://github.com/tensorflow/tensorflow/blob/master/tensorflow/java/src/main/java/org/tensorflow/Graph.java#L359 Like their implementation, I utilized a lot of try-with-resources. I did not like the amount of locking in their implementation, so I decided to use reference counts using AtomicInteger to ensure consistency using CAS rather than synchronization. I also did some general cleanup:

  • Immutable maps and collections.
  • Reading session information only at construction time.
  • Pre-sizing maps and collections.
  • Update gitignore for IDE files.
  • Update javadoc config to only show public methods.
  • Cleaned up default allocator and session options in OrtEnvironment. This did require moving some OrtAllocator-specific code in the C code to a different file.

Thread-Safety Description

Here is some pseudocode of what is going on in NativeObject:

class NativeObject {
  constructor() {
  increment referenceCount
  }

  close() {
   if closed do not close again
   decrement referenceCount
   if the new value is greater than zero then some other thread is still using a reference (at least one NativeReference has not been closed) so wait for that last alive one to notify this thread before proceeding
   else the new value is less than or equal to zero, so there are no active references (all instances of NativeReference have been closed) so we do not need to wait
   free up native resources
  }
  
  class NativeReference {
    constructor() {
      increment referenceCount
      if the old value was less than or equal to zero indicating the NativeObject was closed then release() and throw an exception. release() needs to be called to restore the old value by decrementing the referenceCount. also the exception means that close() won't be called.
      else proceed as normal and the release() will be called in the close()
    }
    
    release() {
      decrement referenceCount
      if the new value is zero then this was the last active reference so notify the wait in NativeObject.close()
      else do nothing since NativeObject is still alive
    }
    
    close() {
      release() if NativeReference has not been closed
    }
  }
}

Motivation and Context

  • I intend on using onnxruntime in a multithreaded environment. The session is created in one thread, then a bunch of other threads see the reference to the OrtSession and do model evaluations in those other threads. Periodically, a new model is loaded and a new OrtSession is created. The reference to the active session is swapped out for the new OrtSession, and then that OrtSession would be closed. However the current codebase does not handle this situation gracefully or safely. Inferences could still be occurring in the run() method, when the OrtSession.close() method is invoked. Depending on the interleaving across threads this could produce a segmentation fault. At the minimum, any objects used after closure should throw an exception. And more importantly, the close() method should be able to block until the last usage is complete, while simultaneously locking out any new attempts to use the object.
  • I also updated the OrtEnvironment semantics. While in most cases, a single environment is created and used for the duration of the application. I did add the ability to close the OrtEnvironment. In that case, a subsequent call to getEnvironment() would not return the existing environment, but rather a new one.

@yuzawa-san yuzawa-san requested a review from a team as a code owner June 24, 2020 22:34
Copy link
Contributor

@Craigacp Craigacp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is that this is a lot of complexity to add into the system, for something that I'd really consider to be the application code's job (i.e. managing shared resources like an OrtSession). I'm a little worried that blocking on calls to close will be unexpected behaviour.

Given your use case I can understand why you would want the session to keep running after close has been called, but I don't think that's how any of the other language interfaces work. If I called close on the session I would expect it to kill all the running jobs and return, which I guess it does at the moment by causing it to crash. We could make that safe by adding an implicit RunOptions to all session.run calls, setting the terminate flag when close is called, and then returning from close once each terminate call has finished.

I made a few comments about the changes, mostly minor things like reformatting of comments which changes the meaning.

I'll let people from MS comment on the overall idea.

@@ -104,12 +94,12 @@ public static OnnxMapValueType mapFromOnnxJavaType(OnnxJavaType type) {
* <p>Called from native code.
*
* @param nativeHandle The reference to the native map object.
* @param allocatorHandle The reference to the allocator that created the map.
* @param allocator The allocator used to when extracting data from this object.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"The allocator used to when" -> "The allocator used when"

public static class SessionOptions implements AutoCloseable {
public static class SessionOptions extends NativeObject {

static final OrtSession.SessionOptions DEFAULT_SESSION_OPTIONS =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a good idea? I think the session options isn't designed to be shared between different sessions. I realise the default one doesn't seem to currently hold any shared resources, but that's not to say it won't in the future, and such a change wouldn't be visible from the Java API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'm fine creating a new session options each time, since they are very minimal.

* The order they care called indicates the preference order as well. In other words call this method
* on your most preferred execution provider first followed by the less preferred ones.
* If none are called Ort will use its internal CPU execution provider.
* functions to enable them in the session: OrtSessionOptionsAppendExecutionProvider_CPU
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reformatting broke this list here, please revert it.

(void) jobj; // Required JNI parameter not needed by functions which don't need to access their host object.
const OrtApi* api = (const OrtApi*) apiHandle;
OrtAllocator* allocator = (OrtAllocator*) allocatorHandle;
OrtSession* session = (OrtSession*) sessionHandle;
OrtRunOptions* runOptions = (OrtRunOptions*) runOptionsHandle;
jsize numInputs = (*jniEnv)->GetArrayLength(jniEnv, inputNamesArr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it faster to pass in the longs rather than make an upcall into the JVM?

@yuzawa-san
Copy link
Contributor Author

@Craigacp I believe I addressed your comments. The formatting changes were done automatically by the spotless plugin. I did change how they were formatted to convey the original layout better.

I do admit this is a little more complex, but I think it shields developers (Java devs especially) from the gnarly world of segmentation faults, JNI race conditions, and other nasty crashes. I'd much prefer to have some exceptions or blocking closes rather than whole process crashes. I would consider tensorflow to be similar to this project and they have it.

I'm going to think about the RunOptions idea. I think it could work in conjunction with what I currently have. What do you think of adding a terminate() (or some other named) method to OrtSession which would do what you mentioned. So if somebody was in a multithreaded environment they would call session.terminate() and then session.close() and theoretically the close() would not block.

@Craigacp
Copy link
Contributor

Craigacp commented Jun 26, 2020

Hmm, it's weird that spotless wasn't changing the format when I ran it. Are you using a different one to the one pulled in by gradle? The fixes for my comments look good, thanks.

I agree exceptions are vastly preferable to runtime crashes. I contribute to the Tensorflow Java API, and I'm not sure that TF has it all figured out either. For instance in the PR what happens if the second session is created using the same GPU as the first session? Does it cause contention, or does it crash? I'm reasonably sure TF will complain if there are two processes trying to use the same GPU and I'm not too sure how graceful it is about it.

I think documenting the behaviour and then leaving it to the user to manage is probably better, rather than inducing long blocking calls, but I suppose the blocking calls are just a backup. I think I'd prefer that closing an OrtSession implicitly called terminate on the running calls, but I accept that in a long running service that's less than ideal as you'd drop some requests on the floor. Maybe we could have a flag on OrtSession which controls if close terminates running inferences or blocks?

@yuzawa-san
Copy link
Contributor Author

yuzawa-san commented Jun 30, 2020

I have been running gradle's spotlessApply task to reformat.

I added a OrtSession.terminate() method which can be used optionally in cases when a developer is in a multi-threaded environment and they want the subsequent call to OrtSession.close() to not block. I tried to document this. I used your idea of using a default RunOptions.

It looks like TF does block on close() in the same manner as this implementation. In both cases the blocking is only done in cases when there are extant usages at the time of closing, specifically in multi-threaded cases. Single-threaded developers need not worry about blocking. I added some logging to indicate there is wait in progress.

Unrelated, but I also reduced the number of JNI calls done on session creation which seemed redundant. If there was a specific reason the quantity, names, and NodeInfos are 3 different JNI calls, let me know and I can leave it alone.

@yuslepukhin
Copy link
Member

I think @saudet has made all the points that I neglected to make clear so I do not need to add anything in principle.
Here is the main outtake.

  1. Re-counting as implemented has nothing to do with thread safety of both the native objects and its managed wrappers. So I suggest we make a change to the title of the PR. It has to do with managing objects life span which I am not convinced we need here. I can see we have the same attempt in a couple classes in our C# interfaces, I will correct it in due time.

  2. AutoClosable interfaces are good since there is nothing else in Java that would provide timely release of native resources. There is a good question to ponder over: is there any connection between garbage collection and native objects? The correct answer, there is none. We can run out of the resources and native memory a thousand times and the garbage collector may still not run. Even if did, it would not release native resources bc it does not know anything about it.
    Another equally good question, since we are on the topic, what is a good use of finalizers? The correct answer is there NONE. The should not have existed in the first place. Occasionally, in a DEBUG setting you can detect that a AutoClosable object was not timely closed, but it is not guarantee. Knowing this type of use is more beneficial in interview setting rather than in practice.
    Finalizers do not run at the time of garbage collection, they run, if ever, at a different point in time and by then we are already out of native memory, out of native resource handles and we are deadlocked if native synchronization objects are involved. Java lacks destructors and the closest equivalent for them is try/finally AKA AutoClosable objects now (better late than never).

Whether one disagrees with the value of Segmentation Fault or not in a particular setting, it is an unchangeable feature of the operating system. Any change we do should be aimed at the correctness, not running away from it. We certainly can not afford writing code that consumes CPU cycles merely to account for different levels of coding maturity.

@yuzawa-san
Copy link
Contributor Author

I've changed the nomenclature to stay away from reference counting. Here is a gist with a very barebones multithreaded environment: https://gist.github.com/yuzawa-san/e12b16a4aac7dc7bec2a94c4b8c2da89 This patch fixes the crash, so I'll update the title to reflect this.

@yuzawa-san yuzawa-san changed the title Java API: Thread-safe Native Objects Java API: Fix OrtSession.close() crashing in multi-threaded environments Aug 3, 2020
@Craigacp
Copy link
Contributor

Craigacp commented Aug 4, 2020

@yuslepukhin I agree that this is more about managing object lifetimes than thread safety, but it turns out the major issue for managing the object lifetimes is sharing an object between threads (in this case the OrtSession) and having one session accidentally close it out from under you. Everything else in the API pretty much just works when used by multiple threads, provided those threads don't call close. try-with-resources isn't the best solution in multi-threaded code like this, especially when the object is long lived (e.g. OrtSession), but it's all we've got which is why everything is already AutoCloseable. This PR adds another layer which alters the behaviour when closing OrtSession if another thread has a run call in flight, which prevents a JVM crash. Preventing JVM crashes is definitely worthwhile, as unlike C# the JVM doesn't really have any operations that can cause a segfault other than JNI so Java developers are not accustomed to seeing them.

@saudet
Copy link

saudet commented Aug 7, 2020

What about doing like @mcimadamore suggests on Panama's mailing list? I'm guessing something like this:

OrtSession s = ...
s.withHandle(SessionHandle sh -> {
    sh.run(input)
});
s.close() // will fail (or wait) if there are any pending handles

The point is the OrtSession cannot be used without a handle, so all processing has to be done within the closure.
I'm not sure I agree with the usefulness of this, but this is coming from a Java compiler architect :)

@yuzawa-san
Copy link
Contributor Author

Some observations:

  • Try-with-resources and closures seem pretty similar. I guess with the former we cannot guarantee the caller will close the Closeable versus with the closure we can put whatever code in the implementation of withHandle to bookend the call to the body of the closure. However, I'm not super concerned about that since I don't even want the everyday developer to know about checking out the handle. The NativeObject.use() which returns NativeUsage is package private so developers would not check out a reference. That is, OrtSession.run() internally calls NativeObject.use() and uses a try-with-resources in there. One value that closures has is that it contractually forces a certain usage rather than the little "internal developers should use NativeUsage in a try-with-resources" javadoc I currently have which puts it on the mercy of future code reviewers.
  • I was playing around and the closure case gets weird in cases when you want to return from inside the closure. I suppose we could make something like that impossible and you could pass in value to be populated in the closure:
final MyResult result = new ...
NativeObject s = ...
s.withHandle(Handle h -> {
    byte[] rawResult= h.run(input);
    result.setValue(rawResult);
});
// versus
try(Handle h = use()){
    return new MyResult(h.run(input));
}

I guess it is just a different way of thinking about it, but I don't know if it is worth the complexity. Also what if you want to make MyResult immutable?

  • Also the functional interface gets a little odd when you throw different type of exceptions from within. It seems dirty to add throws Exception to the functional interface since it is overly broad and could be irrelevant in cases when nonesuch exceptions are present which forces the caller to handle them. Like what if h.run() throws OrtException and maybe new MyResult throws an IOException. Or what if the body of the never throws any Exceptions the caller would have to handle. Or perhaps the implementation of withHandle could wrap it in a subclass of RuntimeException, but then information is lost.
  • If you did make the functional interface allowed return a value (or throw different exceptions) I guess you would have to consider the nested case where multiple handles are needed (like currently in OrtSession.run()) the return values would have to be propagated out and the throws must be consistent.
  • I didn't want to break external API consistency for this library, but I guess given the infancy of this project maybe it could be an option.

I agree with you: definitely interesting, but I'm not fully sold. Honestly, I have not examined Panama in close detail, but I think the closure idea could work in certain cases. If you or somebody else want to do closures here in an elegant and proper way in a future PR, but I think I'll hold off for now.

@saudet
Copy link

saudet commented Aug 8, 2020

@yuzawa-san Don't misunderstand, I'm not recommending to force that kind of API on users, but it does work, for some things, just like try-with-resources works, for some things, and garbage collection works, for some things, and reference counting works, for some things (but there is no magic bullet that works for everything, and Panama isn't really interested in offering one either, so don't set your hopes on that front either). There is no way however to make reference counting thread-safe, unless we follow some magic enchantment protocol, so it doesn't make sense to force users into an API like that and tell them that it's thread-safe! It doesn't offer enough to be worth forcing users to use these things, and that's the point. With the JavaCPP wrapped C/C++ API, it's very easy to implement this kind of resource management on top, something like this:

class SessionHolder {
    Session session = ...
    List<Session> list = ...
    Lock lock = ...
    public void withSession(SessionClosure closure) {
        lock.lock();
        try {
            if (session == null) // throw some exception
            Session s = new Session(session); // only copies the pointer address
            list.add(s);
            lock.unlock();
            closure.run(s);
            lock.lock()
            list.remove(s);
            s.setNull();
        } finally {
            lock.unlock();
        }
    }
    public void close() {
        lock.lock()
        try {
            if (!list.isEmpty()) {
                // wait somehow or throw exception
            }
            session.deallocate();
            session = null;
        } finally {
            lock.unlock();
        }
    }
}

SessionHolder h = new SessionHolder(...);

holder.withSession(Session s -> {
    s.run(input);
});
holder.close();

And we don't need to add that to ONNX Runtime or anything. Users can already do that or anything else on their own just fine. JavaCPP doesn't force anyone into doing things in one specific way.

@Craigacp This is the kind of thing I'm talking about. You're offering users an API that isn't actually usable, and forcing them to open pull requests like this that are getting rejected. Why not offer them something usable, that they can build upon?

@Craigacp
Copy link
Contributor

Craigacp commented Aug 8, 2020

@saudet The API is perfectly usable (because we and others are using it in production), this patch fixes an issue that I hadn't hit because I use the session in a try-with-resources and don't attempt to close it while run calls are in flight, and it doesn't require adding large numbers of locks or inducing new dependencies. I don't think exposing the closure version to users is particularly helpful, it's too low level. The purpose of reviews is to discuss things, and hopefully we're still making process on that front here. I'm not sure that you presupposing the outcome of the review when you haven't contributed anything to the project is particularly helpful.

I think the reference counting approach is sufficiently hidden that it will work in the OrtSession and prevent the crashes. It's not really counting references, it's counting the number of methods in flight using the native resource, and it's completely hidden from the user of the Java API. For the similar issues in OnnxValue subclasses then I think making the relevant methods synchronized is sufficient as they are immutable and it's not necessary to have concurrent access to their getValue() methods.

@mcimadamore
Copy link

mcimadamore commented Aug 10, 2020

It needs to provide something substantial, and right now, Panama is still a lot slower than JNI: https://github.com/zakgof/java-native-benchmark

@saudet please check your facts. The benchmark you refer to is an year old, and is using a completely different Panama architecture (and API) compared to the one we are using now, for both memory and foreign function access. Out of curiosity I put together a small (unscientific) benchmark which does something similar to the one you linked (but using Linux, I used localtime_r) - Panama is just ~10% slower than pure JNI (I did not use JavaCPP); the only overhead I can see is associated to memory allocation/deallocation (compared to Unsafe::allocateMemory/freeMemory), where some extra logic is added so that memory access can be considered "safe" (as in avoiding hard JVM crashes due to segfaults). While we need to keep improving, these numbers are very different from those in the benchmark you linked (which shows Panama to be 20x slower than JNI/JavaCPP).

@yuzawa-san
Copy link
Contributor Author

We could debate the nuances of the internal implementation (which was done prior to my arrival) ad infinitum and yes there are philosophical differences between segfaults in C/CPP-world and the somewhat-equivalent NullPointerException in Java that could be re-litigated, but I would like to get this underlying bug fixed, as it is a blocker to my (and likely others') future use of this API. The main objective here was to fix the defect shown here which @Craigacp mentioned they had overlooked in the original implementation: https://gist.github.com/yuzawa-san/e12b16a4aac7dc7bec2a94c4b8c2da89 I did not alter the Java public facing API with the exception of adding the OrtSession.terminate() method, which I documented. What are the next steps to getting this approved?

@hariharans29
Copy link
Member

/azp run Linux CPU CI Pipeline,Linux CPU x64 NoContribops CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,MacOS CI Pipeline,MacOS NoContribops CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline

@hariharans29
Copy link
Member

/azp run orttraining-linux-ci-pipeline,orttraining-mac-ci-pipeline,orttraining-linux-gpu-ci-pipeline,centos7_cpu,Linux OpenVINO CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 5 pipeline(s).

@azure-pipelines
Copy link

Azure Pipelines successfully started running 9 pipeline(s).

@hariharans29
Copy link
Member

@Craigacp @yuslepukhin - is this change good to go ?

@saudet
Copy link

saudet commented Aug 14, 2020

@saudet The API is perfectly usable (because we and others are using it in production), this patch fixes an issue that I hadn't hit because I use the session in a try-with-resources and don't attempt to close it while run calls are in flight, and it doesn't require adding large numbers of locks or inducing new dependencies. I don't think exposing the closure version to users is particularly helpful, it's too low level. The purpose of reviews is to discuss things, and hopefully we're still making process on that front here. I'm not sure that you presupposing the outcome of the review when you haven't contributed anything to the project is particularly helpful.

I have been trying to contribute, but with your authority here as representative from Oracle you are preventing any potential contributions that I have to offer. We need something like JavaCPP for performance reasons and to interoperate with other libraries like OpenCV and TensorFlow. (BTW, you do know that OpenCV is probably the second most used AI-related framework after TensorFlow, right? Why isn't Oracle doing anything about that one?) The only reasons I have ever heard from you about why you are refusing to consider using (or even creating yourself something similar to JavaCPP), is what you said above, but you provide no specifics:

Using JavaCPP here requires having JavaCPP taking over an already complex build system, adding additional runtime dependencies, changing the way the binaries are packaged, and then changing all of the code to use it's idioms

Where exactly is the build "taken over"? What is "overly complex" about a ~100 lines Bash script and Java config file? What additional dependencies? How do you want to package the binaries to have them share stuff with OpenCV and TensorFlow? The way this sounds to me is you're saying that everything the Python community does with tools like conda, pip, setuptools, etc is wrong, and Java developers shouldn't use tools like that. If that's not what you mean, then at least please offer an alternative! Remember that you're speaking for Oracle here.

I think the reference counting approach is sufficiently hidden that it will work in the OrtSession and prevent the crashes. It's not really counting references, it's counting the number of methods in flight using the native resource, and it's completely hidden from the user of the Java API. For the similar issues in OnnxValue subclasses then I think making the relevant methods synchronized is sufficient as they are immutable and it's not necessary to have concurrent access to their getValue() methods.

That's exactly the problem, it's all hidden, it prevents users from doing things in whatever way they may want to do things!

@saudet
Copy link

saudet commented Aug 14, 2020

@saudet please check your facts. The benchmark you refer to is an year old, and is using a completely different Panama architecture (and API) compared to the one we are using now, for both memory and foreign function access. Out of curiosity I put together a small (unscientific) benchmark which does something similar to the one you linked (but using Linux, I used localtime_r) - Panama is just ~10% slower than pure JNI (I did not use JavaCPP); the only overhead I can see is associated to memory allocation/deallocation (compared to Unsafe::allocateMemory/freeMemory), where some extra logic is added so that memory access can be considered "safe" (as in avoiding hard JVM crashes due to segfaults). While we need to keep improving, these numbers are very different from those in the benchmark you linked (which shows Panama to be 20x slower than JNI/JavaCPP).

That's great news! Thanks for the update. I'm looking forward to the day when Panama actually stops using JNI though.

@saudet
Copy link

saudet commented Aug 14, 2020

We could debate the nuances of the internal implementation (which was done prior to my arrival) ad infinitum and yes there are philosophical differences between segfaults in C/CPP-world and the somewhat-equivalent NullPointerException in Java that could be re-litigated, but I would like to get this underlying bug fixed, as it is a blocker to my (and likely others') future use of this API. The main objective here was to fix the defect shown here which @Craigacp mentioned they had overlooked in the original implementation: https://gist.github.com/yuzawa-san/e12b16a4aac7dc7bec2a94c4b8c2da89 I did not alter the Java public facing API with the exception of adding the OrtSession.terminate() method, which I documented. What are the next steps to getting this approved?

It's not just a bug fix (a bug fix would just throw an exception or something from somewhere), it forces users to adopt a way of doing things that purports to be thread-safe, but that really isn't, and that will slow down processing. If there were a way to disable all of this, I would be fine with these changes though. Is there a way to disable all of this? (I also think that kind of feature should be optional in TensorFlow. I'll probably get to it when I need to hack that there. The developer that used to work at Google on this API wasn't an expert in Java and I don't think his code got reviewed.)

@Craigacp
Copy link
Contributor

I have been trying to contribute, but with your authority here as representative from Oracle you are preventing any potential contributions that I have to offer. We need something like JavaCPP for performance reasons and to interoperate with other libraries like OpenCV and TensorFlow. (BTW, you do know that OpenCV is probably the second most used AI-related framework after TensorFlow, right? Why isn't Oracle doing anything about that one?) The only reasons I have ever heard from you about why you are refusing to consider using (or even creating yourself something similar to JavaCPP), is what you said above, but you provide no specifics:

@saudet As I've told you several times before, I work in Oracle Labs, and do not speak for the company or the Java org in any way. In this particular case I'm speaking as the person who wrote the Java API for this project. I've not written anything for OpenCV for two reasons: one there is already a Java API, and two I don't work on computer vision. I'm unclear why the ORT library would need to interop with Tensorflow. If you want to use a TF model then export it in onnx format and compose it using the mechanisms already available. If there are performance problems that need addressing then open an issue with some numbers and I'll take a look.

Using JavaCPP here requires having JavaCPP taking over an already complex build system, adding additional runtime dependencies, changing the way the binaries are packaged, and then changing all of the code to use it's idioms

Where exactly is the build "taken over"? What is "overly complex" about a ~100 lines Bash script and Java config file? What additional dependencies? How do you want to package the binaries to have them share stuff with OpenCV and TensorFlow? The way this sounds to me is you're saying that everything the Python community does with tools like conda, pip, setuptools, etc is wrong, and Java developers shouldn't use tools like that. If that's not what you mean, then at least please offer an alternative! Remember that you're speaking for Oracle here.

Using JavaCPP requires having JavaCPP build the native layer, opting into JavaCPP's mechanisms for binary distribution, and you need to get the right set of dependencies. As we've found out with the recent issues in the TF Java project, the way that JavaCPP works can be confusing for some developers and it can be tricky to figure out the dependencies. I don't think it's worthwhile to switch to something more complex when we have something that works and is simpler for users of the library.

None of your points are relevant to this discussion, and adopting JavaCPP wouldn't fix the race condition between object use and closing here, given it's essentially a problem with the C API for the library (as calling the C API's close method while there is a run call in flight shouldn't segfault the process).

@yuslepukhin
Copy link
Member

I took a look at the PR and the scenario. While the scenario is valid, I do not feel that this specific user scenario complexity should be pushed down to the API level. A simple inference would have to pay for all refocunting and synchronization mechanisms.
In the scenario described, it is very important to get multi-threaded details correct. yet, the code silently allows destroying objects twice and other things which would not lead to good results, basically swallowing programming mistakes.

synchronized (handleLock) {
if (closed) {
return;
}
Copy link
Member

@yuslepukhin yuslepukhin Aug 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should throw some exception that would clearly tell that things are going wrong. Somehow, a thread is operating on an object that was already be destroyed and but it thinks it has a valid handle on it. With property implemented refcounting it should never happen, as it looks like the a thread is holding a reference to it but the object is destroyed anyway. This condition should prompt an assert or an non-recoverable exception. Which is what Segfault is currently doing for you.

Refcounting or not, each native object should have a proper and clear ownership and judjing by this piece of code, you do not expect that refcounting would actually save you. Why introducing it?
You can wrap a session object into your recounted class and save other users from the bearing the expense of it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The close() method is called by developers. It is within reason that somebody may close the object more that once and we do not really care since the object was closed already. An exception is thrown if the session is used again (like if they use run() again).

The active usage (NOT reference) counting is beside the point here, since that is done in private implementation hidden from the users and with try-with-resources.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

close() is usually the last thing that is being called on the object. Since the calling code seems to be unware that the underlying native object is no longer there, it can call all sorts of functions on it before close(). By this reasoning, one would need to check of the object is still around in every interface which is not being done. So it would still lead you to a segfault, would it not? Discovering the the object is gone at the time of close() does nothing to save you if the program continues to be invalid. If there is not a clear ownership and the recounting is not implemented right such checks actually do nothing. If two threads are trying to close the object but the recount is 1, both would pass the check, and they would happily destroy the object twice thereby corrupting memory. And you already know where it leads -SEGFAULT.
Let me summarize: this PR introduces atomics and locks, both of which impose additional usage requirements and performance costs on other users for the sake of your usage scenario. However, I feel, that the additional overhead and complexity should be shifted to your code.
On another note, there are inter dependencies on some objects. You will see in the next release, OrtAllocators API being exposed and IOBidning. Those depend on Session. Session depends on Environment. Refcounting them individually would not make sense.


In reply to: 471666137 [](ancestors = 471666137)

Copy link
Contributor

@Craigacp Craigacp Aug 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs for Java's AutoCloseable interface says:

Note that unlike the close method of Closeable, this close method is not required to be idempotent. In other words, calling this close method more than once may have some visible side effect, unlike Closeable.close which is required to have no effect if called more than once. However, implementers of this interface are strongly encouraged to make their close methods idempotent.

So having it accept multiple calls to close is idiomatic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However the upcoming changes to allocators and IObinding do make me worried we'll need a bigger solution to this. The allocators were originally exposed and had a more complicated tracking mechanism, but during the initial integration of the Java API we decided that it wasn't necessary and lots of complexity was removed by making the allocators an internal detail.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation gap is recognized internally and there will be an effort to close it. Please, do ask me questions about the behavior.
Please, refer to this PR for more details.
It also demonstrates the way the API is exposed in C#. #4646


In reply to: 471708975 [](ancestors = 471708975)

Copy link
Member

@yuslepukhin yuslepukhin Aug 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initial allocator exposed was a default CPU allocator. It was a singleton that never required destruction.
The allocators obtained with the new API CreateAllocator() are wrapping instances that require ReleaeAllocator() call when no longer needed.
Those refer to the internal Session allocators.
The internal sessions allocators are tied to the execution providers. Those depend on two factors: 1) whether it is present in the binary 2) whether it was added in the session options.
For example, if you have a build with CPU and CUDA, you specify them in the SessionOptions, create a Session.
Then you create an instance of OrtMemoryInfo that has all the characteristics of the allocator required. (Note, this is also a native Closable/Disposable class)
Using that you call CreateAllocator() which internally checks if the requested allocator is present within the session. It would then create a wrapper and return it.
You now can allocate memory on the device which returns OrtMemoryAllocation instance.
This is useful in some scenarios 1) You do not want to reallocate memory for input/output and just re-use a pre-allocated chunk progressively specifying bigger shapes but re-use the same chunk
2) You want to feed output back to input. So you just re-bind input/output trading places.
3) You have a dynamically shaped output. In this case you can not pre-allocate, but by using OrtMemoryInfo you can specify where to allocate and it will do so.
This is especially useful, if you need it to be on the device or you want to feed input that is already allocated, say on CUDA. So no copying to host memory is needed.
in cases 1 and 2 the caller owns the memory, in case of 3, Onnxruntime owns that (Returned OrtValue does).


In reply to: 471751379 [](ancestors = 471751379,471708975)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. You have a dynamically shaped output. In this case you can not pre-allocate, but by using OrtMemoryInfo you can specify where to allocate and it will do so.
    This is especially useful, if you need it to be on the device or you want to feed input that is already allocated, say on CUDA. So no copying to host memory is needed.
    in cases 1 and 2 the caller owns the memory, in case of 3, Onnxruntime owns that (Returned OrtValue does).

@Craigacp Here you go, you should consider your Java API not supporting CUDA memory as a "performance problem". No need to run benchmarks to know that copying data around unnecessarily is going to be slow. And, no, the official Java API of OpenCV doesn't support CUDA, so even if you figure out something for ORT, it's still going to be a problem! Why do you refuse the recognize the importance of this? I'm not saying that JavaCPP is perfect, but at least I'm trying to do something about it, while you are not.

Copy link
Contributor

@Craigacp Craigacp Aug 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initial allocator exposed was a default CPU allocator. It was a singleton that never required destruction.
The allocators obtained with the new API CreateAllocator() are wrapping instances that require ReleaeAllocator() call when no longer needed.
Those refer to the internal Session allocators.
The internal sessions allocators are tied to the execution providers. Those depend on two factors: 1) whether it is present in the binary 2) whether it was added in the session options.
For example, if you have a build with CPU and CUDA, you specify them in the SessionOptions, create a Session.
Then you create an instance of OrtMemoryInfo that has all the characteristics of the allocator required. (Note, this is also a native Closable/Disposable class)
Using that you call CreateAllocator() which internally checks if the requested allocator is present within the session. It would then create a wrapper and return it.
You now can allocate memory on the device which returns OrtMemoryAllocation instance.
This is useful in some scenarios 1) You do not want to reallocate memory for input/output and just re-use a pre-allocated chunk progressively specifying bigger shapes but re-use the same chunk
2) You want to feed output back to input. So you just re-bind input/output trading places.
3) You have a dynamically shaped output. In this case you can not pre-allocate, but by using OrtMemoryInfo you can specify where to allocate and it will do so.
This is especially useful, if you need it to be on the device or you want to feed input that is already allocated, say on CUDA. So no copying to host memory is needed.
in cases 1 and 2 the caller owns the memory, in case of 3, Onnxruntime owns that (Returned OrtValue does).

In reply to: 471751379 [](ancestors = 471751379,471708975)

Ok. I'll work up a PR in a few weeks that starts to incorporate this. I need to expose the getAvailableSessions call anyway and I think there are a couple of other C API features that landed while I was distracted working on Tribuo. Tracking the allocators might be a little tricky, as sharing them across sessions adds yet another layer of things to be held in try-with-resources. Are the new allocators thread safe?

@saudet these are in the context of the IOBinding work happening elsewhere in the ORT project. It allows the use of ORT allocators to allocate memory for specific devices and then bind the inputs to that, the allocation is as far as I can tell performed by ORT, just in a separate call to the call to run. I do not know if it's possible to pass in a slab of GPU memory allocated by some other system using either the C or C++ APIs of ORT.

Copy link
Member

@yuslepukhin yuslepukhin Aug 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Yes, the allocators are thread-safe. They are the same ones used by the session object and the session object support concurrent execution my multiple threads.
  2. Yes, it is possible to bind inputs/outputs to a slub of memory allocated by someone else. When we bind a pre-allocated piece of memory, it is not owned by ORT or the corresponding OrtValue. One of the python binding tests makes use of CUDA memory allocated by PyTorch. Current C# API requires passing OrtMemoryAllocation wrapping the chunk. That currently requires an allocator ptr. So I can see that in C# we could make it more convinient to supply 3rd party allocations.

In reply to: 474764836 [](ancestors = 474764836)

@yuslepukhin
Copy link
Member

Nope. We can not accept this change.


In reply to: 673709909 [](ancestors = 673709909)

@yuzawa-san
Copy link
Contributor Author

I'm going to port the non-controversial code cleanups into a new PR.

@yuzawa-san yuzawa-san closed this Aug 24, 2020
@saudet
Copy link

saudet commented Nov 27, 2020

Using JavaCPP requires having JavaCPP build the native layer, opting into JavaCPP's mechanisms for binary distribution, and you need to get the right set of dependencies. As we've found out with the recent issues in the TF Java project, the way that JavaCPP works can be confusing for some developers and it can be tricky to figure out the dependencies. I don't think it's worthwhile to switch to something more complex when we have something that works and is simpler for users of the library.

@Craigacp FYI, we can easily integrate JavaCPP into other build systems, using it only to generate code. I've just done it successfully for Android Studio, and that uses Gradle to call CMake:

Without any explanations about why writing JNI code manually is less "complex" than using a code generator that does all that for us, your statements are simply not true!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants