Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Parser: Use the Clang API #51

Open
Arcnor opened this issue Dec 3, 2015 · 69 comments
Open

Improve Parser: Use the Clang API #51

Arcnor opened this issue Dec 3, 2015 · 69 comments

Comments

@Arcnor
Copy link

Arcnor commented Dec 3, 2015

I'm in the process of doing this right now. Currently, the following issues exists with the approach I'm taking:

  • Preprocessor directives are not supported: This is fixable, because Clang can be made aware of those, but it doesn't by default. I'll leave this for later.
  • Normal comments cannot be read: Unless I've missed some Clang flag, non-doc comments are completely ignored. I'm not sure this is a big deal, but nonetheless, it's different from the current behavior.
@saudet
Copy link
Member

saudet commented Dec 29, 2015

BTW, we probably want to use the C++ API of Clang for this. It is not currently mapped by the presets, so as initial work, we would either have to:

  1. Code the new parser temporarily in C++, or
  2. Create the presets for the C++ API of Clang, using the current Parser.

Either way is fine with me. Thanks for your interest in this project and let me know how I can help!

@Arcnor
Copy link
Author

Arcnor commented Dec 29, 2015

Sorry, I haven't had the time to work on this lately.

My final changes allowed me to parse a lot, but some missing things from the C++ API prevented me finishing it IIRC, so I think option 2 should help us get there (or, as you said, code it in C++, but it's non-trivial :D)

@Arcnor
Copy link
Author

Arcnor commented Jan 6, 2016

I've continued a bit on this, and for now, I've decided to write the Clang bindings manually, as I think the surface we need from Clang is not that big (I might be wrong, though).

Once we hace a working parser, we can generate proper bindings for Clang itself and use them in the generator, closing the circle ;)

I have some doubts about how to implement some of the bindings, though, so I'll hit the forums in a few days with my questions :)

@saudet
Copy link
Member

saudet commented Dec 6, 2017

@Arcnor Any progress with this?

libclang seems to be getting pretty good for that sort of thing, for example:
https://github.com/rust-lang-nursery/rust-bindgen
https://rust-lang-nursery.github.io/rust-bindgen/
So maybe we won't need to use the C++ API after all...

@Arcnor
Copy link
Author

Arcnor commented Dec 6, 2017

Hi Samuel,

No, unfortunately I haven't had the time to continue, not enough incentive for me to do so right now (the project(s) that were using JavaCPP all stopped for one reason or another, mostly priorities).

So libclang is getting better, eh? That's great news! I've had a quick look at the API again, and it seems to contain some goodies I don't remember from 2 years ago, so yeah, maybe now it's enough for our purposes.

If they kept the same names on the AST I might even be able to reuse some of the code I made years ago that parsed the unstable AST output of CLang (https://github.com/Arcnor/objc2robovm/blob/master/src/main/java/com/arcnor/objcclang/parser/CLangHandler.java for example).

Anyway, unless somebody else is working on this, I'll try to give it another look if it doesn't look too complex to interact with it, as time is limited :).

@saudet
Copy link
Member

saudet commented Dec 6, 2017 via email

@Arcnor
Copy link
Author

Arcnor commented Dec 7, 2017

I'm going to need some help to generate the bindings it seems. I'll put my question(s) here as they are related, but if you need me to use the forum I'll go there instead:

The bindings have the following code:

typedef struct CXVirtualFileOverlayImpl *CXVirtualFileOverlay;

CXVirtualFileOverlay clang_VirtualFileOverlay_create(unsigned options);
...

How can I rename CXVirtualFileOverlayImpl to CXVirtualFileOverlay? I've tried with javaNames but that's obviously not for that.

@Arcnor
Copy link
Author

Arcnor commented Dec 7, 2017

So besides that small problem, the whole API seems to work (well, at least compile) with very few manual mappings, which is cool.

I'll take more time tomorrow to actually figure out if the stuff I couldn't do ~2 years ago is now possible :).

@saudet
Copy link
Member

saudet commented Dec 7, 2017

Sounds good! BTW, the bindings for libclang are already available here:
https://github.com/bytedeco/javacpp-presets/blob/master/llvm/src/main/java/org/bytedeco/javacpp/clang.java
AFAIK, we just need to use those.

If there's anything to fix about those though, please send pull requests against the presets config:
https://github.com/bytedeco/javacpp-presets/blob/master/llvm/src/main/java/org/bytedeco/javacpp/presets/clang.java
Thanks!!

@Arcnor
Copy link
Author

Arcnor commented Dec 7, 2017 via email

@saudet
Copy link
Member

saudet commented Dec 7, 2017

They already work with LLVM 5.0.0 yes:
https://github.com/bytedeco/javacpp-presets/tree/master/llvm

All const char * should already get mapped to String as well as BytePointer, but if there are char * that should also be mapped to String, yes please, do let me know! Thanks

@Arcnor
Copy link
Author

Arcnor commented Dec 7, 2017 via email

@saudet
Copy link
Member

saudet commented Dec 7, 2017

Right, the problem with return values is that when we need a Pointer, we can't get one from the String, but we can get a String from a BytePointer with getString()...

@Arcnor
Copy link
Author

Arcnor commented Dec 7, 2017 via email

@saudet
Copy link
Member

saudet commented Dec 7, 2017

If not, we can add to CXString a helper String getString() { return getCString().getString(); } method :)

saudet added a commit to bytedeco/javacpp-presets that referenced this issue Dec 7, 2017
@saudet
Copy link
Member

saudet commented Dec 7, 2017

Actually, after calling clang_getCString() we need to call clang_disposeString(), so simply returning a String isn't that convenient. I've added the helper function I talked about in the commit above.

@Arcnor
Copy link
Author

Arcnor commented Dec 7, 2017 via email

@Arcnor
Copy link
Author

Arcnor commented Dec 7, 2017

I've finally checked this properly, and it seems this method was the only one that made sense to have as BytePointer, as it has the dispose. As far as I can see, the others (except clang_EvalResult_getAsStr which also has special disposing) only make sense as String (like CXUnsavedFile.Filename() or clang_getTUResourceUsageName())

Anyway, for now I'll continue as it is, we can always improve things later without many changes.

@Arcnor
Copy link
Author

Arcnor commented Dec 8, 2017

I'm now getting crashes (randomly, like 1 for every 5 executions or so) like this one:

Stack: [0x000070000836e000,0x000070000846e000],  sp=0x000070000846cf30,  free space=1019k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
j  org.bytedeco.javacpp.Loader.offsetof(Ljava/lang/Class;Ljava/lang/String;)I+11
j  org.bytedeco.javacpp.Pointer.offsetof(Ljava/lang/String;)I+16
j  org.bytedeco.javacpp.Pointer.sizeof()I+24
j  org.bytedeco.javacpp.Pointer$DeallocatorReference.<init>(Lorg/bytedeco/javacpp/Pointer;Lorg/bytedeco/javacpp/Pointer$Deallocator;)V+29
j  org.bytedeco.javacpp.Pointer$NativeDeallocator.<init>(Lorg/bytedeco/javacpp/Pointer;JJ)V+3
j  org.bytedeco.javacpp.Pointer.init(JJJJ)V+44
v  ~StubRoutines::call_stub
V  [libjvm.dylib+0x2ee9aa]
V  [libjvm.dylib+0x325b59]
V  [libjvm.dylib+0x31b166]
C  [libjniclang.dylib+0x1804]  _ZL19JavaCPP_initPointerP7JNIEnv_P8_jobjectPKvxPvPFvS5_E+0x74
C  [libjniclang.dylib+0x13ef2]  Java_org_bytedeco_javacpp_clang_00024CXCursorVisitor_allocate+0x62
j  org.bytedeco.javacpp.clang$CXCursorVisitor.allocate()V+0
j  org.bytedeco.javacpp.clang$CXCursorVisitor.<init>()V+5
j  com.arcnor.javacpp.Main$Visitor.<init>()V+1
j  com.arcnor.javacpp.Main$Visitor.<init>(Lcom/arcnor/javacpp/Main$1;)V+1
j  com.arcnor.javacpp.Main.visit(Lorg/bytedeco/javacpp/clang$CXTranslationUnit;)V+11
j  com.arcnor.javacpp.Main.main([Ljava/lang/String;)V+22
v  ~StubRoutines::call_stub
V  [libjvm.dylib+0x2ee9aa]
V  [libjvm.dylib+0x3257c2]
V  [libjvm.dylib+0x31e539]
C  [java+0x3931]  JavaMain+0x9c4
C  [libsystem_pthread.dylib+0x393b]  _pthread_body+0xb4
C  [libsystem_pthread.dylib+0x3887]  _pthread_body+0x0
C  [libsystem_pthread.dylib+0x308d]  thread_start+0xd
C  0x0000000000000000

Visitor is a class I created that looks exactly like this:

private static class Visitor extends CXCursorVisitor {
  @Override
  public int call(CXCursor cursor, CXCursor parent, CXClientData client_data) {
    return CXChildVisit_Continue;
  }
}

...and I'm just instantiating by calling new Visitor(). I'm not sure if there are any extra considerations to take when instantiating FunctionPointer classes like this one?

@saudet
Copy link
Member

saudet commented Dec 8, 2017

Have you disabled "crash recovery"?

∗ In the case of Clang, we might need to disable crash recovery with the LIBCLANG_DISABLE_CRASH_RECOVERY=1 environment variable to prevent clashes with the JVM's own signal handlers.

https://github.com/bytedeco/javacpp-presets/tree/master/llvm

@Arcnor
Copy link
Author

Arcnor commented Dec 8, 2017

Ahh, nice, will try that. I've had a good run of ~10 without crashes though, so it will be difficult to prove if it worked (unless I get it again :D)

saudet added a commit to bytedeco/javacpp-presets that referenced this issue Dec 8, 2017
@saudet
Copy link
Member

saudet commented Dec 8, 2017

I've added getString() helper methods for CXTUResourceUsageKind and CXEvalResult as well, as per the commit above, but for things like contents and filenames, the encoding can change at runtime. AFAIK there's no pretty way to make String work under those conditions so we might as well just make sure users call BytePointer.getString(). If you have any good ideas though, let me know. Thanks!

@saudet
Copy link
Member

saudet commented Dec 24, 2017

Let me know if there's anything else missing from the API that would prevent you from making progress. Thanks!!

@saudet
Copy link
Member

saudet commented Dec 13, 2021

We will have some bootstraping problem if we use a JavaCPP preset in the Parser used to build presets. Won't we ?

Not really, the JavaCPP Presets for LLVM also essentially map the C API only. That's not the problem, the problem is that jextract was designed to work only with C, not C++. It fails miserably at anything that even remotely looks like C++. I think that would be the first thing to "fix" before going forward with that idea.

What do you think of this plan ?

@mcimadamore @sundararajana might have some more recent insights into what they looked at, why it doesn't work, etc.

@HGuillemet
Copy link
Contributor

I meant to use jextract to bind the c clang API only. Then clang can be used to parse C++.Where is the limitation due du jextract ?

@saudet
Copy link
Member

saudet commented Dec 13, 2021

jextract also already maps the C API of Clang:
https://github.com/openjdk/panama-foreign/tree/foreign-jextract/src/jdk.incubator.jextract/share/classes/jdk/internal/clang

jextract doesn't support C++, period. It never has and probably never will.

@HGuillemet
Copy link
Contributor

Sure, but the C-API of Clang can parse C++.

@saudet
Copy link
Member

saudet commented Dec 13, 2021

Yeah, but it's not going to be any better than the JavaCPP Presets for LLVM. You'll get the exact same thing. The only reason you may want to use jextract is to get potentially support from Oracle...

@HGuillemet
Copy link
Contributor

And the bootstrapping ?
Once we have switched to the new parser. How would you build the LLVM preset on a new platform ?

@saudet
Copy link
Member

saudet commented Dec 13, 2021

I'm not sure I understand what you mean by "bootstrapping", but whatever it is, it's not going to be a bigger problem than supporting C++. Start with getting something working for C++, and if you get that working, the rest isn't going to be a problem.

@HGuillemet
Copy link
Contributor

HGuillemet commented Dec 13, 2021

I mean the problem of "chicken or the egg": You need the LLVM presets to use the parser, and you need the parser to build the LLVM presets.

@saudet
Copy link
Member

saudet commented Dec 13, 2021

Didn't you just say that you'd use the one from jextract? Just do that, that's fine.

@mcimadamore
Copy link

mcimadamore commented Dec 13, 2021

We will have some bootstraping problem if we use a JavaCPP preset in the Parser used to build presets. Won't we ?

I have started to play with the C API of Clang bound by Panama with jextract and it seems to do the job. Preprocessor directives and comments are available. It even parses Doxygen-like syntaxes.

I suggest to rewrite the parser using this API, first to reproduce the current behavior of the parser, as a preliminary step to issue #402. Then we could try to change the parser and generator so that C++ classes are mapped to Java classes that use FMA instead of Pointer.

What do you think of this plan ?

@HGuillemet I believe you are suggesting to use an approach similar to that used by jextract to e.g. generate libclang bindings which rely on the foreign function API. That part works well, and, assuming a tool only need the C clang API, that could be good enough. We did some experiments parsing C++ with the C API and these were not successful, as the C API, at the moment, does not expose enough information re. template instantiation (the information is there under the hood, just not exposed in the C API, unfortunately). These same problems were observed in other projects using the C API as well (I seem to recall Rust's bindgen having several workarounds to make C++ sort of work with that API).

I do hope that, in the future, the clang C API will be improved to add those missing 2-3 functions which will make handling templates much more manageable. At this point in time I cannot recommend using the clang C API to emit bindings for real-world C++ code.

@HGuillemet
Copy link
Contributor

Thank you for these informations. Yes, that's what I was suggesting.
I'll investigate a bit more to see if recent versions of LLVM provide something good enough with the C API.
Else I guess we will have to stick with Samuel's present magic parser.

@junlarsen
Copy link
Member

junlarsen commented Dec 13, 2021

If I'm understanding correctly, the "bootstrap problem" is the problem that we would depend on the libclang implementation to create the libclang implementation, similar to how you need GCC to build GCC.

We already solved that part, as we already have a stage 1 libclang implementation at https://github.com/bytedeco/javacpp-presets/blob/master/llvm/src/gen/java/org/bytedeco/llvm/global/clang.java made with the old parser which would suffice to build the new javacpp parser.

I actually had a go at this some time back, and I seemed to be able to parse some very basic C headers with the libclang API from JavaCPP Presets. If missing Clang C functions is an issue, we can either:

  1. upstream changes and pull them down (very slow process due to us building clang releases) - can be done alongside 2)
  2. add the functions ourselves like we already do in https://github.com/bytedeco/javacpp-presets/blob/master/llvm/src/main/resources/org/bytedeco/llvm/include/TargetStubs.h

@mcimadamore
Copy link

Thank you for these informations. Yes, that's what I was suggesting. I'll investigate a bit more to see if recent versions of LLVM provide something good enough with the C API. Else I guess we will have to stick with Samuel's present magic parser.

IIRC, one of the main missing bit of functionality was being able to retrieve all template instantiations for a given template method/class (as a binder would need to generate special code for all of these).

@saudet
Copy link
Member

saudet commented Dec 14, 2021

@HGuillemet Ah, you were referring to missing functionality from the C API of Clang. We can easily "extend" the C API ourselves, that's not an issue. I thought I mentioned that in this thread, but it's actually in bytedeco/javacpp-presets#475 (comment). So just add anything along that you need, that's not a problem.

@saudet
Copy link
Member

saudet commented Jan 7, 2022

FYI, here's something that looks more useful than Panama since it supports C++ and it's actually able to inline native functions:

@HGuillemet You may want to start looking at that, in addition to Panama.

Thanks to @frankfliu for letting me know about that!

@HGuillemet
Copy link
Contributor

This project is interesting. It aims at providing a full alternative to JavaCPP (and Panama).
Like JavaCPP, Java code instrumented with specific annotations is used to generate JNI (and Java) glue code.
Two features are worth to be pointed out, compared to JavaCPP:

  • the ability to translate the native glue code from LLVM-IR (LLVM bytecode) to JVM bytecode, which provides a big performance boost (what you meant by inlining native functions I guess).
  • the ability to map templates to generics.

However:

  • there is almost no documentation yet, so it's difficult to experiment.
  • there is no equivalent for the JavaCPP parser. The annotated Java code must be hand written. What I understand from the Chinese document you linked is that such tool does exist, but has not been open-sourced yet.

@saudet
Copy link
Member

saudet commented Feb 3, 2022

This project is interesting. It aims at providing a full alternative to JavaCPP (and Panama).

It doesn't aim to be an alternative to Panama, that one is never going to support C++ or function inlining, it's not part of their goals. Like I explained before, I don't think anyone is going to switch from JNI to Panama, and that project (fastFFI) demonstrates that well. JNI is just fine, it's already fast enough and can be made user-friendly with tools like JavaCPP. However, to increase performance to any meaningful degree, what we need is to bring something like LLVM on the JVM without anything "foreign", which Panama is not willing to do, so in my opinion it's never going to give us anything substantial over JNI.

As for being a "full alternative" to JavaCPP, it's possible, but JavaCPP doesn't use Clang or anything like that, so if that's what they have started to work on, I would consider that an evolution over JavaCPP, and we should probably try to collaborate with them instead of redoing the same thing ourselves. @frankfliu What do you think?

@frankfliu
Copy link
Contributor

@saudet I agree with you. If their architecture is clean and foundation is solid, improving usability is relatively easier.

@HGuillemet
Copy link
Contributor

Their component (LLVM4JNI) that uses clang to compile the JNI glue code to bytecode and then translates it to JVM bytecode seems more or less independent and could probably be applied as is to JavaCPP.

If they do plan to opensource a C++ parser based on clang, with support for generics, I agree that it would be interesting to know more about it before continuing to work on our own.

This project seems quite old in fact. I'd say at least 10 years. They decide to opensource it now, for some reasons it would be interesting to also know about, as well as their plans and available resources.

@shanemikel
Copy link

shanemikel commented Jan 25, 2023

Aside from java-port/clank, the C#/Mono/Xamarin crowd also have a lot of experience binding and porting C++ class hierarchies.

Both of these projects use the Clang frontend to produce ASTs and port the Clang AST class hierarchy to C# for consumer side codegen APIs:

I think both projects produce their own Clang C bindings and manually port the C++ AST bits they need. They also both have non-trivial C++ code they use to control the Clang frontend.

Xamarin project has bindings for most Objective-C libraries on Mac and iOS here: xamarin/xamarin-macios. Would love to understand their process. It has to be one of the largest successful bindings projects ever. I'm sure it's largely automated and my guess is they use Clang's Objective-C frontend...

@shanemikel
Copy link

shanemikel commented Jan 25, 2023

SkiaSharp is another example. A large C# binding project for Google's Skia 2D graphics library. They are a mono project used by Microsoft in .NET.

In the binding generator module they are using CppAst.NET, which implements a C++ AST in C#.

CppAst.NET does not use the C++ library, cppast.

They appear to have stolen the name, but cppast claims to expose bits of Clang's AST which are not exposed directly by libclang. If so, that may be useful.


Edit: I was mistaken that CppAst.NET binds cppast. It is merely named after the latter.

@HGuillemet
Copy link
Contributor

What about using clangd ?
It would remove the chicken or the egg problem mentioned above and allow to efficiently parse files as well as code chunks.

@shanemikel
Copy link

I've taken a cursory look at that. I somewhat like the idea.

Clangd depends on understanding the project's build system through compile_commands.json: https://clangd.llvm.org/installation#project-setup. This is fairly easy to produce for CMake projects and there are tools like https://github.com/rizsotto/Bear that can produce it for any build system by intercepting and parsing compiler command arguments. It's kind of a hack that is more acceptable for getting IDE features than it is to reliably produce a build artifact.

Many large bindings projects seem to effectively reproduce parts of the build system, dependency graph, and source file hierarchy of their underlying library anyway. Generating compile_commands.json by hand or ad-hoc (e.g. by script) isn't totally out of step. Taking the JavaCpp approach, some of this information could be generated from Java source annotations.

One possible issue: AST access is provided as an LSP protocol extension: https://clangd.llvm.org/extensions#ast.
That page features a major caveat:

These extensions may evolve or disappear over time. If you use them, try to recover gracefully if the structures aren’t what’s expected.

There is an LSP implementation for Java here: https://github.com/eclipse/lsp4j. I think the protocol is similar to HTTP, so client implementation shouldn't be too bad.

Other than providing per-file AST access, clangd provides an index which may be marginally helpful:
https://clangd.llvm.org/design/indexing

@LifeIsStrange
Copy link

LifeIsStrange commented Sep 5, 2023

Hi @saudet little update,
Context:
in a previous issue I made about leveraging the foreign linker/foreign memory api

You said

they haven't been able to get any performance gains over JNI, yet, so it's unclear how it's going to be useful at this point

To which Mcimadamore outlined some possible scenarios where the foreign linker api could lead to better performance than JNI.

The news:
There is a new blog post on Java inside showing that the foreign memory api has seen a considerable performance improvement in JDK22 and for native strings, seems to be significantly better than JNI
https://minborgsjavapot.blogspot.com/2023/08/java-22-panama-ffm-provides-massive.html?m=1
The future improvements section also caught my interest:

FFM allows us to use custom allocators and so, if we make several calls, we can reuse memory segments thereby improving performance further. This is not possible with JNI.

And mention future internal use of the vector api.

@saudet
Copy link
Member

saudet commented Sep 16, 2023

FMA is unrelated to Clang or JNI, please see issue #402

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants