Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Mono]: Reduce Mono AOT cross compiler x64 memory footprint. #97096

Merged

Conversation

lateralusX
Copy link
Member

Building .net8 S.P.C using Mono AOT cross compiler in full AOT consumes a large amount of memory (up to 6 GB). This is mainly due to generated LLVM module not being optimized at all while kept in memory during full module generation. Mono x64 also lacks support for several intrinsics as well as Vector 256/512 that in turn leads to massive inlining of intrinsics functions generating a very large LLVM module, where majority of this code ends up as dead code due to IsSupported/IsHardwareAccelerated returning false.

The follow commit adjusts several things that will bring down the memory usage, compiling .net8/.net9 Mono S.P.C on x64 Windows from 6 GB down to ~750 MB.

  • Use PSNE implementations on intrinsics not supported on Mono.
  • Add ILLinker substitutions for intrinsics not supported on Mono. Enables ILLinker to do dead code elimination, reduce code to AOT compile.
  • Prevent aggressive inlining for a couple of unsupported intrinsics types making sure we don't end up with excessive inlining, exploding code size.
  • Run a couple of LLVM optimization passes on each generated method doing early code simplification and dead code elimination during LLVM module generation.
  • Explicit SN_get_IsHardwareAccelerated/SN_get_IsSupported intrinsics implementation for all unsupported Mono x64 SIMD intrinsics.
  • Fixed numerous memory leaks in Mono AOT cross compiler code.
  • Fix a couple of sequence points free after use errors.
  • Fix an anonymous struct build warning triggering build error for LLVM enabled cross compiler on Windows.

@@ -0,0 +1,10 @@
<linker>
<assembly fullname="System.Private.CoreLib">
<type fullname="System.Runtime.Intrinsics.Vector256">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this true for all codegens (e.g. interpreter)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like interpreter mark all vectors as not being hardware accelerated:

} else if (in_corlib &&
		   (!strncmp ("System.Runtime.Intrinsics", klass_name_space, 25) &&
			!strncmp ("Vector", klass_name, 6) &&
			!strcmp (tm, "get_IsHardwareAccelerated"))) {
	*op = MINT_LDC_I4_0;
}

so for that case it should be ok.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @kg

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We mark v128 as hardware accelerated in interp in some cases, I believe.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, so then this should be OK since it only affects v256/v512.

@vargaz
Copy link
Contributor

vargaz commented Jan 17, 2024

Wouldn't be better to split this into smaller PRs ?

@lambdageek
Copy link
Member

I wonder if we can make working with symbols a little more typesafe so that we have some distinction between a mempool allocated symbol and a temporary malloc-allocated symbol. Maybe we can just pass around a GString for the temporary ones?

return NULL;

if (vector_size == 256 || vector_size == 512) {
if (id == SN_get_IsHardwareAccelerated ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For RyuJIT, we have a fallback that handles any unrecognized get_IsHardwareAccelerated and get_IsSupported APIs for System.Numerics, System.Runtime.Intrinsics, and System.Runtime.Intrinsics.* to ensure they can be treated as constant false. Each namespace is being handled a little bit differently since get_IsSupported for some namespaces needs to fallback to user-code instead.

Would it be a good idea to similarly make this general-purpose. Notably vector_size == 64 should also be false on x86/x64 for example?

Copy link
Member Author

@lateralusX lateralusX Jan 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do have a common fallback for IsHardwareAccelerted in instrinsics.c:

/* Fallback if SIMD is disabled */
if (in_corlib && 
	((!strcmp ("System.Numerics", cmethod_klass_name_space) && !strcmp ("Vector", cmethod_klass_name)) || 
	!strncmp ("System.Runtime.Intrinsics", cmethod_klass_name_space, 25))) {
	if (!strcmp (cmethod->name, "get_IsHardwareAccelerated")) {
		EMIT_NEW_ICONST (cfg, ins, 0);
		ins->type = STACK_I4;
		return ins;
	}
}

So I guess we should be able to drop that change (just made it explicitly for better visibility in this PR) and rely on that fallback to end up with the same result. Not sure why 64-bit vector size was not included in the past.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted this change to rely on fallback for get_IsHardwareAccelerated.

if (!strcmp (class_ns, "System.Runtime.Intrinsics")) {
if (!strcmp (class_name, "Vector128"))
if (!strcmp (class_name, "Vector128") || !strcmp(class_name, "Vector256") || !strcmp(class_name, "Vector512"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about Vector64?

@@ -5911,8 +5981,14 @@ static
MonoInst*
arch_emit_simd_intrinsics (const char *class_ns, const char *class_name, MonoCompile *cfg, MonoMethod *cmethod, MonoMethodSignature *fsig, MonoInst **args)
{
if (!strcmp (class_ns, "System.Runtime.Intrinsics.X86")) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a similar line for System.Runtime.Intrinsics.Arm and System.Runtime.Intrinsics.Wasm missing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(or general purpose code handling any unrecognized System.Runtime.Intrinsics.* namespace?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handled elsewhere in that source file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will find Arm and Wasm under respective defines.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do they not need a path to ensure that IsSupported returns constant false and the intrinsics directly generate a PNSE exception, rather than hitting the recursive fallback, or is that handled elsewhere as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mainly focused on x86 in this PR, other code will still go through the fallbacks, but could probably be enhanced as well.

Comment on lines +6218 to +6155
/*
* If a method has been marked with aggressive inlining, check if we support
* aggressive inlining of the intrinsics type, if not, ignore aggressive inlining
* since it could end up inlining a large amount of code that most likely will end
* up as dead code.
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you help me understand why this is needed on Mono a bit?

The software fallback for V128/V256/V512 is currently implemented as doing 2x operations on the lower/upper halves (except for a couple of methods like Shuffle which operate on the full vector). So V512 is implemented as 2x V256, V256 is implemented as 2x V128, and V128 is implemented as 2x V64. V64 is then implemented as a loop over the scalar elements with potentially large amounts of generic code.

So I would imagine that on any platform where V128 is supported, that the codegen for V256/V512 should be generally nice/small, even if aggressively inlined and with no real dead code elimination required. Even in the case where you have an unsupported type like V256<Guid>, the first V128 operation should have a PNSE thrown (since whether a type is supported or not is typically a known constant).

So I'd only expect that this is needed for V64 on x86/x64 where there is no acceleration and it has to hit the scalar fallback with the loop over the elements. For RyuJIT this loop isn't an issue due to the generic specialization we do, allowing us to get it down to only the code path that is actually used. I believe this isn't possible for Mono today and is non-trivial to add, so the skipping of inlining does make sense in that regard.

Copy link
Member Author

@lateralusX lateralusX Jan 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So without disabling aggressive inlining I ended up with methods that where huge, and not doing aggressive inline on these types (but still honor inline size limits) made them sane again. I will need to re-iterate around that change in order to tell exactly what happened with our inliner in the aggressive case.

Copy link
Member Author

@lateralusX lateralusX Jan 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just validated this, with disabling the aggressive inlining as above a .net9 full AOT of S.P.C takes ~1.7 GB of memory, and not doing this will consume an additional 600 MB of memory, so I believe its worth preventing aggressive inlining for these types that are not hardware accelerated on any of the Mono supported platforms. I didn't add V64 since it seems to be handled a little differently on at least ARM case. I won't have bandwidth at the moment to do more deep analysis around why aggressive inlining of these template types cause that large increase in memory so I think doing this change, at least short term is worth it, we could probably file an issue around the bloat of Vector2561 and Vector5121, but maybe the better long term solution is to actually implement intrinsic support for these types.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For interpreter I'm handling this in a general fashion by detecting early dead pieces of code and not inlining any of the calls there. #97514. It is possible that a similar approach for jit can produce further improvements without having to special case classes.

Copy link
Member Author

@lateralusX lateralusX Jan 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did try to experiment with some dead code elimination, but that cause issues and doesn't work with our llvm codegen, so we explicitly turn that pass off for code that will be passed over to llvm. Instead I made sure we could do more in linker, but still these methods still explode and survives first simple llvm optimizations pass we now do in function manager pass, so feels like something in the inlined code prevents elimination until very late in the opt chain.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to reenable branch optimizations when using llvm in this pr:
#97189

@steveisok
Copy link
Member

System.Collections.Concurrent failed to AOT in a couple of suites. Not sure why, but here's the log https://helixre107v0xdcypoyl9e7f.blob.core.windows.net/dotnet-runtime-refs-pull-97096-merge-21c2561d7c2b483faf/Invariant.Tests/1/console.3bf63a3a.log?helixlogtype=result

vargaz added a commit that referenced this pull request Feb 8, 2024
Extracted from #97096.

Author: Johan Lorensson <lateralusx.github@gmail.com>
vargaz added a commit that referenced this pull request Feb 8, 2024
Extracted from #97096.

Author: Johan Lorensson <lateralusx.github@gmail.com>.
@lewing
Copy link
Member

lewing commented Feb 9, 2024

Hopfully I resolved the conflicts correctly, someone should review.

vargaz added a commit to vargaz/runtime that referenced this pull request Feb 9, 2024
Extracted from dotnet#97096.

Author: Johan Lorensson lateralusx.github@gmail.com.
@vargaz vargaz force-pushed the lateralusX/reduce-aot-llvm-memory-use branch from 4e885b1 to 65f6f9c Compare February 9, 2024 04:16
vargaz added a commit that referenced this pull request Feb 9, 2024
…#98151)

Extracted from #97096.

Author: Johan Lorensson lateralusx.github@gmail.com.
@vargaz vargaz force-pushed the lateralusX/reduce-aot-llvm-memory-use branch from 65f6f9c to aa39dff Compare February 9, 2024 05:29
vargaz added a commit to vargaz/runtime that referenced this pull request Feb 9, 2024
Extracted from dotnet#97096.

Author: Johan Lorensson lateralusx.github@gmail.com.
vargaz added a commit that referenced this pull request Feb 9, 2024
Extracted from #97096.

Author: Johan Lorensson lateralusx.github@gmail.com.
Building .net8 S.P.C using Mono AOT cross compiler in full AOT consumes
a large amount of memory (up to 6 GB). This is mainly due to generated
LLVM module not being optimized at all while kept in memory during
full module generation. Mono x64 also lacks support for several
intrinsics as well as Vector 256/512 that in turn leads to massive
inlining of intrinsics functions generating a very large LLVM module,
where majority of this code ends up as dead code due to
IsSupported/IsHardwareAccelerated returning false.

The follow commit adjusts several things that will bring down the
memory usage, compiling .net8/.net9 Mono S.P.C on x64 Windows
from 6 GB down to ~750 MB.

* Use PSNE implementations on intrinsics not supported on Mono.
* Add ILLinker substitutions for intrinsics not supported on Mono. Enables
ILLinker to do dead code elimination, reduce code to AOT compile.
* Prevent aggressive inlining for a couple of unsupported intrinsics types
making sure we don't end up with excessive inlining, exploding code size.
* Run a couple of LLVM optimization passes on each generated method doing
early code simplification and dead code elimination during LLVM module
generation.
* Explicit SN_get_IsHardwareAccelerated/SN_get_IsSupported intrinsics
implementation for all unsupported Mono x64 SIMD intrinsics.
* Fixed numerous memory leaks in Mono AOT cross compiler code.
* Fix a couple of sequence points free after use errors.
* Fix an anonymous struct build warning triggering build error for
LLVM enabled cross compiler on Windows.
@vargaz vargaz force-pushed the lateralusX/reduce-aot-llvm-memory-use branch from aa39dff to cfaf8d9 Compare February 9, 2024 08:46
@vargaz vargaz force-pushed the lateralusX/reduce-aot-llvm-memory-use branch from ea8a484 to 5ab0d94 Compare February 9, 2024 09:06
@vargaz
Copy link
Contributor

vargaz commented Feb 9, 2024

Failures are unrelated.

@steveisok steveisok merged commit a79c62d into dotnet:main Feb 9, 2024
186 of 189 checks passed
vargaz added a commit to vargaz/runtime that referenced this pull request Feb 15, 2024
@matouskozak
Copy link
Member

This PR looks to be responsible for ~200kB package size improvement on iOS HelloWorld (range of commits 339443b...a79c62d)
image.

Good job everyone!

@matouskozak
Copy link
Member

This PR looks to be responsible for ~200kB package size improvement on iOS HelloWorld (range of commits 339443b...a79c62d) image.

Good job everyone!

Edit. after the subsequent fix to this PR (#98515), the package size improvements on iOS HelloWorld are mostly gone (i.e., back to original values).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.