[Mono]: Reduce Mono AOT cross compiler x64 memory footprint. #97096

lateralusX · 2024-01-17T13:18:55Z

Building .net8 S.P.C using Mono AOT cross compiler in full AOT consumes a large amount of memory (up to 6 GB). This is mainly due to generated LLVM module not being optimized at all while kept in memory during full module generation. Mono x64 also lacks support for several intrinsics as well as Vector 256/512 that in turn leads to massive inlining of intrinsics functions generating a very large LLVM module, where majority of this code ends up as dead code due to IsSupported/IsHardwareAccelerated returning false.

The follow commit adjusts several things that will bring down the memory usage, compiling .net8/.net9 Mono S.P.C on x64 Windows from 6 GB down to ~750 MB.

Use PSNE implementations on intrinsics not supported on Mono.
Add ILLinker substitutions for intrinsics not supported on Mono. Enables ILLinker to do dead code elimination, reduce code to AOT compile.
Prevent aggressive inlining for a couple of unsupported intrinsics types making sure we don't end up with excessive inlining, exploding code size.
Run a couple of LLVM optimization passes on each generated method doing early code simplification and dead code elimination during LLVM module generation.
Explicit SN_get_IsHardwareAccelerated/SN_get_IsSupported intrinsics implementation for all unsupported Mono x64 SIMD intrinsics.
Fixed numerous memory leaks in Mono AOT cross compiler code.
Fix a couple of sequence points free after use errors.
Fix an anonymous struct build warning triggering build error for LLVM enabled cross compiler on Windows.

marek-safar · 2024-01-17T14:11:05Z

src/mono/System.Private.CoreLib/src/ILLink/ILLink.Substitutions.Intrinsics.Vectors.xml

@@ -0,0 +1,10 @@
+<linker>
+  <assembly fullname="System.Private.CoreLib">
+    <type fullname="System.Runtime.Intrinsics.Vector256">


Is this true for all codegens (e.g. interpreter)

Looks like interpreter mark all vectors as not being hardware accelerated:

} else if (in_corlib && (!strncmp ("System.Runtime.Intrinsics", klass_name_space, 25) && !strncmp ("Vector", klass_name, 6) && !strcmp (tm, "get_IsHardwareAccelerated"))) { *op = MINT_LDC_I4_0; }

so for that case it should be ok.

We mark v128 as hardware accelerated in interp in some cases, I believe.

ok, so then this should be OK since it only affects v256/v512.

src/mono/mono/mini/aot-compiler.c

vargaz · 2024-01-17T14:36:01Z

Wouldn't be better to split this into smaller PRs ?

lambdageek · 2024-01-17T14:36:39Z

I wonder if we can make working with symbols a little more typesafe so that we have some distinction between a mempool allocated symbol and a temporary malloc-allocated symbol. Maybe we can just pass around a GString for the temporary ones?

src/libraries/System.Private.CoreLib/src/System.Private.CoreLib.Shared.projitems

src/mono/System.Private.CoreLib/src/ILLink/ILLink.Substitutions.Intrinsics.x86.xml

tannergooding · 2024-01-17T15:39:19Z

src/mono/mono/mini/simd-intrinsics.c

-		return NULL; 
-
+	if (vector_size == 256 || vector_size == 512) {
+		if (id == SN_get_IsHardwareAccelerated ) {


For RyuJIT, we have a fallback that handles any unrecognized get_IsHardwareAccelerated and get_IsSupported APIs for System.Numerics, System.Runtime.Intrinsics, and System.Runtime.Intrinsics.* to ensure they can be treated as constant false. Each namespace is being handled a little bit differently since get_IsSupported for some namespaces needs to fallback to user-code instead.

Would it be a good idea to similarly make this general-purpose. Notably vector_size == 64 should also be false on x86/x64 for example?

We do have a common fallback for IsHardwareAccelerted in instrinsics.c:

/* Fallback if SIMD is disabled */ if (in_corlib && ((!strcmp ("System.Numerics", cmethod_klass_name_space) && !strcmp ("Vector", cmethod_klass_name)) || !strncmp ("System.Runtime.Intrinsics", cmethod_klass_name_space, 25))) { if (!strcmp (cmethod->name, "get_IsHardwareAccelerated")) { EMIT_NEW_ICONST (cfg, ins, 0); ins->type = STACK_I4; return ins; } }

So I guess we should be able to drop that change (just made it explicitly for better visibility in this PR) and rely on that fallback to end up with the same result. Not sure why 64-bit vector size was not included in the past.

Reverted this change to rely on fallback for get_IsHardwareAccelerated.

src/mono/mono/mini/simd-intrinsics.c

tannergooding · 2024-01-17T15:43:58Z

src/mono/mono/mini/simd-intrinsics.c

 	if (!strcmp (class_ns, "System.Runtime.Intrinsics")) {
-		if (!strcmp (class_name, "Vector128"))
+		if (!strcmp (class_name, "Vector128") || !strcmp(class_name, "Vector256") || !strcmp(class_name, "Vector512"))


What about Vector64?

tannergooding · 2024-01-17T15:44:51Z

src/mono/mono/mini/simd-intrinsics.c

@@ -5911,8 +5981,14 @@ static
 MonoInst*
 arch_emit_simd_intrinsics (const char *class_ns, const char *class_name, MonoCompile *cfg, MonoMethod *cmethod, MonoMethodSignature *fsig, MonoInst **args)
 {
+	if (!strcmp (class_ns, "System.Runtime.Intrinsics.X86")) {


Is there a similar line for System.Runtime.Intrinsics.Arm and System.Runtime.Intrinsics.Wasm missing?

(or general purpose code handling any unrecognized System.Runtime.Intrinsics.* namespace?)

Handled elsewhere in that source file.

You will find Arm and Wasm under respective defines.

Do they not need a path to ensure that IsSupported returns constant false and the intrinsics directly generate a PNSE exception, rather than hitting the recursive fallback, or is that handled elsewhere as well?

I mainly focused on x86 in this PR, other code will still go through the fallbacks, but could probably be enhanced as well.

src/mono/mono/mini/simd-intrinsics.c

tannergooding · 2024-01-17T15:54:11Z

src/mono/mono/mini/simd-intrinsics.c

+	/*
+	* If a method has been marked with aggressive inlining, check if we support
+	* aggressive inlining of the intrinsics type, if not, ignore aggressive inlining
+	* since it could end up inlining a large amount of code that most likely will end
+	* up as dead code.
+	*/


Could you help me understand why this is needed on Mono a bit?

The software fallback for V128/V256/V512 is currently implemented as doing 2x operations on the lower/upper halves (except for a couple of methods like Shuffle which operate on the full vector). So V512 is implemented as 2x V256, V256 is implemented as 2x V128, and V128 is implemented as 2x V64. V64 is then implemented as a loop over the scalar elements with potentially large amounts of generic code.

So I would imagine that on any platform where V128 is supported, that the codegen for V256/V512 should be generally nice/small, even if aggressively inlined and with no real dead code elimination required. Even in the case where you have an unsupported type like V256<Guid>, the first V128 operation should have a PNSE thrown (since whether a type is supported or not is typically a known constant).

So I'd only expect that this is needed for V64 on x86/x64 where there is no acceleration and it has to hit the scalar fallback with the loop over the elements. For RyuJIT this loop isn't an issue due to the generic specialization we do, allowing us to get it down to only the code path that is actually used. I believe this isn't possible for Mono today and is non-trivial to add, so the skipping of inlining does make sense in that regard.

So without disabling aggressive inlining I ended up with methods that where huge, and not doing aggressive inline on these types (but still honor inline size limits) made them sane again. I will need to re-iterate around that change in order to tell exactly what happened with our inliner in the aggressive case.

Just validated this, with disabling the aggressive inlining as above a .net9 full AOT of S.P.C takes ~1.7 GB of memory, and not doing this will consume an additional 600 MB of memory, so I believe its worth preventing aggressive inlining for these types that are not hardware accelerated on any of the Mono supported platforms. I didn't add V64 since it seems to be handled a little differently on at least ARM case. I won't have bandwidth at the moment to do more deep analysis around why aggressive inlining of these template types cause that large increase in memory so I think doing this change, at least short term is worth it, we could probably file an issue around the bloat of Vector2561 and Vector5121, but maybe the better long term solution is to actually implement intrinsic support for these types.

For interpreter I'm handling this in a general fashion by detecting early dead pieces of code and not inlining any of the calls there. #97514. It is possible that a similar approach for jit can produce further improvements without having to special case classes.

I did try to experiment with some dead code elimination, but that cause issues and doesn't work with our llvm codegen, so we explicitly turn that pass off for code that will be passed over to llvm. Instead I made sure we could do more in linker, but still these methods still explode and survives first simple llvm optimizations pass we now do in function manager pass, so feels like something in the inlined code prevents elimination until very late in the opt chain.

Trying to reenable branch optimizations when using llvm in this pr:
#97189

steveisok · 2024-01-18T16:58:03Z

System.Collections.Concurrent failed to AOT in a couple of suites. Not sure why, but here's the log https://helixre107v0xdcypoyl9e7f.blob.core.windows.net/dotnet-runtime-refs-pull-97096-merge-21c2561d7c2b483faf/Invariant.Tests/1/console.3bf63a3a.log?helixlogtype=result

src/mono/mono/mini/aot-runtime.c

src/mono/mono/mini/exceptions-amd64.c

Extracted from #97096. Author: Johan Lorensson <lateralusx.github@gmail.com>

Extracted from #97096. Author: Johan Lorensson <lateralusx.github@gmail.com>.

lewing · 2024-02-09T02:13:50Z

Hopfully I resolved the conflicts correctly, someone should review.

Extracted from dotnet#97096. Author: Johan Lorensson lateralusx.github@gmail.com.

…#98151) Extracted from #97096. Author: Johan Lorensson lateralusx.github@gmail.com.

Extracted from dotnet#97096. Author: Johan Lorensson lateralusx.github@gmail.com.

Extracted from #97096. Author: Johan Lorensson lateralusx.github@gmail.com.

Building .net8 S.P.C using Mono AOT cross compiler in full AOT consumes a large amount of memory (up to 6 GB). This is mainly due to generated LLVM module not being optimized at all while kept in memory during full module generation. Mono x64 also lacks support for several intrinsics as well as Vector 256/512 that in turn leads to massive inlining of intrinsics functions generating a very large LLVM module, where majority of this code ends up as dead code due to IsSupported/IsHardwareAccelerated returning false. The follow commit adjusts several things that will bring down the memory usage, compiling .net8/.net9 Mono S.P.C on x64 Windows from 6 GB down to ~750 MB. * Use PSNE implementations on intrinsics not supported on Mono. * Add ILLinker substitutions for intrinsics not supported on Mono. Enables ILLinker to do dead code elimination, reduce code to AOT compile. * Prevent aggressive inlining for a couple of unsupported intrinsics types making sure we don't end up with excessive inlining, exploding code size. * Run a couple of LLVM optimization passes on each generated method doing early code simplification and dead code elimination during LLVM module generation. * Explicit SN_get_IsHardwareAccelerated/SN_get_IsSupported intrinsics implementation for all unsupported Mono x64 SIMD intrinsics. * Fixed numerous memory leaks in Mono AOT cross compiler code. * Fix a couple of sequence points free after use errors. * Fix an anonymous struct build warning triggering build error for LLVM enabled cross compiler on Windows.

vargaz · 2024-02-09T15:21:37Z

Failures are unrelated.

matouskozak · 2024-02-19T11:55:08Z

This PR looks to be responsible for ~200kB package size improvement on iOS HelloWorld (range of commits 339443b...a79c62d)
.

Good job everyone!

matouskozak · 2024-02-28T09:11:31Z

This PR looks to be responsible for ~200kB package size improvement on iOS HelloWorld (range of commits 339443b...a79c62d) .

Good job everyone!

Edit. after the subsequent fix to this PR (#98515), the package size improvements on iOS HelloWorld are mostly gone (i.e., back to original values).

lateralusX requested review from fanyang-mono, vargaz, BrzVlad, lambdageek, SamMonoRT and marek-safar as code owners January 17, 2024 13:18

dotnet-issue-labeler bot added the area-Codegen-LLVM-mono label Jan 17, 2024

ghost assigned lateralusX Jan 17, 2024

lateralusX mentioned this pull request Jan 17, 2024

[mono][aot] Investigate RAM consumption in Mono AOT compiler #95791

Closed

2 tasks

marek-safar reviewed Jan 17, 2024

View reviewed changes

lambdageek reviewed Jan 17, 2024

View reviewed changes

src/mono/mono/mini/aot-compiler.c Outdated Show resolved Hide resolved

tannergooding reviewed Jan 17, 2024

View reviewed changes

src/libraries/System.Private.CoreLib/src/System.Private.CoreLib.Shared.projitems Outdated Show resolved Hide resolved

tannergooding reviewed Jan 17, 2024

View reviewed changes

src/mono/System.Private.CoreLib/src/ILLink/ILLink.Substitutions.Intrinsics.x86.xml Show resolved Hide resolved

tannergooding reviewed Jan 17, 2024

View reviewed changes

src/mono/mono/mini/simd-intrinsics.c Outdated Show resolved Hide resolved

tannergooding reviewed Jan 17, 2024

View reviewed changes

src/mono/mono/mini/simd-intrinsics.c Outdated Show resolved Hide resolved

tannergooding reviewed Jan 17, 2024

View reviewed changes

steveisok requested review from lewing and steveisok January 17, 2024 19:43

vargaz reviewed Jan 21, 2024

View reviewed changes

src/mono/mono/mini/aot-runtime.c Outdated Show resolved Hide resolved

vargaz reviewed Jan 21, 2024

View reviewed changes

src/mono/mono/mini/exceptions-amd64.c Outdated Show resolved Hide resolved

vargaz mentioned this pull request Feb 8, 2024

[mono][jit] Add minimal support for unsupported avx instruction sets. #98151

Merged

vargaz added a commit that referenced this pull request Feb 8, 2024

[mono][aot] Fix some memory leaks. (#98147)

f682619

Extracted from #97096. Author: Johan Lorensson <lateralusx.github@gmail.com>

vargaz added a commit that referenced this pull request Feb 8, 2024

[mono][aot] Fix a use after free. (#98149)

ac14935

Extracted from #97096. Author: Johan Lorensson <lateralusx.github@gmail.com>.

vargaz added a commit to vargaz/runtime that referenced this pull request Feb 9, 2024

[mono][jit] Add minimal support for unsupported avx instruction sets.

6925ab8

Extracted from dotnet#97096. Author: Johan Lorensson lateralusx.github@gmail.com.

build-analysis bot mentioned this pull request Feb 9, 2024

[8.0] Failing to build native components #94823

Closed

vargaz force-pushed the lateralusX/reduce-aot-llvm-memory-use branch from 4e885b1 to 65f6f9c Compare February 9, 2024 04:16

vargaz added a commit that referenced this pull request Feb 9, 2024

[mono][jit] Add minimal support for unsupported avx instruction sets. (…

512bcaf

…#98151) Extracted from #97096. Author: Johan Lorensson lateralusx.github@gmail.com.

vargaz force-pushed the lateralusX/reduce-aot-llvm-memory-use branch from 65f6f9c to aa39dff Compare February 9, 2024 05:29

vargaz mentioned this pull request Feb 9, 2024

[mono] Fix a windows build issue. #98206

Merged

vargaz added a commit to vargaz/runtime that referenced this pull request Feb 9, 2024

[mono] Fix a windows build issue.

661364f

Extracted from dotnet#97096. Author: Johan Lorensson lateralusx.github@gmail.com.

vargaz added a commit that referenced this pull request Feb 9, 2024

[mono] Fix a windows build issue. (#98206)

b4c400a

Extracted from #97096. Author: Johan Lorensson lateralusx.github@gmail.com.

lateralusX added 2 commits February 9, 2024 03:46

Fix review feedback.

cfaf8d9

vargaz force-pushed the lateralusX/reduce-aot-llvm-memory-use branch from aa39dff to cfaf8d9 Compare February 9, 2024 08:46

Minor cleanups.

5ab0d94

vargaz force-pushed the lateralusX/reduce-aot-llvm-memory-use branch from ea8a484 to 5ab0d94 Compare February 9, 2024 09:06

vargaz approved these changes Feb 9, 2024

View reviewed changes

build-analysis bot mentioned this pull request Feb 9, 2024

Segmentation fault in System.Text.RegularExpressions.Tests #93206

Closed

steveisok merged commit a79c62d into dotnet:main Feb 9, 2024
186 of 189 checks passed

This was referenced Feb 14, 2024

[Perf] Linux/x64: 52 Regressions on 2/11/2024 7:44:56 AM dotnet/perf-autofiling-issues#28882

Open

[wasm][perf] Tracking #96444

Open

vargaz added a commit to vargaz/runtime that referenced this pull request Feb 15, 2024

[mono][wasm] Fix a performance regression introduced by dotnet#97096.

78d5fbf

vargaz added a commit that referenced this pull request Feb 20, 2024

[mono][wasm] Fix a performance regression introduced by #97096. (#98515)

2756c94

vargaz mentioned this pull request Mar 9, 2024

System.Numerics.Tensors.Tests cause long-running/hanging tests on Mono JIT (not interpreter) #97295

Open

github-actions bot locked and limited conversation to collaborators Mar 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Mono]: Reduce Mono AOT cross compiler x64 memory footprint. #97096

[Mono]: Reduce Mono AOT cross compiler x64 memory footprint. #97096

lateralusX commented Jan 17, 2024

marek-safar Jan 17, 2024

lateralusX Jan 17, 2024

lewing Jan 17, 2024

kg Jan 18, 2024

lateralusX Jan 22, 2024

vargaz commented Jan 17, 2024

lambdageek commented Jan 17, 2024

tannergooding Jan 17, 2024

lateralusX Jan 17, 2024 •

edited

Loading

lateralusX Jan 22, 2024

tannergooding Jan 17, 2024

tannergooding Jan 17, 2024

tannergooding Jan 17, 2024

lateralusX Jan 17, 2024

lateralusX Jan 17, 2024

tannergooding Jan 17, 2024

lateralusX Jan 17, 2024

tannergooding Jan 17, 2024

lateralusX Jan 17, 2024 •

edited

Loading

lateralusX Jan 22, 2024 •

edited

Loading

BrzVlad Jan 25, 2024

lateralusX Jan 25, 2024 •

edited

Loading

vargaz Jan 25, 2024

steveisok commented Jan 18, 2024

lewing commented Feb 9, 2024

vargaz commented Feb 9, 2024

matouskozak commented Feb 19, 2024

matouskozak commented Feb 28, 2024

[Mono]: Reduce Mono AOT cross compiler x64 memory footprint. #97096

[Mono]: Reduce Mono AOT cross compiler x64 memory footprint. #97096

Conversation

lateralusX commented Jan 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vargaz commented Jan 17, 2024

lambdageek commented Jan 17, 2024

Choose a reason for hiding this comment

lateralusX Jan 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lateralusX Jan 17, 2024 • edited Loading

Choose a reason for hiding this comment

lateralusX Jan 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lateralusX Jan 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

steveisok commented Jan 18, 2024

lewing commented Feb 9, 2024

vargaz commented Feb 9, 2024

matouskozak commented Feb 19, 2024

matouskozak commented Feb 28, 2024

lateralusX Jan 17, 2024 •

edited

Loading

lateralusX Jan 17, 2024 •

edited

Loading

lateralusX Jan 22, 2024 •

edited

Loading

lateralusX Jan 25, 2024 •

edited

Loading