Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

condition var->opcode == OP_REGOFFSET not met on Mono-based SDK #90292

Closed
ayakael opened this issue Aug 10, 2023 · 23 comments · Fixed by #91008
Closed

condition var->opcode == OP_REGOFFSET not met on Mono-based SDK #90292

ayakael opened this issue Aug 10, 2023 · 23 comments · Fixed by #91008
Assignees
Milestone

Comments

@ayakael
Copy link
Contributor

ayakael commented Aug 10, 2023

Description

I get the following error when building roslyn with the latest 8.0.0-preview.7 of dotnet on s390x musl-libc system (Alpine Linux):

* Assertion at /var/lib/gitlab-runner/builds/ayakael/dotnet-stage0/src/dotnet-8.0.0-preview.7.23375.6/src/runtime/src/mono/mono/mini/method-to-ir.c:13122, condition `var->opcode == OP_REGOFFSET' not met
3799=================================================================
3800	Native Crash Reporting
3801=================================================================
3802Got a SIGABRT while executing native code. This usually indicates
3803a fatal error in the mono runtime or one of the native libraries 
3804used by your application.
3805=================================================================
3806=================================================================
3807	Basic Fault Address Reporting
3808=================================================================
3809Memory around native instruction pointer (0x3ffa1cd715e):0x3ffa1cd714e  00 af a7 29 00 02 a7 49 00 00 a7 59 00 08 0a 00  ...)...I...Y....
38100x3ffa1cd715e  07 fe eb ef f0 70 00 24 a7 19 00 69 e3 f0 ff 60  .....p.$...i...`
38110x3ffa1cd716e  ff 71 0a 00 c0 e5 ff fe 59 9f eb ef f1 10 00 04  .q......Y.......
38120x3ffa1cd717e  b9 14 00 22 07 fe 07 07 07 07 eb ef f0 70 00 24  ...".........p.$
3813=================================================================
3814	Managed Stacktrace:
3815=================================================================
3816	  at <unknown> <0xffffffff>
3817	  at <unknown> <0xffffffff>
3818	  at System.CommandLine.Parsing.SymbolResult:GetValue <0x0008c>
3819	  at System.CommandLine.ParseResult:GetValue <0x00060>
3820	  at Microsoft.DotNet.Tools.Tool.Restore.ToolRestoreCommand:.ctor <0x0059a>
3821	  at <>c:<ConstructCommand>b__6_0 <0x00088>
3822	  at System.CommandLine.Invocation.AnonymousCliAction:Invoke <0x00068>
3823	  at System.CommandLine.Invocation.InvocationPipeline:Invoke <0x00184>
3824	  at System.CommandLine.ParseResult:Invoke <0x00044>
3825	  at Microsoft.DotNet.Cli.Program:ProcessArgs <0x00b50>
3826	  at Microsoft.DotNet.Cli.Program:Main <0x002b2>
3827	  at <Module>:runtime_invoke_int_object <0x000da>
3828=================================================================
3829./eng/build.sh: line 311: 14307 Aborted                 (core dumped) dotnet tool restore

Reproduction Steps

Build roslyn on Alpine Linux s390x platform using the following bootstrap

Expected behavior

Build should work.

Actual behavior

Build fails

Regression?

I've never been able to build the whole dotnet8 stack using mono-flavored runtime.

Known Workarounds

None

Configuration

Alpine Linux s390x

Other information

I couldn't even get a binlog, so it seems pretty low-level.

@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Aug 10, 2023
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Aug 10, 2023
@ayakael ayakael changed the title condition `var->opcode == OP_REGOFFSET' not met on s390x musl-libc condition var->opcode == OP_REGOFFSET not met on s390x musl-libc Aug 10, 2023
@vargaz vargaz added area-Codegen-JIT-mono arch-s390x Related to s390x architecture (unsupported) and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Aug 10, 2023
@omajid
Copy link
Member

omajid commented Aug 10, 2023

cc @uweigand

@ayakael
Copy link
Contributor Author

ayakael commented Aug 10, 2023

I reproduced this with ppc64le, as well. Should I wait till GA to test out mono on Alpine?

@omajid
Copy link
Member

omajid commented Aug 10, 2023

Should I wait till GA to test out mono on Alpine?

I think testing this now is good. I am not aware of any reason why mono shouldn't work by design right now. If you are finding bugs that others haven't found, filing/fixing them now is probably easier than waiting until GA, which has a lower tolerance for risk and any extensive musl/mono changes that might be required.

@ayakael
Copy link
Contributor Author

ayakael commented Aug 10, 2023

Should I wait till GA to test out mono on Alpine?

I think testing this now is good. I am not aware of any reason why mono shouldn't work by design right now. If you are finding bugs that others haven't found, filing/fixing them now is probably easier than waiting until GA, which has a lower tolerance for risk and any extensive musl/mono changes that might be required.

Perfect, I have a bit of time this week to work on this as cut-off is probably just around the corner for GA. I have not built a bootstrap for preview.6 so I'll get that going and see if anything changes. My last bootstrap (available here) was preview.3 and it spits out even uglier error messages.

@ayakael
Copy link
Contributor Author

ayakael commented Aug 10, 2023

preview.6 is proving difficult to crossbuild. Instead, I've prepared an x86_64 mono-flavored runtime of preview.7 for easier testing. The issue persists, allowing for reproduction on x86_64.

https://lab.ilot.io/ayakael/dotnet-stage0/-/jobs/4496/artifacts/raw/dotnet-sdk-8.0.100-preview.7.23376.3-mono-r0-linux-musl-x86_64.tar.xz

@SamMonoRT
Copy link
Member

cc @directhex @akoeplinger

@ayakael
Copy link
Contributor Author

ayakael commented Aug 11, 2023

This problem doesn't occur when building roslyn within source-build, only when building roslyn directly. It is definitely an issue with the restore tool, but fortunately in my case I can workaround the issue as I do not need to build roslyn directly like I had to when bootstrapping dotnet7 and dotnet6. I tried applying the exact same properties as source-build roslyn with no luck.

@SamMonoRT
Copy link
Member

assigning to @uweigand and @directhex to discuss in their syncs. Moving to Future milestone.

@SamMonoRT SamMonoRT removed the untriaged New issue has not been triaged by the area owner label Aug 14, 2023
@SamMonoRT SamMonoRT added this to the Future milestone Aug 14, 2023
@iii-i
Copy link
Contributor

iii-i commented Aug 16, 2023

Smaller repro (works on Fedora too):

$ dotnet new console -o test
$ cd test
$ dotnet add package System.CommandLine --version 2.0.0-beta4.23307.1
$ cat >Program.cs <<HERE
using System.CommandLine;
using System.IO;
(new CliCommand("cmd")).Parse("cmd").GetValue(new CliOption<FileAccess>("--access"));
HERE
$ dotnet build
$ dotnet run
* Assertion at [...]/runtime/src/mono/mono/mini/method-to-ir.c:13122, condition `var->opcode == OP_REGOFFSET' not met

@iii-i
Copy link
Contributor

iii-i commented Aug 16, 2023

The assertion is triggered by ldaddr R91 <- R16, which is produced by mono_decompose_vtype_opts():

BEFORE LOWER-VTYPE-OPTS  14: [IN:  BB13(11), OUT:  BB1(3) ]
 il_seq_point il: 0xe3
 vzero R16 <-
 il_seq_point il: 0xe8, nonempty-stack
 il_seq_point intr il: 0xe8
 br [B1]
AFTER LOWER-VTYPE-OPTS  14: [IN:  BB13(11), OUT:  BB1(3) ]
 il_seq_point il: 0xe3
 ldaddr R91 <- R16
 storei4_membase_imm [R91] <-
 il_seq_point il: 0xe8, nonempty-stack
 il_seq_point intr il: 0xe8
 br [B1]

var->opcode is OP_REGVAR.

@directhex
Copy link
Member

@vargaz @lambdageek

@vargaz
Copy link
Contributor

vargaz commented Aug 16, 2023

Somebody who knows s390x/ppc needs to look into this.

@iii-i
Copy link
Contributor

iii-i commented Aug 16, 2023

I'm trying to debug this, but it seems to be very specific to mono, in particular, the gsharedvt area. I suspect that something goes wrong with the gsharedvt metadata when mini_emit_inst_for_method() creates OP_VZERO (in the System.Activator.CreateInstance case), but I'm still struggling with wrapping my head around it.

Edit: In particular, I don't see why we are not taking the cfg->gsharedvt && cfg->gsharedvt_vreg_to_idx [var->dreg] case in mono_spill_global_vars(). We are clearly dealing with a local variable. This should only be possible if mini_is_gsharedvt_variable_type (ins->inst_vtype) is false in mono_allocate_gsharedvt_vars().

@vargaz
Copy link
Contributor

vargaz commented Aug 16, 2023

cfg->gsharedvt should not even be true on s390x.

@iii-i
Copy link
Contributor

iii-i commented Aug 17, 2023

Thanks, I must have been looking into a completely wrong direction. cfg->gshared seems to be set only for AOT, which s390x doesn't have.

I wonder what would be the best approach to investigate this further. ldaddr R91 <- R16 must be valid IR, since R16 is global (used in more than 1 basic block). Should the OP_LDADDR handling logic in mono_spill_global_vars() support OP_REGVAR right hand sides?

@iii-i
Copy link
Contributor

iii-i commented Aug 17, 2023

One more finding: the following works

C<E>.M();
enum E {}
class C<T> { public static T? M() { return System.Activator.CreateInstance<T>(); } }

because the result is assigned to an OP_LOCAL, which mono_arch_allocate_vars() converts to OP_REGOFFSET. In the more complex case of System.CommandLine.CliArgument`1<FileAccess>:CreateDefaultValue(), the result is initially in an OP_LOCAL too, but is then moved to OP_REGVAR by:

        Reverse copyprop in BB14 on  move R16 <- R88

@iii-i
Copy link
Contributor

iii-i commented Aug 17, 2023

FWIW the reproducer from #90292 (comment) triggers the same issue on the x86_64 9.0.100-alpha.1.23416.26 built with --usemonoruntime. Somehow this worked on s390x as-is, but for x86_64 I also needed to copy https://github.com/dotnet/sdk/blob/main/NuGet.config into the test/ directory.

Let me summarize what I found so far. Here is the problematic method: https://github.com/dotnet/command-line-api/blob/a045dd54a4c44723c215d992288160eb1401bb7f/src/System.CommandLine/Argument%7BT%7D.cs#L166. Here is the JIT log: https://gist.github.com/iii-i/042ecd6a3c555fdb386994f6059ed378. The problematic basic block is B14.

The IL is:

converting (in B14: stack: 0) IL_00e3: call      0x2b000004
cmethod = System.IO.FileAccess System.Activator:CreateInstance<System.IO.FileAccess> ()
converting (in B14: stack: 1) IL_00e8: ret       

It's converted into:

AFTER METHOD-TO-IR 14: [IN:  BB13(0), OUT:  BB1(0) ]
 il_seq_point il: 0xe3
 vzero R93 <-
 il_seq_point il: 0xe8, nonempty-stack
 il_seq_point intr il: 0xe8
 move R16 <- R93
 br [B1]

Note that the method call became vzero. I think R93 is OP_LOCAL here. After cprop it's replaced by an OP_REGVAR:

	Reverse copyprop in BB14 on  move R16 <- R93
[...]
BEFORE SAFEPOINTS 14: [IN:  BB13(11), OUT:  BB1(3) ]
 il_seq_point il: 0xe3
 vzero R16 <-
[...]

Then vzero is replaced by ldaddr + storei4_membase_imm:

AFTER LOWER-VTYPE-OPTS  14: [IN:  BB13(11), OUT:  BB1(3) ]
 il_seq_point il: 0xe3
 ldaddr R96 <- R16
 storei4_membase_imm [R96] <-

Finally, the attempt to lower ldaddr causes an assertion error:

SPILL BLOCK 14:
 il_seq_point il: 0xe3
	     -1
	1  il_seq_point il: 0xe3
 ldaddr R96 <- R0
* Assertion at /home/iii/dotnet-s390x/runtime/src/mono/mono/mini/method-to-ir.c:13122, condition `var->opcode == OP_REGOFFSET' not met

I have a few thoughts on how to fix that, but it all would boil down to a bunch of "if" statements preventing some steps in the chain above from happening, which looks like a hack and would probably prevent some useful optimizations. So I'd like to ask - what would be a proper way to fix this?

@uweigand
Copy link
Contributor

Thanks Ilya for the write-up! I've had another look, and it appears to be that the copy propagation may be the main problem here. Note that in this case we have a transformation of:

 vzero R93 <-
 move R16 <- R93

into

 vzero R16 <-

where R16 (the return variable) is of type IREG and R93 is of type VREG.

As vzero is defined to take a VREG, propagating an IREG into it seems wrong.

On the other hand, move is defined to take IREG for both operands, so having a VREG in here may itself already be wrong. That move is generated by the mono_arch_emit_setret routine - but that code seems to have been copied across all the architectures, so if it's wrong, it's equally wrong everywhere.

To fix this, I think we should either:

  • not generate a move between IREG and VREG in the first place (but what else should mono_arch_emit_setret do in this case?); or
  • not consider such moves for copy propagation

To implement the second option, I think we should simply modify this condition in local-propagation.c:

                                /*
                                 * Perform a limited kind of reverse copy propagation, i.e.
                                 * transform B <- FOO; A <- B into A <- FOO
                                 * This isn't copyprop, not deadce, but it can only be performed
                                 * after handle_global_vregs () has run.
                                 */
                                if (!get_vreg_to_inst (cfg, ins->sreg1) && (spec2 [MONO_INST_DEST] != ' ') && (def->dreg == ins->sreg1) && !mono_bitset_test_fast (used, ins->sreg1) && !MONO_IS_STORE_MEMBASE (def) && reg_is_softreg (ins->sreg1, spec [MONO_INST_DEST]) && !mono_is_simd_accessor (def)) {

by adding a (spec2 [MONO_INST_DEST] == spec [MONO_INST_DEST]) check.

This does in fact fix the crash for me, without disabling any other copy propagation instances (in the small tests I did so far).

@vargaz does this look reasonable or am I missing something here?

@uweigand
Copy link
Contributor

Looking into this even further, maybe the actual problem is already the expansion into a vzero. The C# source code triggering the problem is

                        return Activator.CreateInstance<T>();

where the generic T is instantiated as System.IO.FileAccess, which is an enum type.

When allocating the return register, Mono uses the underlying base type (which is a primitive type System.Int32) in place of the enum type. However, when replacing the CreateInstance call with an inlined initialization, the code in mini_emit_inst_for_method treats this type not as a primitive type, but as a class type, and therefore uses the vzero path in the following condition:

                        MonoType *t = method_context->method_inst->type_argv [0];
                        MonoClass *arg0 = mono_class_from_mono_type_internal (t);
                        if (m_class_is_valuetype (arg0) && !mono_class_has_default_constructor (arg0, FALSE)) {
                                if (m_class_is_primitive (arg0)) {
                                        int dreg = alloc_dreg (cfg, mini_type_to_stack_type (cfg, t));
                                        mini_emit_init_rvar (cfg, dreg, t);
                                        ins = cfg->cbb->last_ins;
                                } else {
                                        MONO_INST_NEW (cfg, ins, mini_class_is_simd (cfg, arg0) ? OP_XZERO : OP_VZERO);
                                        ins->dreg = mono_alloc_dreg (cfg, STACK_VTYPE);
                                        ins->type = STACK_VTYPE;
                                        ins->klass = arg0;
                                        MONO_ADD_INS (cfg->cbb, ins);
                                }
                                return ins;
                        }

If I change this code to handle enum type by using the base type here as well, the problem also goes away. Plus, the generated code is more efficient than the code I got with the local-propagation.c change from the previous comment (just sets a register to 0 instead of storing 0 to a stack slot and reloading from there).

For reference, here's the change I'm testing:

diff --git a/src/mono/mono/mini/intrinsics.c b/src/mono/mono/mini/intrinsics.c
index 68dc78bd22b..ad2cd406073 100644
--- a/src/mono/mono/mini/intrinsics.c
+++ b/src/mono/mono/mini/intrinsics.c
@@ -2074,7 +2074,9 @@ mini_emit_inst_for_method (MonoCompile *cfg, MonoMethod *cmethod, MonoMethodSign
                        MonoType *t = method_context->method_inst->type_argv [0];
                        MonoClass *arg0 = mono_class_from_mono_type_internal (t);
                        if (m_class_is_valuetype (arg0) && !mono_class_has_default_constructor (arg0, FALSE)) {
-                               if (m_class_is_primitive (arg0)) {
+                               if (m_class_is_primitive (arg0) || m_class_is_enumtype (arg0)) {
+                                       if (m_class_is_enumtype (arg0))
+                                               t = mono_class_enum_basetype_internal (arg0);
                                        int dreg = alloc_dreg (cfg, mini_type_to_stack_type (cfg, t));
                                        mini_emit_init_rvar (cfg, dreg, t);
                                        ins = cfg->cbb->last_ins;

I'd still appreciate feedback on whether this (or the above) is the right approach.

@uweigand
Copy link
Contributor

Note that this JIT bug is triggered by code that is present in the .NET 8 version of msbuild starting with preview7. This means that .NET 8 msbuild will crash in many scenarios on Mono-based platforms. (For example, I cannot build the current runtime repository natively on s390x any more, since this now requires preview7 to build). While the title of this issue still refers to musl-libc, this is actually unrelated and happens on glibc-based platforms just the same.

This represents a serious regression in .NET 8 and should be fixed before release.

@directhex directhex changed the title condition var->opcode == OP_REGOFFSET not met on s390x musl-libc condition var->opcode == OP_REGOFFSET not met on Mono-based SDK Aug 23, 2023
@directhex directhex removed the arch-s390x Related to s390x architecture (unsupported) label Aug 23, 2023
@akoeplinger akoeplinger modified the milestones: Future, 8.0.0 Aug 23, 2023
uweigand added a commit to uweigand/runtime that referenced this issue Aug 23, 2023
Use underlying base type when deciding how to inline a
CreateInstance invocation in mini_emit_inst_for_method.

Fixes dotnet#90292
(Mono abort causing .NET 8 msbuild regression).
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Aug 23, 2023
@uweigand
Copy link
Contributor

FYI - I've now submitted the patch in my latest comment as PR, to verify it passes CI.

@directhex
Copy link
Member

Do you want that dispatched on PPC/390?

@uweigand
Copy link
Contributor

Ah yes, that would be good.

github-actions bot pushed a commit that referenced this issue Aug 24, 2023
Use underlying base type when deciding how to inline a
CreateInstance invocation in mini_emit_inst_for_method.

Fixes #90292
(Mono abort causing .NET 8 msbuild regression).
fanyang-mono pushed a commit that referenced this issue Aug 28, 2023
Use underlying base type when deciding how to inline a
CreateInstance invocation in mini_emit_inst_for_method.

Fixes #90292
(Mono abort causing .NET 8 msbuild regression).
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Aug 28, 2023
carlossanlop pushed a commit that referenced this issue Aug 28, 2023
Use underlying base type when deciding how to inline a
CreateInstance invocation in mini_emit_inst_for_method.

Fixes #90292
(Mono abort causing .NET 8 msbuild regression).

Co-authored-by: Ulrich Weigand <ulrich.weigand@de.ibm.com>
Co-authored-by: Larry Ewing <lewing@microsoft.com>
@ghost ghost locked as resolved and limited conversation to collaborators Sep 27, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants