-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hopper-only crash: Unsupported loop structure. Two loops are mapped together. #2685
Comments
Can you try this again with TOT? It is not failing for me locally. |
I'm running pretty close to ToT: $ python3 -m pip freeze | grep -i nvfuse
nvfuser @ git+https://github.com/NVIDIA/Fuser.git@db95e48689ed640cff577c87ca3b0913c2d6989f Are you perhaps using ampere? Now that I look, I also can't reproduce on my ampere-based workstation, only Hopper-based nodes. Apologies for not including originally; let me edit the bug to be clearer. |
Ah, that's it. Thanks for the pointer! Will look into it more soon. |
I have a smaller repro. This repro schedules with inner persistent scheduler without segmenting. # CUDA devices:
# 0: NVIDIA H100 80GB HBM3
# torch version: 2.5.0a0+git8927fc2
# nvfuser version: 0.2.8+gitdd6886f
import torch
from nvfuser import FusionDefinition, DataType
def nvfuser_fusion_id0(fd : FusionDefinition) -> None :
S0 = fd.define_scalar(None, dtype=DataType.Int)
T1 = fd.define_tensor(shape=[-1, -1, -1, -1], contiguity=[True, True, True, True], dtype=DataType.Float, is_cpu=False, stride_order=[3, 2, 1, 0])
T2 = fd.define_tensor(shape=[-1], contiguity=[True], dtype=DataType.Float, is_cpu=False, stride_order=[0])
T3 = fd.define_tensor(shape=[-1], contiguity=[True], dtype=DataType.Float, is_cpu=False, stride_order=[0])
T4 = fd.define_tensor(shape=[-1, -1, -1, -1], contiguity=[True, True, True, True], dtype=DataType.Float, is_cpu=False, stride_order=[3, 2, 1, 0])
T5 = fd.define_tensor(shape=[-1, -1, -1, -1], contiguity=[None, True, None, None], dtype=DataType.Float, is_cpu=False, stride_order=[3, 2, 1, 0])
T6 = fd.define_tensor(shape=[-1, -1, -1, -1], contiguity=[None, True, None, None], dtype=DataType.Float, is_cpu=False, stride_order=[3, 2, 1, 0])
T7 = fd.define_tensor(shape=[-1, -1, -1, -1], contiguity=[True, True, True, True], dtype=DataType.Bool, is_cpu=False, stride_order=[3, 2, 1, 0])
T8 = fd.define_tensor(shape=[-1, -1, -1, -1], contiguity=[True, True, True, True], dtype=DataType.Bool, is_cpu=False, stride_order=[3, 2, 1, 0])
T9 = fd.define_tensor(shape=[-1, -1, -1, -1], contiguity=[True, True, True, True], dtype=DataType.Bool, is_cpu=False, stride_order=[3, 2, 1, 0])
T10 = fd.define_tensor(shape=[-1, -1, -1, -1], contiguity=[True, True, True, True], dtype=DataType.Bool, is_cpu=False, stride_order=[3, 2, 1, 0])
T11 = fd.define_tensor(shape=[-1, -1, -1, -1, 1], contiguity=[True, True, True, True, None], dtype=DataType.Float, is_cpu=False, stride_order=[4, 3, 2, 1, 0])
T12 = fd.ops.sum(T11, dims=[4], keepdim=False, dtype=DataType.Null)
T13 = fd.ops.set(T12)
T14 = fd.ops.set(T12)
T15 = fd.ops.sum(T14, dims=[0, 2, 3], keepdim=False, dtype=DataType.Null)
S16 = fd.define_scalar(1, dtype=DataType.Int)
S17 = fd.define_scalar(288, dtype=DataType.Int)
S18 = fd.define_scalar(1, dtype=DataType.Int)
S19 = fd.define_scalar(1, dtype=DataType.Int)
V20 = fd.define_vector([S16, S17, S18, S19], dtype=DataType.Int)
T21 = fd.ops.broadcast_in_dim(T15, shape=V20, broadcast_dims=[1])
T22 = fd.ops.set(T12)
T23 = fd.ops.sum(T22, dims=[0, 2, 3], keepdim=False, dtype=DataType.Null)
T24 = fd.ops.broadcast_in_dim(T23, shape=V20, broadcast_dims=[1])
S25 = fd.define_scalar(288, dtype=DataType.Int)
V26 = fd.define_vector([S25], dtype=DataType.Int)
T27 = fd.ops.reshape(T24, new_shape=V26)
S28 = fd.define_scalar(288, dtype=DataType.Int)
V29 = fd.define_vector([S28], dtype=DataType.Int)
T30 = fd.ops.reshape(T21, new_shape=V29)
S31 = fd.define_scalar(-0.500000, dtype=DataType.Double)
T32 = fd.ops.mul(S31, T30)
S33 = fd.define_scalar(3.00000, dtype=DataType.Double)
T34 = fd.ops.pow(T3, S33)
T35 = fd.ops.mul(T32, T34)
T36 = fd.ops.broadcast_in_dim(T27, shape=V20, broadcast_dims=[1])
S37 = fd.define_scalar(2, dtype=DataType.Int)
S38 = fd.define_scalar(288, dtype=DataType.Int)
S39 = fd.define_scalar(120, dtype=DataType.Int)
S40 = fd.define_scalar(160, dtype=DataType.Int)
V41 = fd.define_vector([S37, S38, S39, S40], dtype=DataType.Int)
T42 = fd.ops.broadcast_in_dim(T36, shape=V41, broadcast_dims=[0, 1, 2, 3])
S43 = fd.define_scalar(2.60417e-05, dtype=DataType.Double)
T44 = fd.ops.mul(S43, T42)
T45 = fd.ops.broadcast_in_dim(T35, shape=V20, broadcast_dims=[1])
T46 = fd.ops.broadcast_in_dim(T45, shape=V41, broadcast_dims=[0, 1, 2, 3])
T47 = fd.ops.broadcast_in_dim(T2, shape=V20, broadcast_dims=[1])
S48 = fd.define_scalar(2.00000, dtype=DataType.Double)
T49 = fd.ops.mul(S48, T46)
T50 = fd.ops.sub(T1, T47)
T51 = fd.ops.mul(T49, T50)
S52 = fd.ops.cast(S0, dtype=DataType.Double)
S53 = fd.ops.reciprocal(S52)
T54 = fd.ops.mul(T51, S53)
T55 = fd.ops.add(T44, T54)
T56 = fd.ops.add(T13, T55)
T57 = fd.ops.cast(T56, dtype=DataType.Half)
fd.add_output(T57)
with FusionDefinition() as fd:
nvfuser_fusion_id0(fd)
inputs = [
38400,
torch.randn((11059200,), dtype=torch.float32, device='cuda:0').as_strided((2, 288, 120, 160), (5529600, 19200, 160, 1)),
torch.randn((288,), dtype=torch.float32, device='cuda:0').as_strided((288,), (1,)),
torch.randn((288,), dtype=torch.float32, device='cuda:0').as_strided((288,), (1,)),
torch.randn((11059200,), dtype=torch.float32, device='cuda:0').as_strided((2, 288, 120, 160), (5529600, 19200, 160, 1)),
torch.randn((288,), dtype=torch.float32, device='cuda:0').as_strided((2, 288, 120, 160), (0, 1, 0, 0)),
torch.randn((288,), dtype=torch.float32, device='cuda:0').as_strided((2, 288, 120, 160), (0, 1, 0, 0)),
torch.randint(0, 2, (11059200,), dtype=torch.bool, device='cuda:0').as_strided((2, 288, 120, 160), (5529600, 19200, 160, 1)),
torch.randint(0, 2, (11059200,), dtype=torch.bool, device='cuda:0').as_strided((2, 288, 120, 160), (5529600, 19200, 160, 1)),
torch.randint(0, 2, (11059200,), dtype=torch.bool, device='cuda:0').as_strided((2, 288, 120, 160), (5529600, 19200, 160, 1)),
torch.randint(0, 2, (11059200,), dtype=torch.bool, device='cuda:0').as_strided((2, 288, 120, 160), (5529600, 19200, 160, 1)),
torch.randn((11059200,), dtype=torch.float32, device='cuda:0').as_strided((2, 288, 120, 160, 1), (5529600, 19200, 160, 1, 1)),
]
fd.execute(inputs) The scheduler params on H100 are:
|
cc @liqiangxl in case anything pops out at you about this fusion |
BTW I just checked and the above smaller repro also errors on A100 with these params:
|
Ah, I think this is maybe a geforce vs tesla type of thing. Maybe the additional smem or higher SM counts could explain H100/A100 having the issue vs my 3090Ti and your workstation device not hitting the error. |
CC @naoyam as the error is happening in loop promotion. |
The fusion (Jacob's repro) generates a 3D inner persistent kernel, then uses shared memory persistent which is not supported yet. I added a WAR fix to disable the usage of shared memory for 3D inner persistent kernel. Will add this support and enable it later. |
The original fusion (Tom's repro) works fine with #2754 on both A100 and H100. |
Thanks for the fast help, @liqiangxl! I can confirm that your #2754 fixes this issue. 🎉 ! However the larger model (that this reproducer came from) dies a second later with the very similar error message |
Thanks for double check. #2754 fixes a real issue so we need to get it but it is not a fix to #2685. |
I'm very excited so I had already started it running beforehand :-) #2759 does indeed fix the issue. The larger program now dies with:
which sounds like a separate problem. I can file a new issue for that one. |
Thanks for the check. Let me know when the new issue is created and I'll check whether it is related to #2759 or somethingelse. |
@naoyam is trying to root cause this issue. The WAR doesn't seem sufficient to me given we don't know what the actual issue is. @tfogal if the WAR's provided help you make forward progress great. @liqiangxl please don't merge them in unless we know the root cause. It also doesn't seem like your WAR would last very long if we continued to try a failing case given the line I comment don. |
Here're all the tensors after scheduling:
All tensors appear to be scheduled consistently, except for I'm not sure why @liqiangxl, is it because of #2754? Suppose it's indeed due to #2754, disabling the scheduler seems to be the first thing we should do. Although it should not result in an error, since it doesn't fail with the new indexer, I don't think it's worthwhile to fix the legacy indexer. |
Also, as a clarification, the original error doesn't have anything specific to loop promotion. The problem is self mapping, which in this case should be a false alarm. |
Actually, the smem support doesn't seem to be directly related. #2754 still results in generating a similar scheduled fusion pattern with Tom's next.py:
whereas the rest of the tensors look like:
This seems to indicate there's indeed some problem with the inner persistent scheduler, no matter if smem is used. @liqiangxl What exactly did you mean smem persistent not supported yet? |
(1) For 3D reduction we use the following schedule, which is not set in smem persistent.
(2) You are right, this issue is not related to smem persistent, disable it fixes the original issue is only becuase it changed the segmentation results and bypassed the pattern which causes the error. (3) Further investigation found the inconsistent schedule of T15 comes from
T15 after
Durining
|
So, does that mean we are currently generating invalid heuristics? I'm not sure what consequence it would have, but sounds like that needs to be addressed quickly.
I'll look into it. |
Seems due to |
Here's I believe the root cause of the issue. (Thanks @zasdfgbnm for the discussion on transform propagation) This is an inner normalization fusion, but in addition to the usual normalization pattern, there's also a broadcast domain that's just squeezed without concretization. As usual, we use the reduction tensor as the reference tensor, but it knows nothing about the squeezed non-concretized broadcast domain. While that may not matter in some case, in the case of the original repro, the non-concretized broadcast domain results in the non-consistent ordering as seen below:
If I thought #2765 could work around the issue, but, while it's sufficient for this particular fusion, we could come up with other fusions that might not work as we would like. In fact, since such broadcast domains are not represented by reference, it doesn't seem well defined how they should be transformed, and I'm not sure if there's a generic algorithm to find a right transformation. Instead of trying to transform those non-concretized broadcast domains, I think it makes more sense to just ignore or remove them from the fusion. It would solve the propagation issue without changing the semantics of the original user fusion. To detect such safe-to-remove broadcast domains, I think we can use the Permissive graph. If a Permissive group only has no non-broadcast domain, the group should only have non-concretized broadcast domains. Short term actionAs long as #2765 results in other failures, I believe it's a strict improvement, so we could move forward with it as a short-term workaround. Long term actionWe should consider a pre-segmentation pass to remove non-concretized broadcast domains. |
#2765 seems like a great generic improvement, whether or not it does justice here. Mentioned in a slack with @naoyam and he's trying out another approach which we're hopeful will work, which is to simply move any non-concretized broadcasts to the inner most dimension in scheduling. These dimensions can be found easily with the permissive ID graph. If we move them to the inner most dimension in pre-scheduling and the reference does not have these dimensions represented, then they won't get moved during propagation. @naoyam is trying this out and should be straightforward relative to the other approaches. |
The given program crashes on Hopper with the error:
Full program
Interestingly the program runs just fine on Ampere.
The text was updated successfully, but these errors were encountered: