Skip to content

Unsafe.BitCast/As does not remove stack spilling for custom SIMD type #86448

@xoofx

Description

@xoofx

Hello JIT compiler friends, 😊

Description

Sometimes, there are some legacy Vector Math APIs that are providing their own Vector3/Vector4 and would like to transition to SIMD optimized version through System.Numerics.Vector3/4. I have tried to use Unsafe.As but it generates stack spill, so I was hoping that the new .NET 8 Unsafe.BitCast #82917 would help here, but it seems not. Might require deeper tweaking in the JIT to realize that a type is almost an alias to a SIMD type.

With the latest .NET 8 preview (8.0.100-preview.4.23260.5), the following code:

using System.Numerics;
using System.Runtime.CompilerServices;

namespace BenchCSharpApp;

public class TestCustomVector4
{
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static float Calculator(MyVector4 v1, MyVector4 v2)
    {
        var a = Add(v1, v2);
        var b = Multiply(v1, v2);
        var c = Subtract(v1, v2);
        var d = Subtract(Add(a, b), c);

        return Length(d);
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static float Length(MyVector4 v1)
    {
        return Unsafe.BitCast<MyVector4, Vector4>(v1).Length();
    }
    
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static MyVector4 Multiply(MyVector4 v1, MyVector4 v2)
    {
        return Unsafe.BitCast<Vector4, MyVector4>(Unsafe.BitCast<MyVector4, Vector4>(v1) * Unsafe.BitCast<MyVector4, Vector4>(v2));
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static MyVector4 Add(MyVector4 v1, MyVector4 v2)
    {
        return Unsafe.BitCast<Vector4, MyVector4>(Unsafe.BitCast<MyVector4, Vector4>(v1) + Unsafe.BitCast<MyVector4, Vector4>(v2));
    }


    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static MyVector4 Subtract(MyVector4 v1, MyVector4 v2)
    {
        return Unsafe.BitCast<Vector4, MyVector4>(Unsafe.BitCast<MyVector4, Vector4>(v1) - Unsafe.BitCast<MyVector4, Vector4>(v2));
    }

    public struct MyVector4
    {
        public MyVector4(float x, float y, float z, float w)
        {
            X = x;
            Y = y;
            Z = z;
            W = w;
        }

        public float X;
        public float Y;
        public float Z;
        public float W;
    }
}

generates the following ASM:

; Assembly listing for method BenchCSharpApp.TestCustomVector4Bis:Calculator(BenchCSharpApp.TestCustomVector4Bis+MyVector4,BenchCSharpApp.TestCustomVector4Bis+MyVector4):float
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 6 single block inlinees; 0 inlinees without PGO data

G_M000_IG01:                ;; offset=0000H
       sub      rsp, 104
       vzeroupper 
 
G_M000_IG02:                ;; offset=0007H
       vmovups  xmm0, xmmword ptr [rcx]
       vmovups  xmmword ptr [rsp+58H], xmm0
       vmovups  xmm0, xmmword ptr [rdx]
       vmovups  xmmword ptr [rsp+48H], xmm0
       vmovups  xmm0, xmmword ptr [rsp+58H]
       vaddps   xmm0, xmm0, xmmword ptr [rsp+48H]
       vmovups  xmm1, xmmword ptr [rcx]
       vmovups  xmmword ptr [rsp+38H], xmm1
       vmovups  xmm1, xmmword ptr [rdx]
       vmovups  xmmword ptr [rsp+28H], xmm1
       vmovups  xmm1, xmmword ptr [rsp+38H]
       vmulps   xmm1, xmm1, xmmword ptr [rsp+28H]
       vmovups  xmm2, xmmword ptr [rcx]
       vmovups  xmmword ptr [rsp+18H], xmm2
       vmovups  xmm2, xmmword ptr [rdx]
       vmovups  xmmword ptr [rsp+08H], xmm2
       vmovups  xmm2, xmmword ptr [rsp+18H]
       vsubps   xmm2, xmm2, xmmword ptr [rsp+08H]
       vaddps   xmm0, xmm0, xmm1
       vsubps   xmm0, xmm0, xmm2
       vdpps    xmm0, xmm0, xmm0, -1
       vsqrtss  xmm0, xmm0
 
G_M000_IG03:                ;; offset=0079H
       add      rsp, 104
       ret      
 
; Total bytes of code 126

while a version that would not perform any stack spill could achieve the following:


G_M000_IG01:                ;; offset=0000H
       vzeroupper 
 
G_M000_IG02:                ;; offset=0003H
       vmovups  xmm0, xmmword ptr [rcx]
       vmovups  xmm1, xmmword ptr [rdx]
       vaddps   xmm2, xmm0, xmm1
       vmulps   xmm3, xmm0, xmm1
       vaddps   xmm2, xmm2, xmm3
       vsubps   xmm0, xmm0, xmm1
       vsubps   xmm0, xmm2, xmm0
       vdpps    xmm0, xmm0, xmm0, -1
       vsqrtss  xmm0, xmm0
 
G_M000_IG03:                ;; offset=0029H
       ret      

I would not mind putting an attribute on such custom types e.g [AliasVector(typeof(Vector4))] if it could help the SIMD only usage detection.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMItenet-performancePerformance related issue

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions