Skip to content

Tentative support for avx512vl extensions to 256 bit registers#1345

Open
serge-sans-paille wants to merge 4 commits into
masterfrom
feature/avx512f_256
Open

Tentative support for avx512vl extensions to 256 bit registers#1345
serge-sans-paille wants to merge 4 commits into
masterfrom
feature/avx512f_256

Conversation

@serge-sans-paille
Copy link
Copy Markdown
Contributor

In addition to missing instructions (e.g. bas on int64_t etc) this mostly changes the mask representation from vector register to scalar, thus the big diff.

@DiamonDinoia
Copy link
Copy Markdown
Contributor

Hi @serge-sans-paille,

Are we sure that we only need avx512f for this? It seems to me that instructions like https://diamondinoia.com/simdref/#_mm256_cmp_epi32_mask requires avx512f + VL.

Let me know where I am wrong.

Cheers,
Marco

@serge-sans-paille
Copy link
Copy Markdown
Contributor Author

You're right, all of this requires avx512f+avx512vl. It turns out most build have both (see https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512) but we currently don't have anything to model avx512vl, which should be the parent of avx512f_256.

@DiamonDinoia
Copy link
Copy Markdown
Contributor

I guess it might be called avx512vl_256 at that point as we will also have avx512vl_128. What do you think?

I agree, most CPU have 512+extensions.

@serge-sans-paille
Copy link
Copy Markdown
Contributor Author

I guess it might be called avx512vl_256 at that point as we will also have avx512vl_128. What do you think?

I agree, most CPU have 512+extensions.

It looks like avx512vl does not have any 512bit instruction :-) but it still has avx512 in its name and it implies avx512f, so I agree with you.

@DiamonDinoia
Copy link
Copy Markdown
Contributor

Actually, my suggestion might be wrong: https://diamondinoia.com/simdref/#_mm_maskz_andnot_pd

There are instructions that require DQ + VL for example.

@serge-sans-paille
Copy link
Copy Markdown
Contributor Author

based on the graph above, looks like DQ => VL, so we should still be fine?

@DiamonDinoia
Copy link
Copy Markdown
Contributor

DiamonDinoia commented May 12, 2026

It seems like it, also basically all architectures listed in the chart have DQ,VL,BW together. So in any case is should be fine.

@serge-sans-paille
Copy link
Copy Markdown
Contributor Author

@DiamonDinoia I've move the minimal creation of avx512vl to #1350 , once merged I'll rebase this PR

@DiamonDinoia
Copy link
Copy Markdown
Contributor

Sure! Ping me when you need a review. It might be spotty when I'm on holiday

serge-sans-paille added a commit that referenced this pull request May 14, 2026
avx512vl just extends 128 and 256 bits register with some operations, it
does not have any 512 bit instructions, so the description is mostly
empty and preliminary work for #1345
@serge-sans-paille serge-sans-paille changed the title Tentative support for avx512f extensions to 256 bit registers Tentative support for avx512vl extensions to 256 bit registers May 14, 2026
@serge-sans-paille
Copy link
Copy Markdown
Contributor Author

Sure! Ping me when you need a review. It might be spotty when I'm on holiday

ping ;-)

Copy link
Copy Markdown
Contributor

@DiamonDinoia DiamonDinoia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @serge-sans-paille,

Just some suggestions and maybe we could add to the tests:

TEST_CASE_TEMPLATE_DEFINE("batch_bool mask hygiene", B, batch_bool_hygiene_id)
{
    using value_type = typename B::value_type;
    using batch_type = xsimd::batch<value_type, typename B::arch_type>;

    SUBCASE("any(a != a) is false")        
    {
        batch_type a(value_type(1));
        CHECK_FALSE(xsimd::any(a != a));
    }

    SUBCASE("any(~(a == a)) is false")        
    {
        batch_type a(value_type(1));
        CHECK_FALSE(xsimd::any(~(a == a)));
    }

    SUBCASE("eq(false_mask, false_mask) is all-true")  
    {
        auto m0 = (batch_type(value_type(1)) != batch_type(value_type(1)));
        CHECK_UNARY(xsimd::all(m0 == m0));
    }

    SUBCASE("from_mask ignores bits above lane count")  
    {
        constexpr std::size_t N = B::size;
        uint64_t valid_all_true = (N == 64) ? ~uint64_t(0)
                                            : ((uint64_t(1) << N) - 1);
        uint64_t junk            = valid_all_true | (uint64_t(1) << 63);
        B clean = B::from_mask(valid_all_true);
        B dirty = B::from_mask(junk);
        CHECK_EQ(clean.mask(), dirty.mask());
        CHECK_UNARY(xsimd::all(clean == dirty));
    }

    SUBCASE("batch_bool stored to bool[] is canonical 0/1") 
    {
        batch_type a(value_type(1)), b(value_type(2));
        auto m = (a == b);                    // all false
        alignas(64) bool buf[B::size + 1] = {true, true};  // sentinel
        xsimd::store_aligned(buf, m);
        for (std::size_t i = 0; i < B::size; ++i)
        {
            // bit-level check: must be exactly 0, not just falsy
            CHECK_EQ(*reinterpret_cast<uint8_t const*>(&buf[i]), uint8_t(0));
        }
    }
}
TEST_CASE_TEMPLATE_APPLY(batch_bool_hygiene_id, batch_bool_types);
```cpp
TEST_CASE_TEMPLATE_DEFINE("store_masked respects Mode", B, store_masked_mode_id)
{
    using T = typename B::value_type;
    using A = typename B::arch_type;
    constexpr std::size_t N = B::size;

    // Unaligned-mode + unaligned pointer: must not fault.       
    alignas(64) T big[2 * N + 1] = {};
    T* unaligned_p = big + 1;                 // sizeof(T)-aligned only

    struct AllTrue { static constexpr bool get(std::size_t, std::size_t) { return true; } };
    auto cst = xsimd::make_batch_bool_constant<T, AllTrue, A>();

    B v(T(7));
    xsimd::kernel::store_masked(unaligned_p, v, cst,
                                xsimd::unaligned_mode{},
                                xsimd::kernel::requires_arch<A>{});
    for (std::size_t i = 0; i < N; ++i)
        CHECK_EQ(unaligned_p[i], T(7));

    // Overload resolution: store_masked with same-typed mask must compile.
    // If C3 regresses, this becomes a compile error.                
    auto signed_cst = xsimd::make_batch_bool_constant<T, AllTrue, A>();
    alignas(64) T aligned_buf[N] = {};
    xsimd::kernel::store_masked(aligned_buf, v, signed_cst,
                                xsimd::aligned_mode{},
                                xsimd::kernel::requires_arch<A>{});
    for (std::size_t i = 0; i < N; ++i)
        CHECK_EQ(aligned_buf[i], T(7));
}
TEST_CASE_TEMPLATE_APPLY(store_masked_mode_id, batch_types_for_masked_store);

Comment thread include/xsimd/arch/xsimd_avx512vl_256.hpp Outdated
Comment thread include/xsimd/arch/xsimd_avx512vl_256.hpp
Comment thread include/xsimd/arch/xsimd_avx512vl_256.hpp
Comment thread include/xsimd/arch/xsimd_avx512vl_256.hpp
Comment thread include/xsimd/arch/xsimd_avx512vl_256.hpp
Comment thread .github/workflows/linux.yml
Comment thread include/xsimd/config/xsimd_cpu_features_x86.hpp Outdated
Comment thread include/xsimd/arch/xsimd_avx512vl_256.hpp Outdated
Comment thread include/xsimd/arch/xsimd_avx512vl_256.hpp Outdated
Comment thread include/xsimd/arch/xsimd_avx512vl_256.hpp Outdated
serge-sans-paille added a commit that referenced this pull request May 16, 2026
avx512vl just extends 128 and 256 bits register with some operations, it
does not have any 512 bit instructions, so the description is mostly
empty and preliminary work for #1345
@serge-sans-paille serge-sans-paille force-pushed the feature/avx512f_256 branch 6 times, most recently from e5816f7 to 085ad2f Compare May 16, 2026 13:34
@serge-sans-paille
Copy link
Copy Markdown
Contributor Author

All green!

Copy link
Copy Markdown
Contributor

@DiamonDinoia DiamonDinoia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,

Can you check that the following code produces the expected output? Because on my machine I get that a == a is true and a != is also true in some cases for example.

Also are we handling the store/load alignment correctly? I seem to notice (unless overload resolution dispatches to some other types) that we always use unaligned for integer types and always aligned for float/double?

// clang++ -march=x64-64-v4 -I/home/marco/repos/xsimd/include -DXSIMD_DEFAULT_ARCH=xsimd::avx512vl_256 /tmp/xsimd_stray_bits.cpp -o /tmp/xsimd_stray_bits
#include <cstdio>
#include <cstdint>
#include "xsimd/xsimd.hpp"


#include <iostream>
#include <typeinfo>
#include <cxxabi.h>
#include <memory>

// Helper function to demangle typeid names
std::string demangle(const char* mangled) {
    int status = 0;
    std::unique_ptr<char, decltype(&free)> result(
        abi::__cxa_demangle(mangled, nullptr, nullptr, &status),
        &free
    );
    return (status == 0) ? result.get() : mangled;
}


static int fail_count = 0;

#define EXPECT(label, expr, expected)                                          \
    do {                                                                       \
        const bool got = (expr);                                               \
        const bool ok  = (got == (expected));                                  \
        std::printf("    %-58s expected %-5s got %-5s  %s\n",                  \
                    (label),                                                   \
                    (expected) ? "true" : "false",                             \
                    got ? "true" : "false",                                    \
                    ok ? "PASS" : "FAIL");                                     \
        if (!ok) ++fail_count;                                                 \
    } while (0)

template <class T>
int check() {
    using A = xsimd::avx512vl_256;
    using B = xsimd::batch<T, A>;
    std::cout << "\n=== " << demangle(typeid(B).name())
              << "  (arch=" << demangle(typeid(A).name())
              << ", lanes=" << B::size << ") ===\n";

    B a(T(1));
    std::cout << "  a = " << a << "\n";

    // a equals itself
    EXPECT("a should equal itself                                 [all(a == a)]",
           xsimd::all(a == a), true);
    EXPECT("a should not differ from itself                       [any(a != a)]",
           xsimd::any(a != a), false);

    // combine two logically empty masks
    auto m1 = (a != a);
    auto m2 = (a != a);
    EXPECT("AND of two empty masks should be empty                [any(m1 & m2)]",
           xsimd::any(m1 & m2), false);
    EXPECT("OR  of two empty masks should be empty                [any(m1 | m2)]",
           xsimd::any(m1 | m2), false);
    EXPECT("XOR of identical empty masks should be empty          [any(m1 ^ m2)]",
           xsimd::any(m1 ^ m2), false);

    // complement of an empty mask is a full mask
    auto m_not = ~m1;
    EXPECT("complement of an empty mask should be full            [all(~m1)]",
           xsimd::all(m_not), true);
    EXPECT("complement of an empty mask should be nonempty        [any(~m1)]",
           xsimd::any(m_not), true);

    // a bool batch equals itself
    EXPECT("a bool mask should equal itself                       [all(m_not == m_not)]",
           xsimd::all(m_not == m_not), true);

    return 0;
}

int main() {
    int fails = 0;
    fails += check<int8_t>();
    fails += check<int16_t>();
    fails += check<int32_t>();
    fails += check<int64_t>();
    fails += check<float>();
    fails += check<double>();
    std::printf("%d failures\n", fails);
    return fails != 0 ? 1 : 0;
}

Comment thread include/xsimd/arch/xsimd_avx512vl_256.hpp Outdated
Comment thread include/xsimd/arch/xsimd_avx512vl_256.hpp Outdated
@serge-sans-paille
Copy link
Copy Markdown
Contributor Author

bug found, the all / any implementation was incorrect, fixed and test case added (your example was strange, the fails accumulator was always zero)

@serge-sans-paille
Copy link
Copy Markdown
Contributor Author

And all green again :-)

Copy link
Copy Markdown
Contributor

@DiamonDinoia DiamonDinoia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! The code works locally now without issues. I just have some suggestions which are not crucial. One is more a question to make sure things are okay.

Comment thread include/xsimd/arch/xsimd_avx512vl_256.hpp
Comment thread include/xsimd/config/xsimd_cpu_features_x86.hpp Outdated
Comment thread include/xsimd/arch/xsimd_avx512vl_256.hpp
In addition to missing instructions (e.g. bas on int64_t etc) this
mostly changes the mask representation from vector register to scalar,
thus the big diff.
-DXSIMD_DEFAULT_ARCH is not a cmake option but a preprocessor option
@DiamonDinoia
Copy link
Copy Markdown
Contributor

Great work! Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants