Why gcc is so much worse at std::vector vectorization of a conditional multiply than clang?

Why gcc is so much worse at std::vector vectorization of a conditional multiply than clang?

25

Consider following float loop, compiled using -O3 -mavx2 -mfma

for (auto i = 0; i < a.size(); ++i) {
    a[i] = (b[i] > c[i]) ? (b[i] * c[i]) : 0;
}

Clang done perfect job at vectorizing it. It uses 256-bit ymm registers and understands the difference between vblendps/vandps for the best performance possible.

.LBB0_7:
        vcmpltps        ymm2, ymm1, ymm0
        vmulps  ymm0, ymm0, ymm1
        vandps  ymm0, ymm2, ymm0

GCC, however, is much worse. For some reason it doesn’t get better than SSE 128-bit vectors (-mprefer-vector-width=256 won’t change anything).

.L6:
        vcomiss xmm0, xmm1
        vmulss  xmm0, xmm0, xmm1
        vmovss  DWORD PTR [rcx+rax*4], xmm0

If replace it with plain array (as in guideline), gcc does vectorize it to AVX ymm.

int a[256], b[256], c[256];
auto foo (int *a, int *b, int *c) {
  int i;
  for (i=0; i<256; i++){
    a[i] =  (b[i] > c[i]) ? (b[i] * c[i]) : 0;
  }
}

However I didn’t find how to do it with variable-length std::vector. What sort of hint does gcc need to vectorize std::vector to AVX?

Source on Godbolt with gcc 13.1 and clang 14.0.0

2

  • 5

    BTW the SSE code wasn't really using 128-bit vectors as such, it's scalar code (with the ss suffix standing for 'scalar, single precision'). If it was actually vectorized with SSE, the suffixes would be ps.

    – harold

    2 days ago

  • 1

    Also, that's not SSE, it's using the AVX encodings (vcomiss scalar compare, not legacy-SSE comiss) with 128-bit XMM registers, because you enabled AVX.

    – Peter Cordes

    20 hours ago

3 Answers
3

Highest score (default)

Trending (recent votes count more)

Date modified (newest first)

Date created (oldest first)

25

It’s not std::vector that’s the problem, it’s float and GCC’s usually-bad default of -ftrapping-math that is supposed to treat FP exceptions as a visible side-effect, but doesn’t always correctly do that, and misses some optimizations that would be safe.

In this case, there is a conditional FP multiply in the source, so strict exception behaviour avoids possibly raising an overflow, underflow, inexact, or other exception in case the compare was false.

GCC does that correctly in this case using scalar code: ...ss is Scalar Single, using the bottom element of 128-bit XMM registers, not vectorized at all. Your asm isn’t GCC’s actual output: it loads both elements with vmovss, then branches on a vcomiss result before vmulss, so the multiply doesn’t happen if b[i] > c[i] isn’t true. So unlike your "GCC" asm, GCC’s actual asm does I think correctly implement -ftrapping-math.

Notice that your example which does auto-vectorize uses int * args, not float*. If you change it to float* and use the same compiler options, it doesn’t auto-vectorize either, even with float *__restrict a (https://godbolt.org/z/nPzsf377b).

@273K’s answer shows that AVX-512 lets float auto-vectorize even with -ftrapping-math, since AVX-512 masking (ymm2{k1}{z}) suppresses FP exceptions for masked elements, not raising FP exceptions from any FP multiplies that don’t happen in the C++ abstract machine.


gcc -O3 -mavx2 -mfma -fno-trapping-math auto-vectorizes all 3 functions (Godbolt)

void foo (float *__restrict a, float *__restrict b, float *__restrict c) {
  for (int i=0; i<256; i++){
    a[i] =  (b[i] > c[i]) ? (b[i] * c[i]) : 0;
  }
}
foo(float*, float*, float*):
        xor     eax, eax
.L143:
        vmovups ymm2, YMMWORD PTR [rsi+rax]
        vmovups ymm3, YMMWORD PTR [rdx+rax]
        vmulps  ymm1, ymm2, YMMWORD PTR [rdx+rax]
        vcmpltps        ymm0, ymm3, ymm2
        vandps  ymm0, ymm0, ymm1
        vmovups YMMWORD PTR [rdi+rax], ymm0
        add     rax, 32
        cmp     rax, 1024
        jne     .L143
        vzeroupper
        ret

BTW, I’d recommend -march=x86-64-v3 for an AVX2+FMA feature-level. That also includes BMI1+BMI2 and stuff. It still just uses -mtune=generic I think, but could hopefully in future ignore tuning things that only matter for CPUs that don’t have AVX2+FMA+BMI2.

The std::vector functions are bulkier since we didn’t use float *__restrict a = avec.data(); or similar to promise non-overlap of the data pointed-to by the std::vector control blocks (and the size isn’t known to be a multiple of the vector width), but the non-cleanup loops for the no-overlap case are vectorized with the same vmulps / vcmpltps / vandps.


See also:

Share
Improve this answer

18

GCC by default compiles for older CPU architectures.

Setting -march=native enables using 256-bit ymm registers.

.L7:
        vmovups ymm1, YMMWORD PTR [rsi+rax]
        vmovups ymm0, YMMWORD PTR [rdx+rax]
        vcmpps  k1, ymm1, ymm0, 14
        vmulps  ymm2{k1}{z}, ymm1, ymm0
        vmovups YMMWORD PTR [rcx+rax], ymm2

Setting -march=x86-64-v4 enables using 512-bit zmm registers.

.L7:
        vmovups zmm2, ZMMWORD PTR [rsi+rax]
        vcmpps  k1, zmm2, ZMMWORD PTR [rdx+rax], 14
        vmulps  zmm0{k1}{z}, zmm2, ZMMWORD PTR [rdx+rax]
        vmovups ZMMWORD PTR [rcx+rax], zmm0

Share
Improve this answer

3

  • Thanks. Yes, I tested with -mavx512f (both of your answers implicitly use this flag) before asking a question. It's still strange that gcc gives either SSE or AVX512F assembly without AVX/AVX2 as intermediate. For example, -march=skylake or -march=x86-64-v3 won't use avx/avx2 despite having latter one.

    – Vladislav Kogan

    2 days ago

  • 1

    Yes, agree, it's strange, GCC makes one big step forward without intermediate smaller steps.

    – 273K

    2 days ago

  • 10

    @VladislavKogan: AVX-512 masking suppresses FP exceptions from masked elements, letting GCC make vectorized asm that respects -ftrapping-math (which is on by default). That's why it can vectorize with AVX-512 but not earlier extensions if you don't turn off -ftrapping-math. BTW, -march=native allowing 256-bit vectorization only applies on CPUs with AVX-512, like Ice Lake and Zen 4. (On most CPUs the default is -mprefer-vector-width=256, but apparently -march=x86-64-v4 prefers vector-width=512.)

    – Peter Cordes

    2 days ago

1

Assuming -ftrapping-math, another option is the zero the ignored inputs before multiplying them (untested):

for (size_t i = 0; i < size; i += 4) {
    __m128i x = _mm_loadu_si128((const __m128i*)(a + i));
    __m128i y = _mm_loadu_si128((const __m128i*)(b + i));
    __m128i cmp = _mm_cmplt_ps(x, y);
    
    x = _mm_and_ps(x, cmp);
    y = _mm_and_ps(y, cmp);

    _mm_storeu_si128((__m128i*)(a + i), _mm_mul_ps(x, y));
}

This of course translates to larger widths.

Both inputs must be zeroed, because +0.0 * x is -0.0 if x < 0. On some processors this will probably have the same throughput as other solutions of the same vector width. This same method will work for addition, subtraction, and square root. Division will require a divisor other than zero.

Even under -fno-trapping-math, this solution might be slightly superior to one masking after the multiplication, because it avoids penalties associated with ignored inputs that require microcoded multiplication. But I’m not sure whether the throughput can be the same as a version which zeroes after the multiplication.

Share
Improve this answer

3

  • You mean asm like what that might compile to? Obviously if you're writing intrinsics by hand, you would just do it efficiently with the compare in parallel with multiply, and one and or andn, unless you actually were going to check the FP environment afterwards. What I might do is write the source to make the multiply unconditional, like tmp = b[i]*c[i]; and then use it. This could be vectorized easily, but this is one of those missed-optimization cases where GCC -ftrapping-math sucks and we get scalar code: godbolt.org/z/zMrvrh8EG

    – Peter Cordes

    12 hours ago

  • (And if the number of FP exceptions raised is supposed to be the same as the source, -ftrapping-math doesn't preserve that when it still vectorizes an unconditional a[i] = b[i] * c[i]. It makes no sense that that vectorizes but not the conditional selecting between prod and 0, especially when it is willing to use vcmpps with AVX-512. Good example of -ftrapping-math being inconsistent and bad.)

    – Peter Cordes

    12 hours ago

  • Anyway, yes, if GCC did want to vectorize while respecting -ftrapping-math, it could do it this way. 0 * 0 doesn't raise any FP exceptions, I'm pretty sure not even denormal or underflow even though 0.0 is encoded with an exponent field of all-zero.

    – Peter Cordes

    12 hours ago

Your Answer

Draft saved
Draft discarded

Post as a guest

Required, but never shown


By clicking “Post Your Answer”, you agree to our terms of service and acknowledge that you have read and understand our privacy policy and code of conduct.

Not the answer you're looking for? Browse other questions tagged

or ask your own question.

Leave a Reply

Your email address will not be published. Required fields are marked *