Why gcc is so much worse at std::vector vectorization of a conditional multiply than clang?

Question

Consider following float loop, compiled using -O3 -mavx2 -mfma

for (auto i = 0; i < a.size(); ++i) {
    a[i] = (b[i] > c[i]) ? (b[i] * c[i]) : 0;
}

Clang done perfect job at vectorizing it. It uses 256-bit ymm registers and understands the difference between vblendps/vandps for the best performance possible.

.LBB0_7:
        vcmpltps        ymm2, ymm1, ymm0
        vmulps  ymm0, ymm0, ymm1
        vandps  ymm0, ymm2, ymm0

GCC, however, is much worse. For some reason it doesn’t get better than SSE 128-bit vectors (-mprefer-vector-width=256 won’t change anything).

.L6:
        vcomiss xmm0, xmm1
        vmulss  xmm0, xmm0, xmm1
        vmovss  DWORD PTR [rcx+rax*4], xmm0

If replace it with plain array (as in guideline), gcc does vectorize it to AVX ymm.

int a[256], b[256], c[256];
auto foo (int *a, int *b, int *c) {
  int i;
  for (i=0; i<256; i++){
    a[i] =  (b[i] > c[i]) ? (b[i] * c[i]) : 0;
  }
}

However I didn’t find how to do it with variable-length std::vector. What sort of hint does gcc need to vectorize std::vector to AVX?

Source on Godbolt with gcc 13.1 and clang 14.0.0

BTW the SSE code wasn't really using 128-bit vectors as such, it's scalar code (with the ss suffix standing for 'scalar, single precision'). If it was actually vectorized with SSE, the suffixes would be ps. — 2 days ago
Also, that's not SSE, it's using the AVX encodings (vcomiss scalar compare, not legacy-SSE comiss) with 128-bit XMM registers, because you enabled AVX. — 20 hours ago

score 25 · Accepted Answer · 2023-07-14 15:49:08Z

It’s not std::vector that’s the problem, it’s float and GCC’s usually-bad default of -ftrapping-math that is supposed to treat FP exceptions as a visible side-effect, but doesn’t always correctly do that, and misses some optimizations that would be safe.

In this case, there is a conditional FP multiply in the source, so strict exception behaviour avoids possibly raising an overflow, underflow, inexact, or other exception in case the compare was false.

GCC does that correctly in this case using scalar code: ...ss is Scalar Single, using the bottom element of 128-bit XMM registers, not vectorized at all. Your asm isn’t GCC’s actual output: it loads both elements with vmovss, then branches on a vcomiss result before vmulss, so the multiply doesn’t happen if b[i] > c[i] isn’t true. So unlike your "GCC" asm, GCC’s actual asm does I think correctly implement -ftrapping-math.

Notice that your example which does auto-vectorize uses int * args, not float*. If you change it to float* and use the same compiler options, it doesn’t auto-vectorize either, even with float *__restrict a (https://godbolt.org/z/nPzsf377b).

@273K’s answer shows that AVX-512 lets float auto-vectorize even with -ftrapping-math, since AVX-512 masking (ymm2{k1}{z}) suppresses FP exceptions for masked elements, not raising FP exceptions from any FP multiplies that don’t happen in the C++ abstract machine.

`gcc -O3 -mavx2 -mfma -fno-trapping-math` auto-vectorizes all 3 functions (Godbolt)

void foo (float *__restrict a, float *__restrict b, float *__restrict c) {
  for (int i=0; i<256; i++){
    a[i] =  (b[i] > c[i]) ? (b[i] * c[i]) : 0;
  }
}

foo(float*, float*, float*):
        xor     eax, eax
.L143:
        vmovups ymm2, YMMWORD PTR [rsi+rax]
        vmovups ymm3, YMMWORD PTR [rdx+rax]
        vmulps  ymm1, ymm2, YMMWORD PTR [rdx+rax]
        vcmpltps        ymm0, ymm3, ymm2
        vandps  ymm0, ymm0, ymm1
        vmovups YMMWORD PTR [rdi+rax], ymm0
        add     rax, 32
        cmp     rax, 1024
        jne     .L143
        vzeroupper
        ret

BTW, I’d recommend -march=x86-64-v3 for an AVX2+FMA feature-level. That also includes BMI1+BMI2 and stuff. It still just uses -mtune=generic I think, but could hopefully in future ignore tuning things that only matter for CPUs that don’t have AVX2+FMA+BMI2.

The std::vector functions are bulkier since we didn’t use float *__restrict a = avec.data(); or similar to promise non-overlap of the data pointed-to by the std::vector control blocks (and the size isn’t known to be a multiple of the vector width), but the non-cleanup loops for the no-overlap case are vectorized with the same vmulps / vcmpltps / vandps.

See also:

-ftrapping-math is broken and "never worked" according to GCC dev Marc Glisse. But https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54192 from 2012 proposing to make it not the default is still open.
How to force GCC to assume that a floating-point expression is non-negative? (various FP options other than the full -ffast-math, such as -fno-math-errno which allows many functions to inline and is not a problem for normal code which doesn’t check errno after calling sqrt or whatever!)
Semantics of Floating Point Math in GCC
Auto vectorization on double and ffast-math (of course reductions are only vectorized with -ffast-math or #pragma omp simd reduction (+:my_sum_var), but @phuclv’s answer has some good links)

score 18 · Accepted Answer · 2023-07-13 23:39:52Z

18

GCC by default compiles for older CPU architectures.

Setting -march=native enables using 256-bit ymm registers.

.L7:
        vmovups ymm1, YMMWORD PTR [rsi+rax]
        vmovups ymm0, YMMWORD PTR [rdx+rax]
        vcmpps  k1, ymm1, ymm0, 14
        vmulps  ymm2{k1}{z}, ymm1, ymm0
        vmovups YMMWORD PTR [rcx+rax], ymm2

Setting -march=x86-64-v4 enables using 512-bit zmm registers.

.L7:
        vmovups zmm2, ZMMWORD PTR [rsi+rax]
        vcmpps  k1, zmm2, ZMMWORD PTR [rdx+rax], 14
        vmulps  zmm0{k1}{z}, zmm2, ZMMWORD PTR [rdx+rax]
        vmovups ZMMWORD PTR [rcx+rax], zmm0

edited 2 days ago

answered 2 days ago

273K

28.6k9 gold badges37 silver badges61 bronze badges

3

Thanks. Yes, I tested with -mavx512f (both of your answers implicitly use this flag) before asking a question. It's still strange that gcc gives either SSE or AVX512F assembly without AVX/AVX2 as intermediate. For example, -march=skylake or -march=x86-64-v3 won't use avx/avx2 despite having latter one.

– Vladislav Kogan

2 days ago
1

Yes, agree, it's strange, GCC makes one big step forward without intermediate smaller steps.

– 273K

2 days ago
10

@VladislavKogan: AVX-512 masking suppresses FP exceptions from masked elements, letting GCC make vectorized asm that respects -ftrapping-math (which is on by default). That's why it can vectorize with AVX-512 but not earlier extensions if you don't turn off -ftrapping-math. BTW, -march=native allowing 256-bit vectorization only applies on CPUs with AVX-512, like Ice Lake and Zen 4. (On most CPUs the default is -mprefer-vector-width=256, but apparently -march=x86-64-v4 prefers vector-width=512.)

– Peter Cordes

2 days ago

Add a comment
|

Ovinus RealOvinus Real 4883 silver badges10 bronze badges · Accepted Answer · 2023-07-15 22:53:50Z

Assuming -ftrapping-math, another option is the zero the ignored inputs before multiplying them (untested):

for (size_t i = 0; i < size; i += 4) {
    __m128i x = _mm_loadu_si128((const __m128i*)(a + i));
    __m128i y = _mm_loadu_si128((const __m128i*)(b + i));
    __m128i cmp = _mm_cmplt_ps(x, y);
    
    x = _mm_and_ps(x, cmp);
    y = _mm_and_ps(y, cmp);

    _mm_storeu_si128((__m128i*)(a + i), _mm_mul_ps(x, y));
}

This of course translates to larger widths.

Both inputs must be zeroed, because +0.0 * x is -0.0 if x < 0. On some processors this will probably have the same throughput as other solutions of the same vector width. This same method will work for addition, subtraction, and square root. Division will require a divisor other than zero.

Even under -fno-trapping-math, this solution might be slightly superior to one masking after the multiplication, because it avoids penalties associated with ignored inputs that require microcoded multiplication. But I’m not sure whether the throughput can be the same as a version which zeroes after the multiplication.

You mean asm like what that might compile to? Obviously if you're writing intrinsics by hand, you would just do it efficiently with the compare in parallel with multiply, and one and or andn, unless you actually were going to check the FP environment afterwards. What I might do is write the source to make the multiply unconditional, like tmp = b[i]*c[i]; and then use it. This could be vectorized easily, but this is one of those missed-optimization cases where GCC -ftrapping-math sucks and we get scalar code: godbolt.org/z/zMrvrh8EG — 12 hours ago
(And if the number of FP exceptions raised is supposed to be the same as the source, -ftrapping-math doesn't preserve that when it still vectorizes an unconditional a[i] = b[i] * c[i]. It makes no sense that that vectorizes but not the conditional selecting between prod and 0, especially when it is willing to use vcmpps with AVX-512. Good example of -ftrapping-math being inconsistent and bad.) — 12 hours ago
Anyway, yes, if GCC did want to vectorize while respecting -ftrapping-math, it could do it this way. 0 * 0 doesn't raise any FP exceptions, I'm pretty sure not even denormal or underflow even though 0.0 is encoded with an exponent field of all-zero. — 12 hours ago

Why gcc is so much worse at std::vector vectorization of a conditional multiply than clang?