Consider following float loop, compiled using -O3 -mavx2 -mfma
for (auto i = 0; i < a.size(); ++i) {
a[i] = (b[i] > c[i]) ? (b[i] * c[i]) : 0;
}
Clang done perfect job at vectorizing it. It uses 256-bit ymm registers and understands the difference between vblendps/vandps for the best performance possible.
.LBB0_7:
vcmpltps ymm2, ymm1, ymm0
vmulps ymm0, ymm0, ymm1
vandps ymm0, ymm2, ymm0
GCC, however, is much worse. For some reason it doesn’t get better than SSE 128-bit vectors (-mprefer-vector-width=256 won’t change anything).
.L6:
vcomiss xmm0, xmm1
vmulss xmm0, xmm0, xmm1
vmovss DWORD PTR [rcx+rax*4], xmm0
If replace it with plain array (as in guideline), gcc does vectorize it to AVX ymm.
int a[256], b[256], c[256];
auto foo (int *a, int *b, int *c) {
int i;
for (i=0; i<256; i++){
a[i] = (b[i] > c[i]) ? (b[i] * c[i]) : 0;
}
}
However I didn’t find how to do it with variable-length std::vector. What sort of hint does gcc need to vectorize std::vector to AVX?
2
3 Answers
3
Highest score (default)
Trending (recent votes count more)
Date modified (newest first)
Date created (oldest first)
It’s not std::vector
that’s the problem, it’s float
and GCC’s usually-bad default of -ftrapping-math
that is supposed to treat FP exceptions as a visible side-effect, but doesn’t always correctly do that, and misses some optimizations that would be safe.
In this case, there is a conditional FP multiply in the source, so strict exception behaviour avoids possibly raising an overflow, underflow, inexact, or other exception in case the compare was false.
GCC does that correctly in this case using scalar code: ...ss
is Scalar Single, using the bottom element of 128-bit XMM registers, not vectorized at all. Your asm isn’t GCC’s actual output: it loads both elements with vmovss
, then branches on a vcomiss
result before vmulss
, so the multiply doesn’t happen if b[i] > c[i]
isn’t true. So unlike your "GCC" asm, GCC’s actual asm does I think correctly implement -ftrapping-math
.
Notice that your example which does auto-vectorize uses int *
args, not float*
. If you change it to float*
and use the same compiler options, it doesn’t auto-vectorize either, even with float *__restrict a
(https://godbolt.org/z/nPzsf377b).
@273K’s answer shows that AVX-512 lets float
auto-vectorize even with -ftrapping-math
, since AVX-512 masking (ymm2{k1}{z}
) suppresses FP exceptions for masked elements, not raising FP exceptions from any FP multiplies that don’t happen in the C++ abstract machine.
gcc -O3 -mavx2 -mfma -fno-trapping-math
auto-vectorizes all 3 functions (Godbolt)
void foo (float *__restrict a, float *__restrict b, float *__restrict c) {
for (int i=0; i<256; i++){
a[i] = (b[i] > c[i]) ? (b[i] * c[i]) : 0;
}
}
foo(float*, float*, float*):
xor eax, eax
.L143:
vmovups ymm2, YMMWORD PTR [rsi+rax]
vmovups ymm3, YMMWORD PTR [rdx+rax]
vmulps ymm1, ymm2, YMMWORD PTR [rdx+rax]
vcmpltps ymm0, ymm3, ymm2
vandps ymm0, ymm0, ymm1
vmovups YMMWORD PTR [rdi+rax], ymm0
add rax, 32
cmp rax, 1024
jne .L143
vzeroupper
ret
BTW, I’d recommend -march=x86-64-v3
for an AVX2+FMA feature-level. That also includes BMI1+BMI2 and stuff. It still just uses -mtune=generic
I think, but could hopefully in future ignore tuning things that only matter for CPUs that don’t have AVX2+FMA+BMI2.
The std::vector
functions are bulkier since we didn’t use float *__restrict a = avec.data();
or similar to promise non-overlap of the data pointed-to by the std::vector
control blocks (and the size isn’t known to be a multiple of the vector width), but the non-cleanup loops for the no-overlap case are vectorized with the same vmulps
/ vcmpltps
/ vandps
.
See also:
-ftrapping-math
is broken and "never worked" according to GCC dev Marc Glisse. But https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54192 from 2012 proposing to make it not the default is still open.- How to force GCC to assume that a floating-point expression is non-negative? (various FP options other than the full
-ffast-math
, such as-fno-math-errno
which allows many functions to inline and is not a problem for normal code which doesn’t checkerrno
after callingsqrt
or whatever!) - Semantics of Floating Point Math in GCC
- Auto vectorization on double and ffast-math (of course reductions are only vectorized with
-ffast-math
or#pragma omp simd reduction (+:my_sum_var)
, but @phuclv’s answer has some good links)
GCC by default compiles for older CPU architectures.
Setting -march=native
enables using 256-bit ymm registers.
.L7:
vmovups ymm1, YMMWORD PTR [rsi+rax]
vmovups ymm0, YMMWORD PTR [rdx+rax]
vcmpps k1, ymm1, ymm0, 14
vmulps ymm2{k1}{z}, ymm1, ymm0
vmovups YMMWORD PTR [rcx+rax], ymm2
Setting -march=x86-64-v4
enables using 512-bit zmm registers.
.L7:
vmovups zmm2, ZMMWORD PTR [rsi+rax]
vcmpps k1, zmm2, ZMMWORD PTR [rdx+rax], 14
vmulps zmm0{k1}{z}, zmm2, ZMMWORD PTR [rdx+rax]
vmovups ZMMWORD PTR [rcx+rax], zmm0
3
-
Thanks. Yes, I tested with -mavx512f (both of your answers implicitly use this flag) before asking a question. It's still strange that gcc gives either SSE or AVX512F assembly without AVX/AVX2 as intermediate. For example, -march=skylake or -march=x86-64-v3 won't use avx/avx2 despite having latter one.
– Vladislav Kogan2 days ago
-
1
Yes, agree, it's strange, GCC makes one big step forward without intermediate smaller steps.
– 273K2 days ago
-
10
@VladislavKogan: AVX-512 masking suppresses FP exceptions from masked elements, letting GCC make vectorized asm that respects
-ftrapping-math
(which is on by default). That's why it can vectorize with AVX-512 but not earlier extensions if you don't turn off-ftrapping-math
. BTW,-march=native
allowing 256-bit vectorization only applies on CPUs with AVX-512, like Ice Lake and Zen 4. (On most CPUs the default is-mprefer-vector-width=256
, but apparently-march=x86-64-v4
prefers vector-width=512.)– Peter Cordes2 days ago
Assuming -ftrapping-math, another option is the zero the ignored inputs before multiplying them (untested):
for (size_t i = 0; i < size; i += 4) {
__m128i x = _mm_loadu_si128((const __m128i*)(a + i));
__m128i y = _mm_loadu_si128((const __m128i*)(b + i));
__m128i cmp = _mm_cmplt_ps(x, y);
x = _mm_and_ps(x, cmp);
y = _mm_and_ps(y, cmp);
_mm_storeu_si128((__m128i*)(a + i), _mm_mul_ps(x, y));
}
This of course translates to larger widths.
Both inputs must be zeroed, because +0.0 * x is -0.0 if x < 0. On some processors this will probably have the same throughput as other solutions of the same vector width. This same method will work for addition, subtraction, and square root. Division will require a divisor other than zero.
Even under -fno-trapping-math, this solution might be slightly superior to one masking after the multiplication, because it avoids penalties associated with ignored inputs that require microcoded multiplication. But I’m not sure whether the throughput can be the same as a version which zeroes after the multiplication.
3
-
You mean asm like what that might compile to? Obviously if you're writing intrinsics by hand, you would just do it efficiently with the compare in parallel with multiply, and one
and
orandn
, unless you actually were going to check the FP environment afterwards. What I might do is write the source to make the multiply unconditional, liketmp = b[i]*c[i];
and then use it. This could be vectorized easily, but this is one of those missed-optimization cases where GCC-ftrapping-math
sucks and we get scalar code: godbolt.org/z/zMrvrh8EG– Peter Cordes12 hours ago
-
(And if the number of FP exceptions raised is supposed to be the same as the source,
-ftrapping-math
doesn't preserve that when it still vectorizes an unconditionala[i] = b[i] * c[i]
. It makes no sense that that vectorizes but not the conditional selecting betweenprod
and0
, especially when it is willing to usevcmpps
with AVX-512. Good example of-ftrapping-math
being inconsistent and bad.)– Peter Cordes12 hours ago
-
Anyway, yes, if GCC did want to vectorize while respecting
-ftrapping-math
, it could do it this way.0 * 0
doesn't raise any FP exceptions, I'm pretty sure not even denormal or underflow even though0.0
is encoded with an exponent field of all-zero.– Peter Cordes12 hours ago
Your Answer
Sign up or log in
Post as a guest
Required, but never shown
Post as a guest
Required, but never shown
By clicking “Post Your Answer”, you agree to our terms of service and acknowledge that you have read and understand our privacy policy and code of conduct.
Not the answer you're looking for? Browse other questions tagged
or ask your own question.
or ask your own question.
BTW the SSE code wasn't really using 128-bit vectors as such, it's scalar code (with the
ss
suffix standing for 'scalar, single precision'). If it was actually vectorized with SSE, the suffixes would beps
.2 days ago
Also, that's not SSE, it's using the AVX encodings (
vcomiss
scalar compare, not legacy-SSEcomiss
) with 128-bit XMM registers, because you enabled AVX.20 hours ago
|