For min(ctz(x), ctz(y))
, we can use ctz(x | y)
to gain better performance. But what about max(ctz(x), ctz(y))
?
ctz
represents "count trailing zeros".
C++ Version (Compiler Explorer)
#include <algorithm>
#include <bit>
#include <cstdint>
int32_t test2(uint64_t x, uint64_t y) {
return std::max(std::countr_zero(x), std::countr_zero(y));
}
Rust Version (Compiler Explorer)
pub fn test2(x: u64, y: u64) -> u32 {
x.trailing_zeros().max(y.trailing_zeros())
}
4
4 Answers
4
Highest score (default)
Trending (recent votes count more)
Date modified (newest first)
Date created (oldest first)
I don’t think there’s anything better than the naive approach for the maximum. One attempt is using the identity
x + y = min(x, y) + max(x, y)
and thus
max(ctz(x), ctz(y)) = ctz(x) + ctz(y) - min(ctz(x), ctz(y))
This way, we can reduce the max function to the min function we already optimized, albeit with a few additional operations.
Here are some Rust implementations of the different approaches:
pub fn naive(x: u64, y: u64) -> u32 {
x.trailing_zeros().max(y.trailing_zeros())
}
pub fn sum_minus_min(x: u64, y: u64) -> u32 {
x.trailing_zeros() + y.trailing_zeros() - (x | y).trailing_zeros()
}
pub fn nielsen(x: u64, y: u64) -> u32 {
let x_lsb = x & x.wrapping_neg();
let y_lsb = y & y.wrapping_neg();
let xy_lsb = x_lsb | y_lsb;
let lsb = xy_lsb & xy_lsb.wrapping_neg();
let xy_max_lsb = if xy_lsb == lsb { lsb } else { xy_lsb ^ lsb };
xy_max_lsb.trailing_zeros()
}
pub fn timmermans(x: u64, y: u64) -> u32 {
let loxs = !x & x.wrapping_sub(1);
let loys = !y & y.wrapping_sub(1);
return (loxs | loys).count_ones();
}
pub fn kealey(x: u64, y: u64) -> u32 {
((x | x.wrapping_neg()) & (y | y.wrapping_neg())).trailing_zeros()
}
Results on my machine:
ctz_max/naive time: [279.09 ns 279.55 ns 280.10 ns]
ctz_max/sum_minus_min time: [738.91 ns 742.87 ns 748.61 ns]
ctz_max/nielsen time: [935.35 ns 937.63 ns 940.40 ns]
ctz_max/timmermans time: [803.39 ns 806.98 ns 810.76 ns]
ctz_max/kealey time: [295.03 ns 295.93 ns 297.03 ns]
The naive implementation beats all other implementations. The only implementation that can compete with the naive one is the approach suggested by Martin Kealey. Note that the actual factors between the implementation may be even higher than the timings indicate, due to some overhead of the test harness.
It’s clear that you only have like a couple of CPU instructions to spare to optimize the naive implementation, so I don’t think there is anything you can do. For reference, here is the assembly emitted by the Rust compiler when these implementations are compiled as standalone functions on a modern x86_64 processor:
example::naive:
tzcnt rcx, rdi
tzcnt rax, rsi
cmp ecx, eax
cmova eax, ecx
ret
example::sum_minus_min:
tzcnt rcx, rdi
tzcnt rax, rsi
add eax, ecx
or rsi, rdi
tzcnt rcx, rsi
sub eax, ecx
ret
example::nielsen:
blsi rax, rdi
blsi rcx, rsi
or rcx, rax
blsi rax, rcx
xor edx, edx
cmp rcx, rax
cmovne rdx, rcx
xor rdx, rax
tzcnt rax, rdx
ret
example::timmermans:
lea rax, [rdi - 1]
andn rax, rdi, rax
lea rcx, [rsi - 1]
andn rcx, rsi, rcx
or rcx, rax
xor eax, eax
popcnt rax, rcx
ret
example::kealey:
mov rax, rdi
neg rax
or rax, rdi
mov rcx, rsi
neg rcx
or rcx, rsi
and rcx, rax
tzcnt rax, rcx
ret
In the benchmarks I ran, the functions get inlined, the loops partially unrolled and some subexpressions pulled out of the inner loops, so the assembly looks a lot less clean that the above.
For testing, I used Criterion. Here is the additional code:
use criterion::{black_box, criterion_group, criterion_main, Criterion};
const NUMBERS: [u64; 32] = [
...
];
fn bench<F>(func: F)
where
F: Fn(u64, u64) -> u32,
{
for x in NUMBERS {
for y in NUMBERS {
black_box(func(x, y));
}
}
}
fn compare(c: &mut Criterion) {
let mut group = c.benchmark_group("ctz_max");
group.bench_function("naive", |b| b.iter(|| bench(naive)));
group.bench_function("sum_minus_min", |b| b.iter(|| bench(sum_minus_min)));
group.bench_function("nielsen", |b| b.iter(|| bench(nielsen)));
group.bench_function("timmermans", |b| b.iter(|| bench(timmermans)));
group.bench_function("kealey", |b| b.iter(|| bench(kealey)));
}
criterion_group!(benches, compare);
criterion_main!(benches);
NUMBERS
was generated with this Python code, with the intention of making branch prediction for the min()
function as hard as possible:
[
random.randrange(2 ** 32) * 2 ** random.randrange(32)
for dummy in range(32)
]
I’m running the benchmark using
RUSTFLAGS='-C target-cpu=native -C opt-lelve=3' cargo bench
on an 8th generation i7 processor (Whiskey Lake).
5
-
You might want to accumulate a sum of all the results and throw if it's incorrect, just to make sure that nothing important is being optimized away. Also use -O3, and anything you might need to do to enable inlining in rust.
– Matt Timmermans18 hours ago
-
@MattTimmermans
cargo bench
does optimized builds automatically. The default is using the-O
option to rustc, which is equivalent to-O2
for clang. I tried with-O opt-level=3
as well, which degrades the naive implementation by 5% and improves all other versions by 5%. I usedblack_box()
to avoid that the function return values are optimized away. If I removeblack_box()
, the entire code is optimized away, and all timings are exactly 0. Inlining happens automatically in optimized builds, and I verified the assembly to ensure that the functions actually got inlined.– Sven Marnach18 hours ago
-
Unfortunate that Rustc/LLVM picked
cmova
which is 2 uops (since it needs 4 inputs including CF and the SPAZO group for ZF), instead ofcmovb
orcmovae
which are only 1 uop on Broadwell and later, including Skylake-family. (They only need CF.) Yeah, really hard to be 2xtzcnt
/cmp
/cmov
, especially on AMD CPUs or Skylake or later wheretzcnt
doesn't have false dependencies. Its 1/clock throughput on Intel is almost certainly fine.– Peter Cordes2 hours ago
-
Given the variation in timings, and LLVM's general recklessness with false dependencies (preferring not to spend uops on xor-zeroing unless it fully sees the loop containing the false dep), it might be bottlenecking on tzcnt latency not throughput in some of the tests? But no, your Whiskey Lake CPU doesn't have tzcnt false deps so that can't be it.
– Peter Cordes2 hours ago
-
@PeterCordes The actual benchmark timings are rather noisy, and the full assembly of the functions inlined into the benchmarking loop is rather complex and hard to understand. From the machine code of the isolated functions alone, it's impossible to explain the timings I've observed, and the timings vary based on factors like whether the functions are defined in the same crate, even if they are inlined. However, one result has been consistent: Whatever I did, the naive implementation was fastest on my machine.
– Sven Marnach2 hours ago
You can do it like this:
#include <algorithm>
#include <bit>
#include <cstdint>
int32_t maxr_zero(uint64_t x, uint64_t y) {
uint64_t loxs = ~x & (x-1); // low zeros of x
uint64_t loys = ~y & (y-1); // low zeros of y
return std::countr_zero((loxs|loys)+1);
}
16
-
4
Even something as simple as this will already use far too many CPU instructions to compete with the naive implementation. CTZ is a single, fast machine instruction on modern CPUs, so the naive implementation is really hard to beat.
– Sven Marnach23 hours ago
-
had a bug. fixed it.
– Matt Timmermans23 hours ago
-
2
I benchmarked a Rust version of this, and it's much slower than the naive implementation.
– Sven Marnach23 hours ago
-
1
Both GCC and Clang used
cmov
to implement themax
(but GCC also goes nuts and reintroduces a redundant branch to test whethery
is zero, and a redundanttestcmov
pair to test ifx
is zero)– harold23 hours ago
-
1
@SvenMarnach shouldn't happen, did you forget to allow the compiler to use modern instructions?
popcnt
shouldn't be slower thantzcnt
(orbsf
) except on some oddball CPUs like AMD Piledriver, but it would work out that way without modern instructions: ifcountr_zero
was done withbsf
andpopcount
emulated.– harold21 hours ago
These are equivalent:
max(ctz(a),ctz(b))
ctz((a|-a)&(b|-b))
ctz(a)+ctz(b)-ctz(a&b)
The math-identity ctz(a)+ctz(b)-ctz(a&b)
requires 6 CPU instructions, parallelizable to 3 steps on a 3-way superscalar CPU:
- 3× ctz
- 1× bitwise-and
- 1× addition
- 1× subtraction
The bit-mashing ctz((a|-a)&(b|-b))
requires 6 CPU instructions, parallelizable to 4 steps on a 2-way superscalar CPU:
- 2× negation
- 2× bitwise-or
- 1× bitwize-and
- 1× ctz
The naïve max(ctz(a),ctz(b))
requires 5 CPU instructions, parallelizable to 4 steps on a 2-way superscalar CPU:
- 2× ctz
- 1× comparison
- 1× conditional branch
- 1× load/move (so that the "output" is always in the same register)
… but note that branch instructions can be very expensive.
If your CPU has a conditional load/move instruction, this reduces to 4 CPU instructions taking 3 super-scalar steps.
If your CPU has a max
instruction (I’m not aware of any that do, but maybe there’s one out there), this reduces to 3 CPU instructions taking 2 super-scalar steps.
All that said, the opportunities for super-scalar operation depend on which instructions you’re trying to put against each other. Typically you get the most by putting different instructions in parallel, since they use different parts of the CPU (all at once). Typically there will be more "add" and "bitwise or" units than "ctz" units, so doing multiple ctz instructions may actually be the limiting factor, especially for the "math-identity" version.
If "compare and branch" is too expensive, you can make a non-branching "max" in 4 CPU instructions. Assuming A and B are positive integers:
- C = A-B
- subtract the previous carry, plus D, from D itself (D is now either 0 or -1, regardless of whatever value it previously held)
- C &= D (C is now min(0, A-B))
- A -= C (A’ is now max(A,B))
14
-
I like the second option. It is the simplest alternative to the naive solution and I think what the OP was looking for (though theoretically the language lawyer must use
~a+1
instead of-a
until C23 specifies twos complement).– nielsen15 hours ago
-
@nielsen
-a
is already OK for unsigned types (though MSVC may unreasonably complain and force you to write0 - a
instead, which is also OK) E: here's a reference, stackoverflow.com/q/8026694/555045– harold14 hours ago
-
Shouldn't it be bitwise or rather than bitwise and in the third option?
– Sven Marnach14 hours ago
-
1
The second option is comparable with the naive one on Haswell and Skylake with default compile flags (i.e. no
tzcnt
), according to llvm-mca godbolt.org/z/a81ceGWPc. Although llvm-mca shows the naive one costs a bit fewer instructions, that's because it cannot predict branch cost. I believe that is the farthest place we can reach, so I gonna accept this answer. Withtzcnt
, maybe no code can beat the naive one.– QuarticCat14 hours ago
-
1
Note that non-branching max is usually implemented using a conditional move, e.g.
cmov
on x86_64.– Sven Marnach3 hours ago
I am not sure whether or not it is faster, but this function will take x
and y
and calculate the input to ctz
for getting the max value:
uint64_t getMaxTzInput(uint64_t x, uint64_t y)
{
uint64_t x_lsb = x & (~x + 1); // Least significant 1 of x
uint64_t y_lsb = y & (~y + 1); // Least significant 1 of y
uint64_t xy_lsb = x_lsb | y_lsb; // Least significant 1s of x and y (could be the same)
uint64_t lsb = (xy_lsb) & (~(xy_lsb)+1); // Least significant 1 among x and y
// If the least significant 1s are different for x and y, remove the least significant 1
// to get the second least significant 1.
uint64_t xy_max_lsb = (xy_lsb == lsb) ? lsb : xy_lsb ^ lsb;
return xy_max_lsb;
}
Thus, ctz(getMaxTzInput(x,y))
should at least give the correct value with only one call of ctz
.
4
-
1
-
… and it's passing my enhanced version of Marek's unit test too which includes the case
{0, 0, 64}
and also checks for UB (which my own solution failed).– Ted Lyngmo23 hours ago
-
But it's still much slower and much more complex than the naive implementation. (I measured with a Rust version of this code.)
– Sven Marnach23 hours ago
-
1
Note that
(~x + 1)
is just a fancy way of writing-x
.– Sven Marnach16 hours ago
Your Answer
Post as a guest
Required, but never shown
Post as a guest
Required, but never shown
By clicking “Post Your Answer”, you agree to our terms of service and acknowledge that you have read and understand our privacy policy and code of conduct.
Not the answer you're looking for? Browse other questions tagged
or ask your own question.
or ask your own question.
Unit tests: godbolt.org/z/1hY4ch9sh
yesterday
Note that specifying processor architecture changes code to something more nice. In such case clang nails it and makes it branchless: godbolt.org/z/dWse6hxbY
23 hours ago
On ARM,
ctz(x)
is implemented asclz(rbit(x))
. And since we havemax(clz(x), clz(y)) = clz(min(x,y))
, that lets us doclz(min(rbit(x), rbit(y)))
which saves oneclz
. (Andmin
is easy to do branchless on this architecture.) So it probably helps to know how your architecture actually doesctz
,8 hours ago
Any specific architectures you care about? A lot of discussion so far has involved modern x86. Can you assume BMI1 instructions? Are zeroed inputs possible, which would require care if using x86
bsf
.2 hours ago
|