Re: [PATCH v3 1/3] lib: find_*_bit reimplementation

From: George Spelvin
Date: Thu Feb 12 2015 - 18:46:15 EST


> That, and if the compiler was smart enough, the AND should actually be
> free (at least on x86), since it can replace a test instruction. [1]

Ah, right, x86 loads don't set the flags, so you need a TEST
instruction anyway. So the AND is free!

Of course, not all the world's an x86. But load/store architectures
generally need a test instruction as well. It's things like MIPS and
Aloha, that don't have condition codes, but only conditional branches
on register values, that will need an extra cycle.

> In any case, my code compiles to fewer bytes (though not an entire cache
> line), and I think it is somewhat clearer - I'm obviously biased on the
> latter point.

Myself, I think the code that shows that only the first word gets masked
by control flow is clearer, but since the complexity is completely
localized to this function, I'll take smaller and faster.

> [1] Unfortunately, the compiler doesn't seem to be smart enough :-( When I
> compile my code with gcc 5.0, I get
>
> 0000000000000000 <find_last_bit_rv>:
> 0: 53 push %rbx
> 1: 89 f1 mov %esi,%ecx
> 3: 48 8d 5e 3f lea 0x3f(%rsi),%rbx
> 7: f7 d9 neg %ecx
> 9: 48 c7 c2 ff ff ff ff mov $0xffffffffffffffff,%rdx
> 10: 48 83 e3 c0 and $0xffffffffffffffc0,%rbx
> 14: 48 d3 ea shr %cl,%rdx
> 17: eb 1a jmp 33 <find_last_bit_rv+0x33>
> 19: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
>
> 20: 48 89 d1 mov %rdx,%rcx
> 23: 48 23 0c df and (%rdi,%rbx,8),%rcx
> 27: 48 c7 c2 ff ff ff ff mov $0xffffffffffffffff,%rdx
> 2e: 48 85 c9 test %rcx,%rcx
> 31: 75 15 jne 48 <find_last_bit_rv+0x48>
> 33: 48 83 eb 01 sub $0x1,%rbx
> 37: 48 83 fb ff cmp $0xffffffffffffffff,%rbx
> 3b: 75 e3 jne 20 <find_last_bit_rv+0x20>
>
> 3d: 48 89 f0 mov %rsi,%rax
> 40: 5b pop %rbx
> 41: c3 retq
> 42: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
> 48: 48 89 cf mov %rcx,%rdi
> 4b: 48 c1 e3 06 shl $0x6,%rbx
> 4f: e8 00 00 00 00 callq 54 <find_last_bit_rv+0x54>
> 54: 48 98 cltq
> 56: 48 01 d8 add %rbx,%rax
> 59: 5b pop %rbx
> 5a: c3 retq
> 5b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)

> the main loop is 20--3b. The test instruction at 2e seems to be
> redundant. The same at 37: the sub instruction already sets plenty of
> flags that could be used, so explicitly comparing %rbx to -1 seems
> redundant.

Er... I think you hand-edited that code; it's wrong. The loop assumes that
%rbx is in units of words, but the prologue sets it up in units of bits.

The mov to %rcx is also redundant, since it could be eliminated with some
minor rescheduling.

The code generation I *want* for that function is:

# addr in %rdi, size in %rsi
movl %esi, %ecx
leaq 0x3f(%rsi), %rax
negl %ecx
movq $-1, %rdx
shrq $6, %rax
shrq %cl, %rdx
jmp 2f
1:
movq $-1, %rdx
2:
subq $1, %rax
jc 3f
andq (%rdi,%rax,8), %rdx
jeq 1b

bsrq %rdx, %rdx
salq $6, %rax
addq %rdx, %rax
ret
3:
movq %rsi, %rax
retq

I wonder if the compiler can be convinced by a bit of source tweaking?

unsigned long find_last_bit(const unsigned long *addr, unsigned long size)
{
size_t idx = DIV_ROUND_UP(size, BITS_PER_LONG);
unsigned long val = LAST_WORD_MASK(size);

while (idx--) {
val &= addr[idx];
if (val)
return idx * BITS_PER_LONG + __fls(val);
val = ~0ul;
}
return size;
}

Darn, no difference...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/