Re: Control Dependencies vs C Compilers

From: Willy Tarreau
Date: Wed Oct 07 2020 - 06:31:12 EST


On Wed, Oct 07, 2020 at 11:32:43AM +0200, Peter Zijlstra wrote:
> A branch that cannot be optimized away and prohibits lifting stores
> over. One possible suggestion would be allowing the volatile keyword as
> a qualifier to if.
>
> x = *foo;
> volatile if (x > 42)
> *bar = 1;
>
> This would tell the compiler that the condition is special in that it
> must emit a conditional branch instruction and that it must not lift
> stores (or sequence points) over it.

This test is interesting, because if foo and bar are of the same type,
nothing prevents them from aliasing and the compiler cannot make wild
guesses on them (i.e. they may be plain memory as well as memory-mapped
registers).

Extending it like this shows a difference between the use of volatile
and __atomic_{load,store}_n. While both are correct in that each access
is properly performed, for an unknown reason the compiler decided to
implement two distinct branches in the atomic case and to inflate the
code:

$ gcc -v
gcc version 9.3.0 (GCC)

$ cat foo-volatile.c
long foobar(long *foo, long *bar)
{
*(volatile long *)bar = 10;
if (*(volatile long *)foo <= 42)
*(volatile long *)bar = 64;
if (*(volatile long *)foo > 42)
*(volatile long *)bar = 0;
return *(volatile long *)bar;
}
$ gcc -c -O2 foo-volatile.c
$ objdump -dr foo-volatile.o
0000000000000000 <foobar>:
0: 48 c7 06 0a 00 00 00 movq $0xa,(%rsi)
7: 48 8b 07 mov (%rdi),%rax
a: 48 83 f8 2a cmp $0x2a,%rax
e: 7f 07 jg 17 <foobar+0x17>
10: 48 c7 06 40 00 00 00 movq $0x40,(%rsi)
17: 48 8b 07 mov (%rdi),%rax
1a: 48 83 f8 2a cmp $0x2a,%rax
1e: 7e 07 jle 27 <foobar+0x27>
20: 48 c7 06 00 00 00 00 movq $0x0,(%rsi)
27: 48 8b 06 mov (%rsi),%rax
2a: c3 retq


$ cat foo-atomic.c
long foobar(long *foo, long *bar)
{
__atomic_store_n(bar, 10, __ATOMIC_RELAXED);
if (__atomic_load_n(foo, __ATOMIC_RELAXED) <= 42)
__atomic_store_n(bar, 64, __ATOMIC_RELAXED);
if (__atomic_load_n(foo, __ATOMIC_RELAXED) > 42)
__atomic_store_n(bar, 0, __ATOMIC_RELAXED);
return __atomic_load_n(bar, __ATOMIC_RELAXED);
}
$ objdump -dr foo-atomic.o
0000000000000000 <foobar>:
0: 48 c7 06 0a 00 00 00 movq $0xa,(%rsi)
7: 48 8b 07 mov (%rdi),%rax
a: 48 83 f8 2a cmp $0x2a,%rax
e: 7e 10 jle 20 <foobar+0x20>
10: 48 8b 07 mov (%rdi),%rax
13: 48 83 f8 2a cmp $0x2a,%rax
17: 7f 17 jg 30 <foobar+0x30>
19: 48 8b 06 mov (%rsi),%rax
1c: c3 retq
1d: 0f 1f 00 nopl (%rax)
20: 48 c7 06 40 00 00 00 movq $0x40,(%rsi)
27: 48 8b 07 mov (%rdi),%rax
2a: 48 83 f8 2a cmp $0x2a,%rax
2e: 7e e9 jle 19 <foobar+0x19>
30: 48 c7 06 00 00 00 00 movq $0x0,(%rsi)
37: 48 8b 06 mov (%rsi),%rax
3a: c3 retq

When building at -Os both produce the same code as the volatile version
above. It *seems* to me that the volatile version always produces more
optimal code, but is it always correct ? This is just an illustration of
how tricky this can currently be and how confusing it can sometimes be
for the developer to make sure the desired code is emitted in a few
special cases. And just for this, having the compiler support more easily
predictable constructs would be a nice improvement.

Willy