Subject: RE: [PATCH] tools/nolibc: x86: Remove `r8`, `r9` and `r10` from the clobber list

From: Ammar Faizi
Date: Tue Oct 12 2021 - 19:02:30 EST


On Wed, Oct 13, 2021 at 4:21 AM David Laight <David.Laight@xxxxxxxxxx> wrote:
>
> From: Willy Tarreau
> > Sent: 12 October 2021 10:07
> >
> > On Tue, Oct 12, 2021 at 03:36:44PM +0700, Ammar Faizi wrote:
> > > I have tried to search for the documentation about this one, but I
> > > couldn't find any. Checking at `Documentation/x86/entry_64.rst`, but
> > > it doesn't tell anything relevant.
> > (...)
> >
> > OK thanks for the detailed story, thus I didn't miss any obvious
> > reference.
> >
> > > My stance comes from SO, Telegram group discussion, and reading source
> > > code. Therefore, I don't think I can attach the link to it as
> > > "authoritative information". Or can I?
> >
> > You're right, that's not exactly what we can call authoritative :-)
>
> Given the cost of a system call the code benefit from telling
> gcc that r8 to r10 are preserved is likely to be noise.
> Especially since most syscalls are made from C library stubs
> so the application calling code will assume they are trashed.
>
> There may even be a bigger gain from the syscall exit code just
> setting the registers to zero (instead of restoring them).

Setting those registers to zero on "syscall_return_via_sysret" would
need to edit entry_64.S and that apparently breaks the userspace and
results in an ABI change.

>
> There are probably even bigger gains from zeroing the AVX
> registers (which, IIRC, are all caller-saved) somewhere
> between syscall entry and the process sleeping.
> (This can't be done for non-syscall kernel entry.)
>

I copy and paste my message just to clarify the misunderstanding here. We
don't intend to change the ABI, so we can only strive for gaining more
profit to optimize what we can do based on the current situation.

I know for a fact that every "syscall" in the libc is wrapped with a
function call.

However, that is not the case for nolibc.h, because we have a potential
to inline the "syscall" instruction (0f 05) to the user functions.

All syscalls in the nolibc.h are written as a static function with inline
ASM and are likely always inline if we use optimization flag, so this is
a profit not to have r8, r9 and r10 in the clobber list (currently we
have them).

FWIIW, I created an example where this matters.

```
#include "tools/include/nolibc/nolibc.h"

#define read_abc(a, b, c) __asm__ volatile(""::"r"(a),"r"(b),"r"(c))

int main(void)
{
int a = 0xaa;
int b = 0xbb;
int c = 0xcc;

read_abc(a, b, c);
write(1, "test\n", 5);
read_abc(a, b, c);

return 0;
}
```

Compile with:
gcc -Os test.c -o test -nostdlib


With r8, r9, r10 in the clobber list, results in this:

0000000000001000 <main>:
1000: f3 0f 1e fa endbr64
1004: 41 54 push %r12
1006: 41 bc cc 00 00 00 mov $0xcc,%r12d
100c: 55 push %rbp
100d: bd bb 00 00 00 mov $0xbb,%ebp
1012: 53 push %rbx
1013: bb aa 00 00 00 mov $0xaa,%ebx
1018: b8 01 00 00 00 mov $0x1,%eax
101d: bf 01 00 00 00 mov $0x1,%edi
1022: ba 05 00 00 00 mov $0x5,%edx
1027: 48 8d 35 d2 0f 00 00 lea 0xfd2(%rip),%rsi
102e: 0f 05 syscall
1030: 31 c0 xor %eax,%eax
1032: 5b pop %rbx
1033: 5d pop %rbp
1034: 41 5c pop %r12
1036: c3 ret

GCC thinks that syscall will clobber r8, r9, r10. So it spills 0xaa,
0xbb and 0xcc to callee saved registers (r12, rbp and rbx). This is
clearly extra memory access and extra stack size for preserving them.

But syscall does not actually clobber them, so this is a missed
optimization.

Now without r8, r9, r10 in the clobber list, results in better ASM code:

0000000000001000 <main>:
1000: f3 0f 1e fa endbr64
1004: 41 b8 aa 00 00 00 mov $0xaa,%r8d
100a: 41 b9 bb 00 00 00 mov $0xbb,%r9d
1010: 41 ba cc 00 00 00 mov $0xcc,%r10d
1016: b8 01 00 00 00 mov $0x1,%eax
101b: bf 01 00 00 00 mov $0x1,%edi
1020: ba 05 00 00 00 mov $0x5,%edx
1025: 48 8d 35 d4 0f 00 00 lea 0xfd4(%rip),%rsi
102c: 0f 05 syscall
102e: 31 c0 xor %eax,%eax
1030: c3 ret

Does that make sense?

--
Ammar Faizi