Re: [PATCH] x86/usercopy: speed up 64-bit __clear_user() with stos{b,q}

From: Borislav Petkov
Date: Sun May 23 2021 - 15:04:29 EST


On Sun, May 23, 2021 at 07:04:23PM +0100, Samuel Neves wrote:
> The current 64-bit implementation of __clear_user consists of a simple loop
> writing an 8-byte register per iteration. On typical x86_64 chips, this will
> result in a rate of ~8 bytes per cycle.
>
> On those same typical chips, much better is often possible, ranging from 16
> to 32 to 64 bytes per cycle. Here we want to avoid bringing vector
> instructions for this, but we can still achieve something close to those fill
> rates using `rep stos{b,q}`. This is actually how it is already done in
> usercopy_32.c.
>
> This patch does precisely this. But because `rep stosb` can be slower for
> short fills, I've retained the old loop for sizes below 256 bytes.

Oh yes, you wanna retain the old code for old machines.

But instead of adding more unreadable asm, you can test the size and if
it is > 256 or whatever we decide is the magic value, call a separate
function which contains the ERMS alternative. Similar to how those
different functions are done in arch/x86/lib/copy_user_64.S.

> This is a somewhat arbitrary threshold; some documents say that `rep
> stosb` should be faster after 128 bytes, whereas glibc puts the
> threshold at 2048 bytes (but there it is competing against vector
> instructions). My measurements on various (but not an exhaustive
> variety of) machines suggest this is a reasonable threshold, but I
> could be mistaken.

Those measurements should be part of this commit message. Also, you
wanna test on the currently widely used microarchitectures.

> It should also be mentioned that the existent code contains a bug. In the loop
>
> "0: movq $0,(%[dst])\n"
> " addq $8,%[dst]\n"
> " decl %%ecx ; jnz 0b\n"
>
> The `decl %%ecx` instruction truncates the register containing `size/8` to
> 32 bits, which means that calling __clear_user on a buffer longer than 32 GiB
> would leave part of it unzeroed.

That needs to be a separate pre-patch fixing only this.

> This change is noticeable from userspace. That is in fact how I spotted it; in
> a hashing benchmark that read from /dev/zero, around 10-15% of the CPU time
> was spent in __clear_user. After this patch, on a Skylake CPU, these are the
> before/after figures:

I'm guessing you got those 10-15% with perf profiles?

It is a lot more persuasive when you have a before/after perf profile in
your commit message showing how __clear_user() disappears from the list
of hot functions.

> $ dd if=/dev/zero of=/dev/null bs=1024k status=progress
> 94402248704 bytes (94 GB, 88 GiB) copied, 6 s, 15.7 GB/s
>
> $ dd if=/dev/zero of=/dev/null bs=1024k status=progress
> 446476320768 bytes (446 GB, 416 GiB) copied, 15 s, 29.8 GB/s

As said, you wanna test a couple of currently widespread architectures
and also use a proper benchmark (not dd) to make sure you're not
introducing regressions.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette