Re: [PATCH 1/2] x86: use alternatives for clear_user()

From: Ingo Molnar
Date: Sun Jun 21 2015 - 02:52:47 EST



* Alexey Dobriyan <adobriyan@xxxxxxxxx> wrote:

> Alternatives allow to pick faster code: REP STOSQ, REP STOSB or else. Default to
> REP STOSQ (as memset() does).
>
> Sticking printk() into clear_user() and booting Debian showed that it is usually
> executed with 4-8 bytes of alignment and ~2000-4000 bytes of length. On this
> scale difference should be noticeable.
>
> Also make __clear_user() more, shall we say, "modern". Remove storing zeroes via
> zeroed register, replace INC/DEC with ADD/SUB for dependency breakage.

Yeah, so I like the direction of these changes (modulo the suggestions I make for
the patch, see further below), but note that our (new) policy with x86 assembly
patches is pretty brutal:

1)

This patch should be split into at least a series of 3 patches:

- patch #1 moves the old code into the new file

- patch #2 does the modernization of assembly ops to the old MOVQ method

- patch #3 adds the alternatives patching to introduce the STOSB and STOSQ
methods

2)

Please cite before/after /usr/bin/size comparisons of the .o (and vmlinux where
sensible) in the changelog (you can also do objdump -d comparisons), especially
for the first and second patch this will show that the move is an invariant, and
that the effects of the modernization of the MOVQ method.

3)

We require benchmark numbers for changes to such commonly used APIs: please add
clear_user() support to 'perf bench', as an extension to the existing memcpy and
memset benchmarks:

triton:~/tip> ls -l tools/perf/bench/mem-*x86*.S
-rw-rw-r-- 1 mingo mingo 413 Jun 18 22:53 tools/perf/bench/mem-memcpy-x86-64-asm.S
-rw-rw-r-- 1 mingo mingo 414 Jun 18 22:53 tools/perf/bench/mem-memset-x86-64-asm.S

You can run them with something like:

galatea:~> taskset 1 perf bench mem memset -o -i 10000 -l 1600 -r all

# Running 'mem/memset' benchmark:
Routine default (Default memset() provided by glibc)
# Copying 1600 Bytes ...

44.348694 GB/Sec (with prefault)
Routine x86-64-unrolled (unrolled memset() in arch/x86/lib/memset_64.S)
# Copying 1600 Bytes ...

24.548865 GB/Sec (with prefault)
Routine x86-64-stosq (movsq-based memset() in arch/x86/lib/memset_64.S)
# Copying 1600 Bytes ...

27.645939 GB/Sec (with prefault)
Routine x86-64-stosb (movsb-based memset() in arch/x86/lib/memset_64.S)
# Copying 1600 Bytes ...

47.760132 GB/Sec (with prefault)

You might want to test the specific length values that your printk profiling has
shown to be characteristic, via the -l <length> parameter, because some of these
routines are sensitive to alignment and length details.

Note that this kind of benchmarking matters: for example on this box where I ran
the above memset test it clearly shows that the STOSB method is superior.

I can help out with extending 'perf bench' with clear_user measurements if you'd
like.

> +# unsigned long __clear_user(void __user *, unsigned long);
> +ENTRY(__clear_user)
> + CFI_STARTPROC

There's no CFI_* in the x86 tree anymore, please send a patch against tip:master.

> +
> + ALTERNATIVE_2 "jmp __clear_user_movq", \
> + "", X86_FEATURE_REP_GOOD, \
> + "jmp __clear_user_rep_stosb", X86_FEATURE_ERMS

Can we move this into clear_user(), and patch in CALL instructions instead of
jumps? There's no reason to do these extra jumps.

> +

So for consistency's sake I'd put a label here that names the default function
__clear_user_rep_stosq. So that we know what it is when it shows up in 'perf top'.

(With the 'CALL' patching approach this would become __clear_user_rep_stosq in a
natural fashion - so in that case the extra label is not needed.)

> + ASM_STAC
> + xor %eax, %eax
> + mov %rsi, %rcx
> + and $7, %esi
> + shr $3, %rcx
> +1: rep stosq
> + mov %esi, %ecx
> +2: rep stosb
> +3:
> + mov %rcx, %rax
> + ASM_CLAC
> + ret

So I'd switch the ASM_CLAC with the MOV, because flags manipulation probably has
higher latency than a simple register move. (the same reason we do the STAC as the
first step)

> +ENTRY(__clear_user_movq)
> + CFI_STARTPROC
> +
> + ASM_STAC

With JMP patching: so every __clear_user_* method starts with a STAC, so I'd move
the STAC into the __clear_user function prologue, before the __clear_user_*
alternatives patching site - this might reduce flag value dependencies as well by
the time it's used.

With CALL patching: this can needs to be in the specific functions, like you have
it now.

> + mov %rsi, %rcx
> + and $7, %esi
> + shr $3, %rcx
> + jz 2f
> + .p2align 4

There's no need to align this small loop - the NOP only slows things down.

> +1:
> + movq $0, (%rdi)
> + add $8, %rdi
> + sub $1, %rcx
> + jnz 1b
> +2:
> + mov %esi, %ecx
> + test %ecx, %ecx
> + jz 4f
> + .p2align 4

Ditto.

> +3:
> + movb $0, (%rdi)
> + add $1, %rdi
> + sub $1, %ecx
> + jnz 3b
> +4:
> + mov %rcx, %rax
> + ASM_CLAC

I'd flip this one too.

> + ret
> +
> + .section .fixup,"ax"
> +5: lea (%rsi,%rcx,8),%rcx
> + jmp 4b
> + .previous
> +
> + _ASM_EXTABLE(1b,5b)
> + _ASM_EXTABLE(3b,4b)
> +
> + CFI_ENDPROC
> +ENDPROC(__clear_user_movq)
> +
> +ENTRY(__clear_user_rep_stosb)
> + CFI_STARTPROC
> +
> + ASM_STAC
> + xor %eax, %eax
> + mov %rsi, %rcx
> +1: rep stosb
> +2:
> + mov %rcx, %rax
> + ASM_CLAC

And I'd flip this one too.

> + ret
> +
> + _ASM_EXTABLE(1b,2b)
> +
> + CFI_ENDPROC
> +ENDPROC(__clear_user_rep_stosb)

> @@ -53,6 +19,7 @@ unsigned long clear_user(void __user *to, unsigned long n)
> return n;
> }
> EXPORT_SYMBOL(clear_user);
> +EXPORT_SYMBOL(__clear_user);

Also, please consider inlining clear_user()'s access_ok() check: so that the only
function left is __clear_user(). (if then that should be a separate patch)

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at http://www.tux.org/lkml/