Re: [RFC] Improve memset

From: Ingo Molnar
Date: Fri Sep 13 2019 - 03:35:37 EST

* Borislav Petkov <bp@xxxxxxxxx> wrote:

> Hi,
> since the merge window is closing in and y'all are on a conference, I
> thought I should take another stab at it. It being something which Ingo,
> Linus and Peter have suggested in the past at least once.
> Instead of calling memset:
> ffffffff8100cd8d: e8 0e 15 7a 00 callq ffffffff817ae2a0 <__memset>
> and having a JMP inside it depending on the feature supported, let's simply
> have the REP; STOSB directly in the code:
> ...
> ffffffff81000442: 4c 89 d7 mov %r10,%rdi
> ffffffff81000445: b9 00 10 00 00 mov $0x1000,%ecx
> <---- new memset
> ffffffff8100044a: f3 aa rep stos %al,%es:(%rdi)
> ffffffff8100044c: 90 nop
> ffffffff8100044d: 90 nop
> ffffffff8100044e: 90 nop
> <----
> ffffffff8100044f: 4c 8d 84 24 98 00 00 lea 0x98(%rsp),%r8
> ffffffff81000456: 00
> ...
> And since the majority of x86 boxes out there is Intel, they haz
> X86_FEATURE_ERMS so they won't even need to alternative-patch those call
> sites when booting.
> In order to patch on machines which don't set X86_FEATURE_ERMS, I need
> to do a "reversed" patching of sorts, i.e., patch when the x86 feature
> flag is NOT set. See the below changes in alternative.c which basically
> add a flags field to struct alt_instr and thus control the patching
> behavior in apply_alternatives().
> The result is this:
> static __always_inline void *memset(void *dest, int c, size_t n)
> {
> void *ret, *dummy;
> asm volatile(ALTERNATIVE_2_REVERSE("rep; stosb",
> "call memset_rep", X86_FEATURE_ERMS,
> "call memset_orig", X86_FEATURE_REP_GOOD)
> : "=&D" (ret), "=a" (dummy)
> : "0" (dest), "a" (c), "c" (n)
> /* clobbers used by memset_orig() and memset_rep_good() */
> : "rsi", "rdx", "r8", "r9", "memory");
> return dest;
> }
> and so in the !ERMS case, we patch in a call to the memset_rep() version
> which is the old variant in memset_64.S. There we need to do some reg
> shuffling because I need to map the registers from where REP; STOSB
> expects them to where the x86_64 ABI wants them. Not a big deal - a push
> and two moves and a pop at the end.
> If X86_FEATURE_REP_GOOD is not set either, we fallback to another call
> to the original unrolled memset.
> The rest of the diff is me trying to untangle memset()'s definitions
> from the early code too because we include kernel proper headers there
> and all kinds of crazy include hell ensues but that later.
> Anyway, this is just a pre-alpha version to get people's thoughts and
> see whether I'm in the right direction or you guys might have better
> ideas.

That looks exciting - I'm wondering what effects this has on code
footprint - for example defconfig vmlinux code size, and what the average
per call site footprint impact is?

If the footprint effect is acceptable, then I'd expect this to improve
performance, especially in hot loops.