Re: [RFC] Improve memset

From: Rasmus Villemoes
Date: Fri Sep 13 2019 - 04:51:52 EST


On 13/09/2019 09.22, Borislav Petkov wrote:
>
> Instead of calling memset:
>
> ffffffff8100cd8d: e8 0e 15 7a 00 callq ffffffff817ae2a0 <__memset>
>
> and having a JMP inside it depending on the feature supported, let's simply
> have the REP; STOSB directly in the code:
>
> ...
> ffffffff81000442: 4c 89 d7 mov %r10,%rdi
> ffffffff81000445: b9 00 10 00 00 mov $0x1000,%ecx
>
> <---- new memset
> ffffffff8100044a: f3 aa rep stos %al,%es:(%rdi)
> ffffffff8100044c: 90 nop
> ffffffff8100044d: 90 nop
> ffffffff8100044e: 90 nop
> <----
>

> The result is this:
>
> static __always_inline void *memset(void *dest, int c, size_t n)
> {
> void *ret, *dummy;

How is this going to affect cases like memset(p, 0, 4/8/16); where gcc
would normally just do one or two word stores? Is rep; stosb still
competitive in that case? It seems that gcc doesn't recognize memset as
a builtin with this always_inline definition present [1].

> asm volatile(ALTERNATIVE_2_REVERSE("rep; stosb",
> "call memset_rep", X86_FEATURE_ERMS,
> "call memset_orig", X86_FEATURE_REP_GOOD)
> : "=&D" (ret), "=a" (dummy)
> : "0" (dest), "a" (c), "c" (n)
> /* clobbers used by memset_orig() and memset_rep_good() */
> : "rsi", "rdx", "r8", "r9", "memory");
>
> return dest;
> }
>

Also, am I reading this

> #include <asm/export.h>
>
> -.weak memset
> -
> /*
> */
> -ENTRY(memset)
> -ENTRY(__memset)

right so there's no longer an actual memset symbol in vmlinux? It seems
that despite the above always_inline definition of memset, gcc can still
emit calls to that to implement large initalizations. (The gcc docs are
actually explicit about that, even under -ffreestanding, "GCC requires
the freestanding environment provide 'memcpy', 'memmove', 'memset' and
'memcmp'.")

[1] I tried this silly stripped-down version with gcc-8:

//#include <string.h>
#include <stddef.h>

#if 1
#define always_inline __inline__ __attribute__((__always_inline__))
static always_inline void *memset(void *dest, int c, size_t n)
{
void *ret, *dummy;

asm volatile("rep; stosb"
: "=&D" (ret), "=a" (dummy)
: "0" (dest), "a" (c), "c" (n)
/* clobbers used by memset_orig() and memset_rep_good() */
: "rsi", "rdx", "r8", "r9", "memory");

return dest;
}
#endif

struct s { long x; long y; };
int h(struct s *s);
int f(void)
{
struct s s;
memset(&s, 0, sizeof(s));
return h(&s);
}

int g(void)
{
struct s s[1024] = {};
return h(s);
}

With or without the string.h include, the result was

0000000000000000 <f>:
0: 48 83 ec 18 sub $0x18,%rsp
4: 31 c0 xor %eax,%eax
6: b9 10 00 00 00 mov $0x10,%ecx
b: 49 89 e2 mov %rsp,%r10
e: 4c 89 d7 mov %r10,%rdi
11: f3 aa rep stos %al,%es:(%rdi)
13: 4c 89 d7 mov %r10,%rdi
16: e8 00 00 00 00 callq 1b <f+0x1b>
17: R_X86_64_PLT32 h-0x4
1b: 48 83 c4 18 add $0x18,%rsp
1f: c3 retq

0000000000000020 <g>:
20: 48 81 ec 08 40 00 00 sub $0x4008,%rsp
27: ba 00 40 00 00 mov $0x4000,%edx
2c: 31 f6 xor %esi,%esi
2e: 48 89 e1 mov %rsp,%rcx
31: 48 89 cf mov %rcx,%rdi
34: e8 00 00 00 00 callq 39 <g+0x19>
35: R_X86_64_PLT32 memset-0x4
39: 48 89 c7 mov %rax,%rdi
3c: e8 00 00 00 00 callq 41 <g+0x21>
3d: R_X86_64_PLT32 h-0x4
41: 48 81 c4 08 40 00 00 add $0x4008,%rsp
48: c3 retq


With the asm version #if 0'ed out, f() uses two movq (and yields a
warning if the string.h include is omitted, but it still recognizes
memset()).

Rasmus