Re: [PATCH v3 2/2] vfs: avoid duplicating creds in faccessat if possible

From: Mateusz Guzik
Date: Sat Mar 04 2023 - 15:31:52 EST


On 3/4/23, Mateusz Guzik <mjguzik@xxxxxxxxx> wrote:
> On 3/3/23, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>> On Fri, Mar 3, 2023 at 12:39 PM Mateusz Guzik <mjguzik@xxxxxxxxx> wrote:
>>>
>>> I think there is a systemic problem which comes with the kzalloc API
>>
>> Well, it's not necessarily the API that is bad, but the implementation.
>>
>> We could easily make kzalloc() with a constant size just expand to
>> kmalloc+memset, and get the behavior you want.
>>
>> We already do magical things for "find the right slab bucket" part of
>> kmalloc too for constant sizes. It's changed over the years, but that
>> policy goes back a long long time. See
>>
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/?id=95203fe78007f9ab3aebb96606473ae18c00a5a8
>>
>> from the BK history tree.
>>
>> Exactly because some things are worth optimizing for when the size is
>> known at compile time.
>>
>> Maybe just extending kzalloc() similarly? Trivial and entirely untested
>> patch:
>>
>> --- a/include/linux/slab.h
>> +++ b/include/linux/slab.h
>> @@ -717,6 +717,12 @@ static inline void *kmem_cache_zalloc(struct
>> kmem_cache *k, gfp_t flags)
>> */
>> static inline __alloc_size(1) void *kzalloc(size_t size, gfp_t flags)
>> {
>> + if (__builtin_constant_p(size)) {
>> + void *ret = kmalloc(size, flags);
>> + if (ret)
>> + memset(ret, 0, size);
>> + return ret;
>> + }
>> return kmalloc(size, flags | __GFP_ZERO);
>> }
>>
>
> So I played with this and have a rather nasty summary. Bullet points:
> 1. patched kzalloc does not reduce memsets calls during kernel build
> 2. patched kmem_cache_zalloc_ptr + 2 consumers converted *does* drop
> it significantly (36150671 -> 14414454)
> 3. ... inline memset generated by gcc sucks by resorting to rep stosq
> around 48 bytes
> 4. memsets not sorted out have sizes not known at compilation time and
> are not necessarily perf bugs on their own [read: would benefit from
> faster memset]
>
> Onto the the meat:
>
> I patched the kernel with a slightly tweaked version of the above:
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index 45af70315a94..7abb5490690f 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -717,6 +717,12 @@ static inline void *kmem_cache_zalloc(struct
> kmem_cache *k, gfp_t flags)
> */
> static inline __alloc_size(1) void *kzalloc(size_t size, gfp_t flags)
> {
> + if (__builtin_constant_p(size)) {
> + void *ret = kmalloc(size, flags);
> + if (likely(ret))
> + memset(ret, 0, size);
> + return ret;
> + }
> return kmalloc(size, flags | __GFP_ZERO);
> }
>
> and verified it indeed zeroes inline:
>
> void kztest(void)
> {
> void *ptr;
>
> ptr = kzalloc(32, GFP_KERNEL);
> if (unlikely(!ptr))
> return;
> memsettest_rec(ptr);
> }
>
> $ objdump --disassemble=kztest vmlinux
> [snip]
> call ffffffff8135e130 <kmalloc_trace>
> test %rax,%rax
> je ffffffff81447d5f <kztest+0x4f>
> movq $0x0,(%rax)
> mov %rax,%rdi
> movq $0x0,0x8(%rax)
> movq $0x0,0x10(%rax)
> movq $0x0,0x18(%rax)
> call ffffffff81454060 <memsettest_rec>
> [snip]
>
> This did *NOT* lead to reduction of memset calls when building the kernel.
>
> I verified few cases by hand, it is all either kmem_cache_zalloc or
> explicitly added memsets with sizes not known at compilation time.
>
> Two most frequent callers:
> @[
> memset+5
> __alloc_file+40
> alloc_empty_file+73
> path_openat+77
> do_filp_open+182
> do_sys_openat2+159
> __x64_sys_openat+89
> do_syscall_64+93
> entry_SYSCALL_64_after_hwframe+114
> ]: 11028994
> @[
> memset+5
> security_file_alloc+45
> __alloc_file+89
> alloc_empty_file+73
> path_openat+77
> do_filp_open+182
> do_sys_openat2+159
> __x64_sys_openat+89
> do_syscall_64+93
> entry_SYSCALL_64_after_hwframe+114
> ]: 11028994
>
> My wip addition is:
>
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index 45af70315a94..12b5b02ef3d3 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -710,6 +710,17 @@ static inline void *kmem_cache_zalloc(struct
> kmem_cache *k, gfp_t flags)
> return kmem_cache_alloc(k, flags | __GFP_ZERO);
> }
>
> +#define kmem_cache_zalloc_ptr(k, f, retp) ({ \
> + __typeof(retp) _retp = kmem_cache_alloc(k, f); \
> + bool _rv = false; \
> + retp = _retp; \
> + if (likely(_retp)) { \
> + memset(_retp, 0, sizeof(*_retp)); \
> + _rv = true; \
> + } \
> + _rv; \
> +})
> +
> diff --git a/security/security.c b/security/security.c
> index cf6cc576736f..0f769ede0e54 100644
> --- a/security/security.c
> +++ b/security/security.c
> @@ -600,8 +600,7 @@ static int lsm_file_alloc(struct file *file)
> return 0;
> }
>
> - file->f_security = kmem_cache_zalloc(lsm_file_cache, GFP_KERNEL);
> - if (file->f_security == NULL)
> + if (!kmem_cache_zalloc_ptr(lsm_file_cache, GFP_KERNEL,
> file->f_security))
> return -ENOMEM;
> return 0;
> }

This one is actually buggy -- f_security is a void * pointer and
sizeof(*file->f_security) returns just 1. The macro does not have any
safety belts against that -- it should probably check for void * at
compilation time and get a BUG_ON for runtime mismatch. Does not
affect the idea though.

Good news: gcc provides a lot of control as to how it inlines string
ops, most notably:
-mstringop-strategy=alg
Override the internal decision heuristic for the particular
algorithm to use for inlining string
operations. The allowed values for alg are:

rep_byte
rep_4byte
rep_8byte
Expand using i386 "rep" prefix of the specified size.

byte_loop
loop
unrolled_loop
Expand into an inline loop.

libcall
Always use a library call.

I'm going to play with it and send something more presentable.

> diff --git a/fs/file_table.c b/fs/file_table.c
> index 372653b92617..8e0dabf9530e 100644
> --- a/fs/file_table.c
> +++ b/fs/file_table.c
> @@ -136,8 +136,7 @@ static struct file *__alloc_file(int flags, const
> struct cred *cred)
> struct file *f;
> int error;
>
> - f = kmem_cache_zalloc(filp_cachep, GFP_KERNEL);
> - if (unlikely(!f))
> + if (!kmem_cache_zalloc_ptr(filp_cachep, GFP_KERNEL, f))
> return ERR_PTR(-ENOMEM);
>
> f->f_cred = get_cred(cred);
>
> As mentioned above it cuts total calls in more than half.
>
> The problem is it is it rolls with rep stosq way too easily, partially
> defeating the point of inlining anything. clang does not have this
> problem.
>
> Take a look at __alloc_file:
> [snip]
> mov 0x19cab05(%rip),%rdi # ffffffff82df4318 <filp_cachep>
> call ffffffff813dd610 <kmem_cache_alloc>
> test %rax,%rax
> je ffffffff814298b7 <__alloc_file+0xc7>
> mov %rax,%r12
> mov $0x1d,%ecx
> xor %eax,%eax
> mov %r12,%rdi
> rep stos %rax,%es:(%rdi)
> [/snip]
>
> Here is a sample consumer which can't help but have a variable size --
> select, used by gmake:
> @[
> memset+5
> do_pselect.constprop.0+202
> __x64_sys_pselect6+101
> do_syscall_64+93
> entry_SYSCALL_64_after_hwframe+114
> ]: 13160
>
> In conclusion:
> 1. fixing up memsets is worthwhile regardless of what happens to its
> current consumers -- not all of them are necessarily doing something
> wrong
> 2. inlining memset can be a pessimization vs non-plain-erms memset as
> evidenced above. will have to figure out how to convince gcc to be
> less eager to use it.
>
> Sometimes I hate computers.
>
> --
> Mateusz Guzik <mjguzik gmail.com>
>


--
Mateusz Guzik <mjguzik gmail.com>