Re: [PATCH v3 2/2] vfs: avoid duplicating creds in faccessat if possible

From: Mateusz Guzik
Date: Sat Mar 04 2023 - 14:01:58 EST


On 3/3/23, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> On Fri, Mar 3, 2023 at 12:39 PM Mateusz Guzik <mjguzik@xxxxxxxxx> wrote:
>>
>> I think there is a systemic problem which comes with the kzalloc API
>
> Well, it's not necessarily the API that is bad, but the implementation.
>
> We could easily make kzalloc() with a constant size just expand to
> kmalloc+memset, and get the behavior you want.
>
> We already do magical things for "find the right slab bucket" part of
> kmalloc too for constant sizes. It's changed over the years, but that
> policy goes back a long long time. See
>
>
> https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/?id=95203fe78007f9ab3aebb96606473ae18c00a5a8
>
> from the BK history tree.
>
> Exactly because some things are worth optimizing for when the size is
> known at compile time.
>
> Maybe just extending kzalloc() similarly? Trivial and entirely untested
> patch:
>
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -717,6 +717,12 @@ static inline void *kmem_cache_zalloc(struct
> kmem_cache *k, gfp_t flags)
> */
> static inline __alloc_size(1) void *kzalloc(size_t size, gfp_t flags)
> {
> + if (__builtin_constant_p(size)) {
> + void *ret = kmalloc(size, flags);
> + if (ret)
> + memset(ret, 0, size);
> + return ret;
> + }
> return kmalloc(size, flags | __GFP_ZERO);
> }
>

So I played with this and have a rather nasty summary. Bullet points:
1. patched kzalloc does not reduce memsets calls during kernel build
2. patched kmem_cache_zalloc_ptr + 2 consumers converted *does* drop
it significantly (36150671 -> 14414454)
3. ... inline memset generated by gcc sucks by resorting to rep stosq
around 48 bytes
4. memsets not sorted out have sizes not known at compilation time and
are not necessarily perf bugs on their own [read: would benefit from
faster memset]

Onto the the meat:

I patched the kernel with a slightly tweaked version of the above:
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 45af70315a94..7abb5490690f 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -717,6 +717,12 @@ static inline void *kmem_cache_zalloc(struct
kmem_cache *k, gfp_t flags)
*/
static inline __alloc_size(1) void *kzalloc(size_t size, gfp_t flags)
{
+ if (__builtin_constant_p(size)) {
+ void *ret = kmalloc(size, flags);
+ if (likely(ret))
+ memset(ret, 0, size);
+ return ret;
+ }
return kmalloc(size, flags | __GFP_ZERO);
}

and verified it indeed zeroes inline:

void kztest(void)
{
void *ptr;

ptr = kzalloc(32, GFP_KERNEL);
if (unlikely(!ptr))
return;
memsettest_rec(ptr);
}

$ objdump --disassemble=kztest vmlinux
[snip]
call ffffffff8135e130 <kmalloc_trace>
test %rax,%rax
je ffffffff81447d5f <kztest+0x4f>
movq $0x0,(%rax)
mov %rax,%rdi
movq $0x0,0x8(%rax)
movq $0x0,0x10(%rax)
movq $0x0,0x18(%rax)
call ffffffff81454060 <memsettest_rec>
[snip]

This did *NOT* lead to reduction of memset calls when building the kernel.

I verified few cases by hand, it is all either kmem_cache_zalloc or
explicitly added memsets with sizes not known at compilation time.

Two most frequent callers:
@[
memset+5
__alloc_file+40
alloc_empty_file+73
path_openat+77
do_filp_open+182
do_sys_openat2+159
__x64_sys_openat+89
do_syscall_64+93
entry_SYSCALL_64_after_hwframe+114
]: 11028994
@[
memset+5
security_file_alloc+45
__alloc_file+89
alloc_empty_file+73
path_openat+77
do_filp_open+182
do_sys_openat2+159
__x64_sys_openat+89
do_syscall_64+93
entry_SYSCALL_64_after_hwframe+114
]: 11028994

My wip addition is:

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 45af70315a94..12b5b02ef3d3 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -710,6 +710,17 @@ static inline void *kmem_cache_zalloc(struct
kmem_cache *k, gfp_t flags)
return kmem_cache_alloc(k, flags | __GFP_ZERO);
}

+#define kmem_cache_zalloc_ptr(k, f, retp) ({ \
+ __typeof(retp) _retp = kmem_cache_alloc(k, f); \
+ bool _rv = false; \
+ retp = _retp; \
+ if (likely(_retp)) { \
+ memset(_retp, 0, sizeof(*_retp)); \
+ _rv = true; \
+ } \
+ _rv; \
+})
+
diff --git a/security/security.c b/security/security.c
index cf6cc576736f..0f769ede0e54 100644
--- a/security/security.c
+++ b/security/security.c
@@ -600,8 +600,7 @@ static int lsm_file_alloc(struct file *file)
return 0;
}

- file->f_security = kmem_cache_zalloc(lsm_file_cache, GFP_KERNEL);
- if (file->f_security == NULL)
+ if (!kmem_cache_zalloc_ptr(lsm_file_cache, GFP_KERNEL,
file->f_security))
return -ENOMEM;
return 0;
}
diff --git a/fs/file_table.c b/fs/file_table.c
index 372653b92617..8e0dabf9530e 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -136,8 +136,7 @@ static struct file *__alloc_file(int flags, const
struct cred *cred)
struct file *f;
int error;

- f = kmem_cache_zalloc(filp_cachep, GFP_KERNEL);
- if (unlikely(!f))
+ if (!kmem_cache_zalloc_ptr(filp_cachep, GFP_KERNEL, f))
return ERR_PTR(-ENOMEM);

f->f_cred = get_cred(cred);

As mentioned above it cuts total calls in more than half.

The problem is it is it rolls with rep stosq way too easily, partially
defeating the point of inlining anything. clang does not have this
problem.

Take a look at __alloc_file:
[snip]
mov 0x19cab05(%rip),%rdi # ffffffff82df4318 <filp_cachep>
call ffffffff813dd610 <kmem_cache_alloc>
test %rax,%rax
je ffffffff814298b7 <__alloc_file+0xc7>
mov %rax,%r12
mov $0x1d,%ecx
xor %eax,%eax
mov %r12,%rdi
rep stos %rax,%es:(%rdi)
[/snip]

Here is a sample consumer which can't help but have a variable size --
select, used by gmake:
@[
memset+5
do_pselect.constprop.0+202
__x64_sys_pselect6+101
do_syscall_64+93
entry_SYSCALL_64_after_hwframe+114
]: 13160

In conclusion:
1. fixing up memsets is worthwhile regardless of what happens to its
current consumers -- not all of them are necessarily doing something
wrong
2. inlining memset can be a pessimization vs non-plain-erms memset as
evidenced above. will have to figure out how to convince gcc to be
less eager to use it.

Sometimes I hate computers.

--
Mateusz Guzik <mjguzik gmail.com>