Re: [PATCH v3 2/2] vfs: avoid duplicating creds in faccessat if possible

From: Mateusz Guzik
Date: Fri Mar 03 2023 - 12:39:19 EST


On 3/3/23, Alexander Potapenko <glider@xxxxxxxxxx> wrote:
> On Thu, Mar 2, 2023 at 9:11 PM Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote:
>>
>> On Thu, Mar 02, 2023 at 11:54:03AM -0800, Kees Cook wrote:
>> > On Thu, Mar 02, 2023 at 07:19:49PM +0000, Al Viro wrote:
>> > > On Thu, Mar 02, 2023 at 11:10:03AM -0800, Linus Torvalds wrote:
>> > > > On Thu, Mar 2, 2023 at 11:03 AM Linus Torvalds
>> > > > <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>> > > > >
>> > > > > It might be best if we actually exposed it as a SLAB_SKIP_ZERO
>> > > > > thing,
>> > > > > just to make it possible to say - exactly in situations like this
>> > > > > -
>> > > > > that this particular slab cache has no advantage from
>> > > > > pre-zeroing.
>> > > >
>> > > > Actually, maybe it's just as well to keep it per-allocation, and
>> > > > just
>> > > > special-case getname_flags() itself.
>> > > >
>> > > > We could replace the __getname() there with just a
>> > > >
>> > > > kmem_cache_alloc(names_cachep, GFP_KERNEL |
>> > > > __GFP_SKIP_ZERO);
>> > > >
>> > > > we're going to overwrite the beginning of the buffer with the path
>> > > > we
>> > > > copy from user space, and then we'd have to make people comfortable
>> > > > with the fact that even with zero initialization hardening on, the
>> > > > space after the filename wouldn't be initialized...
>> > >
>> > > ACK; same in getname_kernel() and sys_getcwd(), at the very least.
>> >
>> > FWIW, much earlier analysis suggested opting out these kmem caches:
>> >
>> > buffer_head
>> > names_cache
>> > mm_struct
>> > anon_vma
>> > skbuff_head_cache
>> > skbuff_fclone_cache
>>
>> I would probably add dentry_cache to it; the only subtle part is
>> ->d_iname and I'm convinced that explicit "make sure there's a NUL
>> at the very end" is enough.
>
> FWIW, a couple of years ago I was looking into implementing the
> following scheme for opt-out that I also discussed with Kees:
>
> 1. Add a ___GFP_SKIP_ZERO flag that is not intended to be used
> explicitly (e.g. disallow passing it to kmalloc(), add a checkpatch.pl
> warning). Explicitly passing an opt-out flag to allocation functions
> was considered harmful previously:
> https://lore.kernel.org/kernel-hardening/20190423083148.GF25106@xxxxxxxxxxxxxx/
>
> 2. Define new allocation API that will allow opt-out:
>
> struct page *alloc_pages_uninit(gfp_t gfp, unsigned int order, const
> char *key);
> void *kmalloc_uninit(size_t size, gfp_t flags, const char *key);
> void *kmem_cache_alloc_uninit(struct kmem_cache *, gfp_t flags,
> const char *key);
>
> , where @key is an arbitrary string that identifies a single
> allocation or a group of allocations.
>
> 3. Provide a boot-time flag that holds a comma-separated list of
> opt-out keys that actually take effect:
>
> init_on_alloc.skip="xyz-camera-driver,some_big_buffer".
>
> The rationale behind this two-step mechanism is that despite certain
> allocations may be appealing opt-out targets for performance reasons,
> some vendors may choose to be on the safe side and explicitly list the
> allocations that should not be zeroed.
>
> Several possible enhancements include:
> 1. A SLAB_NOINIT memcache flag that prohibits cache merging and
> disables initialization. Because the number of caches is relatively
> small, it might be fine to explicitly pass SLAB_NOINIT at cache
> creation sites.
> Again, if needed, we could only use this flag as a hint that needs to
> be acknowledged by a boot-time option.
> 2. No opt-out for kmalloc() - instead advise users to promote the
> anonymous allocations they want to opt-out to memory caches with
> SLAB_NOINIT.
> 3. Provide an emergency brake that completely disables
> ___GFP_SKIP_ZERO if a major security breach is discovered.
>
> Extending this idea of per-callsite opt-out we could generate unique
> keys for every allocation in the kernel (or e.g. group them together
> by the caller name) and decide at runtime if we want to opt them out.
> But I am not sure anyone would want this level of control in their distro.
>

I intended to write a longer e-mail concerning the entire
init-on-alloc situation along with some patches in not-so-distant
future, but the bare minimum I wrote previously already got numerous
people involved (unsurprisingly now that I look at it). I don't have
time to write the thing I wanted at the moment, but now that there is
traffic I think I should at least mention one other important bit.

I'm not going to argue for any particular level of granularity -- I'm
a happy camper as long as I can end up with names_cachep allocations
excluded without disabling the entire thing.

So the key is: memset is underperforming at least on x86-64 for
certain sizes and the init-on-alloc thingy makes it used significantly
more, exacerbating the problem. Fixing it is worthwhile on its own and
will reduce the impact of the option, but the need for some form of
opt-out will remain.

I don't have any numbers handy nor time to produce them, so the mail
will be a little handwave-y. I only write it in case someone decides
to bench now what would warrant exclusion with the mechanism under
discussion. Should any of the claims below be challenged, I can
respond some time late next week with hard data(tm).

Onto the issue:
Most cpus in use today have the ERMS bit, in which case the routine is:

SYM_FUNC_START_LOCAL(memset_erms)
movq %rdi,%r9
movb %sil,%al
movq %rdx,%rcx
rep stosb
movq %r9,%rax
RET
SYM_FUNC_END(memset_erms)

The problem here is that instructions with the rep prefix have a very
high startup latency. Afair this remains true even on cpus with FSRM
in case of rep *stos* (what is helped by FSRM is rep *mov*, whereas
stos remains unaffected).

Interestingly looks like the problem was recognized in general --
memcpy and copy_{to,from}_user have hand rolled smaller copies. Apart
from that __clear_user relatively recently got a treatment of that
sort but it somehow never got implemented in memset itself.

If memory serves right, the least slow way to do it is to *NOT* use
rep stos below 128 bytes of size (and this number is even higher the
better the cpu). Instead, a 32-byte loop (as in 4 times movsq) would
do it as long as there is space, along with overlapping stores to take
care of whatever < 32 bytes.

__clear_user got rep stos if FSRM is present and 64 byte non-rep
treatment, with an 8 byte loop and 1 byte loop to cover the tail. As
noted above, this leaves perf on the table. Even then, it would be an
improvement for memset if transplanted over. Maybe this was mentioned
in the discussion concerning the func, I had not read the thread -- I
only see that memset remains unpatched.

memset, even with init-on-alloc disabled, is used *a lot* with very
small sizes. For that bit I do have data collected over 'make' in the
kernel directory.

bpftrace -e 'kprobe:memset { @ = lhist(arg2, 0, 128, 8); }'

[0, 8) 9126030 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[8, 16) 515728 |@@ |
[16, 24) 621902 |@@ |
[24, 32) 110822 | |
[32, 40) 11003451 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[40, 48) 488 | |
[48, 56) 164 | |
[56, 64) 1493851 |@@@@@@ |
[64, 72) 214613 | |
[72, 80) 10468 | |
[80, 88) 10524 | |
[88, 96) 121 | |
[96, 104) 81591 | |
[104, 112) 1659699 |@@@@@@@ |
[112, 120) 3240 | |
[120, 128) 9058 | |
[128, ...) 11287204 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

note: i failed to figure out how to attach to memset on stock kernel,
thus the kernel was booted with the crappery below:

diff --git a/arch/x86/lib/memset_64.S b/arch/x86/lib/memset_64.S
index 6143b1a6fa2c..141d899d5f1d 100644
--- a/arch/x86/lib/memset_64.S
+++ b/arch/x86/lib/memset_64.S
@@ -45,9 +45,6 @@ SYM_FUNC_START(__memset)
SYM_FUNC_END(__memset)
EXPORT_SYMBOL(__memset)

-SYM_FUNC_ALIAS(memset, __memset)
-EXPORT_SYMBOL(memset)
-
/*
* ISO C memset - set a memory block to a byte value. This function uses
* enhanced rep stosb to override the fast string function.
diff --git a/fs/open.c b/fs/open.c
index 4401a73d4032..6e11e95ad9c3 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1564,3 +1564,11 @@ int stream_open(struct inode *inode, struct file *filp)
}

EXPORT_SYMBOL(stream_open);
+
+void *(memset)(void *s, int c, size_t n)
+{
+ return __memset(s, c, n);
+}
+
+EXPORT_SYMBOL(memset);

--
Mateusz Guzik <mjguzik gmail.com>