Re: [PATCH v5] Randomized slab caches for kmalloc()

From: Kees Cook
Date: Mon Sep 11 2023 - 23:14:32 EST


On Mon, Sep 11, 2023 at 11:18:15PM +0200, jvoisin wrote:
> I wrote a small blogpost[1] about this series, and was told[2] that it
> would be interesting to share it on this thread, so here it is, copied
> verbatim:

Thanks for posting!

> Ruiqi Gong and Xiu Jianfeng got their
> [Randomized slab caches for
> kmalloc()](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3c6152940584290668b35fa0800026f6a1ae05fe)
> patch series merged upstream, and I've and enough discussions about it to
> warrant summarising them into a small blogpost.
>
> The main idea is to have multiple slab caches, and pick one at random
> based on
> the address of code calling `kmalloc()` and a per-boot seed, to make
> heap-spraying harder.
> It's a great idea, but comes with some shortcomings for now:
>
> - Objects being allocated via wrappers around `kmalloc()`, like
> `sock_kmalloc`,
> `f2fs_kmalloc`, `aligned_kmalloc`, … will end up in the same slab cache.

I'd love to see some way to "unwrap" these kinds of allocators. Right
now we try to manually mark them so the debugging options can figure out
what did the allocation, but it's not complete by any means.

I'd kind of like to see a common front end that specified some set of
"do stuff" routines. e.g. to replace devm_kmalloc(), we could have:

void *alloc(size_t usable, gfp_t flags,
size_t (*prepare)(size_t, gfp_t, void *ctx),
void * (*finish)(size_t, gfp_t, void *ctx, void *allocated)
void * ctx)

ssize_t devm_prep(size_t usable, gfp_t *flags, void *ctx)
{
ssize_t tot_size;

if (unlikely(check_add_overflow(sizeof(struct devres),
size, &tot_size)))
return -ENOMEM;

tot_size = kmalloc_size_roundup(tot_size);
*flags |= __GFP_ZERO;

return tot_size;
}

void *devm_finish(size_t usable, gfp_t flags, void *ctx, void *allocated)
{
struct devres *dr = allocated;
struct device *dev = ctx;

INIT_LIST_HEAD(&dr->node.entry);
dr->node.release = devm_kmalloc_release;

set_node_dbginfo(&dr->node, "devm_kzalloc_release", usable);
devres_add(dev, dr->data);
return dr->data;
}

#define devm_kmalloc(dev, size, gfp) \
alloc(size, gfp, devm_prep, devm_finish, dev)

And now there's no wrapper any more, just a routine to get the actual
size, and a routine to set up the memory and return the "usable"
pointer.

> - The slabs needs to be pinned, otherwise an attacker could
> [feng-shui](https://en.wikipedia.org/wiki/Heap_feng_shui) their way
> into having the whole slab free'ed, garbage-collected, and have a slab for
> another type allocated at the same VA. [Jann Horn](https://thejh.net/)
> and [Matteo Rizzo](https://infosec.exchange/@nspace) have a [nice
> set of
> patches](https://github.com/torvalds/linux/compare/master...thejh:linux:slub-virtual-upstream),
> discussed a bit in [this Project Zero
> blogpost](https://googleprojectzero.blogspot.com/2021/10/how-simple-linux-kernel-memory.html),
> for a feature called [`SLAB_VIRTUAL`](
> https://github.com/torvalds/linux/commit/f3afd3a2152353be355b90f5fd4367adbf6a955e),
> implementing precisely this.

I'm hoping this will get posted to LKML soon.

> - There are 16 slabs by default, so one chance out of 16 to end up in
> the same
> slab cache as the target.

Future work can make this more deterministic.

> - There are no guard pages between caches, so inter-caches overflows are
> possible.

This may be addressed by SLAB_VIRTUAL.

> - As pointed by
> [andreyknvl](https://twitter.com/andreyknvl/status/1700267669336080678)
> and [minipli](https://infosec.exchange/@minipli/111045336853055793),
> the fewer allocations hitting a given cache means less noise,
> so it might even help with some heap feng-shui.

That may be true, but I suspect it'll be mitigated by the overall
reduction shared caches.

> - minipli also pointed that "randomized caches still freely
> mix kernel allocations with user controlled ones (`xattr`, `keyctl`,
> `msg_msg`, …).
> So even though merging is disabled for these caches, i.e. no direct
> overlap
> with `cred_jar` etc., other object types can still be targeted (`struct
> pipe_buffer`, BPF maps, its verifier state objects,…). It’s just a
> matter of
> probing which allocation index the targeted object falls into.",
> but I considered this out of scope, since it's much more involved;
> albeit something like
> [`CONFIG_KMALLOC_SPLIT_VARSIZE`](https://github.com/thejh/linux/blob/slub-virtual/MITIGATION_README)
> wouldn't significantly increase complexity.

Now that we have a mechanism to easily deal with "many kmalloc buckets",
I think we can easily start carving out specific variable-sized caches
(like msg_msg). Basically doing a manual type-based separation.

So, yeah, we're in a better place that we were before, and better
positioned to continue to make improvements here. I think an easy win
would be doing this last one: separate out the user controlled
variable-sized caches and give them their own distinct buckets outside
of the 16 random ones. Can you give that a try and send patches?

-Kees

--
Kees Cook