Re: [PATCH] mm/slub: add missing TID updates on slab deactivation
From: Hyeonggon Yoo
Date: Mon Jun 13 2022 - 13:35:19 EST
On Wed, Jun 08, 2022 at 08:22:05PM +0200, Jann Horn wrote:
> The fastpath in slab_alloc_node() assumes that c->slab is stable as long as
> the TID stays the same. However, two places in __slab_alloc() currently
> don't update the TID when deactivating the CPU slab.
>
> If multiple operations race the right way, this could lead to an object
> getting lost; or, in an even more unlikely situation, it could even lead to
> an object being freed onto the wrong slab's freelist, messing up the
> `inuse` counter and eventually causing a page to be freed to the page
> allocator while it still contains slab objects.
>
> (I haven't actually tested these cases though, this is just based on
> looking at the code. Writing testcases for this stuff seems like it'd be
> a pain...)
>
> The race leading to state inconsistency is (all operations on the same CPU
> and kmem_cache):
>
> - task A: begin do_slab_free():
> - read TID
> - read pcpu freelist (==NULL)
> - check `slab == c->slab` (true)
> - [PREEMPT A->B]
> - task B: begin slab_alloc_node():
> - fastpath fails (`c->freelist` is NULL)
> - enter __slab_alloc()
> - slub_get_cpu_ptr() (disables preemption)
> - enter ___slab_alloc()
> - take local_lock_irqsave()
> - read c->freelist as NULL
> - get_freelist() returns NULL
> - write `c->slab = NULL`
> - drop local_unlock_irqrestore()
> - goto new_slab
> - slub_percpu_partial() is NULL
> - get_partial() returns NULL
> - slub_put_cpu_ptr() (enables preemption)
> - [PREEMPT B->A]
> - task A: finish do_slab_free():
> - this_cpu_cmpxchg_double() succeeds()
> - [CORRUPT STATE: c->slab==NULL, c->freelist!=NULL]
I can see this happening (!c->slab && c->freelist becoming true)
when I synthetically add scheduling points in the code:
diff --git a/mm/slub.c b/mm/slub.c
index b97fa5e21046..b8012fdf2607 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3001,6 +3001,10 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
goto check_new_slab;
slub_put_cpu_ptr(s->cpu_slab);
+
+ if (!in_atomic())
+ schedule();
+
slab = new_slab(s, gfpflags, node);
c = slub_get_cpu_ptr(s->cpu_slab);
@@ -3456,9 +3460,13 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
if (likely(slab == c->slab)) {
#ifndef CONFIG_PREEMPT_RT
void **freelist = READ_ONCE(c->freelist);
+ unsigned long flags;
set_freepointer(s, tail_obj, freelist);
+ if (!in_atomic())
+ schedule();
+
if (unlikely(!this_cpu_cmpxchg_double(
s->cpu_slab->freelist, s->cpu_slab->tid,
freelist, tid,
@@ -3467,6 +3475,10 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
note_cmpxchg_failure("slab_free", s, tid);
goto redo;
}
+
+ local_irq_save(flags);
+ WARN_ON(!READ_ONCE(c->slab) && READ_ONCE(c->freelist));
+ local_irq_restore(flags);
#else /* CONFIG_PREEMPT_RT */
/*
* We cannot use the lockless fastpath on PREEMPT_RT because if
> From there, the object on c->freelist will get lost if task B is allowed to
> continue from here: It will proceed to the retry_load_slab label,
> set c->slab, then jump to load_freelist, which clobbers c->freelist.
>
> But if we instead continue as follows, we get worse corruption:
>
> - task A: run __slab_free() on object from other struct slab:
> - CPU_PARTIAL_FREE case (slab was on no list, is now on pcpu partial)
> - task A: run slab_alloc_node() with NUMA node constraint:
> - fastpath fails (c->slab is NULL)
> - call __slab_alloc()
> - slub_get_cpu_ptr() (disables preemption)
> - enter ___slab_alloc()
> - c->slab is NULL: goto new_slab
> - slub_percpu_partial() is non-NULL
> - set c->slab to slub_percpu_partial(c)
> - [CORRUPT STATE: c->slab points to slab-1, c->freelist has objects
> from slab-2]
> - goto redo
> - node_match() fails
> - goto deactivate_slab
> - existing c->freelist is passed into deactivate_slab()
> - inuse count of slab-1 is decremented to account for object from
> slab-2
I didn't try to reproduce this -- but I agree SLUB can be fooled
by the condition (!c->slab && c->freelist).
> At this point, the inuse count of slab-1 is 1 lower than it should be.
> This means that if we free all allocated objects in slab-1 except for one,
> SLUB will think that slab-1 is completely unused, and may free its page,
> leading to use-after-free.
>
> Fixes: c17dda40a6a4e ("slub: Separate out kmem_cache_cpu processing from deactivate_slab")
> Fixes: 03e404af26dc2 ("slub: fast release on full slab")
> Cc: stable@xxxxxxxxxxxxxxx
> Signed-off-by: Jann Horn <jannh@xxxxxxxxxx>
> ---
> mm/slub.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/mm/slub.c b/mm/slub.c
> index e5535020e0fdf..b97fa5e210469 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2936,6 +2936,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>
> if (!freelist) {
> c->slab = NULL;
> + c->tid = next_tid(c->tid);
> local_unlock_irqrestore(&s->cpu_slab->lock, flags);
> stat(s, DEACTIVATE_BYPASS);
> goto new_slab;
> @@ -2968,6 +2969,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> freelist = c->freelist;
> c->slab = NULL;
> c->freelist = NULL;
> + c->tid = next_tid(c->tid);
> local_unlock_irqrestore(&s->cpu_slab->lock, flags);
> deactivate_slab(s, slab, freelist);
>
>
> base-commit: 9886142c7a2226439c1e3f7d9b69f9c7094c3ef6
> --
> 2.36.1.476.g0c4daa206d-goog
With this patch I couldn't reproduce it.
This work is really nice. Thanks!
Tested-by: Hyeonggon Yoo <42.hyeyoo@xxxxxxxxx>
BTW I wonder how much this race will affect machines in the real world.
Maybe just rare and undetectable memory leak?
--
Thanks,
Hyeonggon