Re: [PATCH RESEND] mm: memcg/slab: fix panic in __free_slab() caused by premature memcg pointer release

From: Vlastimil Babka
Date: Fri Oct 11 2019 - 05:31:25 EST


I believe it's better to include To: Andrew directly to have the patch
picked up?

On 10/10/19 6:05 PM, Roman Gushchin wrote:
> Karsten reported the following panic in __free_slab() happening on a s390x
> machine:
>
> 349.361168Â Unable to handle kernel pointer dereference in virtual kernel address space
> 349.361210Â Failing address: 0000000000000000 TEID: 0000000000000483
> 349.361223Â Fault in home space mode while using kernel ASCE.
> 349.361240Â AS:00000000017d4007 R3:000000007fbd0007 S:000000007fbff000 P:000000000000003d
> 349.361340Â Oops: 0004 ilc:3 Ã#1Â PREEMPT SMP
> 349.361349Â Modules linked in: tcp_diag inet_diag xt_tcpudp ip6t_rpfilter ip6t_REJECT \
> nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ip6table_nat ip6table_mangle \
> ip6table_raw ip6table_security iptable_at nf_nat
> 349.361436Â CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.3.0-05872-g6133e3e4bada-dirty #14
> 349.361445Â Hardware name: IBM 2964 NC9 702 (z/VM 6.4.0)
> 349.361450Â Krnl PSW : 0704d00180000000 00000000003cadb6 (__free_slab+0x686/0x6b0)
> 349.361464Â R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3
> 349.361470Â Krnl GPRS: 00000000f3a32928 0000000000000000 000000007fbf5d00 000000000117c4b8
> 349.361475Â 0000000000000000 000000009e3291c1 0000000000000000 0000000000000000
> 349.361481Â 0000000000000003 0000000000000008 000000002b478b00 000003d080a97600
> 349.361481Â 0000000000000003 0000000000000008 000000002b478b00 000003d080a97600
> 349.361486Â 000000000117ba00 000003e000057db0 00000000003cabcc 000003e000057c78
> 349.361500Â Krnl Code: 00000000003cada6: e310a1400004 lg %r1,320(%r10)
> 349.361500Â 00000000003cadac: c0e50046c286 brasl %r14,ca32b8
> 349.361500Â #00000000003cadb2: a7f4fe36 brc 15,3caa1e
> 349.361500Â >00000000003cadb6: e32060800024 stg %r2,128(%r6)
> 349.361500Â 00000000003cadbc: a7f4fd9e brc 15,3ca8f8
> 349.361500Â 00000000003cadc0: c0e50046790c brasl %r14,c99fd8
> 349.361500Â 00000000003cadc6: a7f4fe2c brc 15,3caa
> 349.361500Â 00000000003cadc6: a7f4fe2c brc 15,3caa1e
> 349.361500Â 00000000003cadca: ecb1ffff00d9 aghik %r11,%r1,-1
> 349.361619Â Call Trace:
> 349.361627Â (Ã<00000000003cabcc>Â __free_slab+0x49c/0x6b0)
> 349.361634Â Ã<00000000001f5886>Â rcu_core+0x5a6/0x7e0
> 349.361643Â Ã<0000000000ca2dea>Â __do_softirq+0xf2/0x5c0
> 349.361652Â Ã<0000000000152644>Â irq_exit+0x104/0x130
> 349.361659Â Ã<000000000010d222>Â do_IRQ+0x9a/0xf0
> 349.361667Â Ã<0000000000ca2344>Â ext_int_handler+0x130/0x134
> 349.361674Â Ã<0000000000103648>Â enabled_wait+0x58/0x128
> 349.361681Â (Ã<0000000000103634>Â enabled_wait+0x44/0x128)
> 349.361688Â Ã<0000000000103b00>Â arch_cpu_idle+0x40/0x58
> 349.361695Â Ã<0000000000ca0544>Â default_idle_call+0x3c/0x68
> 349.361704Â Ã<000000000018eaa4>Â do_idle+0xec/0x1c0
> 349.361748Â Ã<000000000018ee0e>Â cpu_startup_entry+0x36/0x40
> 349.361756Â Ã<000000000122df34>Â arch_call_rest_init+0x5c/0x88
> 349.361761Â Ã<0000000000000000>Â 0x0
> 349.361765Â INFO: lockdep is turned off.
> 349.361769Â Last Breaking-Event-Address:
> 349.361774Â Ã<00000000003ca8f4>Â __free_slab+0x1c4/0x6b0
> 349.361781Â Kernel panic - not syncing: Fatal exception in interrupt
>
> The kernel panics on an attempt to dereference the NULL memcg pointer.
> When shutdown_cache() is called from the kmem_cache_destroy() context,
> a memcg kmem_cache might have empty slab pages in a partial list,
> which are still charged to the memory cgroup. These pages are released
> by free_partial() at the beginning of shutdown_cache(): either
> directly or by scheduling a RCU-delayed work (if the kmem_cache has
> the SLAB_TYPESAFE_BY_RCU flag). The latter case is when the reported
> panic can happen: memcg_unlink_cache() is called immediately after
> shrinking partial lists, without waiting for scheduled RCU works.
> It sets the kmem_cache->memcg_params.memcg pointer to NULL,
> and the following attempt to dereference it by __free_slab()
> from the RCU work context causes the panic.
>
> To fix the issue, let's postpone the release of the memcg pointer
> to destroy_memcg_params(). It's called from a separate work context
> by slab_caches_to_rcu_destroy_workfn(), which contains a full RCU
> barrier. This guarantees that all scheduled page release RCU works
> will complete before the memcg pointer will be zeroed.
>
> Big thanks for Karsten for the perfect report containing all necessary
> information, his help with the analysis of the problem and testing
> of the fix.
>
> Fixes: fb2f2b0adb98 ("mm: memcg/slab: reparent memcg kmem_caches on cgroup removal")
> Reported-by: Karsten Graul <kgraul@xxxxxxxxxxxxx>
> Tested-by: Karsten Graul <kgraul@xxxxxxxxxxxxx>
> Signed-off-by: Roman Gushchin <guro@xxxxxx>
> Cc: Karsten Graul <kgraul@xxxxxxxxxxxxx>
> Cc: Shakeel Butt <shakeelb@xxxxxxxxxx>
> Cc: Vladimir Davydov <vdavydov.dev@xxxxxxxxx>
> Cc: David Rientjes <rientjes@xxxxxxxxxx>
> Cc: stable@xxxxxxxxxxxxxxx

Acked-by: Vlastimil Babka <vbabka@xxxxxxx>

> ---
> mm/slab_common.c | 9 +++++----
> 1 file changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 0b94a37da531..8afa188f6e20 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -178,10 +178,13 @@ static int init_memcg_params(struct kmem_cache *s,
>
> static void destroy_memcg_params(struct kmem_cache *s)
> {
> - if (is_root_cache(s))
> + if (is_root_cache(s)) {
> kvfree(rcu_access_pointer(s->memcg_params.memcg_caches));
> - else
> + } else {
> + mem_cgroup_put(s->memcg_params.memcg);
> + WRITE_ONCE(s->memcg_params.memcg, NULL);
> percpu_ref_exit(&s->memcg_params.refcnt);
> + }
> }
>
> static void free_memcg_params(struct rcu_head *rcu)
> @@ -253,8 +256,6 @@ static void memcg_unlink_cache(struct kmem_cache *s)
> } else {
> list_del(&s->memcg_params.children_node);
> list_del(&s->memcg_params.kmem_caches_node);
> - mem_cgroup_put(s->memcg_params.memcg);
> - WRITE_ONCE(s->memcg_params.memcg, NULL);
> }
> }
> #else
>