Re: [rfc/patch] mm/slub: restore/expand unfreeze_partials() local exclusion scope

From: Vlastimil Babka
Date: Tue Jul 20 2021 - 05:03:52 EST


On 7/15/21 6:34 PM, Mike Galbraith wrote:
> Greetings crickets,
>
> Methinks he problem is the hole these patches opened only for RT.
>
> static void put_cpu_partial(struct kmem_cache *s, struct page *page,
> int drain)
> {
> #ifdef CONFIG_SLUB_CPU_PARTIAL
> struct page *oldpage;
> int pages;
> int pobjects;
>
> slub_get_cpu_ptr(s->cpu_slab);
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> That is an assertion that the stuff under it is preempt safe for RT and
> ONLY RT. My box says the RT portion of that assertion is horse pookey.
> What it does is let kmem_cache_free()/kfree() blast straight through
> ___slab_alloc() critical sections, swapping out ->partial underneath
> it. Sprinkling percpu fingerprint power about, I saw lots of
> ___slab_alloc() preempts put_cpu_partial().. and vice versa. I don't
> think it's a coincidence that ___slab_alloc() and __unfreeze_partials()
> both explode trying to access page->freelist. You've seen the former,
> here's one of the later.
>
> crash> bt -sx
> PID: 18761 TASK: ffff88812fff0000 CPU: 0 COMMAND: "hackbench"
> #0 [ffff88818f8ff980] machine_kexec+0x14f at ffffffff81051c8f
> #1 [ffff88818f8ff9c8] __crash_kexec+0xd2 at ffffffff8111ef72
> #2 [ffff88818f8ffa88] crash_kexec+0x30 at ffffffff8111fd10
> #3 [ffff88818f8ffa98] oops_end+0xd3 at ffffffff810267e3
> #4 [ffff88818f8ffab8] exc_general_protection+0x195 at ffffffff8179fdb5
> #5 [ffff88818f8ffb50] asm_exc_general_protection+0x1e at ffffffff81800a0e
> [exception RIP: __unfreeze_partials+156]

Hm going back to this report, if that's a direct result of __slab_alloc
preempting (interrupting) put_cpu_partial() then I think that's
something that could happen even on !RT as all we do in the
put_cpu_partial() there is disable preemption, and the allocation could
come in irq handler.
It would mean the whole idea of "mm, slub: detach percpu partial list in
unfreeze_partials() using this_cpu_cmpxchg()" isn't safe. I'll have to
ponder it.
But we didn't see crashes on !RT yet. So could it be that it was still
put_cpu_partial() preempting __slab_alloc() messing the partial list,
but for some reason the put_cpu_partial() side crashed this time?

> RIP: ffffffff81248bac RSP: ffff88818f8ffc00 RFLAGS: 00010202
> RAX: 0000000000200002 RBX: 0000000000200002 RCX: 000000017fffffff
> RDX: 00000001ffffffff RSI: 0000000000000202 RDI: ffff888100040b80
> RBP: ffff88818f8ffca0 R8: ffff88818f9cba30 R9: 0000000000000001
> R10: ffff88818f8ffcc0 R11: 0000000000000000 R12: ffff888100043700
> R13: ffff888100040b80 R14: 00000001ffffffff R15: ffffea00046c0808
> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
> #6 [ffff88818f8ffcb8] put_cpu_partial+0x8e at ffffffff81248ece
> #7 [ffff88818f8ffcd8] consume_skb+0x2b at ffffffff815eddeb
> #8 [ffff88818f8ffce8] unix_stream_read_generic+0x788 at ffffffff8170b238
> #9 [ffff88818f8ffdc0] unix_stream_recvmsg+0x43 at ffffffff8170b433
> #10 [ffff88818f8ffdf8] sock_read_iter+0x104 at ffffffff815dd554
> #11 [ffff88818f8ffe68] new_sync_read+0x16a at ffffffff812674fa
> #12 [ffff88818f8ffee8] vfs_read+0x1ae at ffffffff81269c8e
> #13 [ffff88818f8fff00] ksys_read+0x40 at ffffffff8126a070
> #14 [ffff88818f8fff38] do_syscall_64+0x37 at ffffffff8179f5e7
> #15 [ffff88818f8fff50] entry_SYSCALL_64_after_hwframe+0x44 at ffffffff8180007c
> RIP: 00007fbda4438f2e RSP: 00007fff3bf9d798 RFLAGS: 00000246
> RAX: ffffffffffffffda RBX: 00007fff3bf9e7a0 RCX: 00007fbda4438f2e
> RDX: 0000000000001000 RSI: 00007fff3bf9d7a0 RDI: 0000000000000007
> RBP: 00007fff3bf9e800 R8: 00007fff3bf9e6e0 R9: 00007fbda4641010
> R10: 00007fbda4428028 R11: 0000000000000246 R12: 0000000000001000
> crash> dis __unfreeze_partials+156
> 0xffffffff81248bac <__unfreeze_partials+156>: lock cmpxchg16b 0x20(%r15)
> crash> gdb list *__unfreeze_partials+156
> 0xffffffff81248bac is in __unfreeze_partials (mm/slub.c:475).
> 470 if (!disable_irqs)
> 471 lockdep_assert_irqs_disabled();
> 472 #if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
> 473 defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
> 474 if (s->flags & __CMPXCHG_DOUBLE) {
> 475 if (cmpxchg_double(&page->freelist, &page->counters,
> 476 freelist_old, counters_old,
> 477 freelist_new, counters_new))
> 478 return true;
> 479 } else
> crash> kmem ffffea00046c0808
> PAGE PHYSICAL MAPPING INDEX CNT FLAGS
> ffffea00104b3000 412cc0000 0 c 2 8000000000003000 reserved,private
> crash
>
> Regarding $subject, Lockdep thinks my hole plugging skills suck rocks
> (shrug, I've given it plentiful cause) but that's ok, I sometimes think
> the same of its bug reporting skills :)
>
> [ 2.459456] ============================================
> [ 2.459456] WARNING: possible recursive locking detected
> [ 2.459457] 5.14.0.g79e92006-tip-rt #83 Tainted: G E
> [ 2.459458] --------------------------------------------
> [ 2.459458] rcuc/7/56 is trying to acquire lock:
> [ 2.459459] ffff88840edf01c0 (&(&c->lock)->lock){+.+.}-{0:0}, at: kfree+0x280/0x670
> [ 2.459466]
> but task is already holding lock:
> [ 2.459466] ffff88840edf4d60 (&(&c->lock)->lock){+.+.}-{0:0}, at: __slab_free+0x4d6/0x600
> [ 2.459469]
>
> live kernel snooping....
>
> crash> ps | grep UN
> crash> bt -sx 56
> PID: 56 TASK: ffff888100c19a40 CPU: 7 COMMAND: "rcuc/7"
> #0 [ffff888100c63e40] __schedule+0x2eb at ffffffff818969fb
> #1 [ffff888100c63ec8] schedule+0x3b at ffffffff8189723b
> #2 [ffff888100c63ee0] smpboot_thread_fn+0x18c at ffffffff810a90dc
> #3 [ffff888100c63f18] kthread+0x1af at ffffffff810a27bf
> #4 [ffff888100c63f50] ret_from_fork+0x1f at ffffffff81001cbf
> crash> task_struct ffff888100c19a40 | grep __state
> __state = 1,
> crash> gdb list *__slab_free+0x4d6
> 0xffffffff812923c6 is in __slab_free (mm/slub.c:2568).
> 2563 /*
> 2564 * partial array is full. Move the existing
> 2565 * set to the per node partial list.
> 2566 */
> 2567 local_lock(&s->cpu_slab->lock);
> 2568 unfreeze_partials(s);
> 2569 local_unlock(&s->cpu_slab->lock);
> 2570 oldpage = NULL;
> 2571 pobjects = 0;
> 2572 pages = 0;
> crash> gdb list *kfree+0x280
> 0xffffffff81292770 is in kfree (mm/slub.c:3442).
> 3437 * __slab_free() as that wouldn't use the cpu freelist at all.
> 3438 */
> 3439 void **freelist;
> 3440
> 3441 local_lock(&s->cpu_slab->lock);
> 3442 c = this_cpu_ptr(s->cpu_slab);
> 3443 if (unlikely(page != c->page)) {
> 3444 local_unlock(&s->cpu_slab->lock);
> 3445 goto redo;
> 3446 }
> crash>
>
> Check, those are what's in the stacktrace below... but the allegedly
> deadlocked for real task is very much alive, as is the rest of box.
>
> other info that might help us debug this:
> [ 2.459470] Possible unsafe locking scenario:
>
> [ 2.459470] CPU0
> [ 2.459470] ----
> [ 2.459471] lock(&(&c->lock)->lock);
> [ 2.459471] lock(&(&c->lock)->lock);
> [ 2.459472]
> *** DEADLOCK ***
>
> [ 2.459472] May be due to missing lock nesting notation
>
> [ 2.459472] 6 locks held by rcuc/7/56:
> [ 2.459473] #0: ffff88840edd9820 ((softirq_ctrl.lock).lock){+.+.}-{2:2}, at: __local_bh_disable_ip+0xc3/0x190
> [ 2.459479] #1: ffffffff82382b80 (rcu_read_lock){....}-{1:2}, at: rt_spin_lock+0x5/0xd0
> [ 2.459484] #2: ffffffff82382b80 (rcu_read_lock){....}-{1:2}, at: __local_bh_disable_ip+0xa0/0x190
> [ 2.459487] #3: ffffffff82382ac0 (rcu_callback){....}-{0:0}, at: rcu_do_batch+0x195/0x520
> [ 2.459490] #4: ffff88840edf4d60 (&(&c->lock)->lock){+.+.}-{0:0}, at: __slab_free+0x4d6/0x600
> [ 2.459493] #5: ffffffff82382b80 (rcu_read_lock){....}-{1:2}, at: rt_spin_lock+0x5/0xd0
> [ 2.459496]
> stack backtrace:
> [ 2.459497] CPU: 7 PID: 56 Comm: rcuc/7 Tainted: G E 5.14.0.g79e92006-tip-rt #83
> [ 2.459498] Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 09/23/2013
> [ 2.459499] Call Trace:
> [ 2.459501] dump_stack_lvl+0x44/0x57
> [ 2.459504] __lock_acquire+0xb7f/0x1ab0
> [ 2.459508] lock_acquire+0x2a6/0x340
> [ 2.459510] ? kfree+0x280/0x670
> [ 2.459513] ? __free_slab+0xa4/0x1b0
> [ 2.459515] rt_spin_lock+0x2a/0xd0
> [ 2.459516] ? kfree+0x280/0x670 somewhere in the multiverse lies a dead rcuc/7
> [ 2.459518] kfree+0x280/0x670 <== local_lock(&s->cpu_slab->lock); #2
> [ 2.459521] __free_slab+0xa4/0x1b0 ==> kfree(page_objcgs(page))
> [ 2.459523] __unfreeze_partials+0x1d8/0x330 ==> discard_slab(s, page);
> [ 2.459526] ? _raw_spin_unlock_irqrestore+0x30/0x80
> [ 2.459530] ? __slab_free+0x4de/0x600
> [ 2.459532] __slab_free+0x4de/0x600 <== local_lock(&s->cpu_slab->lock); #1
> [ 2.459534] ? find_held_lock+0x2d/0x90
> [ 2.459536] ? kmem_cache_free+0x276/0x630
> [ 2.459539] ? rcu_do_batch+0x1c3/0x520
> [ 2.459540] ? kmem_cache_free+0x364/0x630
> [ 2.459542] kmem_cache_free+0x364/0x630
> [ 2.459544] ? rcu_do_batch+0x195/0x520
> [ 2.459546] rcu_do_batch+0x1c3/0x520
> [ 2.459547] ? rcu_do_batch+0x195/0x520
> [ 2.459550] ? rcu_cpu_kthread+0x7e/0x4b0
> [ 2.459552] ? rcu_cpu_kthread+0x57/0x4b0
> [ 2.459553] rcu_core+0x2c3/0x590
> [ 2.459555] ? rcu_cpu_kthread+0x78/0x4b0
> [ 2.459557] ? rcu_cpu_kthread+0x7e/0x4b0
> [ 2.459558] ? rcu_cpu_kthread+0x57/0x4b0
> [ 2.459560] rcu_cpu_kthread+0xc2/0x4b0
> [ 2.459562] ? smpboot_thread_fn+0x23/0x300
> [ 2.459565] smpboot_thread_fn+0x249/0x300
> [ 2.459567] ? smpboot_register_percpu_thread+0xe0/0xe0
> [ 2.459569] kthread+0x1af/0x1d0
> [ 2.459571] ? set_kthread_struct+0x40/0x40
> [ 2.459574] ret_from_fork+0x1f/0x30
>