Possible reproduction of CSD locking issue

From: Corey Minyard
Date: Tue Jan 25 2022 - 13:28:15 EST


We have a customer that had been seeing CSD lock issues on a Centos 7
kernel (unfortunately). I couldn't find anything or any kernel changes
that might fix it, so I was consdering it was the CSD locking issue you
have been chasing for a while.

So I backported the debug patches. And of course, they stopped seeing
the issue, at least as much, and they had trouble with the extra CPU
time the debug code took. But they just reproduced it. Here are the
logs:

Jan 23 23:39:43 worker0 kernel: [285737.522743] csd: Detected non-responsive CSD lock (#1) on CPU#3, waiting 5000000042 ns for CPU#55 flush_tlb_func+0x0/0xb0(0xffff8e0b3e2afbe8).
Jan 23 23:39:43 worker0 kernel: [285737.522744] csd: CSD lock (#1) unresponsive.
Jan 23 23:39:43 worker0 kernel: [285737.522747] csd: cnt(0000000): 0000->0000 queue
Jan 23 23:39:43 worker0 kernel: [285737.522748] csd: cnt(0000001): ffff->0037 idle
Jan 23 23:39:43 worker0 kernel: [285737.522749] csd: cnt(63d8dd8): 0003->0037 ipi
Jan 23 23:39:43 worker0 kernel: [285737.522750] csd: cnt(63d8dd9): 0003->0037 ping
Jan 23 23:39:43 worker0 kernel: [285737.522750] csd: cnt(63d8dda): 0003->ffff pinged
Jan 23 23:39:43 worker0 kernel: [285737.522751] csd: cnt(63d8dea): 0035->0037 pinged
Jan 23 23:39:43 worker0 kernel: [285737.522752] csd: cnt(63d8deb): ffff->0037 gotipi
Jan 23 23:39:43 worker0 kernel: [285737.522752] csd: cnt(63d8dec): ffff->0037 handle
Jan 23 23:39:43 worker0 kernel: [285737.522753] csd: cnt(63d8ded): ffff->0037 dequeue (src CPU 0 == empty)
Jan 23 23:39:43 worker0 kernel: [285737.522754] csd: cnt(63d8dee): ffff->0037 hdlend (src CPU 0 == early)
Jan 23 23:39:43 worker0 kernel: [285737.522754] csd: cnt(63d8e1f): 0003->0037 queue
Jan 23 23:39:43 worker0 kernel: [285737.522755] csd: cnt(63d8e20): 0003->0037 ipi
Jan 23 23:39:43 worker0 kernel: [285737.522756] csd: cnt(63d8e21): 0003->0037 ping
Jan 23 23:39:43 worker0 kernel: [285737.522756] csd: cnt(63d8e22): 0003->0037 queue
Jan 23 23:39:43 worker0 kernel: [285737.522757] csd: cnt(63d8e23): 0003->0037 noipi
Jan 23 23:39:43 worker0 kernel: [285737.522757] csd: cnt now: 63fe4cd
Jan 23 23:39:43 worker0 kernel: [285737.522758] Task dump for CPU 55:
Jan 23 23:39:43 worker0 kernel: [285737.522761] kubelet R running task 0 277695 1 0x00080000
Jan 23 23:39:43 worker0 kernel: [285737.522761] Call Trace:
Jan 23 23:39:43 worker0 kernel: [285737.522769] [<ffffffff84376b6a>] ? __schedule+0x46a/0x990
Jan 23 23:39:43 worker0 kernel: [285737.522774] [<ffffffff83db6353>] ? context_tracking_user_enter+0x13/0x20
Jan 23 23:39:43 worker0 kernel: [285737.522776] [<ffffffff843775b5>] ? schedule_user+0x45/0x50
Jan 23 23:39:43 worker0 kernel: [285737.522779] [<ffffffff8437b518>] ? retint_careful+0x16/0x34
Jan 23 23:39:43 worker0 kernel: [285737.522780] csd: Re-sending CSD lock (#1) IPI from CPU#03 to CPU#55
Jan 23 23:39:43 worker0 kernel: [285737.522788] CPU: 3 PID: 54671 Comm: runc:[2:INIT] Kdump: loaded Tainted: G OE ------------ T 3.10.0-1062.12.1.rt56.1042.mvista.test.14.el7.x86_64 #1
Jan 23 23:39:43 worker0 kernel: [285737.522789] Hardware name: Dell Inc. PowerEdge R740/0YWR7D, BIOS 2.9.4 11/06/2020
Jan 23 23:39:43 worker0 kernel: [285737.522789] Call Trace:
Jan 23 23:39:43 worker0 kernel: [285737.522793] [<ffffffff843718ba>] dump_stack+0x19/0x1b
Jan 23 23:39:43 worker0 kernel: [285737.522798] [<ffffffff83d0bcd8>] __csd_lock_wait+0x1a8/0x2a0
Jan 23 23:39:43 worker0 kernel: [285737.522800] [<ffffffff83c6d870>] ? leave_mm+0x120/0x120
Jan 23 23:39:43 worker0 kernel: [285737.522802] [<ffffffff83d0bfa4>] smp_call_function_single+0xc4/0x1b0
Jan 23 23:39:43 worker0 kernel: [285737.522804] [<ffffffff83c6d870>] ? leave_mm+0x120/0x120
Jan 23 23:39:43 worker0 kernel: [285737.522809] [<ffffffff83e2684b>] ? page_counter_uncharge+0x3b/0x70
Jan 23 23:39:43 worker0 kernel: [285737.522811] [<ffffffff83d0c614>] smp_call_function_many+0x344/0x380
Jan 23 23:39:43 worker0 kernel: [285737.522813] [<ffffffff83c6d870>] ? leave_mm+0x120/0x120
Jan 23 23:39:43 worker0 kernel: [285737.522816] [<ffffffff83c6da38>] native_flush_tlb_others+0xb8/0xc0
Jan 23 23:39:43 worker0 kernel: [285737.522818] [<ffffffff83c6dc25>] flush_tlb_page+0x65/0xf0
Jan 23 23:39:43 worker0 kernel: [285737.522821] [<ffffffff83dfdf98>] ptep_clear_flush+0x68/0xa0
Jan 23 23:39:43 worker0 kernel: [285737.522825] [<ffffffff83de6806>] wp_page_copy.isra.83+0x3d6/0x650
Jan 23 23:39:43 worker0 kernel: [285737.522828] [<ffffffff83de8cb4>] do_wp_page+0xb4/0x710
Jan 23 23:39:43 worker0 kernel: [285737.522832] [<ffffffff83decbb4>] handle_mm_fault+0x884/0x1340
Jan 23 23:39:43 worker0 kernel: [285737.522835] [<ffffffff83cd7799>] ? update_cfs_shares+0xa9/0xf0
Jan 23 23:39:43 worker0 kernel: [285737.522839] [<ffffffff8437efc3>] __do_page_fault+0x213/0x5a0
Jan 23 23:39:43 worker0 kernel: [285737.522841] [<ffffffff8437f385>] do_page_fault+0x35/0x90
Jan 23 23:39:43 worker0 kernel: [285737.522842] [<ffffffff8437b728>] page_fault+0x28/0x30
Jan 23 23:39:43 worker0 kernel: [285737.522845] csd: CSD lock (#1) got unstuck on CPU#03, CPU#55 released the lock.

Hopefully this is the issue you are chasing and not something else.
I've been studying them to see what they mean, but I thought you might
be interested to get them asap.

-corey