Re: [External] Re: [PATCH 1/1] mm/vmalloc: Add preempt point in purge_vmap_node() when enabling kasan

From: Uladzislau Rezki
Date: Fri Jul 19 2024 - 07:40:33 EST

Next message: Paul Menzel: "Re: Linux logs error `Failed to query (GET_CUR) UVC control X on unit Y: -75 (exp. 1).` (75 == EOVERFLOW?)"
Previous message: Geert Uytterhoeven: "Re: [GIT PULL] Btrfs updates for 6.11"
Next in thread: Adrian Huang: "Re: [PATCH 1/1] mm/vmalloc: Add preempt point in purge_vmap_node() when enabling kasan"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hello, Peter!

Could you please support here? See below:

> On Mon, Jul 08, 2024 at 01:39:57PM +0000, Adrian Huang12 wrote:
> > Hi,
> >
> > > Could you please test it:
> > >
> > > <snip>
> > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 03b82fb8ecd3..6dc204b8495a
> > > 100644
> > > --- a/mm/vmalloc.c
> > > +++ b/mm/vmalloc.c
> > > @@ -2190,10 +2190,12 @@ static void purge_vmap_node(struct work_struct
> > > *work) {
> > > struct vmap_node *vn = container_of(work,
> > > struct vmap_node, purge_work);
> > > + unsigned long resched_threshold;
> > > struct vmap_area *va, *n_va;
> > > LIST_HEAD(local_list);
> > >
> > > vn->nr_purged = 0;
> > > + resched_threshold = lazy_max_pages() << 1;
> > >
> > > list_for_each_entry_safe(va, n_va, &vn->purge_list, list) {
> > > unsigned long nr = (va->va_end - va->va_start) >> PAGE_SHIFT; @@
> > > -2210,6 +2212,9 @@ static void purge_vmap_node(struct work_struct *work)
> > > atomic_long_sub(nr, &vmap_lazy_nr);
> > > vn->nr_purged++;
> > >
> > > + if (atomic_long_read(&vmap_lazy_nr) < resched_threshold)
> > > + cond_resched();
> > > +
> > > if (is_vn_id_valid(vn_id) && !vn->skip_populate)
> > > if (node_pool_add_va(vn, va))
> > > continue;
> > > <snip>
> >
> > This patch can fix the issue. Feel free to add my tested-by.
> > Tested-by: Adrian Huang <ahuang12@xxxxxxxxxx>
> >
> Thank you. I will add you tested-by!
>
I tried to simulate the reported splat and i can reproduce it with KASAN
enabled. I use qemu on my 64-core system, it allows me to specify 255
cores while running VM. The kernel is 6.10.0-rc5.

The kernel should be built with CONFIG_KASAN=y and CONFIG_KASAN_VMALLOC=y

The "soft lockup" can be triggered when the kernel is compiled within a
VM using 256 jobs and preemption is disabled:

echo none > /sys/kernel/debug/sched/preempt
make -C coding/linux.git/ -j256 bzImage

<snip>
watchdog: BUG: soft lockup - CPU#28 stuck for 22s! [kworker/28:1:1760]
CPU: 28 PID: 1760 Comm: kworker/28:1 Kdump: loaded Not tainted 6.10.0-rc5 #95
Workqueue: events drain_vmap_area_work
RIP: 0010:smp_call_function_many_cond+0x1d8/0xbb0
...
<snip>

(See the full splat in the beginning of this email-thread).

After some debugging, i figured out that a CSD lock is taken and not released
for too long. To debug this the kernel should be compiled with CONFIG_CSD_LOCK_WAIT_DEBUG.
The /sys/module/smp/parameters/csd_lock_timeout i set to 1000, i.e. 1 second.

See below the CSD-lock debug-info during running the compiling test:

<snip>
[ 163.697057] smp: csd: Detected non-responsive CSD lock (#1) on CPU#206, waiting 1000000292 ns for CPU#17 do_kernel_range_flush+0x0/0xb0(0xffff888dfd943d80).
[ 163.697165] smp: csd: CSD lock (#1) unresponsive.
[ 163.697198] Sending NMI from CPU 206 to CPUs 17:
[ 163.697214] NMI backtrace for cpu 17
[ 163.697223] CPU: 17 PID: 36681 Comm: as Kdump: loaded Tainted: G L 6.10.0-rc5-00019-g4236f0255ea8-dirty #3439
[ 163.697228] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[ 163.697231] RIP: 0010:native_queued_spin_lock_slowpath+0x2b9/0xae0
[ 163.697241] Code: 40 ba 01 00 00 00 ..............<cut>...............
[ 163.697244] RSP: 0018:ffffc900396afb98 EFLAGS: 00000002
[ 163.697248] RAX: 0000000000000001 RBX: ffffffffb6e0a084 RCX: ffffffffb5eff77c
[ 163.697250] RDX: fffffbfff6dc1411 RSI: 0000000000000004 RDI: ffffffffb6e0a084
[ 163.697252] RBP: 0000000000000001 R08: 0000000000000000 R09: fffffbfff6dc1410
[ 163.697254] R10: ffffffffb6e0a087 R11: ffffffffb80f8112 R12: 1ffff920072d5f75
[ 163.697255] R13: 0000000000000007 R14: fffffbfff6dc1410 R15: ffffc900396afbd8
[ 163.697260] FS: 0000000000000000(0000) GS:ffff888df7a80000(0000) knlGS:0000000000000000
[ 163.697263] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 163.697266] CR2: 00007f354595c3e8 CR3: 0000000088a44000 CR4: 00000000000006f0
[ 163.697268] Call Trace:
[ 163.697270] <NMI>
[ 163.697272] ? nmi_cpu_backtrace+0xd1/0x190
[ 163.697277] ? nmi_cpu_backtrace_handler+0x11/0x20
[ 163.697282] ? nmi_handle+0xb7/0x2a0
[ 163.697288] ? default_do_nmi+0x45/0x110
[ 163.697292] ? exc_nmi+0x104/0x190
[ 163.697295] ? end_repeat_nmi+0xf/0x53
[ 163.697299] ? native_queued_spin_lock_slowpath+0xbc/0xae0
[ 163.697303] ? native_queued_spin_lock_slowpath+0x2b9/0xae0
[ 163.697306] ? native_queued_spin_lock_slowpath+0x2b9/0xae0
[ 163.697308] ? native_queued_spin_lock_slowpath+0x2b9/0xae0
[ 163.697311] </NMI>
[ 163.697312] <TASK>
[ 163.697313] ? __pfx_native_queued_spin_lock_slowpath+0x10/0x10
[ 163.697317] queued_write_lock_slowpath+0x3c6/0x440
[ 163.697321] ? __pfx_queued_write_lock_slowpath+0x10/0x10
[ 163.697324] ? task_rq_lock+0xd0/0x390
[ 163.697327] ? perf_lock_task_context+0x106/0x310
[ 163.697333] _raw_write_lock_irq+0xcf/0xe0
[ 163.697335] ? __pfx__raw_write_lock_irq+0x10/0x10
[ 163.697339] exit_notify+0x86/0x780
[ 163.697342] ? __pfx_exit_notify+0x10/0x10
[ 163.697345] ? exit_tasks_rcu_start+0x173/0x230
[ 163.697350] do_exit+0x707/0xcb0
[ 163.697352] ? __count_memcg_events+0xe1/0x340
[ 163.697357] ? __pfx_do_exit+0x10/0x10
[ 163.697359] ? _raw_spin_lock_irq+0x86/0xe0
[ 163.697361] ? __pfx__raw_spin_lock_irq+0x10/0x10
[ 163.697364] ? handle_mm_fault+0x150/0x740
[ 163.697368] do_group_exit+0xac/0x230
[ 163.697371] __x64_sys_exit_group+0x3e/0x50
[ 163.697374] do_syscall_64+0x5f/0x170
[ 163.697377] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 163.697381] RIP: 0033:0x7f35457aa349
[ 163.697384] Code: Unable to access opcode bytes at 0x7f35457aa31f.
[ 163.697385] RSP: 002b:00007ffce767e198 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
[ 163.697388] RAX: ffffffffffffffda RBX: 00007f35458a49e0 RCX: 00007f35457aa349
[ 163.697390] RDX: 000000000000003c RSI: 00000000000000e7 RDI: 0000000000000000
[ 163.697392] RBP: 0000000000000000 R08: ffffffffffffff80 R09: 00007ffce767e0af
[ 163.697394] R10: 00007ffce767e010 R11: 0000000000000246 R12: 00007f35458a49e0
[ 163.697395] R13: 00007f35458aa2e0 R14: 0000000000000001 R15: 00007f35458aa2c8
[ 163.697399] </TASK>
<snip>

i see that CPU_17 does not respond waiting in the do_exit() trying to grab
a writing lock. Below one:

<snip>
static void exit_notify(struct task_struct *tsk, int group_dead)
{
bool autoreap;
struct task_struct *p, *n;
LIST_HEAD(dead);

write_lock_irq(&tasklist_lock);
...
<snip>

As i see it disables interrupts and tries to acquire the rw-lock. In
order to debug this further i built the kernel with CONFIG_LOCK_STAT
to see the contention and waiting time on it:

echo 0 > /proc/lock_stat; echo 1 > /proc/sys/kernel/lock_stat; make -C coding/linux.git/ -j256 bzImage > /dev/null; echo 0 > /proc/sys/kernel/lock_stat;

<snip>
...
class name con-bounces contentions waittime-min waittime-max waittime-total waittime-avg acq-bounces acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg
...
tasklist_lock-W: 56291 56830 0.23 3352084.59 3191025005.50 56150.36 86555 110489 0.60 65916.78 50226089.66 454.58
tasklist_lock-R: 40925 40987 0.23 3450788.93 2149685495.54 52447.98 73248 88581 0.09 32061.59 237549.19 2.68
---------------
tasklist_lock 21879 [<000000004337def3>] exit_notify+0x86/0x790
tasklist_lock 40987 [<000000008c6daba0>] __do_wait+0xdc/0x700
tasklist_lock 17632 [<00000000c86b4505>] release_task+0x106/0x470
tasklist_lock 17319 [<000000004c9a1afc>] copy_process+0x2a1b/0x4b00
---------------
tasklist_lock 73231 [<000000004337def3>] exit_notify+0x86/0x790
tasklist_lock 15082 [<00000000c86b4505>] release_task+0x106/0x470
tasklist_lock 4233 [<000000004c9a1afc>] copy_process+0x2a1b/0x4b00
tasklist_lock 5271 [<000000008c6daba0>] __do_wait+0xdc/0x700
...
<snip>

A waiting time can be quite long on a kernel built with KASAN. If i
interpret it correctly, the time is in microseconds, so we have 3.3
seconds as a maximum waiting time on the "tasklist_lock.

Since it disables interrupts on a current CPU, do_kernel_range_flush()
on that CPU is delayed thus it delays kworker which does TLB flushing.

Could you please comment on it? Maybe you can give some good input about
long wait on the tasklist_lock(it disables IRQs also) when KASAN is enabled.

Is that something which is expectable for debug kernel?

Thank you in advance!

--
Uladzislau Rezki

Next message: Paul Menzel: "Re: Linux logs error `Failed to query (GET_CUR) UVC control X on unit Y: -75 (exp. 1).` (75 == EOVERFLOW?)"
Previous message: Geert Uytterhoeven: "Re: [GIT PULL] Btrfs updates for 6.11"
Next in thread: Adrian Huang: "Re: [PATCH 1/1] mm/vmalloc: Add preempt point in purge_vmap_node() when enabling kasan"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]