Re: Hard and soft lockups with FIO and LTP runs on a large system

From: Bharata B Rao
Date: Wed Jul 17 2024 - 06:31:25 EST

Next message: David Hildenbrand: "Re: [RFC PATCH v1 0/4] Control folio sizes used for page cache memory"
Previous message: Krzysztof Kozlowski: "Re: [PATCH v2] ARM: dts: qcom: {a,i}pq8064: correct clock-names in sata node"
In reply to: Vlastimil Babka: "Re: Hard and soft lockups with FIO and LTP runs on a large system"
Next in thread: Karim Manaouil: "Re: Hard and soft lockups with FIO and LTP runs on a large system"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 17-Jul-24 3:12 PM, Vlastimil Babka wrote:

On 7/3/24 5:11 PM, Bharata B Rao wrote:

Many soft and hard lockups are seen with upstream kernel when running a
bunch of tests that include FIO and LTP filesystem test on 10 NVME
disks. The lockups can appear anywhere between 2 to 48 hours. Originally
this was reported on a large customer VM instance with passthrough NVME
disks on older kernels(v5.4 based). However, similar problems were
reproduced when running the tests on bare metal with latest upstream
kernel (v6.10-rc3). Other lockups with different signatures are seen but
in this report, only those related to MM area are being discussed.
Also note that the subsequent description is related to the lockups in
bare metal upstream (and not VM).

The general observation is that the problem usually surfaces when the
system free memory goes very low and page cache/buffer consumption hits
the ceiling. Most of the times the two contended locks are lruvec and
inode->i_lock spinlocks.

- Could this be a scalability issue in LRU list handling and/or page
cache invalidation typical to a large system configuration?

Seems to me it could be (except that ZONE_DMA corner case) a general
scalability issue in that you tweak some part of the kernel and the
contention moves elsewhere. At least in MM we have per-node locks so this
means 256 CPUs per lock? It used to be that there were not that many
(cores/threads) per a physical CPU and its NUMA node, so many cpus would
mean also more NUMA nodes where the locks contention would distribute among
them. I think you could try fakenuma to create these nodes artificially and
see if it helps for the MM part. But if the contention moves to e.g. an
inode lock, I'm not sure what to do about that then.

See below...

<SNIP>

3) AMD has a BIOS setting called NPS (Nodes per socket), using which a
socket can be further partitioned into smaller NUMA nodes. With NPS=4,
there will be four NUMA nodes in one socket, and hence 8 NUMA nodes in
the system. This was done to check if having more number of kswapd
threads working on lesser number of folios per node would make a
difference. However here too, multiple soft lockups were seen (in
clear_shadow_entry() as seen in MGLRU case). No hard lockups were observed.

These are some softlockups seen with NPS4 mode.

watchdog: BUG: soft lockup - CPU#315 stuck for 11s! [kworker/315:1H:5153]
CPU: 315 PID: 5153 Comm: kworker/315:1H Kdump: loaded Not tainted 6.10.0-rc3-enbprftw #12
Workqueue: kblockd blk_mq_run_work_fn
RIP: 0010:handle_softirqs+0x70/0x2f0
Call Trace:
<IRQ>
__irq_exit_rcu+0x68/0x90
irq_exit_rcu+0x12/0x20
sysvec_apic_timer_interrupt+0x85/0xb0
</IRQ>
<TASK>
asm_sysvec_apic_timer_interrupt+0x1f/0x30
RIP: 0010:iommu_dma_map_page+0xca/0x2c0
dma_map_page_attrs+0x20d/0x2a0
nvme_prep_rq.part.0+0x63d/0x940 [nvme]
nvme_queue_rq+0x82/0x210 [nvme]
blk_mq_dispatch_rq_list+0x289/0x6d0
__blk_mq_sched_dispatch_requests+0x142/0x5f0
blk_mq_sched_dispatch_requests+0x36/0x70
blk_mq_run_work_fn+0x73/0x90
process_one_work+0x185/0x3d0
worker_thread+0x2ce/0x3e0
kthread+0xe5/0x120
ret_from_fork+0x3d/0x60
ret_from_fork_asm+0x1a/0x30

watchdog: BUG: soft lockup - CPU#0 stuck for 11s! [fio:19820]
CPU: 0 PID: 19820 Comm: fio Kdump: loaded Tainted: G L 6.10.0-rc3-enbprftw #12
RIP: 0010:native_queued_spin_lock_slowpath+0x2b8/0x300
Call Trace:
<IRQ>
</IRQ>
<TASK>
_raw_spin_lock+0x2d/0x40
clear_shadow_entry+0x3d/0x100
mapping_try_invalidate+0x11b/0x1e0
invalidate_mapping_pages+0x14/0x20
invalidate_bdev+0x40/0x50
blkdev_common_ioctl+0x5f7/0xa90
blkdev_ioctl+0x10d/0x270
__x64_sys_ioctl+0x99/0xd0
x64_sys_call+0x1219/0x20d0
do_syscall_64+0x51/0x120
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7fc92fc3ec6b
</TASK>

The above one (clear_shadow_entry) has since been fixed by Yu Zhao and fix is in mm tree.

We had seen a couple of scenarios with zone lock contention from page free and slab free code paths, as reported here: https://lore.kernel.org/linux-mm/b68e43d4-91f2-4481-80a9-d166c0a43584@xxxxxxx/

Would you have any insights on these?

Regards,
Bharata.

Next message: David Hildenbrand: "Re: [RFC PATCH v1 0/4] Control folio sizes used for page cache memory"
Previous message: Krzysztof Kozlowski: "Re: [PATCH v2] ARM: dts: qcom: {a,i}pq8064: correct clock-names in sata node"
In reply to: Vlastimil Babka: "Re: Hard and soft lockups with FIO and LTP runs on a large system"
Next in thread: Karim Manaouil: "Re: Hard and soft lockups with FIO and LTP runs on a large system"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]