On 7/3/24 5:11 PM, Bharata B Rao wrote:
Many soft and hard lockups are seen with upstream kernel when running a
bunch of tests that include FIO and LTP filesystem test on 10 NVME
disks. The lockups can appear anywhere between 2 to 48 hours. Originally
this was reported on a large customer VM instance with passthrough NVME
disks on older kernels(v5.4 based). However, similar problems were
reproduced when running the tests on bare metal with latest upstream
kernel (v6.10-rc3). Other lockups with different signatures are seen but
in this report, only those related to MM area are being discussed.
Also note that the subsequent description is related to the lockups in
bare metal upstream (and not VM).
The general observation is that the problem usually surfaces when the
system free memory goes very low and page cache/buffer consumption hits
the ceiling. Most of the times the two contended locks are lruvec and
inode->i_lock spinlocks.
- Could this be a scalability issue in LRU list handling and/or page
cache invalidation typical to a large system configuration?
Seems to me it could be (except that ZONE_DMA corner case) a general
scalability issue in that you tweak some part of the kernel and the
contention moves elsewhere. At least in MM we have per-node locks so this
means 256 CPUs per lock? It used to be that there were not that many
(cores/threads) per a physical CPU and its NUMA node, so many cpus would
mean also more NUMA nodes where the locks contention would distribute among
them. I think you could try fakenuma to create these nodes artificially and
see if it helps for the MM part. But if the contention moves to e.g. an
inode lock, I'm not sure what to do about that then.
<SNIP>
3) AMD has a BIOS setting called NPS (Nodes per socket), using which a
socket can be further partitioned into smaller NUMA nodes. With NPS=4,
there will be four NUMA nodes in one socket, and hence 8 NUMA nodes in
the system. This was done to check if having more number of kswapd
threads working on lesser number of folios per node would make a
difference. However here too, multiple soft lockups were seen (in
clear_shadow_entry() as seen in MGLRU case). No hard lockups were observed.