stalling IO regression in linux 5.12
From: Chris Murphy
Date: Wed Aug 10 2022 - 13:16:37 EST
CPU: Intel E5-2680 v3
RAM: 128 G
02:00.0 RAID bus controller [0104]: Broadcom / LSI MegaRAID SAS-3 3108 [Invader] [1000:005d] (rev 02), using megaraid_sas driver
8 Disks: TOSHIBA AL13SEB600
The problem exhibits as increasing load, increasing IO pressure (PSI), and actual IO goes to zero. It never happens on kernel 5.11 series, and always happens after 5.12-rc1 and persists through 5.18.0. There's a new mix of behaviors with 5.19, I suspect the mm improvements in this series might be masking the problem.
The workload involves openqa, which spins up 30 qemu-kvm instances, and does a bunch of tests, generating quite a lot of writes: qcow2 files, and video in the form of many screenshots, and various log files, for each VM. These VMs are each in their own cgroup. As the problem begins, I see increasing IO pressure, and decreasing IO, for each qemu instance's cgroup, and the cgroups for httpd, journald, auditd, and postgresql. IO pressure goes to nearly ~99% and IO is literally 0.
The problem left unattended to progress will eventually result in a completely unresponsive system, with no kernel messages. It reproduces in the following configurations, the first two I provide links to full dmesg with sysrq+w:
btrfs raid10 (native) on plain partitions [1]
btrfs single/dup on dmcrypt on mdadm raid 10 and parity raid [2]
XFS on dmcrypt on mdadm raid10 or parity raid
I've started a bisect, but for some reason I haven't figured out I've started getting compiled kernels that don't boot the hardware. The failure is very early on such that the UUID for the root file system isn't found, but not much to go on as to why.[3] I have tested the first and last skipped commits in the bisect log below, they successfully boot a VM but not the hardware.
Anyway, I'm kinda stuck at this point trying to narrow it down further. Any suggestions? Thanks.
[1] btrfs raid10, plain partitions
https://drive.google.com/file/d/1-oT3MX-hHYtQqI0F3SpgPjCIDXXTysLU/view?usp=sharing
[2] btrfs single/dup, dmcrypt, mdadm raid10
https://drive.google.com/file/d/1m_T3YYaEjBKUROz6dHt5_h92ZVRji9FM/view?usp=sharing
[3]
$ git bisect log
git bisect start
# status: waiting for both good and bad commits
# bad: [c03c21ba6f4e95e406a1a7b4c34ef334b977c194] Merge tag 'keys-misc-20210126' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
git bisect bad c03c21ba6f4e95e406a1a7b4c34ef334b977c194
# status: waiting for good commit(s), bad commit known
# good: [f40ddce88593482919761f74910f42f4b84c004b] Linux 5.11
git bisect good f40ddce88593482919761f74910f42f4b84c004b
# bad: [df24212a493afda0d4de42176bea10d45825e9a0] Merge tag 's390-5.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
git bisect bad df24212a493afda0d4de42176bea10d45825e9a0
# good: [82851fce6107d5a3e66d95aee2ae68860a732703] Merge tag 'arm-dt-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
git bisect good 82851fce6107d5a3e66d95aee2ae68860a732703
# good: [99f1a5872b706094ece117368170a92c66b2e242] Merge tag 'nfsd-5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
git bisect good 99f1a5872b706094ece117368170a92c66b2e242
# bad: [9eef02334505411667a7b51a8f349f8c6c4f3b66] Merge tag 'locking-core-2021-02-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect bad 9eef02334505411667a7b51a8f349f8c6c4f3b66
# bad: [9820b4dca0f9c6b7ab8b4307286cdace171b724d] Merge tag 'for-5.12/drivers-2021-02-17' of git://git.kernel.dk/linux-block
git bisect bad 9820b4dca0f9c6b7ab8b4307286cdace171b724d
# good: [bd018bbaa58640da786d4289563e71c5ef3938c7] Merge tag 'for-5.12/libata-2021-02-17' of git://git.kernel.dk/linux-block
git bisect good bd018bbaa58640da786d4289563e71c5ef3938c7
# skip: [203c018079e13510f913fd0fd426370f4de0fd05] Merge branch 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-5.12/drivers
git bisect skip 203c018079e13510f913fd0fd426370f4de0fd05
# skip: [49d1ec8573f74ff1e23df1d5092211de46baa236] block: manage bio slab cache by xarray
git bisect skip 49d1ec8573f74ff1e23df1d5092211de46baa236
# bad: [73d90386b559d6f4c3c5db5e6bb1b68aae8fd3e7] nvme: cleanup zone information initialization
git bisect bad 73d90386b559d6f4c3c5db5e6bb1b68aae8fd3e7
# skip: [71217df39dc67a0aeed83352b0d712b7892036a2] block, bfq: make waker-queue detection more robust
git bisect skip 71217df39dc67a0aeed83352b0d712b7892036a2
# bad: [8358c28a5d44bf0223a55a2334086c3707bb4185] block: fix memory leak of bvec
git bisect bad 8358c28a5d44bf0223a55a2334086c3707bb4185
# skip: [3a905c37c3510ea6d7cfcdfd0f272ba731286560] block: skip bio_check_eod for partition-remapped bios
git bisect skip 3a905c37c3510ea6d7cfcdfd0f272ba731286560
# skip: [3c337690d2ebb7a01fa13bfa59ce4911f358df42] block, bfq: avoid spurious switches to soft_rt of interactive queues
git bisect skip 3c337690d2ebb7a01fa13bfa59ce4911f358df42
# skip: [3e1a88ec96259282b9a8b45c3f1fda7a3ff4f6ea] bio: add a helper calculating nr segments to alloc
git bisect skip 3e1a88ec96259282b9a8b45c3f1fda7a3ff4f6ea
# skip: [4eb1d689045552eb966ebf25efbc3ce648797d96] blk-crypto: use bio_kmalloc in blk_crypto_clone_bio
git bisect skip 4eb1d689045552eb966ebf25efbc3ce648797d96
--
Chris Murphy