rtmutex, pi_blocked_on, and blk_flush_plug()

From: Crystal Wood
Date: Thu Feb 09 2023 - 23:32:58 EST


Hello!

It is possible for blk_flush_plug() to be called while
current->pi_blocked_on is set, in the process of trying to acquire an rwsem.
If the block flush blocks trying to acquire some lock, then it appears that
current->pi_blocked_on will be overwritten, and then set to NULL once that
lock is acquired, even though the task is still blocked on the original
rwsem. Am I missing something that deals with this situation? It seems
like the lock types that are supposed to call blk_flush_plug() should do so
before calling task_blocks_on_rt_mutex().

I originally noticed this while investigating a related issue on an older
RHEL kernel where task_blocked_on_mutex() has a BUG_ON if entered with
current->pi_blocked_on non-NULL. Current kernels lack this check.

To demonstrate that the recursive blocking scenario can happen (without
actually waiting to hit the scenario where the second lock is contended),
I put a WARN_ON_ONCE(current->pi_blocked_on) in rtlock_lock() (plus a few
other places, but this is the one I hit):

[83246.479396] ------------[ cut here ]------------
[83246.479400] WARNING: CPU: 94 PID: 75492 at kernel/locking/spinlock_rt.c:40 rt_spin_lock+0x46/0x70
[83246.494329] Modules linked in: xt_CHECKSUM(E) xt_MASQUERADE(E) xt_conntrack(E) ipt_REJECT(E) nf_reject_ipv4(E) nft_compat(E) nft_chain_nat(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) nf_tables(E) nfnetlink(E) bridge(E) stp(E) llc(E) rfkill(E) sunrpc(E) vfat(E) fat(E) intel_rapl_msr(E) intel_rapl_common(E) amd64_edac(E) edac_mce_amd(E) kvm_amd(E) kvm(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) rapl(E) ipmi_si(E) ipmi_devintf(E) wmi_bmof(E) ipmi_msghandler(E) pcspkr(E) mlx5_ib(E) ib_uverbs(E) joydev(E) sp5100_tco(E) ib_core(E) ccp(E) ptdma(E) i2c_piix4(E) k10temp(E) i2c_designware_platform(E) i2c_designware_core(E) acpi_cpufreq(E) xfs(E) libcrc32c(E) crc32c_intel(E) sd_mod(E) t10_pi(E) ast(E) crc64_rocksoft(E) crc64(E) drm_kms_helper(E) sg(E) syscopyarea(E) sysfillrect(E) mlx5_core(E) sysimgblt(E) i2c_algo_bit(E) ahci(E) drm_shmem_helper(E) libahci(E) mlxfw(E) drm(E) libata(E) tls(E) psample(E) wmi(E) dm_mirror(E) dm_region_hash(E)
[83246.494667] dm_log(E) dm_mod(E) fuse(E)
[83246.593228] CPU: 94 PID: 75492 Comm: kworker/u514:1 Tainted: G E 6.2.0-rc3-rt1.std+ #4
[83246.603919] Hardware name: AMD Corporation DAYTONA_X/DAYTONA_X, BIOS RYM1009B 05/14/2022
[83246.614174] Workqueue: writeback wb_workfn (flush-253:2)
[83246.620021] RIP: 0010:rt_spin_lock+0x46/0x70
[83246.626265] Code: 48 89 fb 8b 90 34 08 00 00 75 1f 65 48 8b 14 25 c0 14 03 00 31 c0 f0 48 0f b1 53 18 75 10 e8 d1 45 5c ff 5b 5d e9 fa c2 56 ff <0f> 0b eb dd 48 89 df e8 ee 42 00 00 48 89 df 48 89 c5 e8 63 ff ff
[83246.646613] RSP: 0018:ffffafd3f0cb7438 EFLAGS: 00010286
[83246.651943] RAX: ffff8ee41531cc80 RBX: ffff8ec428839f48 RCX: 0000000000000001
[83246.659327] RDX: 0000000000000000 RSI: ffffafd3f0cb7510 RDI: ffff8ec428839f48
[83246.669216] RBP: 0000000000000001 R08: ffff8ec4494d3ec8 R09: 0000000000031480
[83246.676709] R10: ffff8ec41ab79000 R11: 0000000000000000 R12: ffffafd3f0cb7510
[83246.684536] R13: ffffafd3f0cb7510 R14: ffffcfd37e596bc0 R15: 0000000000000001
[83246.692026] FS: 0000000000000000(0000) GS:ffff8f02c7980000(0000) knlGS:0000000000000000
[83246.702144] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[83246.709800] CR2: 0000565164f09000 CR3: 00000003a2f58003 CR4: 0000000000770ee0
[83246.718014] PKRU: 55555554
[83246.722313] Call Trace:
[83246.726219] <TASK>
[83246.729719] dd_insert_requests+0x55/0x360
[83246.735712] blk_mq_sched_insert_requests+0x64/0x170
[83246.741579] blk_mq_flush_plug_list+0x135/0x330
[83246.747219] __blk_flush_plug+0xf6/0x130
[83246.752481] schedule+0x46/0xd0
[83246.756625] __rt_mutex_slowlock.isra.18.constprop.21+0x3c2/0x9d0
[83246.763522] __rwbase_read_lock+0xae/0x210
[83246.768737] xlog_cil_commit+0x1e0/0xc80 [xfs]
[83246.771346] ? __pfx_xfs_trans_precommit_sort+0x10/0x10 [xfs]
[83246.771697] __xfs_trans_commit+0x135/0x400 [xfs]
[83246.772313] xfs_bmapi_convert_delalloc+0x39f/0x500 [xfs]
[83246.773696] xfs_map_blocks+0x1ad/0x480 [xfs]
[83246.774339] iomap_writepage_map+0x105/0x710
[83246.774345] write_cache_pages+0x1bf/0x4f0
[83246.774640] ? __pfx_iomap_do_writepage+0x10/0x10
[83246.774644] ? submit_bio_noacct_nocheck+0xe9/0x2b0
[83246.775236] iomap_writepages+0x20/0x40
[83246.775238] xfs_vm_writepages+0x7e/0xb0 [xfs]
[83246.775628] do_writepages+0xd9/0x1c0
[83246.775633] __writeback_single_inode+0x4a/0x490
[83246.775636] writeback_sb_inodes+0x1f4/0x4c0
[83246.775640] __writeback_inodes_wb+0x5d/0xd0
[83246.775643] wb_writeback+0x244/0x370
[83246.775646] wb_workfn+0x3c0/0x530
[83246.776510] ? __raw_spin_unlock+0x10/0x50
[83246.776514] ? finish_task_switch+0xbd/0x320
[83246.776517] process_one_work+0x1a5/0x400
[83246.776521] ? __pfx_worker_thread+0x10/0x10
[83246.776523] worker_thread+0x34/0x360
[83246.776526] ? __pfx_worker_thread+0x10/0x10
[83246.776527] kthread+0x10f/0x130
[83246.776530] ? __pfx_kthread+0x10/0x10
[83246.776532] ret_from_fork+0x2c/0x50
[83246.776537] </TASK>
[83246.776538] ---[ end trace 0000000000000000 ]---

-Crystal