Re: [PATCH v3 0/2] PCI/IOV: Fix deadlock when removing PF with enabled SR-IOV

From: Dragos Tatulea

Date: Mon Feb 23 2026 - 09:11:05 EST


Hi Niklas,

On 03.02.26 01:48, Bjorn Helgaas wrote:
> On Tue, Dec 16, 2025 at 11:14:01PM +0100, Niklas Schnelle wrote:
>> Hi Bjorn,
>>
>> Doing additional testing for a distribution backport of commit
>> 05703271c3cd ("PCI/IOV: Add PCI rescan-remove locking when
>> enabling/disabling SR-IOV") Benjamin found a hang with s390's
>> recover attribute. Further investigation showed this to be a deadlock by
>> recursively trying to take pci_rescan_remove lock when removing a PF
>> with enabled SR-IOV.
>>
>> The issue can be reproduced on both s390 and x86_64 with:
>>
>> $ echo <NUM> > /sys/bus/pci/devices/<pf>/sriov_numvfs
>> $ echo 1 > /sys/bus/pci/devices/<pf>/remove
>>
>> As this seems worse than the original, hard to hit, race fixed by the
>> cited commit I think we first want to revert the broken fix.
>>
>> Following that patch 2 attempts to fix the original issue by taking the
>> PCI rescan/remove lock directly before calling into the driver's
>> sriov_configure() callback enforcing the rule that this should only
>> be called with the pci_rescan_remove_lock held.
>>
>> Thanks,
>> Niklas
>>
>> Signed-off-by: Niklas Schnelle <schnelle@xxxxxxxxxxxxx>
>> ---
>> Changes in v3:
>> - Rebased on v6.19-rc1, also verified issue is still there and the fix
>> still works
>> - Added more of the lockdep splat for better context
>> - Link to v2: https://lore.kernel.org/r/20251119-revert_sriov_lock-v2-0-ea50eb1e8f96@xxxxxxxxxxxxx
>>
>> Changes in v2:
>> - Collected R-b from Benjamin
>> - Link to v1: https://lore.kernel.org/r/20251030-revert_sriov_lock-v1-0-70f82ade426f@xxxxxxxxxxxxx
>>
>> ---
>> Niklas Schnelle (2):
>> Revert "PCI/IOV: Add PCI rescan-remove locking when enabling/disabling SR-IOV"
>> PCI/IOV: Fix race between SR-IOV enable/disable and hotplug
>>
>> drivers/pci/iov.c | 9 ++++-----
>> 1 file changed, 4 insertions(+), 5 deletions(-)
>> ---
>> base-commit: 8f0b4cce4481fb22653697cced8d0d04027cb1e8
>> change-id: 20251029-revert_sriov_lock-aef4557f360f
>
> Applied to pci/iov for v6.20, thanks!

After pulling in these commits in our internal tree we can see the
lockdep splat from below in many internal tests. We are still trying to
find an easy repro for this. We had to internally revert both of them.

I noticed some similar discussion in another thread [1] but there it
seems that these changes are actually fixing the issue which is not
the case for us.

------------[ cut here ]------------
WARNING: drivers/pci/remove.c:130 at pci_stop_and_remove_bus_device+0x39/0x40, CPU#2: modprobe/12956
Modules linked in: mlx5_core(-) act_tunnel_key vxlan dummy act_mirred act_gact cls_flower act_police act_ct nf_flow_table [...]
CPU: 2 UID: 0 PID: 12956 Comm: modprobe Not tainted 6.19.0net_next_e834b5e #1 PREEMPT
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:pci_stop_and_remove_bus_device+0x39/0x40
Code: [...]
RSP: 0018:ffff888164c9fd10 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff888188ff2000 RCX: 0000000000000001
RDX: 0000000000000046 RSI: ffffffff8307e068 RDI: ffff88816bf4c9c0
RBP: ffff888188ff2000 R08: 00000000000000f4 R09: ffff88816bf4c080
R10: 0000000000000001 R11: 0000000000000003 R12: 0000000000000000
R13: ffff888164c9fd27 R14: 0000000000000002 R15: 0000000000000000
FS: 00007f52364bd740(0000) GS:ffff8885a9019000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00005622dbf749d8 CR3: 0000000169132004 CR4: 0000000000372eb0
Call Trace:
<TASK>
pci_iov_remove_virtfn+0xbd/0x120
sriov_disable+0x30/0xe0
mlx5_sriov_disable+0x50/0xa0 [mlx5_core]
remove_one+0x68/0xe0 [mlx5_core]
pci_device_remove+0x39/0xa0
device_release_driver_internal+0x1e4/0x240
driver_detach+0x47/0x90
bus_remove_driver+0x84/0x110
pci_unregister_driver+0x3b/0x90
mlx5_cleanup+0x13/0x40 [mlx5_core]
__x64_sys_delete_module+0x16f/0x290
? kmem_cache_free+0x221/0x520
do_syscall_64+0xa8/0x13f0
entry_SYSCALL_64_after_hwframe+0x4b/0x53
RIP: 0033:0x7f5235f2c3fb
Code: [...]
RSP: 002b:00007ffc6ba11518 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 00005558c4278f30 RCX: 00007f5235f2c3fb
RDX: 0000000000000000 RSI: 0000000000000800 RDI: 00005558c4278f98
RBP: 00007ffc6ba11540 R08: 1999999999999999 R09: 0000000000000000
R10: 00007f5235fa5fe0 R11: 0000000000000206 R12: 0000000000000000
R13: 00007ffc6ba11570 R14: 0000000000000000 R15: 0000000000000000
</TASK>
irq event stamp: 44859
hardirqs last enabled at (44869): [<ffffffff814af7ca>] __up_console_sem+0x5a/0x70
hardirqs last disabled at (44878): [<ffffffff814af7af>] __up_console_sem+0x3f/0x70
softirqs last enabled at (44844): [<ffffffff81430312>] irq_exit_rcu+0x82/0xe0
softirqs last disabled at (44821): [<ffffffff81430312>] irq_exit_rcu+0x82/0xe0
---[ end trace 0000000000000000 ]---

[1] https://lore.kernel.org/all/20260222112904.171858-1-ionut.nechita@xxxxxxxxxxxxx/

Thanks,
Dragos