Re: [PATCH] x86/apic: Add retry mechanism to add_pin_to_irq_node()

From: Breno Leitao
Date: Mon Jul 29 2024 - 12:56:18 EST


Hello Thomas,

On Mon, Jul 29, 2024 at 06:13:34PM +0200, Thomas Gleixner wrote:
> On Mon, Jul 29 2024 at 07:06, Breno Leitao wrote:
> > I've been running some experiments with failslab fault injector running
> > to detect a different problem, and the machine always crash with the
> > following stack:
> >
> > can not alloc irq_pin_list (-1,0,20)
> > Kernel panic - not syncing: IO-APIC: failed to add irq-pin. Can not proceed
> >
> > Call Trace:
> > panic
> > _printk
> > panic_smp_self_stop
> > rcu_is_watching
> > intel_irq_remapping_free
>
> This completely lacks context. When does this happen? What's the system
> state? What has intel_irq_remapping_free() to do with the allocation path?

Sorry, let me clarify it a bit better:

1) This happens when the machine is booted up, and being under stress
2) This happens when I have failslab fault injection enabled.
3) The machine crashes after hitting this error.
4) This is reproducible with `stress-ng` using the `--aggressive` parameter
5) This is the full stack (sorry for not decoding the stack, but if you
need it, I am more than happy to give you a decoded stack)


04:12:34 can not alloc irq_pin_list (-1,0,20)
Kernel panic - not syncing: IO-APIC: failed to add irq-pin. Can not proceed
CPU: 11 UID: 0 PID: 335023 Comm: stress-ng-dev Kdump: loaded Tainted: G S E N 6.10.0-12563-gdb0610128a16 #48

Call Trace:
<TASK>
panic+0x4e9/0x590
? _printk+0xb3/0xe0
? panic_smp_self_stop+0x70/0x70
? rcu_is_watching+0xe/0xb0
? intel_irq_remapping_free+0x30/0x30
? __add_pin_to_irq_node+0xf4/0x2d0
? rcu_is_watching+0xe/0xb0
mp_irqdomain_alloc+0x9ab/0xa80
? IO_APIC_get_PCI_irq_vector+0x850/0x850
? __kmalloc_cache_node_noprof+0x1e0/0x360
? mutex_lock_io_nested+0x1420/0x1420
irq_domain_alloc_irqs_locked+0x25d/0x8d0
__irq_domain_alloc_irqs+0x80/0x110
mp_map_pin_to_irq+0x645/0x890
? __acpi_get_override_irq+0x350/0x350
? mutex_lock_io_nested+0x1420/0x1420
? lockdep_hardirqs_on_prepare+0x400/0x400
? mp_map_gsi_to_irq+0xe6/0x1b0
acpi_register_gsi_ioapic+0xe6/0x150
? acpi_unregister_gsi_ioapic+0x40/0x40
? mark_held_locks+0x9f/0xe0
? _raw_spin_unlock_irq+0x24/0x50
hpet_open+0x313/0x480
misc_open+0x306/0x420
chrdev_open+0x218/0x660
? __unregister_chrdev+0xe0/0xe0
? security_file_open+0x3d4/0x740
do_dentry_open+0x4a1/0x1300
? __unregister_chrdev+0xe0/0xe0
vfs_open+0x7e/0x350
path_openat+0xb46/0x2740
? kernel_tmpfile_open+0x60/0x60
? lock_acquire+0x1e4/0x650
do_filp_open+0x1af/0x3e0
? path_openat+0x2740/0x2740
? do_raw_spin_lock+0x12d/0x270
? spin_bug+0x1d0/0x1d0
? _raw_spin_unlock+0x29/0x40
? alloc_fd+0x1e6/0x640
do_sys_openat2+0x117/0x150
? build_open_flags+0x450/0x450
? lock_downgrade+0x690/0x690
__x64_sys_openat+0x11f/0x1d0
? __x64_sys_open+0x1a0/0x1a0
? do_syscall_64+0x36/0x190
do_syscall_64+0x6e/0x190
entry_SYSCALL_64_after_hwframe+0x4b/0x53
RIP: 0033:0x7f6c406fd784
Code: 24 20 eb 8f 66 90 44 89 54 24 0c e8 d6 88 f8 ff 44 8b 54 24 0c 44 89 e2 48 89 ee 41 89 c0 bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 34 44 89 c7 89 44 24 0c e8 28 89 f8 ff 8b 44
RSP: 002b:00007fff72413a70 EFLAGS: 00000293 ORIG_RAX: 0000000000000101
RAX: ffffffffffffffda RBX: 00007f6c408c43a8 RCX: 00007f6c406fd784
RDX: 0000000000000800 RSI: 000055759a5fc910 RDI: 00000000ffffff9c
RBP: 000055759a5fc910 R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000800
R13: 00007fff72413c90 R14: 000055759a5fc910 R15: 00007f6c408c43a8
</TASK>

> > This happens because add_pin_to_irq_node() function would panic if
> > adding a pin to an IRQ failed due to -ENOMEM (which was injected by
> > failslab fault injector). I've been running with this patch in my test
> > cases in order to be able to pick real bugs, and I thought it might be a
> > good idea to have it upstream also, so, other people trying to find real
> > bugs don't stumble upon this one. Also, this makes sense in a real
> > world(?), when retrying a few times might be better than just
> > panicking.
>
> While it seems to make sense, the reality is that this is mostly early
> boot code. If there is a real world memory allocation failure during
> early boot then retries will not help at all.

This is not happening at early boot, this is reproducible when running
stress-ng in this aggressive mode.

Since I have failslab injecting a kmalloc fault,
__add_pin_to_irq_noder() returns -ENOMEM, which causes the undesired
panic().

> > Introduce a retry mechanism that attempts to add the pin up to 3 times
> > before giving up and panicking. This should improve the robustness of
> > the IO-APIC code in the face of transient errors.
>
> I'm absolutely not convinced by this loop heuristic. That's just a bad
> hack.

I will not disagree with you here, but I need to use this patch in order
to be able t keep the system not panicking and stable while fault
injecting slab errors and trying to reproduce a real bug in the network
stack.

Thanks for the review,
--breno