[RFC] Processing of raised_list can stall if an IPI/interrupt is missed

From: Herton R. Krzesinski

Date: Tue Mar 03 2026 - 14:07:54 EST

Hello,

I saw recently a report where a system went down after it stopped processing
irq work items in raised_list (from kernel/irq_work.c). The system in question,
from the vmcore data I got, is a linux guest under VMWare (under an x86_64 host).

It seems a very rare ocurrence, from what I know only two different users reported it
so far. While it was reported on an old RHEL based kernel (4.18), I believe the issue
could still happen in newer kernels, since the processing of raised_list in principle
didn't change.

Taking into account an x86_64 system, from my understanding of the code there are two
ways raised_list can be consumed, either trough irq_work_tick() or through the irq
work interrupt/IPI. If the system has an working APIC, raised_list items are only
consumed through interrupt/IPI with irq_work_run() being called at
arch/x86/kernel/irq_work.c, and irq_work_tick() will not call
irq_work_run_list(raised) because of the arch_irq_work_has_interrupt() check
in this case.

So in this specific case, if the interrupt/IPI is missed somehow, processing of
items in raised_list can stall forever, since __irq_work_queue_local() calls
llist_add(), checking if it returns false for a non-empty list: if the list was
not consumed due an missed interrupt/IPI, it'll never call irq_work_raise() again.

This is what I saw on the vmcore from the one of the reports I mention above, where
the system died after some time, and from it we got some pending irq work items
in raised list in CPU 2:

crash> pd raised_list:all
per_cpu(raised_list, 0) = $1 = {
first = 0x0
}
per_cpu(raised_list, 1) = $2 = {
first = 0x0
}
per_cpu(raised_list, 2) = $3 = {
first = 0xffffbb22d1609020
}
...

crash> list 0xffffbb22d1609020
ffffbb22d1609020
ffffbb233d06b020
ffffbb233901d020
ffffbb2324ec1020
ffffbb232cf59020
ffffbb2328f0d020
ffffbb2320e7d020
ffffbb2334fd1020
ffffbb2330f95020
ffffbb231ce39020
ffffbb2318da5020
ffffbb2314d29020
ffffbb22e45f4020
ffffbb23007c5020
ffffbb2310cdd020
ffffbb230c8b1020
ffffbb2308857020
ffffbb2304821020
ffffbb22fc789020
ffffbb22e05f0020
ffffbb22f8715020
ffffbb22f46db020
ffffbb22f06ad020
ffffbb22e8635020
ffffbb22ec673020
ffff93d3a6a1efe0
ffff93d65151e6d0
crash> list 0xffffbb22d1609020 | wc -l
27

All other CPUs had no items, only CPU 2. These pending items looks to have
caused some cascade effects which lead to soft lockups and system dying
(eg. work item doesn't run, hold up resources and several tasks ends up
stuck...).

It appears relying on IPI only could be too strict like in this case, although
I don't know if the system missing an IPI/interrupt is something that can
be expected. It looks to me we could have a virtualization bug/issue on this
specific case (since running under VMWare), but may be should we make a fallback
if something like this happens? For example making it less strict and
allow irq_work_tick to also process the list? Like below:

diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index 73f7e1fd4ab4..e47d64b56a38 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -188,9 +188,8 @@ bool irq_work_needs_cpu(void)
raised = this_cpu_ptr(&raised_list);
lazy = this_cpu_ptr(&lazy_list);

- if (llist_empty(raised) || arch_irq_work_has_interrupt())
- if (llist_empty(lazy))
- return false;
+ if (llist_empty(raised) && llist_empty(lazy))
+ return false;

/* All work should have been flushed before going offline */
WARN_ON_ONCE(cpu_is_offline(smp_processor_id()));
@@ -270,7 +269,7 @@ void irq_work_tick(void)
{
struct llist_head *raised = this_cpu_ptr(&raised_list);

- if (!llist_empty(raised) && !arch_irq_work_has_interrupt())
+ if (!llist_empty(raised))
irq_work_run_list(raised);

if (!IS_ENABLED(CONFIG_PREEMPT_RT))

However, that above essentially reverts 76a33061b9323b7fdb220ae5fa116c10833ec22e
("irq_work: Force raised irq work to run on irq work interrupt"), and could
reintroduce the issue it fixed, however, since nohz_full_kick_func() (which is the
renamed nohz_full_kick_work_func()) is empty now, that may be is ok to not
be strict anymore about making raised_list only run in irq work interrupt?

Or may be it's not worth changing this since this is rare and missed self IPI should
not be expected?