Re: [PATCH] irqchip/gic-v3-its: Don't acquire rt_spin_lock in allocate_vpe_l1_table()

From: Waiman Long

Date: Sat Jan 10 2026 - 16:48:04 EST


On 1/8/26 3:26 AM, Marc Zyngier wrote:
On Wed, 07 Jan 2026 21:53:53 +0000,
Waiman Long <longman@xxxxxxxxxx> wrote:
When running a PREEMPT_RT debug kernel on a 2-socket Grace arm64 system,
the following bug report was produced at bootup time.

BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 0, name: swapper/72
preempt_count: 1, expected: 0
RCU nest depth: 1, expected: 1
:
CPU: 72 UID: 0 PID: 0 Comm: swapper/72 Tainted: G W 6.19.0-rc4-test+ #4 PREEMPT_{RT,(full)}
Tainted: [W]=WARN
Call trace:
:
rt_spin_lock+0xe4/0x408
rmqueue_bulk+0x48/0x1de8
__rmqueue_pcplist+0x410/0x650
rmqueue.constprop.0+0x6a8/0x2b50
get_page_from_freelist+0x3c0/0xe68
__alloc_frozen_pages_noprof+0x1dc/0x348
alloc_pages_mpol+0xe4/0x2f8
alloc_frozen_pages_noprof+0x124/0x190
allocate_slab+0x2f0/0x438
new_slab+0x4c/0x80
___slab_alloc+0x410/0x798
__slab_alloc.constprop.0+0x88/0x1e0
__kmalloc_cache_noprof+0x2dc/0x4b0
allocate_vpe_l1_table+0x114/0x788
its_cpu_init_lpis+0x344/0x790
its_cpu_init+0x60/0x220
gic_starting_cpu+0x64/0xe8
cpuhp_invoke_callback+0x438/0x6d8
__cpuhp_invoke_callback_range+0xd8/0x1f8
notify_cpu_starting+0x11c/0x178
secondary_start_kernel+0xc8/0x188
__secondary_switched+0xc0/0xc8

This is due to the fact that allocate_vpe_l1_table() will call
kzalloc() to allocate a cpumask_t when the first CPU of the
second node of the 72-cpu Grace system is being called from the
CPUHP_AP_MIPS_GIC_TIMER_STARTING state inside the starting section of
Surely *not* that particular state.

My mistake, it should be CPUHP_AP_IRQ_GIC_STARTING. There are three static gic_starting_cpu() functions that confuse me.


the CPU hotplug bringup pipeline where interrupt is disabled. This is an
atomic context where sleeping is not allowed and acquiring a sleeping
rt_spin_lock within kzalloc() may lead to system hang in case there is
a lock contention.

To work around this issue, a static buffer is used for cpumask
allocation when running a PREEMPT_RT kernel via the newly introduced
vpe_alloc_cpumask() helper. The static buffer is currently set to be
4 kbytes in size. As only one cpumask is needed per node, the current
size should be big enough as long as (cpumask_size() * nr_node_ids)
is not bigger than 4k.
What role does the node play here? The GIC topology has nothing to do
with NUMA. It may be true on your particular toy, but that's
definitely not true architecturally. You could, at worse, end-up with
one such cpumask per *CPU*. That'd be a braindead system, but this
code is written to support the architecture, not any particular
implementation.

It is just what I have observed on the hardware that I used for reproducing the problem. I agree that it may be different in other arm64 CPUs.
Signed-off-by: Waiman Long <longman@xxxxxxxxxx>
---
drivers/irqchip/irq-gic-v3-its.c | 26 +++++++++++++++++++++++++-
1 file changed, 25 insertions(+), 1 deletion(-)

diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c
index ada585bfa451..9185785524dc 100644
--- a/drivers/irqchip/irq-gic-v3-its.c
+++ b/drivers/irqchip/irq-gic-v3-its.c
@@ -2896,6 +2896,30 @@ static bool allocate_vpe_l2_table(int cpu, u32 id)
return true;
}
+static void *vpe_alloc_cpumask(void)
+{
+ /*
+ * With PREEMPT_RT kernel, we can't call any k*alloc() APIs as they
+ * may acquire a sleeping rt_spin_lock in an atomic context. So use
+ * a pre-allocated buffer instead.
+ */
+ if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
+ static unsigned long mask_buf[512];
+ static atomic_t alloc_idx;
+ int idx, mask_size = cpumask_size();
+ int nr_cpumasks = sizeof(mask_buf)/mask_size;
+
+ /*
+ * Fetch an allocation index and if it points to a buffer within
+ * mask_buf[], return that. Fall back to kzalloc() otherwise.
+ */
+ idx = atomic_fetch_inc(&alloc_idx);
+ if (idx < nr_cpumasks)
+ return &mask_buf[idx * mask_size/sizeof(long)];
+ }
Err, no. That's horrible. I can see three ways to address this in a
more appealing way:

- you give RT a generic allocator that works for (small) atomic
allocations. I appreciate that's not easy, and even probably
contrary to the RT goals. But I'm also pretty sure that the GIC code
is not the only pile of crap being caught doing that.

- you pre-compute upfront how many cpumasks you are going to require,
based on the actual GIC topology. You do that on CPU0, outside of
the hotplug constraints, and allocate what you need. This is
difficult as you need to ensure the RD<->CPU matching without the
CPUs having booted, which means wading through the DT/ACPI gunk to
try and guess what you have.

- you delay the allocation of L1 tables to a context where you can
perform allocations, and before we have a chance of running a guest
on this CPU. That's probably the simplest option (though dealing
with late onlining while guests are already running could be
interesting...).

But I'm always going to say no to something that is a poor hack and
ultimately falling back to the same broken behaviour.

Thanks for the suggestion. I will try  the first alternative of a more generic memory allocator.

Cheers,
Longman


Thanks,

M.