Re: [bug report] GICv4.1: multiple vpus execute vgic_v4_load at the same time will greatly increase the time consumption

From: Kunkun Jiang
Date: Sun Aug 25 2024 - 23:10:56 EST


Hi Marc,

On 2024/8/23 16:49, Marc Zyngier wrote:
On Thu, 22 Aug 2024 22:20:43 +0100,
Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:

On Thu, Aug 22 2024 at 13:47, Marc Zyngier wrote:
On Thu, 22 Aug 2024 11:59:50 +0100,
Kunkun Jiang <jiangkunkun@xxxxxxxxxx> wrote:
but that will eat a significant portion of your stack if your kernel is
configured for a large number of CPUs.


Currently CONFIG_NR_CPUS=4096,each `struct cpumask` occupies 512 bytes.

This seems crazy. Why would you build a kernel with something *that*
big, specially considering that you have a lot less than 1k CPUs?

That's why CONFIG_CPUMASK_OFFSTACK exists, but that does not help in
that context. :)

The removal of this global lock is the only option in my opinion.
Either the cpumask becomes a stack variable, or it becomes a static
per-CPU variable. Both have drawbacks, but they are not a bottleneck
anymore.

I also prefer to remove the global lock. Which variable do you think is
better?

Given the number of CPUs your system is configured for, there is no
good answer. An on-stack variable is dangerously large, and a per-CPU
cpumask results in 2MB being allocated, which I find insane.

Only if there are actually 4096 CPUs enumerated. The per CPU magic is
smart enough to limit the damage to the actual number of possible CPUs
which are enumerated at boot time. It still will over-allocate due to
NR_CPUS being insanely large but on a 4 CPU machine this boils down to
2k of memory waste unless Aaarg64 is stupid enough to allocate for
NR_CPUS instead of num_possible_cpus()...

No difference between arm64 and xyz85.999 here.


That said, on a real 4k CPU system 2M of memory should be the least of
your worries.

Don't underestimate the general level of insanity!


You'll have to pick your own poison and convince Thomas of the
validity of your approach.

As this is an operation which is really not suitable for on demand
or large stack allocations the per CPU approach makes sense.

Right, so let's shoot for that. Kunkun, can you please give the
following hack a go with your workload?

I tested my workload based on the patch below. It solved my problem.
Thank you very much.

Thanks,
Kunkun Jiang


Thanks,

M.

diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index dd53298ef1a5..b6aa259ac749 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -224,15 +224,16 @@ int irq_do_set_affinity(struct irq_data *data, const struct cpumask *mask,
struct irq_desc *desc = irq_data_to_desc(data);
struct irq_chip *chip = irq_data_get_irq_chip(data);
const struct cpumask *prog_mask;
+ struct cpumask *tmp_mask;
int ret;
- static DEFINE_RAW_SPINLOCK(tmp_mask_lock);
- static struct cpumask tmp_mask;
+ static DEFINE_PER_CPU(struct cpumask, __tmp_mask);
if (!chip || !chip->irq_set_affinity)
return -EINVAL;
- raw_spin_lock(&tmp_mask_lock);
+ tmp_mask = this_cpu_ptr(&__tmp_mask);
+
/*
* If this is a managed interrupt and housekeeping is enabled on
* it check whether the requested affinity mask intersects with
@@ -258,11 +259,11 @@ int irq_do_set_affinity(struct irq_data *data, const struct cpumask *mask,
hk_mask = housekeeping_cpumask(HK_TYPE_MANAGED_IRQ);
- cpumask_and(&tmp_mask, mask, hk_mask);
- if (!cpumask_intersects(&tmp_mask, cpu_online_mask))
+ cpumask_and(tmp_mask, mask, hk_mask);
+ if (!cpumask_intersects(tmp_mask, cpu_online_mask))
prog_mask = mask;
else
- prog_mask = &tmp_mask;
+ prog_mask = tmp_mask;
} else {
prog_mask = mask;
}
@@ -272,16 +273,14 @@ int irq_do_set_affinity(struct irq_data *data, const struct cpumask *mask,
* unless we are being asked to force the affinity (in which
* case we do as we are told).
*/
- cpumask_and(&tmp_mask, prog_mask, cpu_online_mask);
- if (!force && !cpumask_empty(&tmp_mask))
- ret = chip->irq_set_affinity(data, &tmp_mask, force);
+ cpumask_and(tmp_mask, prog_mask, cpu_online_mask);
+ if (!force && !cpumask_empty(tmp_mask))
+ ret = chip->irq_set_affinity(data, tmp_mask, force);
else if (force)
ret = chip->irq_set_affinity(data, mask, force);
else
ret = -EINVAL;
- raw_spin_unlock(&tmp_mask_lock);
-
switch (ret) {
case IRQ_SET_MASK_OK:
case IRQ_SET_MASK_OK_DONE: