Re: [bug report] GICv4.1: multiple vpus execute vgic_v4_load at the same time will greatly increase the time consumption

From: Kunkun Jiang
Date: Thu Aug 22 2024 - 07:01:21 EST

Next message: Pavel Machek: "Re: [PATCH AUTOSEL 5.10 04/38] drm/amdgpu: fix uninitialized scalar variable warning"
Previous message: Breno Leitao: "Re: [PATCH net-next v2 3/3] netconsole: Populate dynamic entry even if netpoll fails"
In reply to: Marc Zyngier: "Re: [bug report] GICv4.1: multiple vpus execute vgic_v4_load at the same time will greatly increase the time consumption"
Next in thread: Marc Zyngier: "Re: [bug report] GICv4.1: multiple vpus execute vgic_v4_load at the same time will greatly increase the time consumption"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi Marc,

On 2024/8/22 16:26, Marc Zyngier wrote:

According to analysis, this problem is due to the execution of vgic_v4_load.
vcpu_load or kvm_sched_in
    kvm_arch_vcpu_load
    ...
        vgic_v4_load
            irq_set_affinity
            ...
                irq_do_set_affinity
                    raw_spin_lock(&tmp_mask_lock)
                    chip->irq_set_affinity
                    ...
                    its_vpe_set_affinity

The tmp_mask_lock is the key. This is a global lock. I don't quite
understand
why tmp_mask_lock is needed here. I think there are two possible
solutions here:
1. Remove this tmp_mask_lock

Maybe you could have a look at 33de0aa4bae98 (and 11ea68f553e24)? It
would allow you to understand the nature of the problem.

This can probably be replaced with a per-CPU cpumask, which would
avoid the locking, but potentially result in a larger memory usage.

Thanks, I will try it.

A simple alternative would be this:

diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index dd53298ef1a5..0d11b74af38c 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -224,15 +224,12 @@ int irq_do_set_affinity(struct irq_data *data, const struct cpumask *mask,
struct irq_desc *desc = irq_data_to_desc(data);
struct irq_chip *chip = irq_data_get_irq_chip(data);
const struct cpumask *prog_mask;
+ struct cpumask tmp_mask = {};
int ret;
- static DEFINE_RAW_SPINLOCK(tmp_mask_lock);
- static struct cpumask tmp_mask;
-
if (!chip || !chip->irq_set_affinity)
return -EINVAL;
- raw_spin_lock(&tmp_mask_lock);
/*
* If this is a managed interrupt and housekeeping is enabled on
* it check whether the requested affinity mask intersects with
@@ -280,8 +277,6 @@ int irq_do_set_affinity(struct irq_data *data, const struct cpumask *mask,
else
ret = -EINVAL;
- raw_spin_unlock(&tmp_mask_lock);
-
switch (ret) {
case IRQ_SET_MASK_OK:
case IRQ_SET_MASK_OK_DONE:

but that will eat a significant portion of your stack if your kernel is
configured for a large number of CPUs.

Currently CONFIG_NR_CPUS=4096,each `struct cpumask` occupies 512 bytes.

2. Modify the gicv4 driver,do not perfrom VMOVP via
irq_set_affinity.

Sure. You could also not use KVM at all if don't care about interrupts
being delivered to your VM. We do not send a VMOVP just for fun. We
send it because your vcpu has moved to a different CPU, and the ITS
needs to know about that.

When a vcpu is moved to a different CPU, of course VMOVP has to be sent.
I mean is it possible to call its_vpe_set_affinity() to send VMOVP by
other means (instead of by calling the irq_set_affinity() API). So we
can bypass this tmp_mask_lock.

The whole point of this infrastructure is that the VPE doorbell is the
control point for the VPE. If the VPE moves, then the change of
affinity *must* be done using irq_set_affinity(). All the locking is
constructed around that. Please read the abundant documentation that
exists in both the GIC code and KVM describing why this is done like
that.

OK. Thank you for your guidance.

You seem to be misunderstanding the use case for GICv4: a partitioned
system, without any over-subscription, no vcpu migration between CPUs.
If that's not your setup, then GICv4 will always be a net loss
compared to SW injection with GICv3 (additional HW interaction,
doorbell interrupts).

Thanks for the explanation. The key to the problem is not vcpu migration
between CPUs. The key point is that many vcpus execute vgic_v4_load() at
the same time. Even if it is not migrated to another CPU, there may be a
large number of vcpus executing vgic_v4_load() at the same time. For
example, the service running in VMs has a large number of MMIO accesses
and need to return to userspace for emulation. Due to the competition of
tmp_mask_lock, performance will deteriorate.

That's only a symptom. And that doesn't affect only pathological VM
workloads, but all interrupts being moved around for any reason.

Yes.

When the target CPU is the same CPU as the last run, there seems to be
no need to call irq_set_affinity() in this case. I did a test and it was
indeed able to alleviate the problem described above.

The premise is that irq_set_affinity() should be cheap when there
isn't much to do, and you are papering over the problem.

I feel it might be better to remove tmp_mask_lock or call
its_vpe_set_affinity() in another way. So I mentioned these two ideas
above.

The removal of this global lock is the only option in my opinion.
Either the cpumask becomes a stack variable, or it becomes a static
per-CPU variable. Both have drawbacks, but they are not a bottleneck
anymore.

I also prefer to remove the global lock. Which variable do you think is
better?

Thanks,
Kunkun Jiang

Next message: Pavel Machek: "Re: [PATCH AUTOSEL 5.10 04/38] drm/amdgpu: fix uninitialized scalar variable warning"
Previous message: Breno Leitao: "Re: [PATCH net-next v2 3/3] netconsole: Populate dynamic entry even if netpoll fails"
In reply to: Marc Zyngier: "Re: [bug report] GICv4.1: multiple vpus execute vgic_v4_load at the same time will greatly increase the time consumption"
Next in thread: Marc Zyngier: "Re: [bug report] GICv4.1: multiple vpus execute vgic_v4_load at the same time will greatly increase the time consumption"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]