Re: [PATCH 0/2] powerpc/kvm: Enable running guests on RT Linux

From: Purcareata Bogdan
Date: Mon Apr 27 2015 - 02:45:56 EST


On 24.04.2015 00:26, Scott Wood wrote:
On Thu, 2015-04-23 at 15:31 +0300, Purcareata Bogdan wrote:
On 23.04.2015 03:30, Scott Wood wrote:
On Wed, 2015-04-22 at 15:06 +0300, Purcareata Bogdan wrote:
On 21.04.2015 03:52, Scott Wood wrote:
On Mon, 2015-04-20 at 13:53 +0300, Purcareata Bogdan wrote:
There was a weird situation for .kvmppc_mpic_set_epr - its corresponding inner
function is kvmppc_set_epr, which is a static inline. Removing the static inline
yields a compiler crash (Segmentation fault (core dumped) -
scripts/Makefile.build:441: recipe for target 'arch/powerpc/kvm/kvm.o' failed),
but that's a different story, so I just let it be for now. Point is the time may
include other work after the lock has been released, but before the function
actually returned. I noticed this was the case for .kvm_set_msi, which could
work up to 90 ms, not actually under the lock. This made me change what I'm
looking at.

kvm_set_msi does pretty much nothing outside the lock -- I suspect
you're measuring an interrupt that happened as soon as the lock was
released.

That's exactly right. I've seen things like a timer interrupt occuring right
after the spinlock_irqrestore, but before kvm_set_msi actually returned.

[...]

Or perhaps a different stress scenario involving a lot of VCPUs
and external interrupts?

You could instrument the MPIC code to find out how many loop iterations
you maxed out on, and compare that to the theoretical maximum.

Numbers are pretty low, and I'll try to explain based on my observations.

The problematic section in openpic_update_irq is this [1], since it loops
through all VCPUs, and IRQ_local_pipe further calls IRQ_check, which loops
through all pending interrupts for a VCPU [2].

The guest interfaces are virtio-vhostnet, which are based on MSI
(/proc/interrupts in guest shows they are MSI). For external interrupts to the
guest, the irq_source destmask is currently 0, and last_cpu is 0 (unitialized),
so [1] will go on and deliver the interrupt directly and unicast (no VCPUs loop).

I activated the pr_debugs in arch/powerpc/kvm/mpic.c, to see how many interrupts
are actually pending for the destination VCPU. At most, there were 3 interrupts
- n_IRQ = {224,225,226} - even for 24 flows of ping flood. I understand that
guest virtio interrupts are cascaded over 1 or a couple of shared MSI interrupts.

So worst case, in this scenario, was checking the priorities for 3 pending
interrupts for 1 VCPU. Something like this (some of my prints included):

[61010.582033] openpic_update_irq: destmask 1 last_cpu 0
[61010.582034] openpic_update_irq: Only one CPU is allowed to receive this IRQ
[61010.582036] IRQ_local_pipe: IRQ 224 active 0 was 1
[61010.582037] IRQ_check: irq 226 set ivpr_pr=8 pr=-1
[61010.582038] IRQ_check: irq 225 set ivpr_pr=8 pr=-1
[61010.582039] IRQ_check: irq 224 set ivpr_pr=8 pr=-1

It would be really helpful to get your comments regarding whether these are
realistical number for everyday use, or they are relevant only to this
particular scenario.

RT isn't about "realistic numbers for everyday use". It's about worst
cases.

- Can these interrupts be used in directed delivery, so that the destination
mask can include multiple VCPUs?

The Freescale MPIC does not support multiple destinations for most
interrupts, but the (non-FSL-specific) emulation code appears to allow
it.

The MPIC manual states that timer and IPI
interrupts are supported for directed delivery, altough I'm not sure how much of
this is used in the emulation. I know that kvmppc uses the decrementer outside
of the MPIC.

- How are virtio interrupts cascaded over the shared MSI interrupts?
/proc/device-tree/soc@e0000000/msi@41600/interrupts in the guest shows 8 values
- 224 - 231 - so at most there might be 8 pending interrupts in IRQ_check, is
that correct?

It looks like that's currently the case, but actual hardware supports
more than that, so it's possible (albeit unlikely any time soon) that
the emulation eventually does as well.

But it's possible to have interrupts other than MSIs...

Right.

So given that the raw spinlock conversion is not suitable for all the scenarios
supported by the OpenPIC emulation, is it ok that my next step would be to send
a patch containing both the raw spinlock conversion and a mandatory disable of
the in-kernel MPIC? This is actually the last conclusion we came up with some
time ago, but I guess it was good to get some more insight on how things
actually work (at least for me).

Fine with me. Have you given any thought to ways to restructure the
code to eliminate the problem?

My first thought would be to create a separate lock for each VCPU pending interrupts queue, so that we make the whole openpic_irq_update more granular. However, this is just a very preliminary thought. Before I can come up with anything worthy of consideration, I must read the OpenPIC specification and the current KVM emulated OpenPIC implementation thoroughly. I currently have other things on my hands, and will come back to this once I have some time.

Meanwhile, I've sent a v2 on the PPC and RT mailing lists for this raw_spinlock conversion, alongside disabling the in-kernel MPIC emulation for PREEMPT_RT. I would be grateful to hear your feedback on that, so that it can get applied.

Thank you,
Bogdan P.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/