Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall\
From: Marcelo Tosatti
Date: Wed Sep 27 2017 - 20:51:21 EST
On Wed, Sep 27, 2017 at 11:37:48AM +0200, Paolo Bonzini wrote:
> On 27/09/2017 00:49, Marcelo Tosatti wrote:
> > On Mon, Sep 25, 2017 at 05:12:42PM +0200, Paolo Bonzini wrote:
> >> On 25/09/2017 11:13, Peter Zijlstra wrote:
> >>> On Sun, Sep 24, 2017 at 11:57:53PM -0300, Marcelo Tosatti wrote:
> >>>> I think you are missing the following point:
> >>>>
> >>>> "vcpu0 can be interrupted when its not in a spinlock protected section,
> >>>> otherwise it can't."
> >>
> >> Who says that? Certainly a driver can dedicate a single VCPU to
> >> periodic polling of the device, in such a way that the polling does not
> >> require a spinlock.
> >
> > This sequence:
> >
> >
> > VCPU-0 VCPU-1 (running realtime workload)
> >
> > takes spinlock A
> > scheduled out
> > spinlock(A) (busy spins until
> > VCPU-0 is scheduled
> > back in)
> > scheduled in
> > finishes execution of
> > code under protected section
> > releases spinlock(A)
> >
> > takes spinlock(A)
>
> Yes, but then you have
>
> busy waits for flag to be set
> polls device
> scheduled out
> device receives data
> ...
> scheduled in
> set flag
>
> or
>
> check for work to do
> scheduled out
> submit work
> busy waits for work to complete
> scheduled in
> do work
>
> None of which have anything to do with spinlocks. They're just
> different ways to get priority inversion, and this...
>
> >>>> So this point of "everything needs to be RT and the priorities must be
> >>>> designed carefully", is this:
> >>>>
> >>>> WHEN in spinlock protected section (more specifically, when
> >>>> spinlock protected section _shared with realtime vcpus_),
> >>>>
> >>>> priority of vcpu0 > priority of emulator thread
> >>>>
> >>>> OTHERWISE
> >>>>
> >>>> priority of vcpu0 < priority of emulator thread.
> >>
> >> This is _not_ designed carefully, this is messy.
> >
> > This is very precise to me. What is "messy" about it? (its clearly
> > defined).
>
> ... it's not how you design RT systems to avoid priority inversion.
> It's just solving an instance of the issue.
>
> >> The emulator thread can interrupt the VCPU thread, so it has to be at
> >> higher RT priority (+ priority inheritance of mutexes).
> >
> > It can only do that _when_ the VCPU thread is not running a critical
> > section which a higher priority task depends on.
>
> All VCPUs must have the same priority. There's no such thing as a
> housekeeping VCPU.
>
> There could be one sacrificial VCPU that you place on the same physical
> CPU as the emulator thread, but that's it. The whole system must run
> smoothly even if you place those on the same physical CPU, without any
> PV hacks.
>
> > So when you say "The emulator thread can interrupt the VCPU thread",
> > you're saying that it has to be modified to interrupt for a maximum
> > amount of time (say 15us).
> >
> > Is that what you are suggesting?
>
> This is correct. But I think you are missing a fundamental point, that
> is not specific to virt. If a thread 1 can trigger an event in thread
> 2, and thread 2 runs at priority N, thread 1 must be placed at priority >N.
>
> In this case, the emulator thread can signal I/O completion to the VCPU:
> it doesn't matter if the VCPU is polling or using interrupts, the
> emulator thread must be placed at higher priority than the VCPU. This
> is the root cause of the SeaBIOS issue that you were seeing. Yes, it
> was me who suggested moving VCPU0 and the emulator thread to
> SCHED_NORMAL, but that was just a bandaid until the real fix was done
> (which is to set up SCHED_FIFO priorities correctly and use PI mutexes).
>
> In fact, this is not specific to the emulator thread. It applies just
> as well to vhost threads, to QEMU iothreads, even to the KVM PIT
> emulation thread if the guest were using it.
>
> > Paolo, you don't control how many interruptions of the emulator thread
> > happen per second.
>
> Indeed these non-VCPU threads should indeed run rarely and for a small
> amount of time only to achieve bounded latencies in the VCPUs. But if
> they don't, those are simply bugs and we fix them. In fact, sources of
> frequent interruptions have all been fixed or moved outside the main
> thread; for example, disks can use separate iothreads. Configuration
> problems are also bugs, just in a different place.
>
> For a very basic VM that I've just tried (whatever QEMU does by default,
> only tweak was not using the QEMU GUI), there is exactly one
> interruption per second when the VM is idle. That interruption in turn
> is caused by udev periodically probing the guest CDROM (!). So that's a
> configuration problem: remove the CDROM, and it does one interruption
> every 10-30 seconds, again all of them guest triggered.
>
> Again: if you have many interruptions, it's not a flaw in KVM or QEMU's
> design, it's just that someone is doing something stupid. It could be
> the guest (e.g. unnecessary devices or daemons as in the example above),
> QEMU (e.g. the RTC emulation used to trigger QEMU timers twice a second
> just to increment the clock), or the management (e.g. polling "is the VM
> running" 50 times per second). But it can and must be fixed.
No, i mean you can run anything in VCPU-0 (it is valid to do that).
And that "anything" can generate 1 interrupt per second, 1000 or 10.000
interrupts per second. Which are all valid things to be done.
"I can't run a kernel compilation on VCPU-0 because that will impact
latency on the realtime VCPU-1" is not acceptable.
> The design should be the basic one: attribute the right priority to each
> thread and things just work.
>
> Paolo
>
> > So if you let the emulator thread interrupt the
> > emulator thread at all times, without some kind of bounding
> > of these interruptions per time unit, you have a similar
> > problem as (*) (where the realtime task is scheduled).
> >