Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall\
From: Peter Zijlstra
Date: Mon Sep 25 2017 - 04:58:49 EST
On Sun, Sep 24, 2017 at 11:22:38PM -0300, Marcelo Tosatti wrote:
> On Fri, Sep 22, 2017 at 03:01:41PM +0200, Peter Zijlstra wrote:
> > On Fri, Sep 22, 2017 at 09:40:05AM -0300, Marcelo Tosatti wrote:
> >
> > > Are you arguing its invalid for the following application to execute on
> > > housekeeping vcpu of a realtime system:
> > >
> > > void main(void)
> > > {
> > >
> > > submit_IO();
> > > do {
> > > computation();
> > > } while (!interrupted());
> > > }
> > >
> > > Really?
> >
> > No. Nobody cares about random crap tasks.
>
> Nobody has control over all code that runs in userspace Peter. And not
> supporting a valid sequence of steps because its "crap" (whatever your
> definition of crap is) makes no sense.
>
> It might be that someone decides to do the above (i really can't see
> any actual reasoning i can follow and agree on your "its crap"
> argument), this truly seems valid to me.
We don't care what other tasks do. This isn't a hard thing to
understand. You're free to run whatever junk on your CPUs. This doesn't
(much) affect the correct functioning of RT tasks that you also run
there.
> So lets follow the reasoning steps:
>
> 1) "NACK, because you didnt understand the problem".
>
> OK thats an invalid NACK, you did understand the problem
> later and now your argument is the following.
It was a NACK because you wrote a shit changelog that didn't explain the
problem. But yes.
> 2) "NACK, because all VCPUs should be SCHED_FIFO all the time".
Very much, if you want a RT guest, all VCPU's should run at RT prio and
the interaction between the VCPUs and all supporting threads should be
designed for RT.
> But the existence of this code path from userspace:
>
> submit_IO();
> do {
> computation();
> } while (!interrupted());
>
> Its a supported code sequence, and works fine in a non-RT environment.
Who cares about that chunk of code? Have you forgotten to mention that
this is the form of the emulation thread?
> Therefore it should work on an -RT environment.
No, this is where you're wrong. That code works on -RT as long as you
don't expect it to be a valid RT program. -RT kernels will run !RT stuff
just fine.
But the moment you run a program as RT (FIFO/RR/DEADLINE) it had better
damn well be a valid RT program, and that excludes a lot of code.
> So please give me some logical reasoning for the NACK (people can live with
> it, but it has to be good enough to justify the decreasing packing of
> guests in pCPUs):
>
> 1) "Voodoo programming" (its hard for me to parse what you mean with
> that... do you mean you foresee this style of priority boosting causing
> problems in the future? Can you give an example?).
Your 'solution' only works if you sacrifice a goat on a full moon,
because only that ensures the guest doesn't VM_EXIT and cause the
self-same problem while you've boosted it.
Because you've _not_ fixed the actual problem!
> Is there fundamentally wrong about priority boosting in spinlock
> sections, or this particular style of priority boosting is wrong?
Yes, its fundamentally crap, because it doesn't guarantee anything.
RT is about making guarantees. An RT program needs a provable forward
progress guarantee at the very least. It including a priority inversion
disqualifies it from being sane.
> 2) "Pollution of the kernel code path". That makes sense to me, if thats
> whats your concerned about.
Also..
> 3) "Reduction of spinlock performance". Its true, but for NFV workloads
> people don't care about.
I've no idea what an NFV is.
> 4) "All vcpus should be SCHED_FIFO all the time". OK, why is that?
> What dictates that to be true?
Solid engineering. Does the guest kernel function as a bunch of
independent CPUs or does it assume all CPUs are equal and have strong
inter-cpu connections? Linux is the latter, therefore if one VCPU is RT
they all should be.
Dammit, you even recognise this in the spin-owner preemption issue
you're hacking around, but then go arse-about-face 'solving' it.
> What the patch does is the following:
> It reduces the window where SCHED_FIFO is applied vcpu0
> to those were a spinlock is shared between -RT vcpus and vcpu0
> (why: because otherwise, when the emulator thread is sharing a
> pCPU with vcpu0, its unable to generate interrupts vcpu0).
>
> And its being rejected because:
Its not fixing the actual problem. The real problem is the prio
inversion between the VCPU and the emulation thread, _That_ is what
needs fixing.
Rewrite that VCPU/emulator interaction to be a proper RT construct.
Then you can run the VCPU at RT prio as you should, and the guest can
issue all the VM_EXIT things it wants at any time and still function
correctly.