Re: [RFC PATCH 0/6] Improve VM DVFS and task placement behavior

From: Quentin Perret
Date: Thu Apr 06 2023 - 08:53:03 EST


On Wednesday 05 Apr 2023 at 14:07:18 (-0700), Saravana Kannan wrote:
> On Wed, Apr 5, 2023 at 12:48 AM 'Quentin Perret' via kernel-team
> > And I concur with all the above as well. Putting this in the kernel is
> > not an obvious fit at all as that requires a number of assumptions about
> > the VMM.
> >
> > As Oliver pointed out, the guest topology, and how it maps to the host
> > topology (vcpu pinning etc) is very much a VMM policy decision and will
> > be particularly important to handle guest frequency requests correctly.
> >
> > In addition to that, the VMM's software architecture may have an impact.
> > Crosvm for example does device emulation in separate processes for
> > security reasons, so it is likely that adjusting the scheduling
> > parameters ('util_guest', uclamp, or else) only for the vCPU thread that
> > issues frequency requests will be sub-optimal for performance, we may
> > want to adjust those parameters for all the tasks that are on the
> > critical path.
> >
> > And at an even higher level, assuming in the kernel a certain mapping of
> > vCPU threads to host threads feels kinda wrong, this too is a host
> > userspace policy decision I believe. Not that anybody in their right
> > mind would want to do this, but I _think_ it would technically be
> > feasible to serialize the execution of multiple vCPUs on the same host
> > thread, at which point the util_guest thingy becomes entirely bogus. (I
> > obviously don't want to conflate this use-case, it's just an example
> > that shows the proposed abstraction in the series is not a perfect fit
> > for the KVM userspace delegation model.)
>
> See my reply to Oliver and Marc. To me it looks like we are converging
> towards having shared memory between guest, host kernel and VMM and
> that should address all our concerns.

Hmm, that is not at all my understanding of what has been the most
important part of the feedback so far: this whole thing belongs to
userspace.

> The guest will see a MMIO device, writing to it will trigger the host
> kernel to do the basic "set util_guest/uclamp for the vCPU thread that
> corresponds to the vCPU" and then the VMM can do more on top as/if
> needed (because it has access to the shared memory too). Does that
> make sense?

Not really no. I've given examples of why this doesn't make sense for
the kernel to do this, which still seems to be the case with what you're
suggesting here.

> Even in the extreme example, the stuff the kernel would do would still
> be helpful, but not sufficient. You can aggregate the
> util_guest/uclamp and do whatever from the VMM.
> Technically in the extreme example, you don't need any of this. The
> normal util tracking of the vCPU thread on the host side would be
> sufficient.
>
> Actually any time we have only 1 vCPU host thread per VM, we shouldn't
> be using anything in this patch series and not instantiate the guest
> device at all.

> > So +1 from me to move this as a virtual device of some kind. And if the
> > extra cost of exiting all the way back to userspace is prohibitive (is
> > it btw?),
>
> I think the "13% increase in battery consumption for games" makes it
> pretty clear that going to userspace is prohibitive. And that's just
> one example.

I beg to differ. We need to understand where these 13% come from in more
details. Is it really the actual cost of the userspace exit? Or is it
just that from userspace the only knob you can play with is uclamp and
that didn't reach the expected level of performance?

If that is the userspace exit, then we can work to optimize that -- it's
a fairly common problem in the virt world, nothing special here.

And if the issue is the lack of expressiveness in uclamp, then that too
is something we should work on, but clearly giving vCPU threads more
'power' than normal host threads is a bit of a red flag IMO. vCPU
threads must be constrained in the same way that userspace threads are,
because they _are_ userspace threads.

> > then we can try to work on that. Maybe something a la vhost
> > can be done to optimize, I'll have a think.
> >
> > > The one thing I'd like to understand that the comment seems to imply
> > > that there is a significant difference in overhead between a hypercall
> > > and an MMIO. In my experience, both are pretty similar in cost for a
> > > handling location (both in userspace or both in the kernel). MMIO
> > > handling is a tiny bit more expensive due to a guaranteed TLB miss
> > > followed by a walk of the in-kernel device ranges, but that's all. It
> > > should hardly register.
> > >
> > > And if you really want some super-low latency, low overhead
> > > signalling, maybe an exception is the wrong tool for the job. Shared
> > > memory communication could be more appropriate.
> >
> > I presume some kind of signalling mechanism will be necessary to
> > synchronously update host scheduling parameters in response to guest
> > frequency requests, but if the volume of data requires it then a shared
> > buffer + doorbell type of approach should do.
>
> Part of the communication doesn't need synchronous handling by the
> host. So, what I said above.

I've also replied to another message about the scale invariance issue,
and I'm not convinced the frequency based interface proposed here really
makes sense. An AMU-like interface is very likely to be superior.

> > Thinking about it, using SCMI over virtio would implement exactly that.
> > Linux-as-a-guest already supports it IIRC, so possibly the problem
> > being addressed in this series could be 'simply' solved using an SCMI
> > backend in the VMM...
>
> This will be worse than all the options we've tried so far because it
> has the userspace overhead AND uclamp overhead.

But it doesn't violate the whole KVM userspace delegation model, so we
should start from there and then optimize further if need be.

Thanks,
Quentin