Re: [RFC PATCH 0/6] Improve VM DVFS and task placement behavior

From: Mel Gorman
Date: Mon Apr 03 2023 - 06:19:42 EST


On Fri, Mar 31, 2023 at 01:49:48AM +0100, Matthew Wilcox wrote:
> On Thu, Mar 30, 2023 at 03:43:35PM -0700, David Dai wrote:
> > Hi,
> >
> > This patch series is a continuation of the talk Saravana gave at LPC 2022
> > titled "CPUfreq/sched and VM guest workload problems" [1][2][3]. The gist
> > of the talk is that workloads running in a guest VM get terrible task
> > placement and DVFS behavior when compared to running the same workload in
>
> DVFS? Some new filesystem, perhaps?
>

Dynamic Voltage and Frequency Scaling (DVFS) -- it's a well known term in
cpufreq/cpuidle/schedutil land.

> > the host. Effectively, no EAS for threads inside VMs. This would make power
>
> EAS?
>

Energy Aware Scheduling (EAS) is mostly a kernel/sched thing that has
an impact on cpufreq and my recollection is that it was discussed at
conferences long before kernel/sched had any EAS awareness. I don't have
the full series in my inbox and didn't dig further but patch 1 at least is
providing additional information to schedutil which impacts CPU frequency
selection on systems to varying degrees. The full impact would depend on
what cpufreq driver is in use and the specific hardware so even if the
series benefits one set of hardware, it's not necessarily a guaranteed win.

> Two unfamiliar and undefined acronyms in your opening paragraph.
> You're not making me want to read the rest of your opus.

It depends on the audience and mm/ is not the audience. VM in the title
refers to Virtual Machine, not Virtual Memory although I confess I originally
read it as mm/ and wondered initially how mm/ affects DVFS to the extent it
triggered a "wtf happened in mm/ recently that I completely missed?". This
series is mostly of concern to scheduler, cpufreq or KVM depending on your
perspective. For example, on KVM, I'd immediately wonder if the hypercall
overhead exceeds any benefit from better task placement although the leader
suggests the answer is "no". However, it didn't comment (or I didn't read
carefully enough) on whether MMIO overhead or alternative communication
methods have constant cost across different hardware or, much more likely,
depend on the hardware that could potentially opt-in. Various cpufreq
hardware has very different costs when measuring or alterating CPU frequency
stuff, even within different generations of chips from the same vendor.
While the data also shows performance improvements, it doesn't indicate how
close to bare metal the improvement is. Even if it's 50% faster within a
VM, how much slower than bare metal is it? In terms of data presentation,
it might be better to assign bare metal a score of 1 at the best possible
score and show the VM performance as a relative ratio (1.00 for bare metal,
0.5 for VM with a vanilla kernel, 0.75 using improved task placement).
It would also be preferred to have x86-64 data as the hazards the series
details with impacts arm64 and x86-64 has the additional challenge that
cpufreq is often managed by the hardware so it should be demonstrated the
the series "does no harm" on x86-64 for recent generation Intel and AMD
chips if possible. The lack of that data doesn't kill the series as a large
improvement is still very interesting even if it's not perfect and possible
specific to arm64. If this *was* my area or I happened to be paying close
attention to it at the time, I would likely favour using hypercalls only at
the start because it can be used universally and suggest adding alternative
communication methods later using the same metric "is an alternative method
of Guest<->Host communication worse, neutral or better at getting close to
bare metal performance?" I'd also push for the ratio tables as it's easier
to see at a glance how close to bare metal performance the series achieves.
Finally, I would look for x86-64 data just in case it causes harm due to
hypercall overhead on chips that management frequency in firmware.

So while I haven't read the series and only patches 2+6 reached by inbox,
I understand the point in principle. The scheduler on wakeup paths for bare
metal also tries to favour recently used CPUs and spurious CPU migration
even though it is only tangentially related to EAS. For example, a recently
used CPUs may still be polling (drivers/cpuidle/poll_state.c:poll_idle)
or at least not entered a deep C-state so the wakeup penalty is lower.

So whatever critism the series deserves, it's not due to using obscure
terms that no one in kernel/sched/, drivers/cpuidle of drivers/cpufreq
would recognise.

--
Mel Gorman
SUSE Labs