Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU

From: Mi, Dapeng
Date: Thu Apr 25 2024 - 21:51:12 EST



On 4/26/2024 4:43 AM, Liang, Kan wrote:

On 2024-04-25 4:16 p.m., Mingwei Zhang wrote:
On Thu, Apr 25, 2024 at 9:13 AM Liang, Kan <kan.liang@xxxxxxxxxxxxxxx> wrote:


On 2024-04-25 12:24 a.m., Mingwei Zhang wrote:
On Wed, Apr 24, 2024 at 8:56 PM Mi, Dapeng <dapeng1.mi@xxxxxxxxxxxxxxx> wrote:

On 4/24/2024 11:00 PM, Sean Christopherson wrote:
On Wed, Apr 24, 2024, Dapeng Mi wrote:
On 4/24/2024 1:02 AM, Mingwei Zhang wrote:
Maybe, (just maybe), it is possible to do PMU context switch at vcpu
boundary normally, but doing it at VM Enter/Exit boundary when host is
profiling KVM kernel module. So, dynamically adjusting PMU context
switch location could be an option.
If there are two VMs with pmu enabled both, however host PMU is not
enabled. PMU context switch should be done in vcpu thread sched-out path.

If host pmu is used also, we can choose whether PMU switch should be
done in vm exit path or vcpu thread sched-out path.

host PMU is always enabled, ie., Linux currently does not support KVM
PMU running standalone. I guess what you mean is there are no active
perf_events on the host side. Allowing a PMU context switch drifting
from vm-enter/exit boundary to vcpu loop boundary by checking host
side events might be a good option. We can keep the discussion, but I
won't propose that in v2.
I suspect if it's really doable to do this deferring. This still makes host
lose the most of capability to profile KVM. Per my understanding, most of
KVM overhead happens in the vcpu loop, exactly speaking in VM-exit handling.
We have no idea when host want to create perf event to profile KVM, it could
be at any time.
No, the idea is that KVM will load host PMU state asap, but only when host PMU
state actually needs to be loaded, i.e. only when there are relevant host events.

If there are no host perf events, KVM keeps guest PMU state loaded for the entire
KVM_RUN loop, i.e. provides optimal behavior for the guest. But if a host perf
events exists (or comes along), the KVM context switches PMU at VM-Enter/VM-Exit,
i.e. lets the host profile almost all of KVM, at the cost of a degraded experience
for the guest while host perf events are active.
I see. So KVM needs to provide a callback which needs to be called in
the IPI handler. The KVM callback needs to be called to switch PMU state
before perf really enabling host event and touching PMU MSRs. And only
the perf event with exclude_guest attribute is allowed to create on
host. Thanks.
Do we really need a KVM callback? I think that is one option.

Immediately after VMEXIT, KVM will check whether there are "host perf
events". If so, do the PMU context switch immediately. Otherwise, keep
deferring the context switch to the end of vPMU loop.

Detecting if there are "host perf events" would be interesting. The
"host perf events" refer to the perf_events on the host that are
active and assigned with HW counters and that are saved when context
switching to the guest PMU. I think getting those events could be done
by fetching the bitmaps in cpuc.
The cpuc is ARCH specific structure. I don't think it can be get in the
generic code. You probably have to implement ARCH specific functions to
fetch the bitmaps. It probably won't worth it.

You may check the pinned_groups and flexible_groups to understand if
there are host perf events which may be scheduled when VM-exit. But it
will not tell the idx of the counters which can only be got when the
host event is really scheduled.

I have to look into the details. But
at the time of VMEXIT, kvm should already have that information, so it
can immediately decide whether to do the PMU context switch or not.

oh, but when the control is executing within the run loop, a
host-level profiling starts, say 'perf record -a ...', it will
generate an IPI to all CPUs. Maybe that's when we need a callback so
the KVM guest PMU context gets preempted for the host-level profiling.
Gah..

hmm, not a fan of that. That means the host can poke the guest PMU
context at any time and cause higher overhead. But I admit it is much
better than the current approach.

The only thing is that: any command like 'perf record/stat -a' shot in
dark corners of the host can preempt guest PMUs of _all_ running VMs.
So, to alleviate that, maybe a module parameter that disables this
"preemption" is possible? This should fit scenarios where we don't
want guest PMU to be preempted outside of the vCPU loop?

It should not happen. For the current implementation, perf rejects all
the !exclude_guest system-wide event creation if a guest with the vPMU
is running.
However, it's possible to create an exclude_guest system-wide event at
any time. KVM cannot use the information from the VM-entry to decide if
there will be active perf events in the VM-exit.
Hmm, why not? If there is any exclude_guest system-wide event,
perf_guest_enter() can return something to tell KVM "hey, some active
host events are swapped out. they are originally in counter #2 and
#3". If so, at the time when perf_guest_enter() returns, KVM will ack
that and keep it in its pmu data structure.
I think it's possible that someone creates !exclude_guest event after
the perf_guest_enter(). The stale information is saved in the KVM. Perf
will schedule the event in the next perf_guest_exit(). KVM will not know it.

Now, when doing context switching back to host at just VMEXIT, KVM
will check this data and see if host perf context has something active
(of course, they are all exclude_guest events). If not, deferring the
context switch to vcpu boundary. Otherwise, do the proper PMU context
switching by respecting the occupied counter positions on the host
side, i.e., avoid doubling the work on the KVM side.

Kan, any suggestion on the above approach?
I think we can only know the accurate event list at perf_guest_exit().
You may check the pinned_groups and flexible_groups, which tell if there
are candidate events.

Totally understand that
there might be some difficulty, since perf subsystem works in several
layers and obviously fetching low-level mapping is arch specific work.
If that is difficult, we can split the work in two phases: 1) phase
#1, just ask perf to tell kvm if there are active exclude_guest events
swapped out; 2) phase #2, ask perf to tell their (low-level) counter
indices.

If you want an accurate counter mask, the changes in the arch specific
code is required. Two phases sound good to me.

Besides perf changes, I think the KVM should also track which counters
need to be saved/restored. The information can be get from the EventSel
interception.

Yes, that's another optimization from guest point view. It's in our to-do list.



Thanks,
Kan
The perf_guest_exit() will reload the host state. It's impossible to
save the guest state after that. We may need a KVM callback. So perf can
tell KVM whether to save the guest state before perf reloads the host state.

Thanks,
Kan

My original sketch: https://lore.kernel.org/all/ZR3eNtP5IVAHeFNC@googlecom