[RFC] perf_events: exclude_guest impact on time_enabled/time_running

From: Stephane Eranian
Date: Thu Jun 06 2024 - 03:58:36 EST


Hi Peter,

In the context of the new vPMU passthru patch series, we have to look
closer at the definition and implementation of the exclude_guest
filter in the perf_event_attr structure. This filter has been in the
kernel for many years. See patch:
https://lore.kernel.org/all/20240506053020.3911940-8-mizhang@xxxxxxxxxx/

The presumed definition of the filter is that the user does not want
the event to count while the processor is running in guest mode (i.e.,
inside the virtual machine guest OS or guest user code).

The perf tool sets is by default on all core PMU events:
$ perf stat -vv -e cycles sleep 0
------------------------------------------------------------
perf_event_attr:
size 112
sample_type IDENTIFIER
read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
disabled 1
inherit 1
enable_on_exec 1
exclude_guest 1
------------------------------------------------------------

In the kernel, the way this is treated differs between AMD and Intel because AMD
does provide a hardware filter for guest vs. host in the PMU counters
whereas Intel
does not. For the latter, the kernel simply disables the event in the
hardware counters,
i.e., the event is not descheduled. Both approaches produce pretty
much the same
desired effect, the event is not counted while in guest mode.

The issue I would like to raise has to do with the effects on
time_enabled and time_running
for exclude_guest=1 events.

Given the event is not scheduled out while in guest mode, even though
it is stopped, both time_enabled and time_running continue ticking
while in guest mode. If a measurement is 10s
long but only 5s are in non-guest mode, then time_enabled=10s,
time_running=10s. The count
represents 10s worth of non guest mode, of which only 5s were really
actively monitoring, but
the user has no way of determining this.

If we look at vPMU passthru, the host event must have exclude_guest=1
to avoid going into
an error state on context switch to the vCPU thread (with vPMU
enabled). But this time,
the event is scheduled out, that means that time_enabled keeps
counting, but time_running
stops. On context switch back in, the host event is scheduled again
and time_running restarts
ticking. For a 10s measurement, where 5s here in the guest, the event
will come out with
time_enabled=10s, time_running=5s, and the tool will scale it up
because it thinks the event
was multiplexed, when in fact it was not. This is not the intended
outcome here. The tool should
not scale the count, it was not multiplexed, it was descheduled
because the filter forced it out.
Note that if the event had been multiplexed while running on the host,
then the scaling would be
appropriate.

In that case, I argue, time_running should be updated to cover the
time the event was not running. That would bring us back to the case I
was describing earlier.

It boils down to the exact definition of exclude_guest and expected
impact on time_enabled
and time_running. Then, with or without vPMU passthru, we can fix the
kernel to ensure a
uniform behavior.

What are your thoughts on this problem?