Re: [RFC] perf_events: exclude_guest impact on time_enabled/time_running

From: Peter Zijlstra
Date: Mon Jun 10 2024 - 16:54:34 EST


On Thu, Jun 06, 2024 at 12:57:35AM -0700, Stephane Eranian wrote:
> Hi Peter,
>
> In the context of the new vPMU passthru patch series, we have to look
> closer at the definition and implementation of the exclude_guest
> filter in the perf_event_attr structure. This filter has been in the
> kernel for many years. See patch:
> https://lore.kernel.org/all/20240506053020.3911940-8-mizhang@xxxxxxxxxx/
>
> The presumed definition of the filter is that the user does not want
> the event to count while the processor is running in guest mode (i.e.,
> inside the virtual machine guest OS or guest user code).
>
> The perf tool sets is by default on all core PMU events:
> $ perf stat -vv -e cycles sleep 0
> ------------------------------------------------------------
> perf_event_attr:
> size 112
> sample_type IDENTIFIER
> read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> disabled 1
> inherit 1
> enable_on_exec 1
> exclude_guest 1
> ------------------------------------------------------------
>
> In the kernel, the way this is treated differs between AMD and Intel
> because AMD does provide a hardware filter for guest vs. host in the
> PMU counters whereas Intel does not. For the latter, the kernel
> simply disables the event in the hardware counters, i.e., the event is
> not descheduled. Both approaches produce pretty much the same desired
> effect, the event is not counted while in guest mode.
>
> The issue I would like to raise has to do with the effects on
> time_enabled and time_running for exclude_guest=1 events.
>
> Given the event is not scheduled out while in guest mode, even though
> it is stopped, both time_enabled and time_running continue ticking
> while in guest mode. If a measurement is 10s long but only 5s are in
> non-guest mode, then time_enabled=10s, time_running=10s. The count
> represents 10s worth of non guest mode, of which only 5s were really
> actively monitoring, but the user has no way of determining this.
>
> If we look at vPMU passthru, the host event must have exclude_guest=1
> to avoid going into an error state on context switch to the vCPU
> thread (with vPMU enabled). But this time, the event is scheduled out,
> that means that time_enabled keeps counting, but time_running stops.
> On context switch back in, the host event is scheduled again and
> time_running restarts ticking. For a 10s measurement, where 5s here in
> the guest, the event will come out with time_enabled=10s,
> time_running=5s, and the tool will scale it up because it thinks the
> event was multiplexed, when in fact it was not. This is not the
> intended outcome here. The tool should not scale the count, it was not
> multiplexed, it was descheduled because the filter forced it out.
> Note that if the event had been multiplexed while running on the host,
> then the scaling would be appropriate.
>
> In that case, I argue, time_running should be updated to cover the
> time the event was not running. That would bring us back to the case I
> was describing earlier.
>
> It boils down to the exact definition of exclude_guest and expected
> impact on time_enabled and time_running. Then, with or without vPMU
> passthru, we can fix the kernel to ensure a uniform behavior.
>
> What are your thoughts on this problem?

So with those patches having explicit scheduling points, we can actually
do this time accounting accurately, so I don't see a reason to not do
the right thing here.

Hysterically this was left vague in order to be able to avoid the
scheduling for these scenarios -- performance raisins etc.

The thing is, if you push this to its limits, we should start time
accounting for the ring selectors too, and that's going to be painful.