Re: [PATCH] KVM: x86/pmu: Prevent any host user from enabling PEBS for profiling guest

From: Liang, Kan
Date: Tue Dec 05 2023 - 10:19:22 EST




On 2023-12-05 2:24 a.m., Like Xu wrote:
>
>
> On 4/12/2023 11:19 pm, Liang, Kan wrote:
>>
>>
>> On 2023-12-04 3:32 a.m., Like Xu wrote:
>>>
>>>
>>> On 1/12/2023 10:38 pm, Liang, Kan wrote:
>>>>
>>>>
>>>> On 2023-11-30 10:59 p.m., Like Xu wrote:
>>>>> On 30/11/2023 11:49 pm, Liang, Kan wrote:
>>>>>>
>>>>>>
>>>>>> On 2023-11-30 2:29 a.m., Like Xu wrote:
>>>>>>> On 29/11/2023 10:38 pm, Liang, Kan wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2023-11-29 4:50 a.m., Like Xu wrote:
>>>>>>>>> From: Like Xu <likexu@xxxxxxxxxxx>
>>>>>>>>>
>>>>>>>>> Stop using PEBS counters on host to profiling guest. Limit the
>>>>>>>>> range of
>>>>>>>>> enabled PEBS counters to only those counters enabled from the
>>>>>>>>> guest
>>>>>>>>> PEBS
>>>>>>>>> emulation perspective.
>>>>>>>>>
>>>>>>>>> If there is a perf-record agent on host that uses perf-tools
>>>>>>>>> events
>>>>>>>>> like
>>>>>>>>> "cpu-cycles:GP" (G for attr.exclude_host, P for max precise event
>>>>>>>>> counter)
>>>>>>>>> to capture guest performance events, then the guest will be
>>>>>>>>> hanged.
>>>>>>>>> This is
>>>>>>>>> because Intel DS-based PEBS buffer is addressed using the 64-bit
>>>>>>>>> linear
>>>>>>>>> address of the current {p/v}CPU context based on MSR_IA32_DS_AREA.
>>>>>>>>>
>>>>>>>>> Any perf user using PEBS counters to profile guest on host is, in
>>>>>>>>> perf/core
>>>>>>>>> implementation details, trying to set bits on
>>>>>>>>> cpuc->intel_ctrl_guest_mask
>>>>>>>>> and arr[pebs_enable].guest, much like the guest PEBS emulation
>>>>>>>>> behaviour.
>>>>>>>>> But the subsequent PEBS memory write, regardless of whether guest
>>>>>>>>> PEBS is
>>>>>>>>> enabled, can overshoot guest entry and corrupt guest memory.
>>>>>>>>>
>>>>>>>>> Profiling guest via PEBS-DS buffer on host is not supported at
>>>>>>>>> this
>>>>>>>>> time.
>>>>>>>>> Fix this by filtering the real configured value of
>>>>>>>>> arr[pebs_enable].guest
>>>>>>>>> with the emulated state of guest enabled PEBS counters, under the
>>>>>>>>> condition
>>>>>>>>> of none cross-mapped PEBS counters.
>>>>>>>>
>>>>>>>> So the counter will be silently disabled. The user never knows why
>>>>>>>> nothing is sampled.
>>>>>>>> Since we don't support the case, profiling guest via PEBS-DS
>>>>>>>> buffer on
>>>>>>>> host. Maybe we should error out when creating the event. For
>>>>>>>> example
>>>>>>>> (not tested),
>>>>>>>
>>>>>>> Test failed.
>>>>>>>
>>>>>>>>
>>>>>>>> diff --git a/arch/x86/events/intel/core.c
>>>>>>>> b/arch/x86/events/intel/core.c
>>>>>>>> index 3871267d3237..24b90c70737f 100644
>>>>>>>> --- a/arch/x86/events/intel/core.c
>>>>>>>> +++ b/arch/x86/events/intel/core.c
>>>>>>>> @@ -3958,6 +3958,10 @@ static int intel_pmu_hw_config(struct
>>>>>>>> perf_event
>>>>>>>> *event)
>>>>>>>>              if ((event->attr.config & INTEL_ARCH_EVENT_MASK) ==
>>>>>>>> INTEL_FIXED_VLBR_EVENT)
>>>>>>>>                  return -EINVAL;
>>>>>>>>
>>>>>>>> +        /* Profiling guest via PEBS-DS buffer on host is not
>>>>>>>> supported. */
>>>>>>>> +        if (event->attr.exclude_host)
>>>>>>>> +            return -EINVAL;
>>>>>>>> +
>>>>>>>
>>>>>>> Guest PEBS emulation also sets this bit, a typical call stack looks
>>>>>>> like:
>>>>>>>
>>>>>>>        intel_pmu_hw_config+0x441/0x4d0
>>>>>>>        hsw_hw_config+0x12/0xa0
>>>>>>>        x86_pmu_event_init+0x98/0x370
>>>>>>>        perf_try_init_event+0x47/0x130
>>>>>>>        perf_event_alloc+0x446/0xeb0
>>>>>>>        perf_event_create_kernel_counter+0x38/0x190
>>>>>>>        pmc_reprogram_counter.constprop.17+0xd9/0x230 [kvm]
>>>>>>>        kvm_pmu_handle_event+0x1a6/0x310 [kvm]
>>>>>>>        vcpu_enter_guest+0x1388/0x19b0 [kvm]
>>>>>>>        vcpu_run+0x117/0x6c0 [kvm]
>>>>>>>        kvm_arch_vcpu_ioctl_run+0x13d/0x4d0 [kvm]
>>>>>>>        kvm_vcpu_ioctl+0x301/0x6e0 [kvm]
>>>>>>>
>>>>>>
>>>>>> Oh right, the event from the KVM guest is also exclude_host.
>>>>>> So we should only error out with the non-KVM exclude_host PEBS event.
>>>>>>
>>>>>>> Alternatively, this path is taken when using PEBS-via-PT to profile
>>>>>>> guests on host.
>>>>>>
>>>>>> There is a is_pebs_pt(event), so we can skip the PEBS-via-PT.
>>>>>>
>>>>>> Seems we just need to distinguish a KVM event and a normal host
>>>>>> event.
>>>>>> I don't have a better way to do it except using
>>>>>> event->overflow_handler_context, which is NULL for a normal host
>>>>>> event.
>>>>>
>>>>> The assumption that event->overflow_handler_context == NULL is unsafe,
>>>>> considering .overflow_handler_context hook has its acclaimed
>>>>> generality.
>>>>>
>>>>
>>>> Yes, I agree it's very hacky.
>>>>
>>>> What we need here is a way to distinguish the KVM guest request from
>>>> the
>>>> others. How about the PF_VCPU flag?
>>>>
>>>>        if (event->attr.exclude_host &&
>>>>            !is_pebs_pt(event) &&
>>>>            (!(event->attach_state & PERF_ATTACH_TASK) ||
>>>> !(current->flags &
>>>> PF_VCPU))
>>>>                return -EINVAL;
>>>
>>> Unfortunately, the tests are not passed w/ the above diff.
>>
>> What's the exact failed perf command?
>
> [0] one vcpu thread running //usr/bin/yes
> [1] tools/perf/perf kvm --host --guest \
> --guestkallsyms="..." \ --guestvmlinux="..." \
> record --raw-samples --kcore \
> -e "cpu-cycles:GP" \
> -p `pidof cloud-hypervisor`
>
>>
>> Is it because that the KVM guest is mistakenly killed or the host
>> command is not detected?
>
> Theoretically it should error_out, but instead of just printing
> unsupported info,
> this command even collects samples pointing to vmx_vmexit, which is also
> not
> expected, there should be no samples taken for "cpu-cycles:GP" event.

If the event creation is rejected by the kernel, the perf tool will
automatically remove the P and give another try.
What's the perf tool output with -vvv?
Do you see "decreasing precise_ip by one"?

If the precise_ip is decreased to 0, the event is a non-pebs which can
be created successfully. So you will still see the samples.

>
> The point at which PF_VCPU is set/cleared is not synchronised or even
> controllable with the time of perf_event being added and enabled.
>
> Going the way of PF_VCPU will bring more complexity.

Seems we need a new flag to distinguish the KVM created event.

>
>>
>>> But it's good to know that there is PF_VCPU and more things to play
>>> around with.
>>>
>>> The AMD IBS also takes the precise path, but it puts the recorded values
>>> on group of MSRs instead of the linear address-based memory.
>>
>> The check is in the intel_pmu_hw_config(). So AMD isn't impacted.
>>
>>>
>>> The root cause of this issue is the hardware limitation, just like what
>>> we did
>>> in the commit 26a4f3c08de4 ("perf/x86: disable PEBS on a guest entry"),
>>> a similar fix should also belong in the context of directly configuring
>>> VMX-switch related hardware configuration values.
>>>
>>
>> The above fix is different than this case. IIRC, It's caused by a ucode
>> issue. So once there is a guest, SW has to explicitly disable the PEBS.
>> Otherwise, the guest crashes even the host doesn't intend to profile the
>> guest.
>>
>> While the case in the patch is apparently a violation of the current
>> rules. I think it's better to detect it and error out early.
>>
>> Thanks,
>> Kan
>
> "The host does not intend to profile the guest" implies that at the time
> 26a4f3c08de4 was introduced, we didn't have the capability to use PEBS to
> profiling guest, and the error_out message was not prompted to the end-user
> when vPEBS is not supported, and the fix to prevent guest crashes appears
> in intel_guest_get_msrs();
>
> Now vPEBS is supported and not enabled, but at this time, we're moving
> the same logic that fixes guest crashes from intel_guest_get_msrs() to
> generic intel_pmu_hw_config(). This might not be a good move.
> For me, it fixes the same issue in the same way (I should have fixed
> it in c59a1f106f5c).

The problem is that the current KVM addicts to silently manipulating the
MSRs. It doesn't care if the end user is informed properly about the
changes. I don't think it's a proper way.

Let's say if a user runs a system-wide PEBS event which tries to collect
both host and guest, they should only get samples from the host, right?
How do they know the reason why there is 0 samples from the guest?

>
> If we do more implementation to make PEBS on the host profiling guest,
> with shared memory and read/write ordering, we don't need to change
> intel_pmu_hw_config() any more, any of the required change is not out
> of the scope of intel_guest_get_msrs().

The intel_pmu_hw_config() is to check whether the feature is supported.
If we support the host profiling guest later, we should remove the check
in the intel_pmu_hw_config().

>
> To move forward, this fix won't prevent you from exploring more further
> fixes in the intel_pmu_hw_config() or elsewhere, could we agree on that ?

I think we need a real fix (from error handling to MSR manipulation),
not just a workaround.

I know the current error handling of perf is not perfect. Some KVM
failures may not be able to be explicitly passed to the end user, e.g.,
if I recall correctly PEBS crossmapping. But it doesn't mean it's not
important. We should try our best to notify the end user when it's possible.

I think we can patch both intel_pmu_hw_config() and
intel_guest_get_msrs() if necessary.

Thanks,
Kan
>
>>
>>> I haven't find a better location than intel_guest_get_msrs().
>>>
>>>>
>>>>
>>>>> I understand your motivation very well, but this is not the right move
>>>>> (based on my previous history of being sprayed by Peter). For
>>>>> perf/core,
>>>>> in-kernel perf_events should be treated equally, and the semantics of
>>>>> kvm_pmu should only be accounted when a perf/core API is only used for
>>>>> guest-only path. In this case for KVM perf_events,
>>>>> intel_guest_get_msrs()
>>>>> and x86_pmu_handle_guest_pebs() have this context.
>>>>>
>>>>>>
>>>>>> diff --git a/arch/x86/events/intel/core.c
>>>>>> b/arch/x86/events/intel/core.c
>>>>>> index a968708ed1fb..c93a2aaff7c3 100644
>>>>>> --- a/arch/x86/events/intel/core.c
>>>>>> +++ b/arch/x86/events/intel/core.c
>>>>>> @@ -3958,6 +3958,16 @@ static int intel_pmu_hw_config(struct
>>>>>> perf_event
>>>>>> *event)
>>>>>>             if ((event->attr.config & INTEL_ARCH_EVENT_MASK) ==
>>>>>> INTEL_FIXED_VLBR_EVENT)
>>>>>>                 return -EINVAL;
>>>>>>
>>>>>> +        /*
>>>>>> +         * Profiling guest via PEBS-DS buffer on host is not
>>>>>> supported.
>>>>>> +         * The event->overflow_handler_context is to distinguish
>>>>>> a KVM
>>>>>> +         * event and a normal host event.
>>>>>> +         */
>>>>>> +        if (event->attr.exclude_host &&
>>>>>> +            !is_pebs_pt(event) &&
>>>>>> +            !event->overflow_handler_context)
>>>>>> +            return -EINVAL;
>>>>>> +
>>>>>>             if (!(event->attr.freq || (event->attr.wakeup_events &&
>>>>>> !event->attr.watermark))) {
>>>>>>                 event->hw.flags |= PERF_X86_EVENT_AUTO_RELOAD;
>>>>>>                 if (!(event->attr.sample_type &
>>>>>>
>>>>>>>
>>>>>>> The status of the guest can only be queried in the NMI handler and
>>>>>>> the func
>>>>>>> intel_guest_get_msrs() in the perf/core context, where it's easier
>>>>>>> and more
>>>>>>> centrally to review this part of changes that affects vPMU for
>>>>>>> corner
>>>>>>> cases.
>>>>>>>
>>>>>>> Maybe adding print info on the perf-tool side would help.
>>>>>>>
>>>>>>> For perf-tool users, it will get 0 number of sample for
>>>>>>> "cpu-cycles:GP"
>>>>>>> events,
>>>>>>> just like other uncounted perf-tool events.
>>>>>>
>>>>>> perf-tool would never know such details, e.g., whether the platform
>>>>>> supports PEBS-DS or other PEBS method. It's hard to tell if the 0 is
>>>>>> because of an unsupported hardware or nothing sampled in the guest.
>>>>>
>>>>> It kind of looks like this use case to me:
>>>>>
>>>>>       perf record -e cycles -e cpu/event=0xf4,umask=0x10/ ./workload #
>>>>> ICX
>>>>>
>>>>> # Total Lost Samples: 0
>>>>> #
>>>>> # Samples: 0  of event 'cpu/event=0xf4,umask=0x10/'
>>>>> # Event count (approx.): 0
>>>>>
>>>>> A end-user has to check if the event-umask combination is supported
>>>>> or not,
>>>>> or nothing sampled for the workload. Is there any room for
>>>>> improvement in
>>>>> perf-tool to reduce the pain of this part ? If any, the same thing
>>>>> could
>>>>> be applied
>>>>> to cpu-cycles:GP, isn't it ?
>>>>
>>>> I don't think we can expect the end user knows such details. Most of
>>>> them may even don't know what's PEBS-via-DS.
>>>
>>> Maybe the following generic reminder helps:
>>>
>>> # Total Lost Samples: 0
>>> # Note: the event is not counted or unsupported.
>>> #
>>> # Samples: 0  of event 'cpu/event=0xf4,umask=0x10/'
>>> # Event count (approx.): 0
>>>
>>>>
>>>> Thanks,
>>>> Kan
>>>>
>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Kan
>>>>>>>
>>>>>>>>              if (!(event->attr.freq ||
>>>>>>>> (event->attr.wakeup_events &&
>>>>>>>> !event->attr.watermark))) {
>>>>>>>>                  event->hw.flags |= PERF_X86_EVENT_AUTO_RELOAD;
>>>>>>>>                  if (!(event->attr.sample_type &
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Kan
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Cc: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
>>>>>>>>> Fixes: c59a1f106f5c ("KVM: x86/pmu: Add IA32_PEBS_ENABLE MSR
>>>>>>>>> emulation for extended PEBS")
>>>>>>>>> Signed-off-by: Like Xu <likexu@xxxxxxxxxxx>
>>>>>>>>> ---
>>>>>>>>>      arch/x86/events/intel/core.c | 8 +++++++-
>>>>>>>>>      1 file changed, 7 insertions(+), 1 deletion(-)
>>>>>>>>>
>>>>>>>>> diff --git a/arch/x86/events/intel/core.c
>>>>>>>>> b/arch/x86/events/intel/core.c
>>>>>>>>> index a08f794a0e79..17afd504c35b 100644
>>>>>>>>> --- a/arch/x86/events/intel/core.c
>>>>>>>>> +++ b/arch/x86/events/intel/core.c
>>>>>>>>> @@ -4103,13 +4103,19 @@ static struct perf_guest_switch_msr
>>>>>>>>> *intel_guest_get_msrs(int *nr, void *data)
>>>>>>>>>              .guest = pebs_mask & ~cpuc->intel_ctrl_host_mask,
>>>>>>>>>          };
>>>>>>>>>      +    /* In any case, clear guest PEBS bits first. */
>>>>>>>>> +    arr[global_ctrl].guest &= ~arr[pebs_enable].guest;
>>>>>>>>> +
>>>>>>>>>          if (arr[pebs_enable].host) {
>>>>>>>>>              /* Disable guest PEBS if host PEBS is enabled. */
>>>>>>>>>              arr[pebs_enable].guest = 0;
>>>>>>>>>          } else {
>>>>>>>>>              /* Disable guest PEBS thoroughly for cross-mapped
>>>>>>>>> PEBS
>>>>>>>>> counters. */
>>>>>>>>>              arr[pebs_enable].guest &=
>>>>>>>>> ~kvm_pmu->host_cross_mapped_mask;
>>>>>>>>> -        arr[global_ctrl].guest &=
>>>>>>>>> ~kvm_pmu->host_cross_mapped_mask;
>>>>>>>>> +
>>>>>>>>> +        /* Prevent any host user from enabling PEBS for profiling
>>>>>>>>> guest. */
>>>>>>>>> +        arr[pebs_enable].guest &= (kvm_pmu->pebs_enable &
>>>>>>>>> kvm_pmu->global_ctrl);
>>>>>>>>> +
>>>>>>>>>              /* Set hw GLOBAL_CTRL bits for PEBS counter when
>>>>>>>>> it runs
>>>>>>>>> for guest */
>>>>>>>>>              arr[global_ctrl].guest |= arr[pebs_enable].guest;
>>>>>>>>>          }
>>>>>>>>>
>>>>>>>>> base-commit: 6803fb00772cc50cd59a66bd8caaee5c84b13fcf
>>>