Re: [RFC PATCH 02/41] perf: Support guest enter/exit interfaces

From: Zhang, Xiong Y
Date: Fri Apr 26 2024 - 00:10:10 EST



>>> +static inline int perf_force_exclude_guest_check(struct perf_event *event,
>>> + int cpu, struct task_struct *task)
>>> +{
>>> + bool *force_exclude_guest = NULL;
>>> +
>>> + if (!has_vpmu_passthrough_cap(event->pmu))
>>> + return 0;
>>> +
>>> + if (event->attr.exclude_guest)
>>> + return 0;
>>> +
>>> + if (cpu != -1) {
>>> + force_exclude_guest = per_cpu_ptr(&__perf_force_exclude_guest, cpu);
>>> + } else if (task && (task->flags & PF_VCPU)) {
>>> + /*
>>> + * Just need to check the running CPU in the event creation. If the
>>> + * task is moved to another CPU which supports the force_exclude_guest.
>>> + * The event will filtered out and be moved to the error stage. See
>>> + * merge_sched_in().
>>> + */
>>> + force_exclude_guest = per_cpu_ptr(&__perf_force_exclude_guest, task_cpu(task));
>>> + }
>>
>> These checks are extremely racy, I don't see how this can possibly do the
>> right thing. PF_VCPU isn't a "this is a vCPU task", it's a "this task is about
>> to do VM-Enter, or just took a VM-Exit" (the "I'm a virtual CPU" comment in
>> include/linux/sched.h is wildly misleading, as it's _only_ valid when accounting
>> time slices).
>>
>
> This is to reject an !exclude_guest event creation for a running
> "passthrough" guest from host perf tool.
> Could you please suggest a way to detect it via the struct task_struct?
Here PF_VCPU is used to distinguish a perf event profiling userspace VMM
process, like perf record -e {} -p $QEMU_PID. A lot of emails have
discussed how to handle system wide perf event which has
perf_event.attr.task == NULL. But perf event for user space VMM should be
handled the same as system wide perf event, perf need a method to identify
a process perf event is for user space VMM. PF_VCPU isn't the right one,
then an open how to handle this ?

thanks
>
>
>> Digging deeper, I think __perf_force_exclude_guest has similar problems, e.g.
>> perf_event_create_kernel_counter() calls perf_event_alloc() before acquiring the
>> per-CPU context mutex.
>
> Do you mean that the perf_guest_enter() check could be happened right
> after the perf_force_exclude_guest_check()?
> It's possible. For this case, the event can still be created. It will be
> treated as an existing event and handled in merge_sched_in(). It will
> never be scheduled when a guest is running.
>
> The perf_force_exclude_guest_check() is to make sure most of the cases
> can be rejected at the creation place. For the corner cases, they will
> be rejected in the schedule stage.
>
>>
>>> + if (force_exclude_guest && *force_exclude_guest)
>>> + return -EBUSY;
>>> + return 0;
>>> +}
>>> +
>>> /*
>>> * Holding the top-level event's child_mutex means that any
>>> * descendant process that has inherited this event will block
>>> @@ -11973,6 +12142,11 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
>>> goto err_ns;
>>> }
>>>
>>> + if (perf_force_exclude_guest_check(event, cpu, task)) {
>>
>> This should be:
>>
>> err = perf_force_exclude_guest_check(event, cpu, task);
>> if (err)
>> goto err_pmu;
>>
>> i.e. shouldn't effectively ignore/override the return result.
>>
>
> Sure.
>
> Thanks,
> Kan
>
>>> + err = -EBUSY;
>>> + goto err_pmu;
>>> + }
>>> +
>>> /*
>>> * Disallow uncore-task events. Similarly, disallow uncore-cgroup
>>> * events (they don't make sense as the cgroup will be different
>>> --
>>> 2.34.1
>>>
>