Re: [RFC] drm/msm: Add UABI to request perfcntr usage

From: Akhil P Oommen
Date: Mon Dec 16 2024 - 12:03:14 EST


On 12/13/2024 10:40 PM, Antonino Maniscalco wrote:
> On 12/13/24 5:50 PM, Akhil P Oommen wrote:
>> On 12/12/2024 9:44 PM, Antonino Maniscalco wrote:
>>> On 12/12/24 4:58 PM, Akhil P Oommen wrote:
>>>> On 12/5/2024 10:24 PM, Rob Clark wrote:
>>>>> From: Rob Clark <robdclark@xxxxxxxxxxxx>
>>>>>
>>>>> Performance counter usage falls into two categories:
>>>>>
>>>>> 1. Local usage, where the counter configuration, start, and end read
>>>>>      happen within (locally to) a single SUBMIT.  In this case,
>>>>> there is
>>>>>      no dependency on counter configuration or values between submits,
>>>>> and
>>>>>      in fact counters are normally cleared on context switches,
>>>>> making it
>>>>>      impossible to rely on cross-submit state.
>>>>>
>>>>> 2. Global usage, where a single privilaged daemon/process is sampling
>>>>>      counter values across all processes for profiling.
>>>>>
>>>>> The two categories are mutually exclusive.  While you can have many
>>>>> processes making local counter usage, you cannot combine global and
>>>>> local usage without the two stepping on each others feet (by changing
>>>>> counter configuration).
>>>>>
>>>>> For global counter usage, there is already a SYSPROF param (since
>>>>> global
>>>>> counter usage requires disabling counter clearing on context switch).
>>>>> This patch adds a REQ_CNTRS param to request local counter usage.  If
>>>>> one or more processes has requested counter usage, then a SYSPROF
>>>>> request will fail with -EBUSY.  And if SYSPROF is active, then
>>>>> REQ_CNTRS
>>>>> will fail with -EBUSY, maintaining the mutual exclusivity.
>>>>>
>>>>> This is purely an advisory interface to help coordinate userspace.
>>>>> There is no real means of enforcement, but the worst that can
>>>>> happen if
>>>>> userspace ignores a REQ_CNTRS failure is that you'll get nonsense
>>>>> profiling data.
>>>>>
>>>>> Signed-off-by: Rob Clark <robdclark@xxxxxxxxxxxx>
>>>>> ---
>>>>> kgsl takes a different approach, which involves a lot more UABI for
>>>>> assigning counters to different processes.  But I think by taking
>>>>> advantage of the fact that mesa (freedreno+turnip) reconfigure the
>>>>> counters they need in each SUBMIT, for their respective gl/vk perf-
>>>>> counter extensions, we can take this simpler approach.
>>>>
>>>> KGSL's approach is preemption and ifpc safe (also whatever HW changes
>>>> that will come up in future generations). How will we ensure that here?
>>>>
>>>> I have plans to bring up IFPC support in near future. Also, I
>>>> brought up
>>>> this point during preemption series. But from the responses, I felt
>>>> that
>>>> profiling was not considered a serious usecase. Still I wonder how the
>>>> perfcounter extensions work accurately with preemption.
>>>
>>> So back then I implemented the postamble IB to clear perf counters and
>>> that gets disabled when sysprof (so global usage) is happening. The
>>> kernel is oblivious to "Local isage" of profiling but in that case
>>> really what we want to do is disable preemption which in my
>>> understanding can be done from userspace with a PKT. In my understanding
>>> this had us covered for all usecases.
>>
>> I think this wasn't mentioned at that time. Which UMD PKT in a6x+ did
>> you mean?
>
> Ah, I thought it wasmentioned, sorry.
> The packet I was referring to is:
>     <doc> Make next dword 1 to disable preemption, 0 to re-enable it. </
> doc>
>     <value name="CP_PREEMPT_DISABLE" value="0x6c" variants="A6XX"/>

Ah! Okay. I think this packet is not used by the downstream blob. IMO,
disabling preemption is still a suboptimal solution.

>
> BTW you mentioned wanting to look into IFPC. Since I too wanted to look
> into implementing it wonder if you could let me know when you planned on
> working on it.

I have few patches in progress. Nothing final yet and need verification
on the hw side. Also, I need to do some housekeeping here to debug gmu
issues since IFPC increases the probability of those a lot.

I will try to send out the patches very soon.

-Akhil.

>
>>
>> -Akhil.
>>
>>>
>>> So what would you expect instead we should do kernel side to make
>>> profiling preemption safe?
>>>
>>>>
>>>> -Akhil
>>>>
>>>>>
>>>>>    drivers/gpu/drm/msm/adreno/adreno_gpu.c |  2 +
>>>>>    drivers/gpu/drm/msm/msm_drv.c           |  5 ++-
>>>>>    drivers/gpu/drm/msm/msm_gpu.c           |  1 +
>>>>>    drivers/gpu/drm/msm/msm_gpu.h           | 29 +++++++++++++-
>>>>>    drivers/gpu/drm/msm/msm_submitqueue.c   | 52 +++++++++++++++++++
>>>>> +++++-
>>>>>    include/uapi/drm/msm_drm.h              |  1 +
>>>>>    6 files changed, 85 insertions(+), 5 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/
>>>>> drm/msm/adreno/adreno_gpu.c
>>>>> index 31bbf2c83de4..f688e37059b8 100644
>>>>> --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c
>>>>> +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
>>>>> @@ -441,6 +441,8 @@ int adreno_set_param(struct msm_gpu *gpu, struct
>>>>> msm_file_private *ctx,
>>>>>            if (!capable(CAP_SYS_ADMIN))
>>>>>                return UERR(EPERM, drm, "invalid permissions");
>>>>>            return msm_file_private_set_sysprof(ctx, gpu, value);
>>>>> +    case MSM_PARAM_REQ_CNTRS:
>>>>> +        return msm_file_private_request_counters(ctx, gpu, value);
>>>>>        default:
>>>>>            return UERR(EINVAL, drm, "%s: invalid param: %u", gpu-
>>>>>> name, param);
>>>>>        }
>>>>> diff --git a/drivers/gpu/drm/msm/msm_drv.c b/drivers/gpu/drm/msm/
>>>>> msm_drv.c
>>>>> index 6416d2cb4efc..bf8314ff4a25 100644
>>>>> --- a/drivers/gpu/drm/msm/msm_drv.c
>>>>> +++ b/drivers/gpu/drm/msm/msm_drv.c
>>>>> @@ -377,9 +377,12 @@ static void msm_postclose(struct drm_device
>>>>> *dev, struct drm_file *file)
>>>>>         * It is not possible to set sysprof param to non-zero if gpu
>>>>>         * is not initialized:
>>>>>         */
>>>>> -    if (priv->gpu)
>>>>> +    if (ctx->sysprof)
>>>>>            msm_file_private_set_sysprof(ctx, priv->gpu, 0);
>>>>>    +    if (ctx->counters_requested)
>>>>> +        msm_file_private_request_counters(ctx, priv->gpu, 0);
>>>>> +
>>>>>        context_close(ctx);
>>>>>    }
>>>>>    diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/
>>>>> msm_gpu.c
>>>>> index 82f204f3bb8f..013b59ca3bb1 100644
>>>>> --- a/drivers/gpu/drm/msm/msm_gpu.c
>>>>> +++ b/drivers/gpu/drm/msm/msm_gpu.c
>>>>> @@ -991,6 +991,7 @@ int msm_gpu_init(struct drm_device *drm, struct
>>>>> platform_device *pdev,
>>>>>        gpu->nr_rings = nr_rings;
>>>>>          refcount_set(&gpu->sysprof_active, 1);
>>>>> +    refcount_set(&gpu->local_counters_active, 1);
>>>>>          return 0;
>>>>>    diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/
>>>>> msm_gpu.h
>>>>> index e25009150579..83c61e523b1b 100644
>>>>> --- a/drivers/gpu/drm/msm/msm_gpu.h
>>>>> +++ b/drivers/gpu/drm/msm/msm_gpu.h
>>>>> @@ -195,12 +195,28 @@ struct msm_gpu {
>>>>>        int nr_rings;
>>>>>          /**
>>>>> -     * sysprof_active:
>>>>> +     * @sysprof_active:
>>>>>         *
>>>>> -     * The count of contexts that have enabled system profiling.
>>>>> +     * The count of contexts that have enabled system profiling plus
>>>>> one.
>>>>> +     *
>>>>> +     * Note: refcount_t does not like 0->1 transitions.. we want to
>>>>> keep
>>>>> +     * the under/overflow checks that refcount_t provides, but allow
>>>>> +     * multiple on/off transitions so we track the logical value
>>>>> plus one.)
>>>>>         */
>>>>>        refcount_t sysprof_active;
>>>>>    +    /**
>>>>> +     * @local_counters_active:
>>>>> +     *
>>>>> +     * The count of contexts that have requested local (intra-submit)
>>>>> +     * performance counter usage plus one.
>>>>> +     *
>>>>> +     * Note: refcount_t does not like 0->1 transitions.. we want to
>>>>> keep
>>>>> +     * the under/overflow checks that refcount_t provides, but allow
>>>>> +     * multiple on/off transitions so we track the logical value
>>>>> plus one.)
>>>>> +     */
>>>>> +    refcount_t local_counters_active;
>>>>> +
>>>>>        /**
>>>>>         * lock:
>>>>>         *
>>>>> @@ -383,6 +399,13 @@ struct msm_file_private {
>>>>>         */
>>>>>        int sysprof;
>>>>>    +    /**
>>>>> +     * @counters_requested:
>>>>> +     *
>>>>> +     * Has the context requested local perfcntr usage.
>>>>> +     */
>>>>> +    bool counters_requested;
>>>>> +
>>>>>        /**
>>>>>         * comm: Overridden task comm, see MSM_PARAM_COMM
>>>>>         *
>>>>> @@ -626,6 +649,8 @@ void msm_submitqueue_destroy(struct kref *kref);
>>>>>      int msm_file_private_set_sysprof(struct msm_file_private *ctx,
>>>>>                     struct msm_gpu *gpu, int sysprof);
>>>>> +int msm_file_private_request_counters(struct msm_file_private *ctx,
>>>>> +                      struct msm_gpu *gpu, int reqcntrs);
>>>>>    void __msm_file_private_destroy(struct kref *kref);
>>>>>      static inline void msm_file_private_put(struct msm_file_private
>>>>> *ctx)
>>>>> diff --git a/drivers/gpu/drm/msm/msm_submitqueue.c b/drivers/gpu/drm/
>>>>> msm/msm_submitqueue.c
>>>>> index 7fed1de63b5d..1e1e21e6f7ae 100644
>>>>> --- a/drivers/gpu/drm/msm/msm_submitqueue.c
>>>>> +++ b/drivers/gpu/drm/msm/msm_submitqueue.c
>>>>> @@ -10,6 +10,15 @@
>>>>>    int msm_file_private_set_sysprof(struct msm_file_private *ctx,
>>>>>                     struct msm_gpu *gpu, int sysprof)
>>>>>    {
>>>>> +    int ret = 0;
>>>>> +
>>>>> +    mutex_lock(&gpu->lock);
>>>>> +
>>>>> +    if (sysprof && (refcount_read(&gpu->local_counters_active) >
>>>>> 1)) {
>>>>> +        ret = UERR(EBUSY, gpu->dev, "Local counter usage active");
>>>>> +        goto out_unlock;
>>>>> +    }
>>>>> +
>>>>>        /*
>>>>>         * Since pm_runtime and sysprof_active are both refcounts, we
>>>>>         * call apply the new value first, and then unwind the previous
>>>>> @@ -18,7 +27,8 @@ int msm_file_private_set_sysprof(struct
>>>>> msm_file_private *ctx,
>>>>>          switch (sysprof) {
>>>>>        default:
>>>>> -        return UERR(EINVAL, gpu->dev, "Invalid sysprof: %d",
>>>>> sysprof);
>>>>> +        ret = UERR(EINVAL, gpu->dev, "Invalid sysprof: %d", sysprof);
>>>>> +        goto out_unlock;
>>>>>        case 2:
>>>>>            pm_runtime_get_sync(&gpu->pdev->dev);
>>>>>            fallthrough;
>>>>> @@ -43,7 +53,45 @@ int msm_file_private_set_sysprof(struct
>>>>> msm_file_private *ctx,
>>>>>          ctx->sysprof = sysprof;
>>>>>    -    return 0;
>>>>> +out_unlock:
>>>>> +    mutex_unlock(&gpu->lock);
>>>>> +
>>>>> +    return ret;
>>>>> +}
>>>>> +
>>>>> +int msm_file_private_request_counters(struct msm_file_private *ctx,
>>>>> +                      struct msm_gpu *gpu, int reqctrs)
>>>>> +{
>>>>> +    int ret = 0;
>>>>> +
>>>>> +    mutex_lock(&gpu->lock);
>>>>> +
>>>>> +    if (reqctrs && (refcount_read(&gpu->sysprof_active) > 1)) {
>>>>> +        ret = UERR(EBUSY, gpu->dev, "System profiling active");
>>>>> +        goto out_unlock;
>>>>> +    }
>>>>> +
>>>>> +    if (reqctrs) {
>>>>> +        if (ctx->counters_requested) {
>>>>> +            ret = UERR(EINVAL, gpu->dev, "Already requested");
>>>>> +            goto out_unlock;
>>>>> +        }
>>>>> +
>>>>> +        ctx->counters_requested = true;
>>>>> +        refcount_inc(&gpu->local_counters_active);
>>>>> +    } else {
>>>>> +        if (!ctx->counters_requested) {
>>>>> +            ret = UERR(EINVAL, gpu->dev, "Not requested");
>>>>> +            goto out_unlock;
>>>>> +        }
>>>>> +        refcount_dec(&gpu->local_counters_active);
>>>>> +        ctx->counters_requested = false;
>>>>> +    }
>>>>> +
>>>>> +out_unlock:
>>>>> +    mutex_unlock(&gpu->lock);
>>>>> +
>>>>> +    return ret;
>>>>>    }
>>>>>      void __msm_file_private_destroy(struct kref *kref)
>>>>> diff --git a/include/uapi/drm/msm_drm.h b/include/uapi/drm/msm_drm.h
>>>>> index 2342cb90857e..ae7fb355e4a1 100644
>>>>> --- a/include/uapi/drm/msm_drm.h
>>>>> +++ b/include/uapi/drm/msm_drm.h
>>>>> @@ -91,6 +91,7 @@ struct drm_msm_timespec {
>>>>>    #define MSM_PARAM_UBWC_SWIZZLE 0x12 /* RO */
>>>>>    #define MSM_PARAM_MACROTILE_MODE 0x13 /* RO */
>>>>>    #define MSM_PARAM_UCHE_TRAP_BASE 0x14 /* RO */
>>>>> +#define MSM_PARAM_REQ_CNTRS  0x15 /* WO: request "local" (intra-
>>>>> submit) perfcntr usage  */
>>>>>      /* For backwards compat.  The original support for preemption was
>>>>> based on
>>>>>     * a single ring per priority level so # of priority levels equals
>>>>> the #
>>>>
>>>
>>>
>>> Best regards,
>>
>
>
> Best regards,