Re: [PATCH v4 1/4] drm/panthor: introduce job cycle and timestamp accounting

From: Boris Brezillon
Date: Mon Aug 19 2024 - 06:59:14 EST


Hi,

On Mon, 19 Aug 2024 08:48:24 +0100
Steven Price <steven.price@xxxxxxx> wrote:

> Hi Adrián,
>
> On 31/07/2024 13:41, Adrián Larumbe wrote:
> > Hi Steven, thanks for the remarks.
> >
> > On 19.07.2024 15:14, Steven Price wrote:
> >> On 16/07/2024 21:11, Adrián Larumbe wrote:
> >>> Enable calculations of job submission times in clock cycles and wall
> >>> time. This is done by expanding the boilerplate command stream when running
> >>> a job to include instructions that compute said times right before an after
> >>> a user CS.
> >>>
> >>> Those numbers are stored in the queue's group's sync objects BO, right
> >>> after them. Because the queues in a group might have a different number of
> >>> slots, one must keep track of the overall slot tally when reckoning the
> >>> offset of a queue's time sample structs, one for each slot.
> >>>
> >>> This commit is done in preparation for enabling DRM fdinfo support in the
> >>> Panthor driver, which depends on the numbers calculated herein.
> >>>
> >>> A profile mode device flag has been added that will in a future commit
> >>> allow UM to toggle time sampling behaviour, which is disabled by default to
> >>> save power. It also enables marking jobs as being profiled and picks one of
> >>> two call instruction arrays to insert into the ring buffer. One of them
> >>> includes FW logic to sample the timestamp and cycle counter registers and
> >>> write them into the job's syncobj, and the other does not.
> >>>
> >>> A profiled job's call sequence takes up two ring buffer slots, and this is
> >>> reflected when initialising the DRM scheduler for each queue, with a
> >>> profiled job contributing twice as many credits.
> >>>
> >>> Signed-off-by: Adrián Larumbe <adrian.larumbe@xxxxxxxxxxxxx>
> >>
> >> Thanks for the updates, this looks better. A few minor comments below.
> >>
> >>> ---
> >>> drivers/gpu/drm/panthor/panthor_device.h | 2 +
> >>> drivers/gpu/drm/panthor/panthor_sched.c | 244 ++++++++++++++++++++---
> >>> 2 files changed, 216 insertions(+), 30 deletions(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h
> >>> index e388c0472ba7..3ede2f80df73 100644
> >>> --- a/drivers/gpu/drm/panthor/panthor_device.h
> >>> +++ b/drivers/gpu/drm/panthor/panthor_device.h
> >>> @@ -162,6 +162,8 @@ struct panthor_device {
> >>> */
> >>> struct page *dummy_latest_flush;
> >>> } pm;
> >>> +
> >>> + bool profile_mode;
> >>> };
> >>>
> >>> /**
> >>> diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
> >>> index 79ffcbc41d78..6438e5ea1f2b 100644
> >>> --- a/drivers/gpu/drm/panthor/panthor_sched.c
> >>> +++ b/drivers/gpu/drm/panthor/panthor_sched.c
> >>> @@ -93,6 +93,9 @@
> >>> #define MIN_CSGS 3
> >>> #define MAX_CSG_PRIO 0xf
> >>>
> >>> +#define NUM_INSTRS_PER_SLOT 16
> >>> +#define SLOTSIZE (NUM_INSTRS_PER_SLOT * sizeof(u64))
> >>> +
> >>> struct panthor_group;
> >>>
> >>> /**
> >>> @@ -466,6 +469,9 @@ struct panthor_queue {
> >>> */
> >>> struct list_head in_flight_jobs;
> >>> } fence_ctx;
> >>> +
> >>> + /** @time_offset: Offset of panthor_job_times structs in group's syncobj bo. */
> >>> + unsigned long time_offset;
> >>
> >> AFAICT this doesn't need to be stored. We could just pass this value
> >> into group_create_queue() as an extra parameter where it's used.
> >
> > I think we need to keep this offset value around, because queues within the same group
> > could have a variable number of slots, so when fetching the sampled values from the
> > syncobjs BO in update_fdinfo_stats, it would have to traverse the entire array of
> > preceding queues and figure out their size in slots so as to jump over as many
> > struct panthor_job_times after the preceding syncobj array.
>
> Yes I think I was getting myself confused - for some reason I'd thought
> the ring buffers in each queue were the same size. It makes sense to
> keep this.

IIRC, I was the one suggesting to put all metadata in a single BO to
avoid losing too much space when ring buffers are small (because of the
4k granularity of BOs). But given the size of panthor_job_times (32
bytes per slot), a 4k chunk only covers 128 job slots, which is not
that big to be honest. So I'm no longer sure this optimization makes
sense. If we still want to allocate one big BO containing syncobjs
and timestamp slots for all queues, say, to optimize the GEM metadata vs
actual BO data overhead, I'd recommend having a CPU/GPU pointer
relative to the queue in panthor_queue which we can easily dereference
with time_slots[job_slot_id].

>
> <snip>
>
> >>> @@ -3384,9 +3565,12 @@ panthor_job_create(struct panthor_file *pfile,
> >>> goto err_put_job;
> >>> }
> >>>
> >>> + job->is_profiled = pfile->ptdev->profile_mode;
> >>> +
> >>> ret = drm_sched_job_init(&job->base,
> >>> &job->group->queues[job->queue_idx]->entity,
> >>> - 1, job->group);
> >>> + job->is_profiled ? NUM_INSTRS_PER_SLOT * 2 :
> >>> + NUM_INSTRS_PER_SLOT, job->group);
> >>
> >> Is there a good reason to make the credits the number of instructions,
> >> rather than the number of slots? If we were going to count the actual
> >> number of non-NOP instructions then there would be some logic (although
> >> I'm not convinced that makes much sense), but considering we only allow
> >> submission in "slot granules" we might as well use that as the unit of
> >> "credit".
> >
> > In my initial pre-ML version of the patch series I was passing the number of
> > queue slots as the total credit count, but Boris was keener on setting it to
> > the total number of instructions instead.
> >
> > I agree with you that both are equivalent, one just being an integer multiple
> > of the other, so I'm fine with either choice. Maybe Boris can pitch in, in
> > case he has a strong opinion about this.
>
> I wouldn't say I have a strong opinion, it just seems a little odd to be
> multiplying the value by a constant everywhere. If you'd got some
> changes planned where the instructions could vary more dynamically that
> would be good to know about.

Yeah, the long term plan is to make the number of instructions variable
based on the profiling parameters, such that we don't lose ringbuf
slots when profiling is completely disabled. Of course we always want to
keep it a multiple of a cache-line (AKA 8 instructions), but I find it
easier to reason in number of instructions rather than block of
instructions.

Regards,

Boris