On Tue, Sep 03, 2024 at 03:46:43PM +0200, Christian König wrote:
Hi Steven,panthor doesn't yet have a shrinker, so all memory is pinned, which means
Am 29.08.24 um 15:37 schrieb Steven Price:
Hi Christian,Mhm, that is unfortunately exactly what I feared.
Mihail should be able to give more definitive answers, but I think I can
answer your questions.
On 29/08/2024 10:40, Christian König wrote:
Am 28.08.24 um 19:25 schrieb Mihail Atanassov:A timeout. Although looking at it I think it's probably set too high
Hello all,Question is how do you guarantee forward progress for fence signaling?
This series implements a mechanism to expose Mali CSF GPUs' queue
ringbuffers directly to userspace, along with paraphernalia to allow
userspace to control job synchronisation between the CPU and GPU.
The goal of these changes is to allow userspace to control work
submission to the FW/HW directly without kernel intervention in the
common case, thereby reducing context switching overhead. It also allows
for greater flexibility in the way work is enqueued in the ringbufs.
For example, the current kernel submit path only supports indirect
calls, which is inefficient for small command buffers. Userspace can
also skip unnecessary sync operations.
currently:
+#define JOB_TIMEOUT_MS 5000But basically the XGS queue is a DRM scheduler just like a normal GPU
queue and the jobs have a timeout. If the timeout is hit then any fences
will be signalled (with an error).
How do you guarantee forward progress and that resuming of suspended queuesE.g. when are fences created and published? How do they signal?The actual userspace queue can be suspended. This is actually a
How are dependencies handled? How can the kernel suspend an userspace
queue?
combination of firmware and kernel driver, and this functionality is
already present without the user submission. The firmware will multiplex
multiple 'groups' onto the hardware, and if there are too many for the
firmware then the kernel multiplexes the extra groups onto the ones the
firmware supports.
doesn't end up in a circle dependency?
I haven't studied Mihail's series in detail yet, but if I understandYou might have misunderstood my question (and I might misunderstand the
correctly, the XGS queues are handled separately and are not suspended
when the hardware queues are suspended. I guess this might be an area
for improvement and might explain the currently very high timeout (to
deal with the case where the actual GPU work has been suspended).
How does memory management work in this case?I'm not entirely sure what you mean here. If you are referring to the
potential memory issues with signalling path then this should be handled
by the timeout - although I haven't studied the code to check for bugs here.
code), but on first glance it strongly sounds like the current approach will
be NAKed.
The actual new XGS queues don't allocate/free memory during the queueBut the kernel and the hardware could suspend the queues, right?
execution - so it's just the memory usage related to fences (and the
other work which could be blocked on the fence).
In terms of memory management for the GPU work itself, this is handledI don't know the details, but that again strongly sounds like that won't
the same as before. The VM_BIND mechanism allows dependencies to be
created between syncobjs and VM operations, with XGS these can then be
tied to GPU HW events.
work.
What you need is to somehow guarantee that work doesn't run into memory
management deadlocks which are resolved by timeouts.
Please read up here on why that stuff isn't allowed: https://www.kernel.org/doc/html/latest/driver-api/dma-buf.html#indefinite-dma-fences
memory management easy mode.
But also this means there might be an uapi design bug in here, and we
really don't want to commit to that. My stance is that panthor should gain
a proper shrinker first, which means there will be some changes here too.
And then we can properly evaluate this. As-is it's a bit too much on the
toy end of things.
That aside, I've thought this all through with tons of people, and I do
think it's all possible.
-Sima
Regards,
Christian.
Fundamentally (modulo bugs) there is little change compared to kernel
submission - it's already fairly trivial to write GPU job which will
make no forward progress (a 'while (1)' equivalent job). The only
difference here is that XGS makes this 'easy' and doesn't involve the
GPU spinning. Either way we rely on a timeout to recover from these
situations.
Thanks,
Steve
Regards,
Christian.
This is still a work-in-progress, there's an outstanding issue with
multiple processes using different submission flows triggering
scheduling bugs (e.g. the same group getting scheduled twice), but we'd
love to gather some feedback on the suitability of the approach in
general and see if there's a clear path to merging something like this
eventually.
I've also CCd AMD maintainers because they have in the past done
something similar[1], in case they want to chime in.
There are two uses of this new uAPI in Mesa, one in gallium/panfrost
(link TBD), and one in panvk [2].
The Gallium implementation is a naïve change just to switch the
submission model and exercise the new kernel code, and we don't plan
on pursuing this at this time.
The panvk driver changes are, however, a better representation of the
intent behind this new uAPI, so please consider that as the reference
userspace. It is still very much also a work in progress.
* patch 1 adds all the uAPI changes;
* patch 2 implements the GROUP_CREATE ioctl changes necessary to expose
the required objects to userspace;
* patch 3 maps the doorbell pages, similarly to how the user I/O
page is
mapped;
* patch 4 implements GROUP_KICK, which lets userspace request an
inactive group to be scheduled on the GPU;
* patches 5 & 6 implement XGS queues, a way for userspace to
synchronise GPU queue progress with DRM syncobjs;
* patches 7 & 8 add notification mechanisms for user & kernel to signal
changes to native GPU syncobjs.
[1]
https://lore.kernel.org/amd-gfx/CADnq5_N61q_o+5WYUZsZ=qu7VmeXTFHQSxLwTco05gLzHaiswA@xxxxxxxxxxxxxx/t/#m116a36a598d8fad1329e053974ad37a4dc0f28ed
[2]
https://gitlab.freedesktop.org/larsivsi/mesa/-/commits/panvk-v10-usersubmit?ref_type=heads
Ketil Johnsen (7):
drm/panthor: Add uAPI to submit from user space
drm/panthor: Extend GROUP_CREATE for user submission
drm/panthor: Map doorbell pages
drm/panthor: Add GROUP_KICK ioctl
drm/panthor: Factor out syncobj handling
drm/panthor: Implement XGS queues
drm/panthor: Add SYNC_UPDATE ioctl
Mihail Atanassov (1):
drm/panthor: Add sync_update eventfd handling
drivers/gpu/drm/panthor/Makefile | 4 +-
drivers/gpu/drm/panthor/panthor_device.c | 66 ++-
drivers/gpu/drm/panthor/panthor_device.h | 35 +-
drivers/gpu/drm/panthor/panthor_drv.c | 233 +++++++-
drivers/gpu/drm/panthor/panthor_fw.c | 2 +-
drivers/gpu/drm/panthor/panthor_sched.c | 408 +++++++++-----
drivers/gpu/drm/panthor/panthor_sched.h | 8 +-
drivers/gpu/drm/panthor/panthor_syncobj.c | 167 ++++++
drivers/gpu/drm/panthor/panthor_syncobj.h | 27 +
drivers/gpu/drm/panthor/panthor_xgs.c | 638 ++++++++++++++++++++++
drivers/gpu/drm/panthor/panthor_xgs.h | 42 ++
include/uapi/drm/panthor_drm.h | 243 +++++++-
12 files changed, 1696 insertions(+), 177 deletions(-)
create mode 100644 drivers/gpu/drm/panthor/panthor_syncobj.c
create mode 100644 drivers/gpu/drm/panthor/panthor_syncobj.h
create mode 100644 drivers/gpu/drm/panthor/panthor_xgs.c
create mode 100644 drivers/gpu/drm/panthor/panthor_xgs.h