[SNIP]
The background is that core memory management requires that signaling aAnd hardware progress is exactly the only dependency here...
fence only depends on signaling other fences and hardware progress and
nothing else. Otherwise you immediately run into problems because of
circle dependencies or what we call infinite fences.
Jason Ekstrand gave a create presentation on that problem a few yearsFaith Ekstrand (it looks like you mistyped that name...)
ago on LPC. I strongly suggest you google that one up.
is the person
who proposed that I should use drm_sched in this way (see below), we've
had a few private meetings about this design ^^
I don't think so...? It's just an atomic that gets checked inYou can see the driver implementation of that callback inYeah and that hw_submission_limit is based on a fence signaling again.
drivers/gpu/drm/asahi/queue/mod.rs (QueueJob::can_run()), which then
calls into drivers/gpu/drm/asahi/workqueue.rs (Job::can_submit()) that
does the actual available slot count checks.
The can_run_job logic is written to mirror the hw_submission_limit logic
(just a bit later in the sched main loop since we need to actually pick
a job to do the check), and just like for that case, completion of any
job in the same scheduler will cause another run of the main loop and
another check (which is exactly what we want here).
drm_sched_ready(). There are no extra fences involved (other than the
job completion fences that trigger another scheduler run). The idea is
that when the hardware queue makes forward progress you check against
the limit again and submit more jobs as needed. I'm doing the same exact
thing, I'm just using more complex logic for the notion of in-flight
queue limits!
When you have some firmware limitation that a job needs resources whichI think we have a disconnect in our views of what is going on here...
are currently in use by other submissions then those other submissions
have fences as well and you can return those in the prepare_job callback.
If those other submissions don't have fences, then you have a major
design problem inside your driver and we need to get back to square one
and talk about that dependency handling.
This hardware has firmware-side scheduling with an arbitrary (as far as
I know) number of queues. There is one scheduler instance and one entity
per userspace queue (not global!). These queues process jobs in some
logical sequence, though at the firmware level they get split into up to
three queues each (and there is some parallelism allowed). The
limitation here is in the number of in-flight jobs per firmware queue,
not global.
There is no way for things to deadlock. If jobs have been submitted to
the firmware queue, that means their dependencies were signaled already.
Jobs have intra-job dependencies via driver barriers (which drm_sched
knows nothing about), but the submission code in the driver guarantees
that they are deadlock-free since you can only barrier on past commands,
which by definition submit first.
If a firmware queue is full, drm_sched blocks. Since it is full, that
means it will run those commands (since they have no outside
dependencies and they are already queued and ready to run by the
firmware), eventually space will be freed, and each time a job completes
drm_sched will do the can_run_job check again and decide whether to run
a new job.
Since the firmware queues contain commands which only have past-facing
barriers on other already submitted commands, by definition they will
become empty at some point as long as the firmware is making forward
progress. And therefore, by definition, can_run_job will eventually
return true at some point after a job completion fence is signaled (the
one for the last job submitted prior). There is a check in the driver to
ensure that we do not allow submissions which, by themselves, would
exceed the queued command limit (we actually just limit to 64 commands
overall right now, which is conservative but seems reasonable given the
128-per-firmware-queue limit).
I get the feeling that you are conflating pending jobs with submitted
jobs. This isn't about how many jobs you can have pending in drm_sched
before running them or anything like that. Of course, at that point,
arbitrary dependencies come into play and you can end up with deadlocks
on dependency fences. But that's not the case here. What can_run_job is
waiting on is guaranteed to make forward progress.
But the resource can_run_job is checking on isn't globally shared! It'sThis case (potentially scheduling more than the FW job limit) is rareActually that's a pretty normal use case. I've have rejected similar
but handling it is necessary, since otherwise the entire job
completion/tracking logic gets screwed up on the firmware end and queues
end up stuck (I've managed to trigger this before).
requirements like this before as well.
For an example how this can work see amdgpu_job_prepare_job():
https://elixir.bootlin.com/linux/v6.3-rc1/source/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c#L251
The gang submit gives and example of a global fence lock and the VMIDs
are an example of a global shared firmware resource.
specific to this scheduler instance, just like hw_submission_limit is,
so as long as the firmware behind the scheduler is making forward
progress, the resource will be guaranteed to be freed until another job
can run.
I actually know I have a different theoretical deadlock issue along
these lines in the driver because right now we grab actually global
resources (including a VMID) before job submission to drm_sched. This is
a known issue, and to fix it without reducing performance I need to
introduce some kind of "patching/fixup" system for firmware commands
(because we need to inject those identifiers in dozens of places, but we
don't want to construct those commands from scratch at job run time
because that introduces latency at the wrong time and makes error
handling/validation more complicated and error-prone), and that is
exactly what should happen in prepare_job, as you say. And yes, at that
point that should use fences to block when those resources are
exhausted. But that's a different discussion we should have when
reviewing the driver, it has nothing to do with the DRM abstractions nor
the can_run_job callback I'm adding here nor the firmware queue length
limit issue! (And also the global hardware devices are plentiful enough
that I would be very surprised if anyone ever deadlocks it in practice
even with the current code, so I honestly don't think that should be a
blocker for driver submission either, I can and will fix it later...)
~~ Lina