On 09/03/2023 00.30, Christian König wrote:
Am 08.03.23 um 15:53 schrieb Asahi Lina:I do, it's the prior job hardware completion fences that drm_sched
[SNIP]Well then you should have a fence for that hardware progress.
The background is that core memory management requires that signaling aAnd hardware progress is exactly the only dependency here...
fence only depends on signaling other fences and hardware progress and
nothing else. Otherwise you immediately run into problems because of
circle dependencies or what we call infinite fences.
already knows about!
Yes, I could return those in the prepare callback, it just means I need
to start stashing fence references in the underlying firmware job queue
command objects so I can find out what is the oldest pending fence is,
and return it when a queue is full. As long as drm_sched doesn't mind if
I keep giving it fences (since multiple commands can have to complete
before there is space) or the occasional already signaled fence (because
this process is inherently racy), it should work fine.
If you think this is the better way, I'll do it that way and drop this
patch. It just seemed simpler to do it with another callback, since
drm_sched is already tracking those fences and doing a hardware queue
limit check anyway, and that way I can avoid tracking the fences down
into the hardware queue code... *
(But I still maintain what I'm trying to do here is entirely correct and
deadlock-free! If you prefer I use prepare_job and return prior job
fences from that instead, that's very different from NAKing the patch
saying it's broken...)
* If you're wondering how the fences get signaled at all then: callback
closures that capture a reference to the fence when firmware commands
are constructed and submitted. I know, I know, fancy Rust stuff... ^^
If you'd rather have me use the fences for the blocking, I'll probably
just drop the signaling bit from the closures so we don't need to keep
two redundant fence references in different places per command. I still
need the closures for command completion processing though, since I use
them to process statistics too...
I see that we have a disconnection here. As far as I can see you can useIt isn't... I agree, it would be problematic. It doesn't make any sense
the can_run callback in only three ways:
1. To check for some userspace dependency (We don't need to discuss
that, it's evil and we both know it).
2. You check for some hw resource availability. Similar to VMID on
amdgpu hw.
This is what I think you do here (but I might be wrong).
to check for global resources this way, not just because you might
deadlock but also because there might be nothing to signal to the
scheduler that a resource was freed at all once it is!
But thisThis particular issue aside, fairness in global resource allocation is a
would be extremely problematic because you can then live lock.
E.g. queue A keeps submitting jobs which take only a few resources
and by doing so delays submitting jobs from queue B indefinitely.
conversation I'd love to have! Right now the driver doesn't try to
ensure that, a queue can easily monopolize certain hardware resources
(though one queue can only monopolize one of each, so you'd need
something like 63 queues with 63 distinct VMs all submitting
free-running jobs back to back in order to starve other queues of
resources forever). For starters, one thing I'm thinking of doing is
reserving certain subsets of hardware resources for queues with a given
priority, so you can at least guarantee forward progress of
higher-priority queues when faced with misbehaving lower-priority
queues. But if we want to guarantee proper fairness, I think I'll have
to start doing things like switching to a CPU-roundtrip submission model
when resources become scarce (to guarantee that queues actually release
the resources once in a while) and then figure out how to add fairness
to the allocation code...
But let's have that conversation when we talk about the driver (or maybe
on IRC or something?), right now I'm more interested in getting the
abstractions reviewed ^^
3. You have an intra queue dependency. E.g. you have jobs which take XYes, I can do this. I can just do the same check can_run_job() does and
amount of resources, you can submit only to a specific limit.
But in this case you should be able to return fences from the same
queue as dependency and won't need that callback.
if it fails, pick the oldest job in the full firmware queue and return
its fence (it just means I need to keep track of those fences there, as
I said above).
We would just need to adjust drm_sched_entity_add_dependency_cb() aActually that should be fine, because I'd be returning the underlying
bit because dependencies from the same queue are currently filtered out
because it assumes a pipeline nature of submission (e.g. previous
submissions are finished before new submissions start).
hardware completion fences (what the run() callback returns) which the
driver owns, and wouldn't be recognized as belonging to the sched.
We don't even have a shrinker yet, and I'm sure that's going to be a lotI actually know I have a different theoretical deadlock issue alongWell this is what I thought about those problems in amdgpu as well and
these lines in the driver because right now we grab actually global
resources (including a VMID) before job submission to drm_sched. This is
a known issue, and to fix it without reducing performance I need to
introduce some kind of "patching/fixup" system for firmware commands
(because we need to inject those identifiers in dozens of places, but we
don't want to construct those commands from scratch at job run time
because that introduces latency at the wrong time and makes error
handling/validation more complicated and error-prone), and that is
exactly what should happen in prepare_job, as you say. And yes, at that
point that should use fences to block when those resources are
exhausted. But that's a different discussion we should have when
reviewing the driver, it has nothing to do with the DRM abstractions nor
the can_run_job callback I'm adding here nor the firmware queue length
limit issue! (And also the global hardware devices are plentiful enough
that I would be very surprised if anyone ever deadlocks it in practice
even with the current code, so I honestly don't think that should be a
blocker for driver submission either, I can and will fix it later...)
it totally shipwrecked.
We still have memory allocations in the VMID code path which I'm still
not sure how to remove.
of fun when we add it too... but yes, if we can't do any memory
allocations in some of these callbacks (is this documented anywhere?),
that's going to be interesting...
It's not all bad news though! All memory allocations are fallible in
kernel Rust (and therefore explicit, and also failures have to be
explicitly handled or propagated), so it's pretty easy to point out
where they are, and there are already discussions of higher-level
tooling to enforce rules like that (and things like wait contexts).
Also, Rust makes it a lot easier to refactor code in general and not be
scared that you're going to regress everything, so I'm not really
worried if I need to turn a chunk of the driver on its head to solve
some of these problems in the future ^^ (I already did that when I
switched it from the "demo" synchronous submission model to the proper
explicit sync + fences one.)
~~ Lina