Re: [PATCH RFC 10/18] drm/scheduler: Add can_run_job callback

From: Faith Ekstrand
Date: Thu Mar 09 2023 - 13:51:27 EST


Jumping in here quick... (Sorry, I was out yesterday and was ignoring
my e-mail on Tuesday so I could finally type some compiler code.)

On Thu, 2023-03-09 at 18:14 +0900, Asahi Lina wrote:
> On 09/03/2023 17.05, Christian König wrote:
> > Am 09.03.23 um 07:30 schrieb Asahi Lina:
> > > On 09/03/2023 05.14, Christian König wrote:
> > > > > I think you mean wake_up_interruptible(). That would be
> > > > > drm_sched_job_done(), on the fence callback when a job
> > > > > completes, which
> > > > > as I keep saying is the same logic used for
> > > > > hw_rq_count/hw_submission_limit tracking.
> > > > As the documentation to wait_event says:
> > > >
> > > >    * wake_up() has to be called after changing any variable
> > > > that could
> > > >    * change the result of the wait condition.
> > > >
> > > > So what you essentially try to do here is to skip that and say
> > > > drm_sched_job_done() would call that anyway, but when you read
> > > > any
> > > > variable to determine that state then as far as I can see
> > > > nothing is
> > > > guarantying that order.
> > > The driver needs to guarantee that any changes to that state
> > > precede a
> > > job completion fence signal of course, that's the entire idea of
> > > the
> > > API. It's supposed to represent a check for per-scheduler (or
> > > more
> > > specific, but not more global) resources that are released on job
> > > completion. Of course if you misuse the API you could cause a
> > > problem,
> > > but what I'm trying to say is that the API as designed and when
> > > used as
> > > intended does work properly.
> > >
> > > Put another way: job completions always need to cause the sched
> > > main
> > > loop to run an iteration anyway (otherwise we wouldn't make
> > > forward
> > > progress), and job completions are exactly the signal that the
> > > can_run_job() condition may have changed.
> > >
> > > > The only other possibility how you could use the callback
> > > > correctly
> > > > would be to call drm_fence_is_signaled() to query the state of
> > > > your hw
> > > > submission from the same fence which is then signaled. But then
> > > > the
> > > > question is once more why you don't give that fence directly to
> > > > the
> > > > scheduler?
> > > But the driver is supposed to guarantee that the ordering is
> > > always 1.
> > > resources freed, 2. fence signaled. So you don't need to check
> > > for the
> > > fence, you can just check for the resource state.
> >
> > Yeah, but this is exactly what the dma_fence framework tried to
> > prevent.
> > We try very hard to avoid such side channel signaling :)
>
> Right, and it's fine, I can use the fences directly easily enough.
> I'm
> just trying to explain why my original idea works too, even if it's
> not
> the best solution for other reasons!
>
> Of course I don't have the context of what other drivers are doing or
> did historically and what the pitfalls are, so I can't know what the
> "right" solution for any of this is in that context. I did my best to
> understand the drm_sched code and come up with a solution that works
> (which it does) without any more info. When I saw the hw submission
> limit stuff, I thought "okay, I need the same thing but with slightly
> more complex logic, so let's add a callback so the driver can
> customize
> it and do its own inflight counting".

So, I think there's a difference here between "impossible to implement
correctly", "likely to be implemented correctly", and "impossible to
implement incorrectly". It's obviously possible to implement
correctly. You can just always return true or do exactly the same
check or do some simple thing where you can guarantee that it will only
ever return false when there's a bunch of other stuff in the queue.
That doesn't mean that it's likely to be implemented correctly by some
other driver. Some idiot will come along and try to take advantage of
it and cause themselves horrible problems.

And, to be clear, for the purposes of this discussion, we're ALL
idiots, myself included. If there's one thing the DRM community has
learned over the years, it's that drivers are so complex that we all
turn into idiots at some point, relative to the complexity of the code
and hardware behavior. That's why things like dma_fence are written so
incredibly defensively and why we're so harsh about the rules. It's
the rules and not our individual smarts that keep us from making
mistakes. (Kinda like Rust, in a way.) So while I appreciate the
frustration of "I'm just trying to do something that's clearly correct
here", that doesn't mean that then next person to come by and fix a bug
by tweaking that callback isn't going to screw it up irreparably. That
person may even be you in 6 to 12 months after this e-mail thread is a
distant memory.

So, yes, does the implementation you have today work without deadlocks
or starvation? Maybe it does. I've not verified. Is the suggested
callback a giant foot-gun in the already treacherous territory of
scheduling and fencing? Yeah, it probably is and there's another way
to implement the same behavior which is likely safer in the long run.

> After this discussion, I can see that this is equivalent to doing the
> same check in prepare_job() followed by returning the oldest running
> job's fence (as long as there's no race there... it should be fine if
> the fence reference is taken first, before the resource check, or if
> everything is done within the same critical section taking the
> firmware
> queue lock), so I'm happy to switch to that and drop this patch.
>
> But keep in mind none of this is documented, and there's no way for
> us
> driver authors to understand what we're supposed to do without
> documentation. As I said I spent a long time trying to understand
> drm_sched, and then my original attempt missed the drm_sched_fini()
> issue with dangling jobs and Alyssa managed to hit an oops on the
> test
> branch, I guessed what the problem was from her trace, figured out a
> way
> to reproduce it (the kill-loop glmark2 thing), and fixed it in the
> next
> patch in this series. So even trying my best to figure out how to do
> this, reading the code and what scarce docs there are, I managed to
> miss
> something that caused a potential oops on the first try. If I can't
> even
> get the API usage right after spending hours on it trying really hard
> not to (because it's not just about my driver, I need the Rust
> abstraction to be safe for any driver), there's no way I'm going to
> divine what approaches to resource/dependency signaling are
> problematic/easy to abuse... the most I can hope for is "I got the
> wrapper right and the API/driver interaction is correct and
> guarantees
> forward progress if the driver follows the rules".

Your frustration with the lack of good documentation in DRM is entirely
justified. It's a mess and there's not a whole lot of people who
understand all these subtleties. Connecting to the hive mind via e-
mail and asking questions is the best you can do a lot of the time, I'm
afraid. I wish we had better documentation for a lot of things and I'd
be happy to see the situation improved added but we've got a lot of
debt there and not always a lot of time. (Yeah, I know, that's every
senior engineer's excuse...) We really are trying to be better about
it moving forward, though. Daniel has been pushing people to document
things a lot more in recent years. But, yeah, lots of debt...

Also, in a weird way, I think these conversations are sometimes better
than documentation. It took a while to get around to it all but
there's a lot of context that was brought together in this e-mail
thread that wouldn't have been in the docs no matter how good they are.
A lot of it isn't an isolated thing that should clearly be explained in
the run_job docs. It's subtle interactions which happen when all the
pieces come together. I see this complaint a lot about Vulkan as well.
There are behaviors which only become evident when you find the right 5
pieces of the spec and put them all together and squint. It'd be good
to call those out sometimes but there's no way we can document all of
them.

> So when I submit something, and you reply with "Well complete NAK",
> that's just not nice. Honestly, I was kind of upset when I got that
> email. It sounded as if you were saying my solution was completely
> broken and couldn't work, but no matter how I looked at it I couldn't
> figure out how it's broken. And then it took several emails to even
> understand what you were suggesting with the prepare_job callback
> (and
> yes, that works too and is probably harder to abuse than a new
> callback). I'm trying really hard to make this all work and be
> correct,
> and of course I make mistakes too... but then I look at the code and
> no
> matter what I can come up with it seems to work and be correct, what
> am
> I supposed to do? I'm happy to learn and figure out better approaches
> for everything that lead to better drivers, but I need an actual
> explanation of the issues, not just a NAK...
>
> I also would appreciate it if people give me the benefit of the doubt
> and let me explain what I'm doing and how I'm doing it and how this
> hardware works, because the whole thing is subtle to the core and
> very
> different to other GPUs. Honestly, I don't think any reviewer that
> hasn't spent hours poring over the driver/abstraction code could
> confidently say that a certain subtle sync issue exists at a first
> pass
> (other than for really obvious bad code sequences). I'm happy to look
> into issues and I definitely want to know what cases to look at and
> what
> to check for and fix anything we find... but isn't it better if we
> work
> together instead of shouting "this is broken" at the first hint of
> possible trouble?

Debating if I want to wade in in this one because this thread is
already getting a bit warm and I don't want to make it worse. But, I'm
an idiot, so...

Knowing what I do of both people in this thread, I think Christian is
giving you more benefit of the doubt than you realize. Yes, his tone
may be a bit abrupt but he continued to spend his time responding in
detail to every question you raised. That means he was taking you
seriously, even if he wasn't yielding ground.

Communication is hard, especially with all the different personalities,
languages, and cultures involved in an international community like
this. Sometimes the clarity of saying "no, this isn't going to work"
up-front is necessary. Sometimes the person on the other end of the e-
mail could benefit from a gentler response. It's hard to know from
early interactions. Enough people have been wrong about dma_fence over
the years (Hi! It's me!) that "no" is often the right starting
position. 😭️ It doesn't always feel great to be on the receiving end
of that but Christian is pretty much guarding a dragon cave, so...

To be clear, none of that is a defense of the toxicity for which the
Linux community has gotten a reputation. A lot of subsystem
maintainers have been known to start of with "no" to any idea they
didn't already think of. That's bad. Generally, you shouldn't assume
everyone but you is an idiot. When it comes to dma_fence, though, the
assumption is that we're ALL idiots and the "No, seriously, don't go
into the dragon cave. You won't come out alive. You're not that
special." signs are justified. 😓️

I hope the context I'm providing here is helpful. If not, feel free to
ignore me. It looks like you got the technical issues sorted.

~Faith

>