Re: [PATCH RFC 11/18] drm/scheduler: Clean up jobs when the scheduler is torn down

From: Asahi Lina
Date: Fri Mar 10 2023 - 04:58:43 EST


On 10/03/2023 04.59, Faith Ekstrand wrote:
> On Thu, 2023-03-09 at 18:43 +0900, Asahi Lina wrote:
>> On 09/03/2023 17.42, Christian König wrote:
>>> Long story short: Don't do this! This is what the Windows drivers
>>> have
>>> been doing and it creates tons of problems.
>
> Yeah, we tried to do a bit of that in the GL days. It was a bad idea.

I think I should clarify: I was proposing re-queueing innocent jobs from
innocent queues/VMs that were impacted by a fault. The reason is that we
may be able to tweak firmware state to force it to do that safely,
during the firmware recovery cycle, such that an aborted job restarts
and then subsequent jobs/commands continue as normal. We can't leave it
to userspace because if we do nothing, the affected job ends up
incomplete but then everything after it that is already queued still
runs, and that is definitely a recipe for a bigger mess if userspace
wants to seamlessly recover. The firmware recovery cycle is a
"stop-the-world" situation for the GPU (the firmware literally
busy-loops waiting for the driver to set a continue flag in memory...),
so that's the only real chance that the driver gets to make decisions
about what is going to happen next.

Of course, that only works if individual possibly concurrently running
commands are idempotent, but I think a lot of typical GPU work is? (E.g.
any render pass without side effects other than the render targets and
where the background shader does no loads, or even render passes that do
loads but where all draws are opaque, which are all things the current
Gallium driver is intimately familiar with since Crazy Tiler
Optimizations™ need that info to be provided anyway). So I was wondering
whether it'd make sense to have such an idempotency/restartable flag on
job submission, and then the driver would do its best to recover and
rerun it if it gets killed by an unrelated concurrent bad job.

Then again this all depends on an investigation into what we *can* do
during firmware recovery that hasn't happened at all yet. It might be
that it isn't safe to do anything really, or that doing things depends
on touching even deeper firmware state structs that we treat as opaque
right now and we really don't want to have to touch...

But maybe none of this is worth it in practice, it just sounded like it
could be useful maybe?

Now that I look at it, we have a lovely "what is this flag doing anyway"
bit already passed from Mesa through to the firmware we called
ASAHI_RENDER_SET_WHEN_RELOADING_Z_OR_S which, now that I look at it, is
actually getting set when any attachment (any color, Z, S) is not being
cleared for that pass (so it's loaded). That could very well be an "is
not idempotent" flag... and maybe that means the firmware does this for
us already? Sounds like something to test... I might have some 16Kx16K
GLmark runs to do concurrent with an evil faulting job now ^^ (and then
that also means we need to set it when shaders have side effects and
stuff, which right now we don't).

>>> Just signal the problem back to userspace and let the user space
>>> driver
>>> decide what to do.
>>>
>>> The background is that most graphics applications (games etc..)
>>> then
>>> rather start on the next frame instead of submitting the current
>>> one
>>> again while compute applications make sure that the abort and tell
>>> the
>>> user that the calculations might be corrupted and need to be
>>> redone.
>
> The guarantee that Vulkan makes is that, if you idle the GPU and you
> haven't gotten a DEVICE_LOST yet, your data is good. If you get a
> DEVICE_LOST, all bets are off. The problem is that, no matter how fast
> the error propagation may be in the kernel or userspace driver, errors
> can still show up in strange ways. An OOB buffer access could end up
> modifying a shader binary which gets run 3 frames later and causes a
> corruption. Once you've faulted, you really have no idea how far back
> is good or what memory is corrupted. You have to assume that
> everything mapped to the GPU VA space is potentially toast.

Yes of course, for the actually faulting VM all bets are off after a
fault (though we can try a bit harder at least... I have a READ_ONLY BO
flag now, I should set it on the shader pools!).

>> Actually I wanted to ask about error notifications. Right now we have
>> an
>> out-of-band mechanism to provide detailed fault info to userspace
>> which
>> works fine, but in principle it's optional.
>
> This is fine, in principal. Because of the nature of errors, async is
> fine as long as the error shows up eventually. Faster is better, for
> sure, but error latency doesn't really matter in practice.
>
>> However, I also mark the hw
>>  fences as errored when a fault happens (with an errno that describes
>> the overall situation), but that never makes it into the drm_sched
>> job
>> complete fence. I looked at the drm_sched code and I didn't see any
>> error propagation. Is that supposed to work, or am I supposed to
>> directly mark the drm_sched side fence as complete, or did I
>> misunderstand all this? I get the feeling maybe existing drivers just
>> rely on the recovery/timeout/etc paths to mark jobs as errored (since
>> those do it explicitly) and never need error forwarding from the hw
>> fence?
>
> The end behavior needs to be that all fences for all jobs submitted to
> the queue get signaled. That's needed to satisfy the finite time
> guarantees of dma_fence. Exactly how that happens (let the job run,
> abort all the jobs, etc.) is an implementation detail for the driver to
> decide. If you want, you can also set a bit on the context (or queue)
> to mark it as dead and start returning EIO or similar from any ioctls
> trying to submit more work if you wanted. Not required but you can.

Fences have an error flag though, does that get reported to userspace
somehow? I thought it did, but maybe not, or maybe only drm_sched not
propagating it is the issue?

In other words, absent my fancy stats reporting BO system, what is the
normal way that an explicit sync driver signals to userspace that the
job associated with a syncobj has failed?

(If there is no way, then I'll probably want to change the stats BO
system to be configurable, so if you ask for no stats/time info, you
only get overall job status and faults, which has less overhead.)

~~ Lina