Am 24.04.23 um 03:43 schrieb André Almeida:
When a DRM job timeout, the GPU is probably hang and amdgpu have some
ways to deal with that, ranging from soft recoveries to full device
reset. Anyway, when userspace ask the kernel the state of the context
(via AMDGPU_CTX_OP_QUERY_STATE), the kernel reports that the device was
reset, regardless if a full reset happened or not.
However, amdgpu only marks a context guilty in the ASIC reset path. This
makes the userspace report incomplete, given that on soft recovery path
the guilty context is not told that it's the guilty one.
Fix this by marking the context guilty for every type of reset when a
job timeouts.
The guilty handling is pretty much broken by design and only works because we go through multiple hops of validating the entity after the job has already been pushed to the hw.
I think we should probably just remove that completely and use an approach where we check the in flight submissions in the query state IOCTL.
Additional to that I currently didn't considered soft-recovered submissions as fatal and continue accepting submissions from that context, but already wanted to talk with Marek about that behavior.
Regards,
Christian.