Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations

From: Alex Deucher
Date: Fri Jun 30 2023 - 10:59:40 EST


On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick
<sebastian.wick@xxxxxxxxxx> wrote:
>
> On Tue, Jun 27, 2023 at 3:23 PM André Almeida <andrealmeid@xxxxxxxxxx> wrote:
> >
> > Create a section that specifies how to deal with DRM device resets for
> > kernel and userspace drivers.
> >
> > Acked-by: Pekka Paalanen <pekka.paalanen@xxxxxxxxxxxxx>
> > Signed-off-by: André Almeida <andrealmeid@xxxxxxxxxx>
> > ---
> >
> > v4: https://lore.kernel.org/lkml/20230626183347.55118-1-andrealmeid@xxxxxxxxxx/
> >
> > Changes:
> > - Grammar fixes (Randy)
> >
> > Documentation/gpu/drm-uapi.rst | 68 ++++++++++++++++++++++++++++++++++
> > 1 file changed, 68 insertions(+)
> >
> > diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> > index 65fb3036a580..3cbffa25ed93 100644
> > --- a/Documentation/gpu/drm-uapi.rst
> > +++ b/Documentation/gpu/drm-uapi.rst
> > @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a third handler for
> > mmapped regular files. Threads cause additional pain with signal
> > handling as well.
> >
> > +Device reset
> > +============
> > +
> > +The GPU stack is really complex and is prone to errors, from hardware bugs,
> > +faulty applications and everything in between the many layers. Some errors
> > +require resetting the device in order to make the device usable again. This
> > +sections describes the expectations for DRM and usermode drivers when a
> > +device resets and how to propagate the reset status.
> > +
> > +Kernel Mode Driver
> > +------------------
> > +
> > +The KMD is responsible for checking if the device needs a reset, and to perform
> > +it as needed. Usually a hang is detected when a job gets stuck executing. KMD
> > +should keep track of resets, because userspace can query any time about the
> > +reset stats for an specific context. This is needed to propagate to the rest of
> > +the stack that a reset has happened. Currently, this is implemented by each
> > +driver separately, with no common DRM interface.
> > +
> > +User Mode Driver
> > +----------------
> > +
> > +The UMD should check before submitting new commands to the KMD if the device has
> > +been reset, and this can be checked more often if the UMD requires it. After
> > +detecting a reset, UMD will then proceed to report it to the application using
> > +the appropriate API error code, as explained in the section below about
> > +robustness.
> > +
> > +Robustness
> > +----------
> > +
> > +The only way to try to keep an application working after a reset is if it
> > +complies with the robustness aspects of the graphical API that it is using.
> > +
> > +Graphical APIs provide ways to applications to deal with device resets. However,
> > +there is no guarantee that the app will use such features correctly, and the
> > +UMD can implement policies to close the app if it is a repeating offender,
> > +likely in a broken loop. This is done to ensure that it does not keep blocking
> > +the user interface from being correctly displayed. This should be done even if
> > +the app is correct but happens to trigger some bug in the hardware/driver.
>
> I still don't think it's good to let the kernel arbitrarily kill
> processes that it thinks are not well-behaved based on some heuristics
> and policy.
>
> Can't this be outsourced to user space? Expose the information about
> processes causing a device and let e.g. systemd deal with coming up
> with a policy and with killing stuff.

I don't think it's the kernel doing the killing, it would be the UMD.
E.g., if the app is guilty and doesn't support robustness the UMD can
just call exit().

Alex

>
> > +
> > +OpenGL
> > +~~~~~~
> > +
> > +Apps using OpenGL should use the available robust interfaces, like the
> > +extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES). This
> > +interface tells if a reset has happened, and if so, all the context state is
> > +considered lost and the app proceeds by creating new ones. If it is possible to
> > +determine that robustness is not in use, the UMD will terminate the app when a
> > +reset is detected, giving that the contexts are lost and the app won't be able
> > +to figure this out and recreate the contexts.
> > +
> > +Vulkan
> > +~~~~~~
> > +
> > +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions.
> > +This error code means, among other things, that a device reset has happened and
> > +it needs to recreate the contexts to keep going.
> > +
> > +Reporting causes of resets
> > +--------------------------
> > +
> > +Apart from propagating the reset through the stack so apps can recover, it's
> > +really useful for driver developers to learn more about what caused the reset in
> > +first place. DRM devices should make use of devcoredump to store relevant
> > +information about the reset, so this information can be added to user bug
> > +reports.
> > +
> > .. _drm_driver_ioctl:
> >
> > IOCTL Support on Device Nodes
> > --
> > 2.41.0
> >
>