Re: [PATCH 1/2] drm: Create an app info option for wedge events

From: Raag Jadav
Date: Thu Mar 13 2025 - 07:08:23 EST


On Wed, Mar 12, 2025 at 06:59:33PM -0300, André Almeida wrote:
> Em 12/03/2025 07:06, Raag Jadav escreveu:
> > On Tue, Mar 11, 2025 at 07:09:45PM +0200, Raag Jadav wrote:
> > > On Mon, Mar 10, 2025 at 06:27:53PM -0300, André Almeida wrote:
> > > > Em 01/03/2025 02:53, Raag Jadav escreveu:
> > > > > On Fri, Feb 28, 2025 at 06:54:12PM -0300, André Almeida wrote:
> > > > > > Hi Raag,
> > > > > >
> > > > > > On 2/28/25 11:20, Raag Jadav wrote:
> > > > > > > Cc: Lucas
> > > > > > >
> > > > > > > On Fri, Feb 28, 2025 at 09:13:52AM -0300, André Almeida wrote:
> > > > > > > > When a device get wedged, it might be caused by a guilty application.
> > > > > > > > For userspace, knowing which app was the cause can be useful for some
> > > > > > > > situations, like for implementing a policy, logs or for giving a chance
> > > > > > > > for the compositor to let the user know what app caused the problem.
> > > > > > > > This is an optional argument, when `PID=-1` there's no information about
> > > > > > > > the app caused the problem, or if any app was involved during the hang.
> > > > > > > >
> > > > > > > > Sometimes just the PID isn't enough giving that the app might be already
> > > > > > > > dead by the time userspace will try to check what was this PID's name,
> > > > > > > > so to make the life easier also notify what's the app's name in the user
> > > > > > > > event.
> > > > > > > >
> > > > > > > > Signed-off-by: André Almeida <andrealmeid@xxxxxxxxxx>
> > > >
> > > > [...]
> > > >
> > > > > > > > len = scnprintf(event_string, sizeof(event_string), "%s", "WEDGED=");
> > > > > > > > @@ -562,6 +564,14 @@ int drm_dev_wedged_event(struct drm_device *dev, unsigned long method)
> > > > > > > > drm_info(dev, "device wedged, %s\n", method == DRM_WEDGE_RECOVERY_NONE ?
> > > > > > > > "but recovered through reset" : "needs recovery");
> > > > > > > > + if (info) {
> > > > > > > > + snprintf(pid_string, sizeof(pid_string), "PID=%u", info->pid);
> > > > > > > > + snprintf(comm_string, sizeof(comm_string), "APP=%s", info->comm);
> > > > > > > > + } else {
> > > > > > > > + snprintf(pid_string, sizeof(pid_string), "%s", "PID=-1");
> > > > > > > > + snprintf(comm_string, sizeof(comm_string), "%s", "APP=none");
> > > > > > > > + }
> > > > > > > This is not much use for wedge cases that needs recovery, since at that point
> > > > > > > the userspace will need to clean house anyway.
> > > > > > >
> > > > > > > Which leaves us with only 'none' case and perhaps the need for standardization
> > > > > > > of "optional telemetry collection".
> > > > > > >
> > > > > > > Thoughts?
> > > > > >
> > > > > > I had the feeling that 'none' was already meant to be used for that. Do you
> > > > > > think we should move to another naming? Given that we didn't reach the merge
> > > > > > window yet we could potentially change that name without much damage.
> > > > >
> > > > > No, I meant thoughts on possible telemetry data that the drivers might
> > > > > think is useful for userspace (along with PID) and can be presented in
> > > > > a vendor agnostic manner (just like wedged event).
> > > >
> > > > I'm not if I agree that this will only be used for telemetry and for the
> > > > `none` use case. As stated by Xaver, there's use case to know which app
> > > > caused the device to get wedged (like switching to software rendering) and
> > > > to display something for the user after the recovery is done (e.g. "The game
> > > > <app name> stopped working and Plasma has reset").
> > >
> > > Sure, but since this information is already available in coredump, I was
> > > hoping to have something like a standardized DRM level coredump with both
> > > vendor specific and agnostic sections, which the drivers can (and hopefully
> > > transition to) use in conjunction with wedged event to provide wider
> > > telemetry and is useful for all wedge cases.
> >
> > This is more useful because,
> >
> > 1. It gives drivers an opportunity to present the telemetry that they are
> > interested in without needing to add a new event string (like PID or APP)
> > for their case.
> >
> > 2. When we consider wedging as a usecase, there's a lot more that goes
> > into it than an application that might be behaving strangely. So a wider
> > telemetry is what I would hope to look at in such a scenario.
> >
>
> I agree that coredump is the way to go for telemetry, we already have the
> name and PID of the guilty app there, along with more information about the
> GPU state. But I don't think it should be consumed like an uAPI. Even if we
> wire up some common DRM code for that, I don't think we can guarantee the
> stability of it as we can do for an uevent. coredump can be disabled and by
> default is only accessible by root.

Hm, this made me curious about how a pid of a specific user will be dealt
with in a multi-user scenario. I know it's not a common scenario for the
usercase but setting the rules to avoid side-effects (or could there be
any?) might be a good idea.

> So I think that coredump is really good after the fact and if the user is
> willing to go ahead and report a bug somewhere. But for the goal of
> notifying the compositor, the same uevent that the compositor is already
> listening to will have everything they need to deal with this reset.

I agree that having to deal with coredump will be cumbersome for the
usecase. Although I'm still leaning towards the idea that we should
consider the room for new usecases that probably want to expose new data,
and having to add a new string each time might not be the best of the
approach.

But that's just my opinion and we should probably wait for wider feedback.

Raag