Re: [Intel-gfx] [PATCH] drm/i915: Remove instructions to file a bug report.

From: Daniel Vetter
Date: Wed Dec 07 2016 - 11:09:05 EST


On Mon, Dec 05, 2016 at 04:55:47PM -0800, Matt Turner wrote:
> On Sat, Dec 3, 2016 at 1:52 AM, Jani Nikula <jani.nikula@xxxxxxxxxxxxxxx> wrote:
> > On Sat, 03 Dec 2016, Matt Turner <mattst88@xxxxxxxxx> wrote:
> >> From these instructions, users assume that /sys/class/drm/card0/error
> >> contains all the information a developer needs to diagnose and fix a GPU
> >> hang.
> >>
> >> In fact it doesn't, and we have no tools for solving them (other than
> >> stabbing in the dark). Most of the time the error state itself isn't
> >> even useful because it just shows a hang on a PIPE_CONTROL or similar.
> >>
> >> Until a time when the error state contains enough information to
> >> actually solve a hang, stop telling users to file unsolvable bugs, and
> >> instead rely on users who know where and how to file a good bug report
> >> to find their own way there.
> >>
> >> Signed-off-by: Matt Turner <mattst88@xxxxxxxxx>
> >> ---
> >> Maybe now's a good time to discuss what *would* be useful to put in the
> >> error state for debugging hangs. The currently executing shader program
> >> would be a great place to start.
> >
> > I'm wondering why we're getting this patch now, and my guess is that
> > it's because we have been reassigning the related bugs to Mesa more
> > actively lately. Is that the case?
>
> No, it's simply because I spent a week going through Bugzilla and
> realized how incomplete an unactionable the majority of GPU hang
> reports are.
>
> Asking users to report bugs, but not telling them what actually
> constitutes a bug report, is a recipe for a lot of wasted developer
> time.
>
> I suspect we could improve the usefulness of the reports by directing
> users to a webpage that gave a few suggestions (tell us what you were
> doing when the hang occurred would be an obvious one) about filing a
> bug and then provided a link to Bugzilla. Or even configured Bugzilla
> to have a default template that requested various bits of information.

I think dumping at least some of the aux buffers should make this tons
more useful for mesa, since it would indicate stuff like "we always die on
resolves on skl gt4" or stuff like that. Thus far error states have been
mostly used by kernel folks to debug kernel issues, which is why none of
that additional stuff gets dumped.

But a bare-bones parser to hunt for indirect state base addresses and fish
out the aux stuff shouldn't be that hard, and might make this fully
useful.

Like Chris said the goal is to at least be able to triage and classify
bugs, and I'm perfectly fine with merging additional code to the dumper to
get there for mesa folks. We z-compress the state, so size isn't really an
issue. And Ben has commit rights, so shouldn't be a problem to get this
all merged.

> > IIUC the bug reports are useful for us when it's a kernel bug, but less
> > useful for you when it's a Mesa bug. And you'd rather have fewer
> > incoming bugs that you think are unsolvable with the information at
> > hand.
> >
> > Sounds like a bug workflow issue between drm/i915 and Mesa to be ironed
> > out.
>
> Indeed. I'd rather have the information provided in a bug report to
> actually solve it. I hope having access to the shader program will
> make many more reports useful.
>
> I am also happy to see that there's now a sunset to the GPU hang message.

The other option is that mesa folks don't want error states that we triage
to mesa. We could definitely update the process in that area.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch