Re: [Intel-gfx] [REGRESSION] Black screen after switching desktop session (was: Re: Linux 4.10-rc5)

From: Martin Steigerwald
Date: Sat Feb 11 2017 - 09:56:07 EST

Am Mittwoch, 1. Februar 2017, 14:11:22 CET schrieb David Weinehall:
> On Wed, Jan 25, 2017 at 01:10:26PM +0100, Martin Steigerwald wrote:
> > Am Sonntag, 22. Januar 2017, 13:32:08 CET schrieb Linus Torvalds:
> > > Things seem to be calming down a bit, and everything looks nominal.
> > >
> > > There's only been about 250 changes (not counting merges) in the last
> > > week, and the diffstat touches less than 300 files (with drivers and
> > > architecture updates being the bulk, but there's tooling, networking
> > > and filesystems in there too).
> > >
> > > So keep testing, and I think we'll have a regular release schedule.
> >
> > Testing this is no fun:
> >
> > Bug 99533 - black screen after switching session
> >
> >
> >
> > This after GPU hang/lockups with Kernel 4.9 reported as for example:
> >
> > Bug 98922 - [snb] GPU hang on PlaneShift
> >
> >
> > Which may be a duplicate of #98747, #98794, #98860, #98891, #98288.
> >
> >
> > I am back at kernel 4.8.15 as I need this machine for production work.
> >
> > Sometimes I wish for a microkernel that might be able to reincarnate
> > drivers that hang or do wierd things like that. That may at least give a
> > way to actually do some debugging or even get the desktop session back
> > without loosing its state. Especially for graphics drivers and
> > hibernating/resuming from hibernations which also occasionally fails â
> > again without leaving a way to interact with the machine to do further
> > debugging. Linux kernel usually just crashes completely, not even a ping
> > or ssh possible, or it at least stuck with a black display without any
> > way to restart the graphics driver cause it seems to be in some undefined
> > state. Combined with occasionally happening bugs this makes triaging bugs
> > time consuming and risky. I do like to help testing, but maybe its time
> > to just switch to distro kernels and be done about it, as I regularily
> > come across bugs that are too expensive for me to triage.
> >
> > Please understand that I am not willing to bisect these occasionally
> > happening bugs with have the potential to cause data loss due to having
> > to switch off the machine forcefully. Fortunately at least KMail saves a
> > mail I write from time to time and also Kate does swap files.
> >
> > I am also a bit unwilling to do further debugging of this one as I usually
> > use two sessions when I am at work and I risk loosing data I work on.
> > Butâ at least with this issue it seems I would have a way to SSH into the
> > machine before kicking it.
> >
> >
> > I am dissatisfied with the state of the Intel graphics driver on this
> > ThinkPad T520 with Sandybridge since kernel 4.9 and wonder whether you
> > guys at Intel really test things with older hardware versions.
> Yes, we do. But for practical reasons we can only do testing for things
> that we actually have testcases for, and obviously we don't have the
> manpower to actually do *manual* testing on every platform, so issues
> for older platforms that are only triggered by manual interaction tend
> to slip under the radar.
> We have a testfarm that tests every nightly build on all platforms we
> have test machines for. The testcases are publicly available here:
> Obviously most of our manpower is spent on development and testing for
> current and future platforms, so for issues that involve older platforms,
> especially something as old as Sandybridge (which is, by now, 6 years old)
> we are happy for help with testing and bisection.
> If the issues are specific to certain subsets of a platform it obviously
> gets even more complex; it'd be a combinatorial nightmare to build a
> testfarm that could test every variation of every platform.
> If I got the count right the i915 driver supports around a hundred
> different varieties of Intel graphics; combine that with the number of
> different displays people connect, the number of eDP display that the
> vendors connect, the different BIOSes that vendors use, etc., and I
> think you'll begin to see what we're combating) -- to make things even
> more complex you can connect several displays to each graphics card
> (possibly via adapters), displays that don't always meet the standards
> that they claim to meet. Due to limited room we are also a bit limited
> when it comes to testing with multi-monitor setups.
> This is why any help is welcome and sometimes even necessary. If you're
> afraid of dataloss, be aware that it's possible to boot your system with
> file systems mounted read-only; you could also boot from a USB-stick or
> similar.
> If you can find a testcase in i-g-t that easily reproduces the issue
> that'd also be very helpful. Do note that not all testcases in i-g-t
> are run as part of our nightly tests, since some of them are *extremely*
> time consuming; the full combinatorial testcase, for instance, can
> take weeks or months--I haven't done a full run recently--to complete.
> I hope this helps you understand why bugs can slip under the radar,
> and why a bisect is so important.

Wow, David. Does that mean that even Intel cannot really test the driver for
the hardware it supports?

A bisect of a hang the machine bug that only happens after a certain time of
using the computer and switching between sessions then is too expensive for
me. Thats the whole point I tried to make. The *cost* to provide a *useful*
bug report is too high.

You say you canÂt cover this with a test case â I think switching between
sessions *could* be automated â and then you ask for help, yet to provide this
help an effort is needed that is beyond what I am willing to invest and which
IMHO is beyond what many users are willing to invest. See:

It would easily take 10-15 iterations as far as I remember from my last
bisect. And IÂd either risk data loss *or* IÂd use a live linux which means
that during that time I canÂt use the machine for productive work. *Each time*
it may take anything between 10 minutes and several hours for the issue to
appear. IÂd need to reboot, compile the next kernel, either copy it to USB
drive and boot from there or install it, and then repeat the testing steps.

I bet that would take me about one or two complete days to eventually find the
offending commit. I would not feel comfortable asking my employer for these
one to two days to do that work and my leisure time is also too valuable to me
and to full with other things to reduce it by that amount of time for every
bug like this.

Next week I hold a training, since in this particular case with 4.10 it
appears â I didnÂt verify it â that "just" the gfx driver is broken, I might
be able to log in into the machine from a training workstation, soâ I could at
least try to obtain some kind of gpu state dump, while I do most of my work on
the training workstation anyway. I remember Jani having told that backtraces
are mostly useless (then why bother to do them at all instead of just logging
"gfx driver crashed, use tool xyz to obtain debug info"?) and there is a new
way to dump the state of the gpu when it is hung.

But a bisect of an issue like this is an effort that is exceeding what I am
willing to put into it. And I think I am not alone with it.

With other issues like hangs during resume and on waking up that happen
occassionally I have given up already. I donÂt even remember in which kernels
it started and it is even more costly to bisect. I actually donÂt even
remember whether it worked okay at all since I gave up on compiling TuxOnIce
into every kernel.

I am giving up here as well now, unless there is a way to provide you with
sufficient debug information without doing a bisect here, i.e. by a gpu state
dump or something like that.

Upto to now on Linux there does not seem to be a gfx driver that either
*never* *ever* hangs, or at least manages to put out enough debug information,
if need be even onto a plain block device, in order to create a useful bug

Added to that the development speed of one new kernel every three months I see
no realistic chance to keep the driver in a fully working state for the
hardware it supports. The effort to toroughly bisect every nasty bug like this
would just be too high. If invested with every hang bug the current kernels
haveâ â I have seen 4 different issues in 4.9 and 4.10 *just* on this ThinkPad
T520 â it may even exceed the development time.

So until at some time the effort needed to provide a *useful* bug report can
be reduced, I am out. I am willing to spend some hours into it, but not some
days for every single hang sometimes issue.

If you ask me instabilities like thisâ like also the instabilities within
Plasma / KWin which where related to Intel driver bugs, are one reason why
Linux still is not yet ready for the desktop.