Re: [Intel-gfx] GPU hang with kernel 4.10rc3
From: Juergen Gross
Date: Fri May 12 2017 - 00:54:36 EST
On 11/05/17 23:08, Pavel Machek wrote:
> On Mon 2017-01-23 10:39:27, Juergen Gross wrote:
>> On 13/01/17 15:41, Juergen Gross wrote:
>>> On 12/01/17 10:21, Chris Wilson wrote:
>>>> On Thu, Jan 12, 2017 at 07:03:25AM +0100, Juergen Gross wrote:
>>>>> On 11/01/17 18:08, Chris Wilson wrote:
>>>>>> On Wed, Jan 11, 2017 at 05:33:34PM +0100, Juergen Gross wrote:
>>>>>>> With kernel 4.10rc3 running as Xen dm0 I get at each boot:
>>>>>>>
>>>>>>> [ 49.213697] [drm] GPU HANG: ecode 7:0:0x3d1d3d3d, in gnome-shell
>>>>>>> [1431], reason: Hang on render ring, action: reset
>>>>>>> [ 49.213699] [drm] GPU hangs can indicate a bug anywhere in the entire
>>>>>>> gfx stack, including userspace.
>>>>>>> [ 49.213700] [drm] Please file a _new_ bug report on
>>>>>>> bugs.freedesktop.org against DRI -> DRM/Intel
>>>>>>> [ 49.213700] [drm] drm/i915 developers can then reassign to the right
>>>>>>> component if it's not a kernel issue.
>>>>>>> [ 49.213700] [drm] The gpu crash dump is required to analyze gpu
>>>>>>> hangs, so please always attach it.
>>>>>>> [ 49.213701] [drm] GPU crash dump saved to /sys/class/drm/card0/error
>>>>>>> [ 49.213755] drm/i915: Resetting chip after gpu hang
>>>>>>> [ 60.213769] drm/i915: Resetting chip after gpu hang
>>>>>>> [ 71.189737] drm/i915: Resetting chip after gpu hang
>>>>>>> [ 82.165747] drm/i915: Resetting chip after gpu hang
>>>>>>> [ 93.205727] drm/i915: Resetting chip after gpu hang
>>>>>>>
>>>>>>> The dump is attached.
>>>>>>
>>>>>> That's a nasty one. The first couple of pages of the batchbuffer appear
>>>>>> to be overwritten. (Full of 0xc2c2c2c2, i.e. probably pixel data.) That
>>>>>> may be a concurrent write by either the GPU or CPU, or we may have
>>>>>> incorrected mapped a set of pages. That it doesn't recovered suggests
>>>>>> that the corruption occurs frequently, probably on every request/batch.
>>>>>
>>>>> I hoped someone would have an idea already.
>>>>
>>>> Sorry, first report of something like this in a long time (that I can
>>>> remember at least). And the problem is that it can be anything from a
>>>> coherency to a concurrency issue, so no one patch springs to mind.
>>>> Thankfully it appears to be kernel related.
>>>> -Chris
>>>>
>>>
>>> Bisecting took longer than I thought, but I had to cherry pick some
>>> patches and rebase one of them multiple times...
>>>
>>> Finally I found the commit to blame: 920cf4194954ec ("drm/i915:
>>> Introduce an internal allocator for disposable private objects")
>>>
>>> In case you need me to produce some more data or test a patch
>>> feel free to reach out.
>>
>> Anything new for this severe regression?
>>
>> Without a fix 4.10 will be unusable with Xen on a machine with i915
>> graphics!
>
> Did this get solved?
Yes. Commit 7152187159193056f30ad5726741bb25028672bf.
Juergen