Re: inode->i_wb_list corruption.

From: Dave Airlie
Date: Thu Mar 15 2012 - 10:22:41 EST


On Thu, Mar 15, 2012 at 2:08 PM, Petr TesaÅÃk <petr@xxxxxxxxxxx> wrote:
> Dne So 10. bÅezna 2012 02:00:15 Dave Jones napsal(a):
>> (trimmed cc)
>>
>> On Sat, Mar 10, 2012 at 12:14:37AM +0800, Yang Bai wrote:
>> Â> On Fri, Mar 9, 2012 at 11:19 PM, Dave Jones <davej@xxxxxxxxxx> wrote:
>> Â> > And with that, this arrived..
>> Â> > https://bugzilla.redhat.com/show_bug.cgi?id=788433#c3
>> Â> >
>> Â> > I'm leaning strongly towards believing this is yet another case of
>> Â> > i915 corrupting memory on resume.
>> Â>
>> Â> Nice catch. I am wondering
>> Â> 1) why all lists being affected and
>> Â> 2) why all list_head's prev being set to NULL.
>> Â>
>> Â> Any ideas?
>>
>> This is probably the same bug:
>> https://bugzilla.kernel.org/show_bug.cgi?id=37142 Petr noticed that the
>> corruption is 32 bytes getting zeroed at the beginning of a page.
>>
>> I think this may be responsible for a lot of different bugs that we've
>> had reported.
>>
>> i915_drm_thaw is a deep nest of functions though, so this is going to be
>> hard to track down where that write is coming from. Because the corruption
>> seems to happen to pages that are already allocated, we probably can't
>> even rely on DEBUG_PAGEALLOC, though it might be worth trying.
>
> If it you believe it could be written by the CPU, I can try to catch the
> instruction that writes to this memory. My plan is as follows:
>
> Set up all the hardware debug registers to trap writes to the pages that are
> likely to get corrupted. Remember, I've seen the corruption happen always
> roughly in the same physical memory area.
>
> I know, there are only 4 registers I can use, and the potential corruption
> area is much larger than 4 pages, but with enough reboots, the chance is quite
> high that I'll be lucky.
>
> I haven't gone for that plan yet, because I thought the area was in fact
> written to by someone else on the PCI bus, not the CPU. If nothing else, I can
> verify that. ;-)

It would be interesting to maybe dump the GTT then and see where the
pages you see
corruption are in it, if they as in the fbcon object then that kinda
proves the CPU writes them.

Dave.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/