Re: [Bisected Regression in 2.6.32.8] i915 with KMS enabled causes memorycorruption when resuming from suspend-to-disk

From: Rafael J. Wysocki
Date: Sat Mar 13 2010 - 15:02:56 EST


On Saturday 13 March 2010, M. Vefa Bicakci wrote:
> Hello,
>
> As you can guess from the subject, I have noticed that enabling the
> KMS feature of the i915 module with any kernel version after 2.6.32.7
> causes memory corruption after one resumes from suspend-to-disk.
>
> My hardware is a Toshiba Satellite A100, with an Intel graphics card.
> I am using an up-to-date version of Debian Sid. Here are the lspci
> entries for my graphics card:
>
> === 8< ===
> 00:02.0 VGA compatible controller [0300]: Intel Corporation Mobile 945GM/GMS, 943/940GML Express Integrated Graphics Controller [8086:27a2] (rev 03) (prog-if 00 [VGA controller])
> 00:02.1 Display controller [0380]: Intel Corporation Mobile 945GM/GMS/GME, 943/940GML Express Integrated Graphics Controller [8086:27a6] (rev 03)
> === >8 ===
>
> I have noticed that after upgrading from 2.6.32.7 to 2.6.32.9, I started
> to get a lot of segfaults from different programs when I resume from
> suspend-to-disk. After searching the Internet for this problem, I have
> seen that some other people also had it, and that it wasn't a new problem
> either:
>
> http://bbs.archlinux.org/viewtopic.php?id=91375
> https://bugzilla.redhat.com/show_bug.cgi?id=537494
> http://bugzilla.kernel.org/show_bug.cgi?id=13811
>
> Even though some people say that they have had this problem for a long time,
> I have only noticed it after upgrading to 2.6.32.9.
>
> After booting with "nomodeset" and confirming that the problem doesn't
> happen with that kernel option, I have determined that the problem was
> with i915.
>
> Then I used the following command to bisect the changes that i915 has
> seen between 2.6.32.7 and 2.6.32.9:
>
> git bisect start v2.6.32.9 v2.6.32.7 -- ./drivers/gpu/drm/
>
> With each iteration in the bisection, I have tried at least 3 cycles
> of suspend-to-disk and resume operations. I saw that all of the tried
> versions had memory corruption issues after resume from suspend-to-disk.
>
> Then, git told me that the culprit is the first change to i915 after the
> release 2.6.32.7. So 2.6.32.8 introduced the regression I am experiencing.
> Here's the "git bisect log" output:
>
> === 8< ===
> # bad: [7f5e918e62cbc9ac27c2f47d3c3dd4b86f67ff0e] Linux 2.6.32.9
> # good: [b4bdd73ce865213a5653dc424873e8da37e858cc] Linux 2.6.32.7
> git bisect start 'v2.6.32.9' 'v2.6.32.7' '--' './drivers/gpu/drm/'
> # bad: [192ff23a2206eb5136c779bfed73171a4d214ad6] drm/i915: Add HP nx9020/SamsungSX20S to ACPI LID quirk list
> git bisect bad 192ff23a2206eb5136c779bfed73171a4d214ad6
> # bad: [6240058ce3725f5e708e1c17c3a676217e44ba9b] drm/i915: disable hotplug detect before Ironlake CRT detect
> git bisect bad 6240058ce3725f5e708e1c17c3a676217e44ba9b
> # bad: [61d4374b51386dd40c03fd15df5a7f97347de688] drm/i915: Reload hangcheck timer too for Ironlake
> git bisect bad 61d4374b51386dd40c03fd15df5a7f97347de688
> # bad: [d8e0902806c0bd2ccc4f6a267ff52565a3ec933b] drm/i915: Selectively enable self-reclaim
> git bisect bad d8e0902806c0bd2ccc4f6a267ff52565a3ec933b
>
> d8e0902806c0bd2ccc4f6a267ff52565a3ec933b is the first bad commit
> commit d8e0902806c0bd2ccc4f6a267ff52565a3ec933b
> Author: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
> Date: Wed Jan 27 13:36:32 2010 +0000
>
> drm/i915: Selectively enable self-reclaim
>
> commit 4bdadb9785696439c6e2b3efe34aa76df1149c83 upstream.
>
> Having missed the ENOMEM return via i915_gem_fault(), there are probably
> other paths that I also missed. By not enabling NORETRY by default these
> paths can run the shrinker and take memory from the system (but not from
> our own inactive lists because our shrinker can not run whilst we hold
> the struct mutex) and this may allow the system to survive a little longer
> whilst our drivers consume all available memory.
>
> References:
> OOM killer unexpectedly called with kernel 2.6.32
> http://bugzilla.kernel.org/show_bug.cgi?id=14933
>
> v2: Pass gfp into page mapping.
> v3: Use new read_cache_page_gfp() instead of open-coding.
>
> ...
> === >8 ===
>
> For the record, just to confirm that this commit is actually the culprit,
> I took a vanilla 2.6.32.9 source tree and reverted only this commit. I am
> happy to let you know that with this commit reverted, I can no longer
> reproduce the memory corruption issue.
>
> However, as I noted above, some people have had this problem for a longer
> time. So I am not sure if the commit above causes the bug or if it makes
> the bug easier to trigger.
>
> Finally, I would like to note that this regression is going to be important,
> because, as you know, Intel's X11 drivers are not going to support mode-setting
> in user mode starting with version 2.10.0.
>
> If there is any help I can provide in fixing this regression, please let me
> know. I am willing to try patches.

If I remember correctly, this has been fixed in the mainline, but I don't
remember the exact commit right now.

Chris, Jesse, can you please help?

Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/