Re: Soft lockup during suspend since ~2.6.36 [bisected]

From: Thilo-Alexander Ginkel
Date: Thu Apr 14 2011 - 08:25:22 EST


On Wed, Apr 6, 2011 at 08:03, Thilo-Alexander Ginkel <thilo@xxxxxxxxxx> wrote:
> On Wed, Apr 6, 2011 at 01:28, Arnd Bergmann <arnd@xxxxxxxx> wrote:
>> On Tuesday 05 April 2011, Thilo-Alexander Ginkel wrote:
>>> Thanks, that worked pretty well. A bisect with eleven builds later I
>>> have now identified the following candidate commit, which may have
>>> introduced the bug:
>>>
>>> dcd989cb73ab0f7b722d64ab6516f101d9f43f88 is the first bad commit
>>> commit dcd989cb73ab0f7b722d64ab6516f101d9f43f88
>>> Author: Tejun Heo <tj@xxxxxxxxxx>
>>> Date: Â Tue Jun 29 10:07:14 2010 +0200
>>
>> Sorry, but looking at the patch shows that it can't possibly have introduced
>> the problem, since all the code that is modified in it is new code that
>> is not even used anywhere at that stage.
>>
>> As far as I can tell, you must have hit a false positive or a false negative
>> somewhere in the bisect.
>
> Well you're right. I hit "Reply" too early and should have paid closer
> attention to what change the bisect actually brought up.
>
> I already found a false negative (fortunately pretty close to the end
> of the bisect sequence) and also verified the preceding good commits,
> which gives me two new commits to test. I'll provide an update once
> the builds and tests are through, which may however take until early
> next week as I will be on vacation until then.

All right... I verified all my bisect tests and actually found yet
another bug. After correcting that one (and verifying the correctness
of the other tests), git bisect actually came up with a commit, which
makes some more sense:

| e22bee782b3b00bd4534ae9b1c5fb2e8e6573c5c is the first bad commit
| commit e22bee782b3b00bd4534ae9b1c5fb2e8e6573c5c
| Author: Tejun Heo <tj@xxxxxxxxxx>
| Date: Tue Jun 29 10:07:14 2010 +0200
|
| workqueue: implement concurrency managed dynamic worker pool

The good news is that I am able to reproduce the issue within a KVM
virtual machine, so I am able to test for the soft lockup (which
somewhat looks like a race condition during worker / CPU shutdown) in
a mostly automated fashion. Unfortunately, that also means that this
issue is all but hardware specific, i.e., it most probably affects all
SMP systems (with a varying probability depending on the number of
CPUs).

Adding some further details about my configuration (which I replicated
in the VM):
- lvm running on top of
- dmcrypt (luks) running on top of
- md raid1

If anyone is interested in getting hold of this VM for further tests,
let me know and I'll try to figure out how to get it (2*8 GB, barely
compressible due to dmcrypt) to its recipient.

Regards,
Thilo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/