Re: [BUG] long freezes on thinkpad t60

From: Linus Torvalds
Date: Thu Jun 21 2007 - 12:03:17 EST

Next message: Linus Torvalds: "Re: [PATCH] Alternative fix for kprobes&DEBUG_RODATA was Re: [1/2]2.6.22-rc5: known regressions with patches"
Previous message: Andreas Gruenbacher: "Re: [AppArmor 39/45] AppArmor: Profile loading and manipulation, pathname matching"
In reply to: Jarek Poplawski: "Re: [BUG] long freezes on thinkpad t60"
Next in thread: Jarek Poplawski: "Re: [BUG] long freezes on thinkpad t60"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, 21 Jun 2007, Jarek Poplawski wrote:
>
> BTW, I've looked a bit at these NMI watchdog traces, and now I'm not
> even sure it's necessarily the spinlock's problem (but I don't exclude
> this possibility yet). It seems both processors use task_rq_lock(), so
> there could be also a problem with that loop.

I agree that that is also this kind of "loop getting a lock" loop, but if
you can see it actually happening there, we have some other (and major)
bug.

That loop is indeed a loop, but realistically speaking it should never be
run through more than once. The only reason it's a loop is that there's a
small small race where we get the information at the wrong time, get the
wrong lock, and try again.

So the loop certainly *can* trigger (or it would be pointless), but I'd
normally expect it not to, and even if it does end up looping around it
should happen maybe *one* more time, absolutely not "get stuck" in the
loop for any appreciable number of iterations.

So I don't see how you could possibly having two different CPU's getting
into some lock-step in that loop: changing "task_rq()" is a really quite
heavy operation (it's about migrating between CPU's), and generally
happens at a fairly low frequency (ie "a couple of times a second" kind of
thing, not "tight CPU loop").

But bugs happen..

> Another possible problem could be a result of some wrong optimization
> or wrong propagation of change of this task_rq(p) value.

I agree, but that kind of bug would likely not cause temporary hangs, but
actual "the machine is dead" operations. If you get the totally *wrong*
value due to some systematic bug, you'd be waiting forever for it to
match, not get into a loop for a half second and then it clearing up..

But I don't think we can throw the "it's another bug" theory _entirely_
out the window.

That said, both Ingo and me have done "fairness" testing of hardware, and
yes, if you have a reasonably tight loop with spinlocks, existing hardware
really *is* unfair. Not by a "small amount" either - I had this program
that did statistics (and Ingo improved on it and had some other tests
too), and basically if you have two CPU's that try to get the same
spinlock, they really *can* get into situations where one of them gets it
millions of times in a row, and the other _never_ gets it.

But it only happens with badly coded software: the rule simply is that you
MUST NOT release and immediately re-acquire the same spinlock on the same
core, because as far as other cores are concerned, that's basically the
same as never releasing it in the first place.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Linus Torvalds: "Re: [PATCH] Alternative fix for kprobes&DEBUG_RODATA was Re: [1/2]2.6.22-rc5: known regressions with patches"
Previous message: Andreas Gruenbacher: "Re: [AppArmor 39/45] AppArmor: Profile loading and manipulation, pathname matching"
In reply to: Jarek Poplawski: "Re: [BUG] long freezes on thinkpad t60"
Next in thread: Jarek Poplawski: "Re: [BUG] long freezes on thinkpad t60"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]