Re: [BUG] long freezes on thinkpad t60

From: Linus Torvalds
Date: Thu Jun 21 2007 - 13:33:50 EST




On Thu, 21 Jun 2007, Chuck Ebbert wrote:
>
> A while ago I showed that spinlocks were a lot more fair when doing
> unlock with the xchg instruction on x86. Probably the arbitration is all
> screwed up because we use a mov instruction, which while atomic is not
> locked.

No, the cache line arbitration doesn't know anything about "locked" vs
"unlocked" instructions (it could, but there really is no point).

The real issue is that locked instructions on x86 are serializing, which
makes them extremely slow (compared to just a write), and then by being
slow they effectively just make the window for another core bigger.

IOW, it doesn't "fix" anything, it just hides the bug with timing.

You can hide the problem other ways by just increasing the delay between
the unlock and the lock (and adding one or more serializing instruction in
between is generally the best way of doing that, simply because otherwise
micro- architecture may just be re-ordering things on you, so that your
"delay" isn't actually in between any more!).

But adding delays doesn't really fix anything, of course. It makes things
"fairer" by making *both* sides suck more, but especially if both sides
are actually the same exact thing, I could well imagine that they'd both
just suck equally, and get into some pattern where they are now both
slower, but still see exactly the same problem!

Of course, as long as interrupts are on, or things like DMA happen etc,
it's really *really* hard to be totally unlucky, and after a while you're
likely to break out of the steplock on your own, just because the CPU's
get interrupted by something else.

It's in fact entirely possible that the long freezes have always been
there, but the NOHZ option meant that we had much longer stretches of time
without things like timer interrupts to jumble up the timing! So maybe the
freezes existed before, but with timer interrupts happening hundreds of
times a second, they weren't noticeable to humans.

(Btw, that's just _one_ theory. Don't take it _too_ seriously, but it
could be one of the reasons why this showed up as a "new" problem, even
though I don't think the "wait_for_inactive()" thing has changed lately.)

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/