Re: [BUG] long freezes on thinkpad t60
From: Jarek Poplawski
Date: Fri Jun 22 2007 - 06:30:38 EST
On Thu, Jun 21, 2007 at 09:01:28AM -0700, Linus Torvalds wrote:
>
>
> On Thu, 21 Jun 2007, Jarek Poplawski wrote:
...
> So I don't see how you could possibly having two different CPU's getting
> into some lock-step in that loop: changing "task_rq()" is a really quite
> heavy operation (it's about migrating between CPU's), and generally
> happens at a fairly low frequency (ie "a couple of times a second" kind of
> thing, not "tight CPU loop").
Yes, I've agreed with Ingo it was only one of my illusions...
> But bugs happen..
>
> > Another possible problem could be a result of some wrong optimization
> > or wrong propagation of change of this task_rq(p) value.
>
> I agree, but that kind of bug would likely not cause temporary hangs, but
> actual "the machine is dead" operations. If you get the totally *wrong*
> value due to some systematic bug, you'd be waiting forever for it to
> match, not get into a loop for a half second and then it clearing up..
>
> But I don't think we can throw the "it's another bug" theory _entirely_
> out the window.
Alas, I'm the last here who should talk with you or Ingo about
hardware, but my point is that until it's not 100% proven this
is spinlocks vs. cpu case any nearby possibilities should be
considered. One of them is this loop, which can ... loop. Of
course we can't see any reason for this, but if something is
theoretically allowed it can happen. Here it's enough the
task_rq(p) is for some very unprobable reason (maybe buggy
hardware, code or compiling) cached or read too late, and
maybe it's only on this one only notebook in the world? If
you're sure this is not the case, let's forget about this.
Of course, I don't mean it's a direct optimization. I think
about something that could be triggered by something like this
tested smp_mb() in entirely another place. IMHO, it would be
very interesting to assure if this barrier fixed the spinlock
or maybe some variable.
There is also another interesting question: if it was only about
spinlocks (which may be) why on those watchdog traces there
are mainly two "fighters": wait_task_inactive() and
try_to_wake_up(). It seems there should be seen more clients
of task_rq_lock(). So, my another unlikely idea is: maybe
for some other strange, unprobable but only theoretically
possibility there could be something wrong around this
yield() in wait_task_inactive(). The comment above this
function reads about the task that *will* unschedule.
So, maybe it would be not nice if e.g. during this yield()
something wakes it up?
...
> But it only happens with badly coded software: the rule simply is that you
> MUST NOT release and immediately re-acquire the same spinlock on the same
> core, because as far as other cores are concerned, that's basically the
> same as never releasing it in the first place.
I totally agree! Let's only get certainty there is nothing
more here. BTW, is there any reason not to add some test with
a warning under spinlock debugging against such badly behaving
places?
Regards,
Jarek P.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/