Re: 2.2.9-ac2 locks solid

Ingo Molnar (mingo@chiara.csoma.elte.hu)
Mon, 7 Jun 1999 14:28:01 +0200 (CEST)


On Mon, 7 Jun 1999, Alan Cox wrote:

> > > to use GFP_ATOMIC instead and let me know
> >
> > this might as well explain the 'stuck on TLB IPI wait' bugs?
>
> Im still unhappy with the smp_tlb_flush stuff. When I did 2.0 I made the
> assumption that the kernel lock was held by the person who caused this. 2.2
> does make this assumption still, but it is not true 8(

i've re-checked that code, and i think the comment is wrong. It does not
matter wether the 'smp_invalidate_needed = cpu_online_map' is a simple
assignment (atomic but not coherent write), or an atomic-or (coherent
write), because even if in the first (current) case the write gets delayed
and doesnt get to the other CPU when the IPI hits that other CPU, when
that other CPU does the atomic_test_and_clear_bit() thing we will surely
get the write flushed to RAM and passed on to the IPI-executing-CPU.

there are two scenarios where smp_flush_tlb() can hang and cause a 'stuck
TLB ...' message:

1) we 'miss' an IPI, ie. either the IPI does get executed on the other CPU
but for some reason the clear_bit() does not reach the assignment we did.
I believe this is impossible due to MESI coherency guarantees. Or the IPI
is getting lost altogether - this is rather improbable given that the
system(s) in question are mostly P5s which explicitly check the APIC
status.

2) the IPI _hangs_, because the other CPU has IRQs disabled for some
reason. This frequently happens to me when i mess up some sort of
spinlocked code :) We do a spin_lock_irq(), which (rightfully) loops with
IRQs disabled but for some reason it deadlocks - and cannot execute the
TLB IPI.

the debugging trace sent by George looks like case 2). Note that case 2)
is _likely_ to cause a stuck TLB IPI because the TLB flush is about the
only kernel thing that actively waits for another CPU to do something,
synchronously. So the 'TLB stuck...' message is just a side-effect, not
the problem itself.

-- mingo

ps. NMI TLB flushes will probably be necessery later on - although it just
hides the real bug.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/