Re: [debug PATCHes] Re: smp_call_function_single lockups

From: Daniel J Blueman
Date: Tue Mar 31 2015 - 22:00:39 EST


On Wednesday, April 1, 2015 at 6:40:06 AM UTC+8, Chris J Arges wrote:
> On Tue, Mar 31, 2015 at 12:56:56PM +0200, Ingo Molnar wrote:
> >
> > * Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> >
> > > Ok, interesting. So the whole "we try to do an APIC ACK with the ISR
> > > bit clear" seems to be a real issue.
> >
> > It's interesting in particular when it happens with an edge-triggered
> > interrupt source: it's much harder to miss level triggered IRQs, which
> > stay around until actively handled. Edge triggered irqs are more
> > fragile to loss of event processing.
> >
> > > > Anyway, maybe this sheds some more light on this issue. I can
> > > > reproduce this at will, so let me know of other experiments to do.
> >
> > Btw., could you please describe (again) what your current best method
> > for reproduction is? It's been a long discussion ...
> >
>
> Ingo,
>
> To set this up, I've done the following on a Xeon E5620 / Xeon E312xx machine
> ( Although I've heard of others that have reproduced on other machines. )
>
> 1) Create an L1 KVM VM with 2 vCPUs (single vCPU case doesn't reproduce)
> 2) Create an L2 KVM VM inside the L1 VM with 1 vCPU
> 3) Add the following to the L1 cmdline:
> nmi_watchdog=panic hung_task_panic=1 softlockup_panic=1 unknown_nmi_panic
> 3) Run something like 'stress -c 1 -m 1 -d 1 -t 1200' inside the L2 VM
>
> Sometimes this is sufficient to reproduce the issue, I've observed that running
> KSM in the L1 VM can agitate this issue (it calls native_flush_tlb_others).
> If this doesn't reproduce then you can do the following:
> 4) Migrate the L2 vCPU randomly (via virsh vcpupin --live OR tasksel) between
> L1 vCPUs until the hang occurs.
>
> I attempted to write a module that used smp_call_function_single calls to
> trigger IPIs but have been unable to create a more simple reproducer.

A non-intrusive way of generating a lot of IPIs, is calling
stop_machine() via something like:

while :; do
echo "base=0x20000000000 size=0x8000000 type=write-back" >/proc/mtrr
echo "disable=4" >| /proc/mtrr
done

Of course, ensure base is above DRAM and any 64-bit MMIO for no
side-effects and ensure it'll be entry 4. Onlining and offlining cores
in parallel will generate IPIs also.

Dan
--
Daniel J Blueman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/