Re: futex funkiness -- massive lockups

From: Ingo Molnar
Date: Wed Mar 05 2014 - 04:01:28 EST



* Davidlohr Bueso <davidlohr@xxxxxx> wrote:

> Hi,
>
> A large amount of lockups are seen on a 480 core system doing some sort
> of database-like workload. All except one are soft lockups. This is a
> SLES11 system with most of the recent futex changes backported,
> including commits 63b1a816, b0c29f79, 99b60ce6, a52b89eb, 0d00c7b2,
> 5cdec2d8 and f12d5bfc.
>
> The following are some traces I put together in chronological order from
> the report I received. While the traces aren't perfect, I believe it
> exemplifies the issue pretty well. There are a lot more, but just of the
> same.
>
> [212046.044098] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 22
> [212046.044098] Pid: 312554, comm: XXX Tainted: GF D W N 3.0.101-0.15-default #1
> [212046.044098] Call Trace:
> [212046.044098] [<ffffffff81004935>] dump_trace+0x75/0x310
> [212046.044098] [<ffffffff8145e0b3>] dump_stack+0x69/0x6f
> [212046.044098] [<ffffffff8145e14c>] panic+0x93/0x201
> [212046.044098] [<ffffffff810c65e4>] watchdog_overflow_callback+0xb4/0xc0
> [212046.044098] [<ffffffff810f2d9a>] __perf_event_overflow+0xaa/0x230
> [212046.044098] [<ffffffff81018210>] intel_pmu_handle_irq+0x1a0/0x330
> [212046.044098] [<ffffffff81462ae1>] perf_event_nmi_handler+0x31/0xa0
> [212046.044098] [<ffffffff81464c37>] notifier_call_chain+0x37/0x70
> [212046.044098] [<ffffffff81464c7d>] __atomic_notifier_call_chain+0xd/0x20
> [212046.044098] [<ffffffff81464ccd>] notify_die+0x2d/0x40
> [212046.044098] [<ffffffff81462127>] default_do_nmi+0x37/0x200
> [212046.044098] [<ffffffff81462358>] do_nmi+0x68/0x80
> [212046.044098] [<ffffffff814618ad>] restart_nmi+0x1a/0x1e

Is this end of the traceback, i.e. does the first anomalous lockup
show that the NMI interrupted user-space mode? If yes then that's
highly unusual.

The 'GF D W' taint also suggests that there was something going on
before this triggered: 'W' suggests that something warned before, 'D'
suggests something died anomalously before and 'F' suggests a forced
or unsigned module.

So even the earliest traces look like after effects.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/