Re: [ghes_copy_tofrom_phys] BUG: sleeping function called from invalid context at mm/page_alloc.c:4150
From: Borislav Petkov
Date: Mon Oct 30 2017 - 07:05:41 EST
On Mon, Oct 30, 2017 at 12:18:35AM +0100, Fengguang Wu wrote:
> CC related developers for the BUG in v4.14-rc6.
>
> On Sun, Oct 29, 2017 at 11:51:55PM +0100, Fengguang Wu wrote:
> > Hi Linus,
> >
> > Up to now we see the below boot error/warnings when testing v4.14-rc6.
> >
> > They hit the RC release mainly due to various imperfections in 0day's
> > auto bisection. So I manually list them here and CC the likely easy to
> > debug ones to the corresponding maintainers in the followup emails.
> >
> > boot_successes: 4700
> > boot_failures: 247
> >
> > BUG:kernel_hang_in_test_stage: 152
> > BUG:kernel_reboot-without-warning_in_test_stage: 10
> > BUG:sleeping_function_called_from_invalid_context_at_kernel/locking/mutex.c: 1
> > BUG:sleeping_function_called_from_invalid_context_at_kernel/locking/rwsem.c: 3
> > BUG:sleeping_function_called_from_invalid_context_at_mm/page_alloc.c: 21
>
> Here is the dmesg fragment:
>
> [ 47.597981] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x26d34d96462, max_idle_ns: 440795289520 ns
> [ 48.626601] clocksource: Switched to clocksource tsc
> [ 49.273620] ERST: Error Record Serialization Table (ERST) support is initialized.
> [ 49.290288] pstore: using zlib compression
> [ 49.299588] pstore: Registered erst as persistent store backend
> [ 49.311408] BUG: sleeping function called from invalid context at mm/page_alloc.c:4150
> [ 49.312031] in_atomic(): 1, irqs_disabled(): 1, pid: 1, name: swapper/0
> [ 49.312031] CPU: 37 PID: 1 Comm: swapper/0 Not tainted 4.14.0-rc6 #1
> [ 49.312031] Hardware name: Intel Corporation S2600WP/S2600WP, BIOS SE5C600.86B.02.02.0002.122320131210 12/23/2013
> [ 49.312031] Call Trace:
> [ 49.312031] dump_stack+0x63/0x86
> [ 49.312031] ___might_sleep+0xf1/0x110
> [ 49.312031] __might_sleep+0x4a/0x80
> [ 49.312031] __alloc_pages_nodemask+0x14e/0x270
> [ 49.312031] alloc_page_interleave+0x17/0x80
> [ 49.312031] alloc_pages_current+0xc8/0xe0
> [ 49.312031] __get_free_pages+0xe/0x40
> [ 49.312031] pte_alloc_one_kernel+0x15/0x20
> [ 49.312031] __pte_alloc_kernel+0x1d/0x100
> [ 49.312031] ioremap_page_range+0x330/0x3a0
> [ 49.312031] ghes_copy_tofrom_phys+0x182/0x2b0
> [ 49.312031] ghes_read_estatus+0x76/0x140
> [ 49.312031] ghes_proc+0x1c/0x130
> [ 49.312031] ghes_probe+0x157/0x430
> [ 49.312031] platform_drv_probe+0x3b/0xa0
> [ 49.312031] driver_probe_device+0x29c/0x450
> [ 49.312031] __driver_attach+0xdf/0xf0
> [ 49.312031] ? driver_probe_device+0x450/0x450
> [ 49.312031] bus_for_each_dev+0x60/0xa0
> [ 49.312031] driver_attach+0x1e/0x20
> [ 49.312031] bus_add_driver+0x170/0x260
> [ 49.312031] ? set_debug_rodata+0x17/0x17
> [ 49.312031] driver_register+0x60/0xe0
> [ 49.312031] __platform_driver_register+0x36/0x40
> [ 49.312031] ghes_init+0x10f/0x199
> [ 49.312031] ? bert_init+0x215/0x215
> [ 49.312031] do_one_initcall+0x43/0x170
> [ 49.312031] ? set_debug_rodata+0x17/0x17
> [ 49.312031] kernel_init_freeable+0x198/0x220
> [ 49.312031] ? rest_init+0xd0/0xd0
> [ 49.312031] kernel_init+0xe/0x101
> [ 49.312031] ret_from_fork+0x25/0x30
> [ 49.670116] GHES: APEI firmware first mode is enabled by APEI bit and WHEA _OSC.
> [ 49.691436] Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
> [ 49.729954] 00:03: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 16550A
> [ 49.767235] Non-volatile memory driver v1.3
> [ 49.778363] Linux agpgart interface v0.103
Looks like Tyler broke it:
77b246b32b2c ("acpi: apei: check for pending errors when probing GHES entries")
and it went into 4.13 and -stable.
Tyler, why is it so important to do the polling immediately upon
registration? Can't we wait until the polling does it?
--
Regards/Gruss,
Boris.
SUSE Linux GmbH, GF: Felix ImendÃrffer, Jane Smithard, Graham Norton, HRB 21284 (AG NÃrnberg)
--