Re: Intermittent boot failure after 6492fed7d8c9 (v6.0-rc1)

From: Rafael J. Wysocki
Date: Mon Oct 10 2022 - 10:48:35 EST


Hi Mel,

Thanks for the report!

On Mon, Oct 10, 2022 at 4:25 PM Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> wrote:
>
> Hi Rafael,
>
> I'm seeing intermittent boot failures after 6492fed7d8c9 ("rtc: rtc-cmos:
> Do not check ACPI_FADT_LOW_POWER_S0") due to a NULL pointer exception
> early in boot. It fails to boot 5 times after 10 boot attempts and I've
> only observed it on one machine so far. Either a revert or the patch below
> fixes it but it's unlikely it is the correct fix.
>
> --- drivers/rtc/rtc-cmos.c.orig 2022-10-10 15:11:50.335756567 +0200
> +++ drivers/rtc/rtc-cmos.c 2022-10-10 15:11:53.211756691 +0200
> @@ -1209,7 +1209,7 @@
> * Or else, ACPI SCI is enabled during suspend/resume only,
> * update rtc irq in that case.
> */
> - if (cmos_use_acpi_alarm())
> + if (cmos_use_acpi_alarm() && cmos)
> cmos_interrupt(0, (void *)cmos->rtc);
> else {
> /* Fix me: can we use cmos_interrupt() here as well? */

It looks like I've exposed a race condition there.

Generally speaking, it is misguided to install an event handler that
is not ready to handle the event at that time before making sure that
the event is disabled.

Does the attached patch help?

>
> Boot failure looks like the below, it's not a vanilla kernel but the
> applied patch is not relevant and it's known to fail on a vanilla kernel.
> The machine has a E5-2698 v4 CPU plugged into a SGI C2112-4GP3 platform
> with a X10DRT-P-Series motherboard.
>
> [ 10.924167][ C1] BUG: kernel NULL pointer dereference, address: 0000000000000000
> [ 10.928016][ C1] #PF: supervisor read access in kernel mode
> [ 10.928016][ C1] #PF: error_code(0x0000) - not-present page
> [ 10.928016][ C1] PGD 0 P4D 0
> [ 10.928016][ C1] Oops: 0000 [#1] PREEMPT SMP PTI
> [ 10.928016][ C1] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 6.0.0-mm-pcpnoirq-v1r2 #1 6debc4647ebcbe3e91270f1109aebc1e85510e3e
> [ 10.928016][ C1] Hardware name: SGI.COM C2112-4GP3/X10DRT-P-Series, BIOS 2.0a 05/09/2016
> [ 10.928016][ C1] RIP: 0010:rtc_handler+0x73/0xd0
> [ 10.928016][ C1] Code: df e8 41 62 f9 ff bf 04 00 00 00 e8 a3 bf e7 ff 31 f6 bf 04 00 00 00 e8 08 c2 e7 ff b8 01 00 00 00 5b 5d 41 5c c3 cc cc cc cc <48> 8b 75 00 31 ff e8 72 fe ff ff eb c0 bf 0b 00 00 00 e8 56 81 77
> [ 10.928016][ C1] RSP: 0000:ffffaf7f8003eec0 EFLAGS: 00010002
> [ 10.928016][ C1] RAX: ffffffffad6d0c00 RBX: ffff94049801a000 RCX: 0000000000000000
> [ 10.928016][ C1] RDX: 0000000000000040 RSI: ffffffffadf00460 RDI: ffff94049801a000
> [ 10.928016][ C1] RBP: 0000000000000000 R08: 0000000000000000 R09: 00000000000004d0
> [ 10.928016][ C1] R10: 0000000000000000 R11: ffffaf7f8003eff8 R12: 0000000000000000
> [ 10.928016][ C1] R13: ffffffffae228d82 R14: 0000000000000004 R15: 0000000000000000
> [ 10.928016][ C1] FS: 0000000000000000(0000) GS:ffff94037ea80000(0000) knlGS:0000000000000000
> [ 10.928016][ C1] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 10.928016][ C1] CR2: 0000000000000000 CR3: 00000002c7e26001 CR4: 00000000003706e0
> [ 10.928016][ C1] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 10.928016][ C1] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 10.928016][ C1] Call Trace:
> [ 10.928016][ C1] <IRQ>
> [ 10.928016][ C1] acpi_ev_fixed_event_detect+0x14a/0x18c
> [ 10.928016][ C1] acpi_ev_sci_xrupt_handler+0x2c/0x6e
> [ 10.928016][ C1] acpi_irq+0x18/0x40
> [ 10.928016][ C1] __handle_irq_event_percpu+0x3e/0x2d0
> [ 10.928016][ C1] handle_irq_event_percpu+0xf/0x40
> [ 10.928016][ C1] handle_irq_event+0x34/0x60
> [ 10.928016][ C1] handle_fasteoi_irq+0x7b/0x140
> [ 10.928016][ C1] __common_interrupt+0x4b/0x100
> [ 10.928016][ C1] common_interrupt+0x58/0xa0
> [ 10.928016][ C1] </IRQ>
> [ 10.928016][ C1] <TASK>
> [ 10.928016][ C1] asm_common_interrupt+0x22/0x40
> [ 10.928016][ C1] RIP: 0010:cmos_wake_setup.part.9+0x2f/0x120
> [ 10.928016][ C1] Code: 80 3d 65 16 4a 01 00 53 48 89 fb 0f 84 a5 00 00 00 48 89 da 48 c7 c6 00 0c 6d ad bf 04 00 00 00 e8 53 b8 e7 ff bf 04 00 00 00 <e8> 98 c6 e7 ff 31 f6 bf 04 00 00 00 e8 fd c8 e7 ff 0f b6 0d 34 ce
> [ 10.928016][ C1] RSP: 0000:ffffaf7f800d7ca8 EFLAGS: 00000246
> [ 10.928016][ C1] RAX: 0000000000000000 RBX: ffff94049801a000 RCX: 0000000000000004
> [ 10.928016][ C1] RDX: ffffffffadefef10 RSI: ffffffffadefee20 RDI: 0000000000000004
> [ 10.928016][ C1] RBP: ffffffffaeaf98a0 R08: 0000000000000000 R09: 0000000000000000
> [ 10.928016][ C1] R10: 0000000000000000 R11: 000000000000000a R12: ffffffffad6d1750
> [ 10.928016][ C1] R13: 0000000000000000 R14: ffff93c5111191a0 R15: ffffffffaefe47f8
> [ 10.928016][ C1] ? rdinit_setup+0x2f/0x2f
> [ 10.928016][ C1] ? cmos_do_probe+0x570/0x570
> [ 10.928016][ C1] ? cmos_wake_setup.part.9+0x2a/0x120
> [ 10.928016][ C1] cmos_pnp_probe+0x6c/0xa0
> [ 10.928016][ C1] pnp_device_probe+0x5b/0xb0
> [ 10.928016][ C1] ? driver_sysfs_add+0x75/0xe0
> [ 10.928016][ C1] really_probe+0x109/0x3e0
> [ 10.928016][ C1] ? pm_runtime_barrier+0x4f/0xa0
> [ 10.928016][ C1] __driver_probe_device+0x79/0x170
> [ 10.928016][ C1] driver_probe_device+0x1f/0xa0
> [ 10.928016][ C1] __driver_attach+0x11e/0x180
> [ 10.928016][ C1] ? __device_attach_driver+0x110/0x110
> [ 10.928016][ C1] bus_for_each_dev+0x79/0xc0
> [ 10.928016][ C1] bus_add_driver+0x1ba/0x250
> [ 10.928016][ C1] ? rtc_dev_init+0x34/0x34
> [ 10.928016][ C1] driver_register+0x5f/0x100
> [ 10.928016][ C1] ? rtc_dev_init+0x34/0x34
> [ 10.928016][ C1] cmos_init+0x12/0x70
> [ 10.928016][ C1] do_one_initcall+0x5b/0x310
> [ 10.928016][ C1] ? rcu_read_lock_held_common+0xe/0x50
> [ 10.928016][ C1] ? rcu_read_lock_sched_held+0x23/0x80
> [ 10.928016][ C1] kernel_init_freeable+0x2b7/0x319
> [ 10.928016][ C1] ? rest_init+0x1b0/0x1b0
> [ 10.928016][ C1] kernel_init+0x16/0x140
> [ 10.928016][ C1] ret_from_fork+0x22/0x30
> [ 10.928016][ C1] </TASK>
> [ 10.928016][ C1] Modules linked in:
> [ 10.928016][ C1] CR2: 0000000000000000
> [ 10.928016][ C1] ---[ end trace 0000000000000000 ]---
> [ 10.928016][ C1] RIP: 0010:rtc_handler+0x73/0xd0
> [ 10.928016][ C1] Code: df e8 41 62 f9 ff bf 04 00 00 00 e8 a3 bf e7 ff 31 f6 bf 04 00 00 00 e8 08 c2 e7 ff b8 01 00 00 00 5b 5d 41 5c c3 cc cc cc cc <48> 8b 75 00 31 ff e8 72 fe ff ff eb c0 bf 0b 00 00 00 e8 56 81 77
> [ 10.928016][ C1] RSP: 0000:ffffaf7f8003eec0 EFLAGS: 00010002
> [ 10.928016][ C1] RAX: ffffffffad6d0c00 RBX: ffff94049801a000 RCX: 0000000000000000
> [ 10.928016][ C1] RDX: 0000000000000040 RSI: ffffffffadf00460 RDI: ffff94049801a000
> [ 10.928016][ C1] RBP: 0000000000000000 R08: 0000000000000000 R09: 00000000000004d0
> [ 10.928016][ C1] R10: 0000000000000000 R11: ffffaf7f8003eff8 R12: 0000000000000000
> [ 10.928016][ C1] R13: ffffffffae228d82 R14: 0000000000000004 R15: 0000000000000000
> [ 10.928016][ C1] FS: 0000000000000000(0000) GS:ffff94037ea80000(0000) knlGS:0000000000000000
> [ 10.928016][ C1] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 10.928016][ C1] CR2: 0000000000000000 CR3: 00000002c7e26001 CR4: 00000000003706e0
> [ 10.928016][ C1] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 10.928016][ C1] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 10.928016][ C1] Kernel panic - not syncing: Fatal exception in interrupt
> [ 10.928016][ C1] Kernel Offset: 0x2be00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> [ 10.928016][ C1] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---
>
> --
> Mel Gorman
> SUSE Labs
---
drivers/rtc/rtc-cmos.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)

Index: linux-pm/drivers/rtc/rtc-cmos.c
===================================================================
--- linux-pm.orig/drivers/rtc/rtc-cmos.c
+++ linux-pm/drivers/rtc/rtc-cmos.c
@@ -1233,13 +1233,12 @@ static u32 rtc_handler(void *context)

static inline void rtc_wake_setup(struct device *dev)
{
- acpi_install_fixed_event_handler(ACPI_EVENT_RTC, rtc_handler, dev);
/*
- * After the RTC handler is installed, the Fixed_RTC event should
- * be disabled. Only when the RTC alarm is set will it be enabled.
+ * Disable the Fixed_RTC event before installing the RTC handler. It
+ * can be enabled only when the RTC alarm is set.
*/
- acpi_clear_event(ACPI_EVENT_RTC);
acpi_disable_event(ACPI_EVENT_RTC, 0);
+ acpi_install_fixed_event_handler(ACPI_EVENT_RTC, rtc_handler, dev);
}

static void rtc_wake_on(struct device *dev)