Re: PROBLEM: Kernel BUG in mfgpt_tick (cs5535-clockevt.c) on ALIX 2c3 - null call

From: David Kozub
Date: Tue Oct 10 2017 - 17:20:00 EST


On Tue, 10 Oct 2017, Daniel Lezcano wrote:

On 09/10/2017 21:33, David Kozub wrote:
On Mon, 9 Oct 2017, Daniel Lezcano wrote:

On 07/10/2017 23:26, David Kozub wrote:
Hi all,

booting up kernel 4.14-rc3 with CS5535_CLOCK_EVENT_SRC on an ALIX 2c3
(http://pcengines.ch/alix2c3.htm) dies with:

[ÂÂÂ 2.313086] cs5535-smb cs5535-smb: SCx200 device 'CS5535 ACB0'
registered
[ÂÂÂ 2.338711] cs5535-mfgpt cs5535-mfgpt: registered timer 0
[ÂÂÂ 2.355676] ledtrig-cpu: registered to indicate activity on CPUs
[ÂÂÂ 2.373745] cs5535-mfgpt cs5535-mfgpt: registered timer 1
[ÂÂÂ 2.389976] cs5535-clockevt: Registering MFGPT timer as a clock
event, using IRQ 7
[ÂÂÂ 2.412687] BUG: unable to handle kernel NULL pointer dereference
atÂÂ (null)
[ÂÂÂ 2.412698] IP:ÂÂ (null)
[ÂÂÂ 2.412702] *pde = 00000000
[ÂÂÂ 2.412713] Oops: 0000 [#1]
[ÂÂÂ 2.412716] Modules linked in:
[ÂÂÂ 2.412732] CPU: 0 PID: 1 Comm: swapper Not tainted
4.14.0-rc3-humel-test17 #36
[ÂÂÂ 2.412739] task: c0010000 task.stack: c008e000
[ÂÂÂ 2.412744] EIP:ÂÂ (null)
[ÂÂÂ 2.412749] EFLAGS: 00210093 CPU: 0
[ÂÂÂ 2.412758] EAX: c05fb8c0 EBX: c05fb880 ECX: 00000000 EDX: 0000620c
[ÂÂÂ 2.412769] ESI: c0009fd4 EDI: c014e14e EBP: c0009fac ESP: c0009fa8
[ÂÂÂ 2.412780]Â DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
[ÂÂÂ 2.412790] CR0: 80050033 CR2: 00000000 CR3: 0064f000 CR4: 00000090
[ÂÂÂ 2.412793] Call Trace:
[ÂÂÂ 2.412800]Â <IRQ>
[ÂÂÂ 2.412825]Â ? mfgpt_tick+0x5d/0x81
[ÂÂÂ 2.412845]Â __handle_irq_event_percpu+0x56/0xb6
[ÂÂÂ 2.412864]Â ? handle_fasteoi_irq+0xf3/0xf3
[ÂÂÂ 2.412878]Â handle_irq_event_percpu+0x17/0x3f
[ÂÂÂ 2.412892]Â handle_irq_event+0x1d/0x29
[ÂÂÂ 2.412905]Â handle_level_irq+0x57/0xc6
[ÂÂÂ 2.412924]Â handle_irq+0x47/0x52
[ÂÂÂ 2.412929]Â </IRQ>
[ÂÂÂ 2.412934]Â <SOFTIRQ>
[ÂÂÂ 2.412949]Â do_IRQ+0x32/0x9b
[ÂÂÂ 2.412963]Â ? __irqentry_text_end+0x7/0x7
[ÂÂÂ 2.412976]Â common_interrupt+0x2e/0x34
...

The problem seems to be in drivers/clocksource/cs5535-clockevt.c in
mfgpt_tick() on line 132:
...
ÂÂÂÂcs5535_clockevent.event_handler(&cs5535_clockevent);
...
cs5535_clockevent.event_handler is null.

Adding some more traces I see mfgpt_tick() gets called before
clockevents_config_and_register() finishes (invoked from
cs5535_mfgpt_init() on line 178). So when mfgpt_tick() accessess the
event_handler, it's NULL. Wrapping the event_handler call on line 132 in
a null pointer check results in a working system.

The issue is present at least also in kernel 4.13.5. In kernel versions
<= 4.1-rc6 the cs5535_clockevent worked OK. Kernels >= 4.1-rc7 never
booted at all on my ALIX 2c3 (last I tried 4.5, then gave up and tried
4.13.5 recently), so I don't know when exactly this issue appeared.

My guess is that the timer interrupt is enabled too early. If I change
the order of clockevents_config_and_register() and
cs5535_mfgpt_write(cs5535_event_clock, MFGPT_REG_SETUP, val) in
cs5535_mfgpt_init() like this:
...
ÂÂÂÂ/* Set up the clock event */
ÂÂÂÂprintk(KERN_INFO DRV_NAME
ÂÂÂÂÂÂÂ ": Registering MFGPT timer as a clock event, using IRQ %d\n",
ÂÂÂÂÂÂÂ timer_irq);
ÂÂÂÂclockevents_config_and_register(&cs5535_clockevent, MFGPT_HZ,
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 0xF, 0xFFFE);

ÂÂÂÂ/* Set the clock scale and enable the event mode for CMP2 */
ÂÂÂÂval = MFGPT_SCALE | (3 << 8);
ÂÂÂÂcs5535_mfgpt_write(cs5535_event_clock, MFGPT_REG_SETUP, val);
...
the system boots and seems to be OK.

What happens if instead of inverting those two lines, you add
mfgpt_shutdown() early in the init function ?

Hi Daniel,

I can't call mfgpt_shutdown() before clockevents_config_and_register()
as mfgpt_shutdown() wants a struct clock_event_device* which is only
available after clockevents_config_and_register(). But I tried calling
disable_timer():

@@ -152,6 +152,8 @@ static int __init cs5535_mfgpt_init(void)
ÂÂÂÂÂÂÂ }
ÂÂÂÂÂÂÂ cs5535_event_clock = timer;

+ÂÂÂÂÂÂ disable_timer(cs5535_event_clock);
+
ÂÂÂÂÂÂÂ /* Set up the IRQ on the MFGPT side */
ÂÂÂÂÂÂÂ if (cs5535_mfgpt_setup_irq(timer, MFGPT_CMP2, &timer_irq)) {
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ printk(KERN_ERR DRV_NAME ": Could not set up IRQ %d\n",

which should do the same thing. Unfortunately I still get the same BUG.

I added some more traces and I think the IRQ comes just after:
ÂÂÂÂret = setup_irq(timer_irq, &mfgptirq);

Mmh, the interrupt flag tells it is a shared interrupt, bad :/

While there is IRQF_SHARED, I think on the ALIX 2c3 the IRQ is not really shared:

$ cat /proc/interrupts
CPU0
0: 9959 XT-PIC timer
2: 0 XT-PIC cascade
4: 1763 XT-PIC ttyS0
7: 4648 XT-PIC cs5535-clockevt
9: 0 XT-PIC ath
10: 151 XT-PIC eth0
11: 0 XT-PIC eth1
12: 1 XT-PIC ehci_hcd:usb1, ohci_hcd:usb2
14: 4063 XT-PIC pata_cs5536
15: 0 XT-PIC eth2

(if this is sufficient proof)

The interrupt handler checks if the interrupt was for the timer.

So here, there are two options:

- the timer is in an inconsistent state, with an interrupt line up and
enabled. In this case, it should be reseted correctly before going
further in the init function (disable, ack irq if any, and then request
irq and register).

.. or ..

- the check at the beginning of the interrupt handler:
/* See if the interrupt was for us */
if (!(val & (MFGPT_SETUP_SETUP | MFGPT_SETUP_CMP2 |
MFGPT_SETUP_CMP1)))
return IRQ_NONE;

does not work and we try to handle an interrupt which is not for us.

If this interrupt is not shared, this option can be ruled out, right?

That could be any of these two (or both) and I suspect the commit
8f9327cbb brings to light this problem as before the interrupt handler
had the check:
if (cs5535_tick_mode == CLOCK_EVT_MODE_SHUTDOWN)
return IRQ_HANDLED;

with:

static unsigned int cs5535_tick_mode = CLOCK_EVT_MODE_SHUTDOWN;

I agree, in the previous version any early calls would be ignored.

I see in http://pcengines.ch/pdf/alix2.pdf on page 10 in "Setup options": "M: toggles MFGPT workaround â may be required to support high speed timer. See AMD CS5536 data book section 5.16.3 for the gory details. The system may hang during boot if you get it wrongâ"

I tried toggling the option and it didn't seem to have any effect on this issue. Furthermore I did a BIOS update and now the menu option is gone.

The CD5536 Data Book is here: https://support.amd.com/TechDocs/33238G_cs5536_db.pdf and that 5.16.3 issue doesn't seem to be relevant here.

So I think the following must be happening:

- the timer is in an inconsistent state, with an interrupt line up and
enabled. In this case, it should be reseted correctly before going
further in the init function (disable, ack irq if any, and then request
irq and register).

Should I do something more to verify that it is this case?

Could you please give me more detailed info on how to reset the IRQ to a sane sate? Being a kernel noob that summary is not sufficient for me.

Best regards,
David