Re: PROBLEM: Kernel BUG in mfgpt_tick (cs5535-clockevt.c) on ALIX 2c3 - null call

From: Daniel Lezcano
Date: Tue Oct 10 2017 - 09:03:55 EST


On 09/10/2017 21:33, David Kozub wrote:
> On Mon, 9 Oct 2017, Daniel Lezcano wrote:
>
>> On 07/10/2017 23:26, David Kozub wrote:
>>> Hi all,
>>>
>>> booting up kernel 4.14-rc3 with CS5535_CLOCK_EVENT_SRC on an ALIX 2c3
>>> (http://pcengines.ch/alix2c3.htm) dies with:
>>>
>>> [ÂÂÂ 2.313086] cs5535-smb cs5535-smb: SCx200 device 'CS5535 ACB0'
>>> registered
>>> [ÂÂÂ 2.338711] cs5535-mfgpt cs5535-mfgpt: registered timer 0
>>> [ÂÂÂ 2.355676] ledtrig-cpu: registered to indicate activity on CPUs
>>> [ÂÂÂ 2.373745] cs5535-mfgpt cs5535-mfgpt: registered timer 1
>>> [ÂÂÂ 2.389976] cs5535-clockevt: Registering MFGPT timer as a clock
>>> event, using IRQ 7
>>> [ÂÂÂ 2.412687] BUG: unable to handle kernel NULL pointer dereference
>>> atÂÂ (null)
>>> [ÂÂÂ 2.412698] IP:ÂÂ (null)
>>> [ÂÂÂ 2.412702] *pde = 00000000
>>> [ÂÂÂ 2.412713] Oops: 0000 [#1]
>>> [ÂÂÂ 2.412716] Modules linked in:
>>> [ÂÂÂ 2.412732] CPU: 0 PID: 1 Comm: swapper Not tainted
>>> 4.14.0-rc3-humel-test17 #36
>>> [ÂÂÂ 2.412739] task: c0010000 task.stack: c008e000
>>> [ÂÂÂ 2.412744] EIP:ÂÂ (null)
>>> [ÂÂÂ 2.412749] EFLAGS: 00210093 CPU: 0
>>> [ÂÂÂ 2.412758] EAX: c05fb8c0 EBX: c05fb880 ECX: 00000000 EDX: 0000620c
>>> [ÂÂÂ 2.412769] ESI: c0009fd4 EDI: c014e14e EBP: c0009fac ESP: c0009fa8
>>> [ÂÂÂ 2.412780]Â DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
>>> [ÂÂÂ 2.412790] CR0: 80050033 CR2: 00000000 CR3: 0064f000 CR4: 00000090
>>> [ÂÂÂ 2.412793] Call Trace:
>>> [ÂÂÂ 2.412800]Â <IRQ>
>>> [ÂÂÂ 2.412825]Â ? mfgpt_tick+0x5d/0x81
>>> [ÂÂÂ 2.412845]Â __handle_irq_event_percpu+0x56/0xb6
>>> [ÂÂÂ 2.412864]Â ? handle_fasteoi_irq+0xf3/0xf3
>>> [ÂÂÂ 2.412878]Â handle_irq_event_percpu+0x17/0x3f
>>> [ÂÂÂ 2.412892]Â handle_irq_event+0x1d/0x29
>>> [ÂÂÂ 2.412905]Â handle_level_irq+0x57/0xc6
>>> [ÂÂÂ 2.412924]Â handle_irq+0x47/0x52
>>> [ÂÂÂ 2.412929]Â </IRQ>
>>> [ÂÂÂ 2.412934]Â <SOFTIRQ>
>>> [ÂÂÂ 2.412949]Â do_IRQ+0x32/0x9b
>>> [ÂÂÂ 2.412963]Â ? __irqentry_text_end+0x7/0x7
>>> [ÂÂÂ 2.412976]Â common_interrupt+0x2e/0x34
>>> ...
>>>
>>> The problem seems to be in drivers/clocksource/cs5535-clockevt.c in
>>> mfgpt_tick() on line 132:
>>> ...
>>> ÂÂÂÂcs5535_clockevent.event_handler(&cs5535_clockevent);
>>> ...
>>> cs5535_clockevent.event_handler is null.
>>>
>>> Adding some more traces I see mfgpt_tick() gets called before
>>> clockevents_config_and_register() finishes (invoked from
>>> cs5535_mfgpt_init() on line 178). So when mfgpt_tick() accessess the
>>> event_handler, it's NULL. Wrapping the event_handler call on line 132 in
>>> a null pointer check results in a working system.
>>>
>>> The issue is present at least also in kernel 4.13.5. In kernel versions
>>> <= 4.1-rc6 the cs5535_clockevent worked OK. Kernels >= 4.1-rc7 never
>>> booted at all on my ALIX 2c3 (last I tried 4.5, then gave up and tried
>>> 4.13.5 recently), so I don't know when exactly this issue appeared.
>>>
>>> My guess is that the timer interrupt is enabled too early. If I change
>>> the order of clockevents_config_and_register() and
>>> cs5535_mfgpt_write(cs5535_event_clock, MFGPT_REG_SETUP, val) in
>>> cs5535_mfgpt_init() like this:
>>> ...
>>> ÂÂÂÂ/* Set up the clock event */
>>> ÂÂÂÂprintk(KERN_INFO DRV_NAME
>>> ÂÂÂÂÂÂÂ ": Registering MFGPT timer as a clock event, using IRQ %d\n",
>>> ÂÂÂÂÂÂÂ timer_irq);
>>> ÂÂÂÂclockevents_config_and_register(&cs5535_clockevent, MFGPT_HZ,
>>> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 0xF, 0xFFFE);
>>>
>>> ÂÂÂÂ/* Set the clock scale and enable the event mode for CMP2 */
>>> ÂÂÂÂval = MFGPT_SCALE | (3 << 8);
>>> ÂÂÂÂcs5535_mfgpt_write(cs5535_event_clock, MFGPT_REG_SETUP, val);
>>> ...
>>> the system boots and seems to be OK.
>>
>> What happens if instead of inverting those two lines, you add
>> mfgpt_shutdown() early in the init function ?
>
> Hi Daniel,
>
> I can't call mfgpt_shutdown() before clockevents_config_and_register()
> as mfgpt_shutdown() wants a struct clock_event_device* which is only
> available after clockevents_config_and_register(). But I tried calling
> disable_timer():
>
> @@ -152,6 +152,8 @@ static int __init cs5535_mfgpt_init(void)
> ÂÂÂÂÂÂÂ }
> ÂÂÂÂÂÂÂ cs5535_event_clock = timer;
>
> +ÂÂÂÂÂÂ disable_timer(cs5535_event_clock);
> +
> ÂÂÂÂÂÂÂ /* Set up the IRQ on the MFGPT side */
> ÂÂÂÂÂÂÂ if (cs5535_mfgpt_setup_irq(timer, MFGPT_CMP2, &timer_irq)) {
> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ printk(KERN_ERR DRV_NAME ": Could not set up IRQ %d\n",
>
> which should do the same thing. Unfortunately I still get the same BUG.
>
> I added some more traces and I think the IRQ comes just after:
> ÂÂÂÂret = setup_irq(timer_irq, &mfgptirq);

Mmh, the interrupt flag tells it is a shared interrupt, bad :/

The interrupt handler checks if the interrupt was for the timer.

So here, there are two options:

- the timer is in an inconsistent state, with an interrupt line up and
enabled. In this case, it should be reseted correctly before going
further in the init function (disable, ack irq if any, and then request
irq and register).

.. or ..

- the check at the beginning of the interrupt handler:
/* See if the interrupt was for us */
if (!(val & (MFGPT_SETUP_SETUP | MFGPT_SETUP_CMP2 |
MFGPT_SETUP_CMP1)))
return IRQ_NONE;

does not work and we try to handle an interrupt which is not for us.


That could be any of these two (or both) and I suspect the commit
8f9327cbb brings to light this problem as before the interrupt handler
had the check:
if (cs5535_tick_mode == CLOCK_EVT_MODE_SHUTDOWN)
return IRQ_HANDLED;

with:

static unsigned int cs5535_tick_mode = CLOCK_EVT_MODE_SHUTDOWN;







--
<http://www.linaro.org/> Linaro.org â Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog