Re: PROBLEM: Kernel BUG in mfgpt_tick (cs5535-clockevt.c) on ALIX 2c3 - null call

From: Daniel Lezcano
Date: Wed Oct 11 2017 - 04:26:36 EST


On 10/10/2017 23:19, David Kozub wrote:
> On Tue, 10 Oct 2017, Daniel Lezcano wrote:
>
>> On 09/10/2017 21:33, David Kozub wrote:
>>> On Mon, 9 Oct 2017, Daniel Lezcano wrote:
>>>
>>>> On 07/10/2017 23:26, David Kozub wrote:
>>>>> Hi all,
>>>>>
>>>>> booting up kernel 4.14-rc3 with CS5535_CLOCK_EVENT_SRC on an ALIX 2c3
>>>>> (http://pcengines.ch/alix2c3.htm) dies with:
>>>>>
>>>>> [ÂÂÂ 2.313086] cs5535-smb cs5535-smb: SCx200 device 'CS5535 ACB0'
>>>>> registered
>>>>> [ÂÂÂ 2.338711] cs5535-mfgpt cs5535-mfgpt: registered timer 0
>>>>> [ÂÂÂ 2.355676] ledtrig-cpu: registered to indicate activity on CPUs
>>>>> [ÂÂÂ 2.373745] cs5535-mfgpt cs5535-mfgpt: registered timer 1
>>>>> [ÂÂÂ 2.389976] cs5535-clockevt: Registering MFGPT timer as a clock
>>>>> event, using IRQ 7
>>>>> [ÂÂÂ 2.412687] BUG: unable to handle kernel NULL pointer dereference
>>>>> atÂÂ (null)
>>>>> [ÂÂÂ 2.412698] IP:ÂÂ (null)
>>>>> [ÂÂÂ 2.412702] *pde = 00000000
>>>>> [ÂÂÂ 2.412713] Oops: 0000 [#1]
>>>>> [ÂÂÂ 2.412716] Modules linked in:
>>>>> [ÂÂÂ 2.412732] CPU: 0 PID: 1 Comm: swapper Not tainted
>>>>> 4.14.0-rc3-humel-test17 #36
>>>>> [ÂÂÂ 2.412739] task: c0010000 task.stack: c008e000
>>>>> [ÂÂÂ 2.412744] EIP:ÂÂ (null)
>>>>> [ÂÂÂ 2.412749] EFLAGS: 00210093 CPU: 0
>>>>> [ÂÂÂ 2.412758] EAX: c05fb8c0 EBX: c05fb880 ECX: 00000000 EDX: 0000620c
>>>>> [ÂÂÂ 2.412769] ESI: c0009fd4 EDI: c014e14e EBP: c0009fac ESP: c0009fa8
>>>>> [ÂÂÂ 2.412780]Â DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
>>>>> [ÂÂÂ 2.412790] CR0: 80050033 CR2: 00000000 CR3: 0064f000 CR4: 00000090
>>>>> [ÂÂÂ 2.412793] Call Trace:
>>>>> [ÂÂÂ 2.412800]Â <IRQ>
>>>>> [ÂÂÂ 2.412825]Â ? mfgpt_tick+0x5d/0x81
>>>>> [ÂÂÂ 2.412845]Â __handle_irq_event_percpu+0x56/0xb6
>>>>> [ÂÂÂ 2.412864]Â ? handle_fasteoi_irq+0xf3/0xf3
>>>>> [ÂÂÂ 2.412878]Â handle_irq_event_percpu+0x17/0x3f
>>>>> [ÂÂÂ 2.412892]Â handle_irq_event+0x1d/0x29
>>>>> [ÂÂÂ 2.412905]Â handle_level_irq+0x57/0xc6
>>>>> [ÂÂÂ 2.412924]Â handle_irq+0x47/0x52
>>>>> [ÂÂÂ 2.412929]Â </IRQ>
>>>>> [ÂÂÂ 2.412934]Â <SOFTIRQ>
>>>>> [ÂÂÂ 2.412949]Â do_IRQ+0x32/0x9b
>>>>> [ÂÂÂ 2.412963]Â ? __irqentry_text_end+0x7/0x7
>>>>> [ÂÂÂ 2.412976]Â common_interrupt+0x2e/0x34
>>>>> ...
>>>>>
>>>>> The problem seems to be in drivers/clocksource/cs5535-clockevt.c in
>>>>> mfgpt_tick() on line 132:
>>>>> ...
>>>>> ÂÂÂÂcs5535_clockevent.event_handler(&cs5535_clockevent);
>>>>> ...
>>>>> cs5535_clockevent.event_handler is null.
>>>>>
>>>>> Adding some more traces I see mfgpt_tick() gets called before
>>>>> clockevents_config_and_register() finishes (invoked from
>>>>> cs5535_mfgpt_init() on line 178). So when mfgpt_tick() accessess the
>>>>> event_handler, it's NULL. Wrapping the event_handler call on line
>>>>> 132 in
>>>>> a null pointer check results in a working system.
>>>>>
>>>>> The issue is present at least also in kernel 4.13.5. In kernel
>>>>> versions
>>>>> <= 4.1-rc6 the cs5535_clockevent worked OK. Kernels >= 4.1-rc7 never
>>>>> booted at all on my ALIX 2c3 (last I tried 4.5, then gave up and tried
>>>>> 4.13.5 recently), so I don't know when exactly this issue appeared.
>>>>>
>>>>> My guess is that the timer interrupt is enabled too early. If I change
>>>>> the order of clockevents_config_and_register() and
>>>>> cs5535_mfgpt_write(cs5535_event_clock, MFGPT_REG_SETUP, val) in
>>>>> cs5535_mfgpt_init() like this:
>>>>> ...
>>>>> ÂÂÂÂ/* Set up the clock event */
>>>>> ÂÂÂÂprintk(KERN_INFO DRV_NAME
>>>>> ÂÂÂÂÂÂÂ ": Registering MFGPT timer as a clock event, using IRQ %d\n",
>>>>> ÂÂÂÂÂÂÂ timer_irq);
>>>>> ÂÂÂÂclockevents_config_and_register(&cs5535_clockevent, MFGPT_HZ,
>>>>> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 0xF, 0xFFFE);
>>>>>
>>>>> ÂÂÂÂ/* Set the clock scale and enable the event mode for CMP2 */
>>>>> ÂÂÂÂval = MFGPT_SCALE | (3 << 8);
>>>>> ÂÂÂÂcs5535_mfgpt_write(cs5535_event_clock, MFGPT_REG_SETUP, val);
>>>>> ...
>>>>> the system boots and seems to be OK.
>>>>
>>>> What happens if instead of inverting those two lines, you add
>>>> mfgpt_shutdown() early in the init function ?
>>>
>>> Hi Daniel,
>>>
>>> I can't call mfgpt_shutdown() before clockevents_config_and_register()
>>> as mfgpt_shutdown() wants a struct clock_event_device* which is only
>>> available after clockevents_config_and_register(). But I tried calling
>>> disable_timer():
>>>
>>> @@ -152,6 +152,8 @@ static int __init cs5535_mfgpt_init(void)
>>> ÂÂÂÂÂÂÂ }
>>> ÂÂÂÂÂÂÂ cs5535_event_clock = timer;
>>>
>>> +ÂÂÂÂÂÂ disable_timer(cs5535_event_clock);
>>> +
>>> ÂÂÂÂÂÂÂ /* Set up the IRQ on the MFGPT side */
>>> ÂÂÂÂÂÂÂ if (cs5535_mfgpt_setup_irq(timer, MFGPT_CMP2, &timer_irq)) {
>>> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ printk(KERN_ERR DRV_NAME ": Could not set up IRQ %d\n",
>>>
>>> which should do the same thing. Unfortunately I still get the same BUG.
>>>
>>> I added some more traces and I think the IRQ comes just after:
>>> ÂÂÂÂret = setup_irq(timer_irq, &mfgptirq);
>>
>> Mmh, the interrupt flag tells it is a shared interrupt, bad :/
>
> While there is IRQF_SHARED, I think on the ALIX 2c3 the IRQ is not
> really shared:
>
> $ cat /proc/interrupts
> ÂÂÂÂÂÂÂÂÂÂ CPU0
> Â 0:ÂÂÂÂÂÂ 9959ÂÂÂ XT-PICÂÂÂÂÂ timer
> Â 2:ÂÂÂÂÂÂÂÂÂ 0ÂÂÂ XT-PICÂÂÂÂÂ cascade
> Â 4:ÂÂÂÂÂÂ 1763ÂÂÂ XT-PICÂÂÂÂÂ ttyS0
> Â 7:ÂÂÂÂÂÂ 4648ÂÂÂ XT-PICÂÂÂÂÂ cs5535-clockevt
> Â 9:ÂÂÂÂÂÂÂÂÂ 0ÂÂÂ XT-PICÂÂÂÂÂ ath
> Â10:ÂÂÂÂÂÂÂ 151ÂÂÂ XT-PICÂÂÂÂÂ eth0
> Â11:ÂÂÂÂÂÂÂÂÂ 0ÂÂÂ XT-PICÂÂÂÂÂ eth1
> Â12:ÂÂÂÂÂÂÂÂÂ 1ÂÂÂ XT-PICÂÂÂÂÂ ehci_hcd:usb1, ohci_hcd:usb2
> Â14:ÂÂÂÂÂÂ 4063ÂÂÂ XT-PICÂÂÂÂÂ pata_cs5536
> Â15:ÂÂÂÂÂÂÂÂÂ 0ÂÂÂ XT-PICÂÂÂÂÂ eth2
>
> (if this is sufficient proof)

After digging a bit, the interrupt can be shared. There is a comment in
the drivers/misc/cs5535-mfgpt.c:cs5535_mfgpt_set_irq() function.

Moreover, the kernel can boot with a module parameter to change the irq
number. By config, the default is 7 but it could be another interrupt
shared with a device.

But your setup, if I refer the output of /proc/interrupts, does not show
the interrupt timer is shared.

>> The interrupt handler checks if the interrupt was for the timer.
>>
>> So here, there are two options:
>>
>> - the timer is in an inconsistent state, with an interrupt line up and
>> enabled. In this case, it should be reseted correctly before going
>> further in the init function (disable, ack irq if any, and then request
>> irq and register).
>>
>> .. or ..
>>
>> - the check at the beginning of the interrupt handler:
>> ÂÂÂÂÂÂ /* See if the interrupt was for us */
>> ÂÂÂÂÂÂ if (!(val & (MFGPT_SETUP_SETUP | MFGPT_SETUP_CMP2 |
>> ÂÂÂÂÂÂÂ MFGPT_SETUP_CMP1)))
>> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ return IRQ_NONE;
>>
>> Â does not work and we try to handle an interrupt which is not for us.
>
> If this interrupt is not shared, this option can be ruled out, right?

Yes.

>> That could be any of these two (or both) and I suspect the commit
>> 8f9327cbb brings to light this problem as before the interrupt handler
>> had the check:
>> ÂÂÂÂif (cs5535_tick_mode == CLOCK_EVT_MODE_SHUTDOWN)
>> ÂÂÂÂÂÂÂ return IRQ_HANDLED;
>>
>> with:
>>
>> static unsigned int cs5535_tick_mode = CLOCK_EVT_MODE_SHUTDOWN;
>
> I agree, in the previous version any early calls would be ignored.
>
> I see in http://pcengines.ch/pdf/alix2.pdf on page 10 in "Setup
> options": "M: toggles MFGPT workaround â may be required to support high
> speed timer. See AMD CS5536 data book section 5.16.3 for the gory
> details. The system may hang during boot if you get it wrongâ"
>
> I tried toggling the option and it didn't seem to have any effect on
> this issue. Furthermore I did a BIOS update and now the menu option is
> gone.
>
> The CD5536 Data Book is here:
> https://support.amd.com/TechDocs/33238G_cs5536_db.pdf and that 5.16.3
> issue doesn't seem to be relevant here.
>
> So I think the following must be happening:
>
>> - the timer is in an inconsistent state, with an interrupt line up and
>> enabled. In this case, it should be reseted correctly before going
>> further in the init function (disable, ack irq if any, and then request
>> irq and register).
>
> Should I do something more to verify that it is this case?
>
> Could you please give me more detailed info on how to reset the IRQ to a
> sane sate? Being a kernel noob that summary is not sufficient for me.

The code should be already there in the driver.

Usually, in order to prevent this kind issue, before calling
clockevent_config_and_register(), we disable the timer and ack
unconditionally any interrupt.

So at the first glance, I would add:

diff --git a/drivers/clocksource/cs5535-clockevt.c
b/drivers/clocksource/cs5535-clockevt.c
index a1df588..87e9581 100644
--- a/drivers/clocksource/cs5535-clockevt.c
+++ b/drivers/clocksource/cs5535-clockevt.c
@@ -152,6 +152,9 @@ static int __init cs5535_mfgpt_init(void)
}
cs5535_event_clock = timer;

+ disable_timer(timer);
+ cs5535_mfgpt_write(timer, MFGPT_REG_COUNTER, 0);
+
/* Set up the IRQ on the MFGPT side */
if (cs5535_mfgpt_setup_irq(timer, MFGPT_CMP2, &timer_irq)) {
printk(KERN_ERR DRV_NAME ": Could not set up IRQ %d\n",





--
<http://www.linaro.org/> Linaro.org â Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog