Re: PROBLEM: Kernel BUG in mfgpt_tick (cs5535-clockevt.c) on ALIX 2c3 - null call

From: Thomas Gleixner
Date: Wed Oct 18 2017 - 14:23:29 EST


On Mon, 16 Oct 2017, David Kozub wrote:
> On Mon, 16 Oct 2017, Thomas Gleixner wrote:
> > --- a/drivers/clocksource/cs5535-clockevt.c
> > +++ b/drivers/clocksource/cs5535-clockevt.c
> > @@ -117,7 +117,8 @@ static irqreturn_t mfgpt_tick(int irq, void *dev_id)
> > /* Turn off the clock (and clear the event) */
> > disable_timer(cs5535_event_clock);
> >
> > - if (clockevent_state_shutdown(&cs5535_clockevent))
> > + if (clockevent_state_detached(&cs5535_clockevent) ||
> > + clockevent_state_shutdown(&cs5535_clockevent))
> > return IRQ_HANDLED;
> >
> > /* Clear the counter */
>
> Hi Thomas,
>
> Thanks for the patch. I can confirm that this resolves the issue.
>
> I still think the fact that the handler is triggered at all is a bit strange
> but maybe ensuring the handler doesn't do anything bad is sufficient to
> declare this issue fixed. After all I'm probbly the last person in the known
> universe using this driver. :)

Well, any handler should be able to handle spurious interrupts sanely,
especially on legacy x86 interrupts.

> Would it not be nicer to also explicitly set cs5535_clockevent's
> state_use_accessors to CLOCK_EVT_STATE_DETACHED (in the initialization of
> cs5535_clockevent)?

Well, it's explicitely the first state to ensure that zero initialization
sets it to CLOCK_EVT_STATE_DETACHED.

> Maybe that field is not to be touched directly from outside, on the other
> hand in the current code the check in the handler relies on
> CLOCK_EVT_STATE_DETACHED being 0 (and so cs5535_clockevent's
> state_use_accessors being CLOCK_EVT_STATE_DETACHED by default).

There is lots of other code which would break if we violate that
assumption. But, yes you are right as well. The data structure is visible
before clockevents_register_device() is invoked because its statically
allocated, so we might as well explicitely set the state in the static
initializer. Though I don't think it's worth it.

> I was also wondering how is it ensured that the handler can never see an
> inconsistent state of state_use_accessors but after going through the code I
> think the key is clockevents_register_device's
> raw_spin_lock_irqsave(&clockevents_lock, flags); (even though the set to
> DETACHED happens outside, but we're already starting in DETACHED state). I
> also wonder if this would still be safe if there were multiple CPUs - but this
> is not happening on the ALIX as it has just one CPU.

In a case like this the core code cannot do much about serialization
vs. interrupts escpecially when the interrupt is enabled prior to clock
event registration. Especially not when the interrupt is shared. That's a
hen egg problem outside the scope of the core code.

There are lots of convoluted ways to prevent the interrupt from happening
early, but the best way to handle this particular issue is to make it
robust vs. both spurious and shared interrupt invocations.

Care to submit a patch with a proper changelog? Take author ship as the
complex part of that is writing a good changelog. Feel free to add

Suggested-by: Thomas .....

Thanks,

tglx