Re: Regression: ONE CPU fails bootup at Re: [3.2.0-RC7] BUG: unableto handle kernel NULL pointer dereference at 0000000000000598 1.478005]IP: [<ffffffff8107a6c4>] queue_work_on+0x4/0x30
From: Stefan Bader
Date: Wed Jan 04 2012 - 03:17:49 EST
On 04.01.2012 01:53, John Stultz wrote:
> On Wed, 2012-01-04 at 11:31 +1100, NeilBrown wrote:
>> On Tue, 03 Jan 2012 15:09:48 -0800 John Stultz <john.stultz@xxxxxxxxxx> wrote:
>>> >From the stack trace, we've kicked off a rtc_timer_do_work, probably
>>> from the rtc_initialize_alarm() schedule_work call added in Neil's
>>> patch. From there, we call __rtc_set_alarm -> cmos_set_alarm ->
>>> cmos_rq_disable -> cmos_checkintr -> rtc_update_irq -> schedule_work.
Sorry, I was off for the evening a while after sending this out. And I just
started, so a few thing I will be doing later but have not yet had time.
Over night I had still be thinking on this and maybe one important fact I had
been ignoring. This really has only been observed on paravirt guests on Xen as
far as I know. And one thing that I should have pointed out is that
[ 0.792634] rtc_cmos rtc_cmos: rtc core: registered rtc_cmos as rtc0
[ 0.792725] rtc_cmos: probe of rtc_cmos failed with error -38
So first the registration is done and the first line is the last thing printed
in the registration function. Then, and that line always comes after, the probe,
which looks like being done asynchronously, detects that the rtc is not
implemented. I would assume that this causes the rtc to be unregistered again
and that is probably the point where, under the right circumstances, the worker
triggered by the initialize alarm is trying to set another alarm. Probably while
some of the elements of the structure started to be torn down. I need to check
on that code path, yet. So right now its more a guess.
>>>
>>> So, what it looks to me is that in cmos_checkintr, we grab the cmos->rtc
>>> and pass that along. Unfortunately, since the cmos->rtc value isn't set
>>> until after rtc_device_register() returns its null at that point. So
>>> your patch isn't really fixing the issue, but just reducing the race
>>> window for the second cpu to schedule the work.
>>>
>>> Sigh. I'd guess dropping the schedule_work call from
>>> rtc_initialize_alarm() is the right approach (see below). When reviewing
>>> Neil's patch it seemed like a good idea there, but it seems off to me
>>> now.
>>>
>>> Neil, any thoughts on the following? Can you expand on the condition you
>>> were worried about in around that call?
>>
>> If you set an alarm in the future, then shutdown and boot again after that
>> time, then you will end up with a timer_queue node which is in the past.
>
> Thanks for explaining this again.
>
> Hrm. It seems the easy answer is to simply not add alarms that are in
> the past. Further, I'm a bit perplexed, as if they are in the past, the
> enabled flag shouldn't be set. __rtc_read_alarm() does check the
> current time, so maybe we can make sure we don't return old values? I
> guess I assumed __rtc_read_alarm() avoided returning stale values, but
> apparently not.
>
>> When this happens the queue gets stuck. That entry-in-the-past won't get
>> removed until and interrupt happens and an interrupt won't happen because the
>> RTC only triggers an interrupt when the alarm is "now".
>>
>> So you'll find that e.g. "hwclock" will always tell you that 'select' timed
>> out.
>>
>> So we force the interrupt work to happen at the start just in case.
>
> Unfortunately its too early.
>
>> Did you see my proposed patch which converted those calls to do the work
>> in-process rather than passing it to a worker-thread? I think that is a
>> clean fix.
>
> I don't think I saw it today. Was it from before the holidays?
>
I fear I caused a bit of confusion there. Neil responded to my initial mail
which was done as a reply to the mail announcing this patch for stable (which
just was the first thread I could get hold of).
I will try Neil's patch as well. And in parallel try to see whether the theory I
had this night makes sense. If it does, then it is only indirectly that the work
is scheduled too early. In that case just the teardown needs to make sure that
no work is being run while removal. Well, maybe the question is whether there
should be a delay in running the irq work until the device really, really is
completely set up... But that sounds a bit more complicated.
> Even so, at this point, I don't know if we have enough time for testing,
> so I'm thinking we either just drop the problematic sched_work call or
> revert the whole thing and try again for 3.3
That was the reason I was in a bit of hurry to get this back to you. Especially
since this patch had been marked as stable material and sooner or later will or
would be added to all the stable releases it applies to.
Thanks,
Stefan
> thanks
> -john
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/