Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

From: Mathieu Desnoyers
Date: Tue Feb 23 2016 - 15:34:23 EST


----- On Feb 21, 2016, at 1:15 PM, Ross Green rgkernel@xxxxxxxxx wrote:

> On Sun, Feb 21, 2016 at 4:04 PM, Ross Green <rgkernel@xxxxxxxxx> wrote:
>> On Sat, Feb 20, 2016 at 5:32 PM, Paul E. McKenney
>> <paulmck@xxxxxxxxxxxxxxxxxx> wrote:
>>> On Sat, Feb 20, 2016 at 03:34:30PM +1100, Ross Green wrote:
>>>> On Sat, Feb 20, 2016 at 4:33 AM, Paul E. McKenney
>>>> <paulmck@xxxxxxxxxxxxxxxxxx> wrote:
>>>> > On Thu, Feb 18, 2016 at 08:13:18PM -0800, John Stultz wrote:
>>>> >> On Thu, Feb 18, 2016 at 7:56 PM, Ross Green <rgkernel@xxxxxxxxx> wrote:
>>>> >> > Well a bonus extra!
>>>> >> > Kept everything running and there was another stall.
>>>> >> > So i have included the demsg output for perusal.
>>>> >> >
>>>> >> > Just to clear things up there is no hotplug involved in this system.
>>>> >> > It is a standard Pandaboard ES Ti4460 two processor system.
>>>> >> > I use this for testing as a generic armv7 processor, plus can keep it
>>>> >> > just running along for testing for a long time. the system has a total
>>>> >> > of 23-25 process running on average. Mainly standard daemons. There is
>>>> >> > certainly no heavy processing going on. I run a series of benchmarks
>>>> >> > that are cpu intensive for the first 20 miinutes after boot and then
>>>> >> > just leave it idle away. checking every so often to see how it has
>>>> >> > gone.
>>>> >> > As mentioned I have observed these stalls going back to 3.17 kernel.
>>>> >> > It will often take up to a week to record such a stall. I will
>>>> >> > typically test every new release kernel, so the -rc? series will get
>>>> >> > around a weeks testing.
>>>> >>
>>>> >> Sorry. Kind of hopping in a bit late here. Is this always happening
>>>> >> with just the pandaboard? Or are you seeing this on different
>>>> >> machines?
>>>> >>
>>>> >> Have you tried enabling CONFIG_DEBUG_TIMEKEEPING just in case
>>>> >> something is going awry there?
>>>> >
>>>> > Excellent point -- timekeeping issues have caused this sort of issue
>>>> > in the past.
>>>> >
>>>> > Ross, on your next test, could you please enable CONFIG_DEBUG_TIMEKEEPING
>>>> > as John suggests?
>>>> >
>>>> > Thanx, Paul
>>>> >
>>>> As John has suggested have already enabled CONFIG_DEBUG_TIMEKEEPING.
>>>>
>>>> So far just on 1 day running.
>>>>
>>>> Sigh...!! Nothing to report as yet, only one day on the clock.
>>>> Its like watching grass grow!
>>>
>>> I hear you! Though I was thinking in terms of watching paint dry...
>>>
>>> Thanx, Paul
>>>
>> Yes,
>>
>> but with paint drying there is an end point!
>> Grass just keeps on growing ...
>>
>> More like the children in the back of the car ...
>> Are we there yet? ...
>>
>> Well still nothing .. to report. I have just built a 4.5-rc5, but will
>> wait till I get some outcome from the previous test. That can't be too
>> much longer!
>>
>> In hope,
>>
>> Ross Green
> Patience little ones ...
>
> Well after 2 days plus running pulled another stall.
> This is with 4.5-rc4 and CONFIG_DEBUG_TIMEKEEPING set.
>
> Can't see anything related to the TIMEKEEPING.
>
> Anyway here is the dmesg output for people to look at.
>
> Paul, I was going to move onto 4.5-rc5 kernel, is there something else
> that you want me to test with that, Anyone else have any suggestions
> or ideas?

Starting from kernel 3.17, if we request a e.g. 1000 jiffies schedule_timeout
on a HZ=1000 kernel, is there an upper bound to the number of jiffies it
can actually take before the timeout happens, especially on idle systems ?
I remember a talk from Thomas Gleixner on the new timer wheel which could
add some imprecision to those timeouts. Not sure in which kernel version it
got in though.

My thinking is that it might be a good idea to try using hrtimers rather than
a jiffies-based timeout to awaken the RCU thread if it's really important to
run in a bounded amount of jiffies. Or else the jiffies-based sanity check
that triggers the warning is perhaps too strict.

Thoughts ?

Thanks,

Mathieu


>
> Regards,
>
> Ross Green

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Attachment: dmesg-4.5-rc4-3
Description: Binary data