Re: RCU stall when using function_graph

From: Daniel Lezcano
Date: Fri Aug 11 2017 - 05:38:20 EST


On 10/08/2017 23:39, Paul E. McKenney wrote:
> On Thu, Aug 10, 2017 at 11:45:09AM +0200, Daniel Lezcano wrote:

[ ... ]

>> Nothing coming in mind but may be worth to mention the slowness of the
>> CPU is the aggravating factor. In particular I was able to reproduce the
>> issue by setting to the min CPU frequency. With the ondemand governor,
>> we can have the frequency high (hence enough CPU power) at the moment we
>> set the function_graph because another CPU is loaded (and both CPUs are
>> sharing the same clock line). The system became stuck at the moment the
>> other CPU went idle with the lowest frequency. That introduced
>> randomness in the issue and made hard to figure out why the RCU stall
>> was happening.
>
> Adding this, then?

Yes, sure.

Thanks Paul.

-- Daniel

> ------------------------------------------------------------------------
>
> commit f7d9ce95064f76be583c775fac32076fa59f1617
> Author: Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx>
> Date: Thu Aug 10 14:33:17 2017 -0700
>
> documentation: Slow systems can stall RCU grace periods
>
> If a fast system has a worst-case grace-period duration of (say) ten
> seconds, then running the same workload on a system ten times as slow
> will get you an RCU CPU stall warning given default stall-warning
> timeout settings. This commit therefore adds this possibility to
> stallwarn.txt.
>
> Reported-by: Daniel Lezcano <daniel.lezcano@xxxxxxxxxx>
> Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx>
>
> diff --git a/Documentation/RCU/stallwarn.txt b/Documentation/RCU/stallwarn.txt
> index 21b8913acbdf..238acbd94917 100644
> --- a/Documentation/RCU/stallwarn.txt
> +++ b/Documentation/RCU/stallwarn.txt
> @@ -70,6 +70,12 @@ o A periodic interrupt whose handler takes longer than the time
> considerably longer than normal, which can in turn result in
> RCU CPU stall warnings.
>
> +o Testing a workload on a fast system, tuning the stall-warning
> + timeout down to just barely avoid RCU CPU stall warnings, and then
> + running the same workload with the same stall-warning timeout on a
> + slow system. Note that thermal throttling and on-demand governors
> + can cause a single system to be sometimes fast and sometimes slow!
> +
> o A hardware or software issue shuts off the scheduler-clock
> interrupt on a CPU that is not in dyntick-idle mode. This
> problem really has happened, and seems to be most likely to
>


--
<http://www.linaro.org/> Linaro.org â Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog