Re: INFO: rcu detected stall in ext4_write_checks

From: Paul E. McKenney
Date: Sun Jul 14 2019 - 23:11:28 EST


On Sun, Jul 14, 2019 at 12:29:51PM -0700, Paul E. McKenney wrote:
> On Sun, Jul 14, 2019 at 03:05:22PM -0400, Theodore Ts'o wrote:
> > On Sun, Jul 14, 2019 at 05:48:00PM +0300, Dmitry Vyukov wrote:
> > > But short term I don't see any other solution than stop testing
> > > sched_setattr because it does not check arguments enough to prevent
> > > system misbehavior. Which is a pity because syzkaller has found some
> > > bad misconfigurations that were oversight on checking side.
> > > Any other suggestions?
> >
> > Or maybe syzkaller can put its own limitations on what parameters are
> > sent to sched_setattr? In practice, there are any number of ways a
> > root user can shoot themselves in the foot when using sched_setattr or
> > sched_setaffinity, for that matter. I imagine there must be some such
> > constraints already --- or else syzkaller might have set a kernel
> > thread to run with priority SCHED_BATCH, with similar catastrophic
> > effects --- or do similar configurations to make system threads
> > completely unschedulable.
> >
> > Real time administrators who know what they are doing --- and who know
> > that their real-time threads are well behaved --- will always want to
> > be able to do things that will be catastrophic if the real-time thread
> > is *not* well behaved. I don't it is possible to add safety checks
> > which would allow the kernel to automatically detect and reject unsafe
> > configurations.
> >
> > An apt analogy might be civilian versus military aircraft. Most
> > airplanes are designed to be "inherently stable"; that way, modulo
> > buggy/insane control systems like on the 737 Max, the airplane will
> > automatically return to straight and level flight. On the other hand,
> > some military planes (for example, the F-16, F-22, F-36, the
> > Eurofighter, etc.) are sometimes designed to be unstable, since that
> > way they can be more maneuverable.
> >
> > There are use cases for real-time Linux where this flexibility/power
> > vs. stability tradeoff is going to argue for giving root the
> > flexibility to crash the system. Some of these systems might
> > literally involve using real-time Linux in military applications,
> > something for which Paul and I have had some experience. :-)
> >
> > Speaking of sched_setaffinity, one thing which we can do is have
> > syzkaller move all of the system threads to they run on the "system
> > CPU's", and then move the syzkaller processes which are testing the
> > kernel to be on the "system under test CPU's". Then regardless of
> > what priority the syzkaller test programs try to run themselves at,
> > they can't crash the system.
> >
> > Some real-time systems do actually run this way, and it's a
> > recommended configuration which is much safer than letting the
> > real-time threads take over the whole system:
> >
> > http://linuxrealtime.org/index.php/Improving_the_Real-Time_Properties#Isolating_the_Application
>
> Good point! We might still have issues with some per-CPU kthreads,
> but perhaps use of nohz_full would help at least reduce these sorts
> of problems. (There could still be issues on CPUs with more than
> one runnable threads.)

I looked at testing limitations in a bit more detail from an RCU
viewpoint, and came up with the following rough rule of thumb (which of
course might or might not survive actual testing experience, but should at
least be a good place to start). I believe that the sched_setaffinity()
testing rule should be that the SCHED_DEADLINE cycle be no more than
two-thirds of the RCU CPU stall warning timeout, which defaults to 21
seconds in mainline and 60 seconds in many distro kernels.

That is, the SCHED_DEADLINE cycle should never exceed 14 seconds when
testing mainline on the one hand or 40 seconds when testing enterprise
distros on the other.

This assumes quite a bit, though:

o The system has ample memory to spare, and isn't running a
callback-hungry workload. For example, if you "only" have 100MB
of spare memory and you are also repeatedly and concurrently
expanding (say) large source trees from tarballs and then deleting
those source trees, the system might OOM. The reason OOM might
happen is that each close() of a file generates an RCU callback,
and 40 seconds worth of waiting-for-a-grace-period structures
takes up a surprisingly large amount of memory.

So please be careful when combining tests. ;-)

o There are no aggressive real-time workloads on the system.
The reason for this is that RCU is going to start sending IPIs
halfway to the RCU CPU stall timeout, and, in certain situations
on CONFIG_NO_HZ_FULL kernels, much earlier. (These situations
constitute abuse of CONFIG_NO_HZ_FULL, but then again carefully
calibrated abuse is what stress testing is all about.)

o The various RCU kthreads will get a chance to run at least once
during the SCHED_DEADLINE cycle. If in real life, they only
get a chance to run once per two SCHED_DEADLINE cycles, then of
course the 14 seconds becomes 7 and the 40 seconds becomes 20.

o The current RCU CPU stall warning defaults remain in
place. These are set by the CONFIG_RCU_CPU_STALL_TIMEOUT
Kconfig parameter, which may in turn be overridden by the
rcupdate.rcu_cpu_stall_timeout kernel boot parameter.

o The current SCHED_DEADLINE default for providing spare cycles
for other uses remains in place.

o Other kthreads might have other constraints, but given that you
were seeing RCU CPU stall warnings instead of other failures,
the needs of RCU's kthreads seem to be a good place to start.

Again, the candidate rough rule of thumb is that the the SCHED_DEADLINE
cycle be no more than 14 seconds when testing mainline kernels on the one
hand and 40 seconds when testing enterprise distro kernels on the other.

Dmitry, does that help?

Thanx, Paul