Re: INFO: rcu detected stall in ext4_write_checks

From: Paul E. McKenney
Date: Mon Jul 15 2019 - 09:01:23 EST


On Sun, Jul 14, 2019 at 08:10:27PM -0700, Paul E. McKenney wrote:
> On Sun, Jul 14, 2019 at 12:29:51PM -0700, Paul E. McKenney wrote:
> > On Sun, Jul 14, 2019 at 03:05:22PM -0400, Theodore Ts'o wrote:
> > > On Sun, Jul 14, 2019 at 05:48:00PM +0300, Dmitry Vyukov wrote:
> > > > But short term I don't see any other solution than stop testing
> > > > sched_setattr because it does not check arguments enough to prevent
> > > > system misbehavior. Which is a pity because syzkaller has found some
> > > > bad misconfigurations that were oversight on checking side.
> > > > Any other suggestions?
> > >
> > > Or maybe syzkaller can put its own limitations on what parameters are
> > > sent to sched_setattr? In practice, there are any number of ways a
> > > root user can shoot themselves in the foot when using sched_setattr or
> > > sched_setaffinity, for that matter. I imagine there must be some such
> > > constraints already --- or else syzkaller might have set a kernel
> > > thread to run with priority SCHED_BATCH, with similar catastrophic
> > > effects --- or do similar configurations to make system threads
> > > completely unschedulable.
> > >
> > > Real time administrators who know what they are doing --- and who know
> > > that their real-time threads are well behaved --- will always want to
> > > be able to do things that will be catastrophic if the real-time thread
> > > is *not* well behaved. I don't it is possible to add safety checks
> > > which would allow the kernel to automatically detect and reject unsafe
> > > configurations.
> > >
> > > An apt analogy might be civilian versus military aircraft. Most
> > > airplanes are designed to be "inherently stable"; that way, modulo
> > > buggy/insane control systems like on the 737 Max, the airplane will
> > > automatically return to straight and level flight. On the other hand,
> > > some military planes (for example, the F-16, F-22, F-36, the
> > > Eurofighter, etc.) are sometimes designed to be unstable, since that
> > > way they can be more maneuverable.
> > >
> > > There are use cases for real-time Linux where this flexibility/power
> > > vs. stability tradeoff is going to argue for giving root the
> > > flexibility to crash the system. Some of these systems might
> > > literally involve using real-time Linux in military applications,
> > > something for which Paul and I have had some experience. :-)
> > >
> > > Speaking of sched_setaffinity, one thing which we can do is have
> > > syzkaller move all of the system threads to they run on the "system
> > > CPU's", and then move the syzkaller processes which are testing the
> > > kernel to be on the "system under test CPU's". Then regardless of
> > > what priority the syzkaller test programs try to run themselves at,
> > > they can't crash the system.
> > >
> > > Some real-time systems do actually run this way, and it's a
> > > recommended configuration which is much safer than letting the
> > > real-time threads take over the whole system:
> > >
> > > http://linuxrealtime.org/index.php/Improving_the_Real-Time_Properties#Isolating_the_Application
> >
> > Good point! We might still have issues with some per-CPU kthreads,
> > but perhaps use of nohz_full would help at least reduce these sorts
> > of problems. (There could still be issues on CPUs with more than
> > one runnable threads.)
>
> I looked at testing limitations in a bit more detail from an RCU
> viewpoint, and came up with the following rough rule of thumb (which of
> course might or might not survive actual testing experience, but should at
> least be a good place to start). I believe that the sched_setaffinity()
> testing rule should be that the SCHED_DEADLINE cycle be no more than
> two-thirds of the RCU CPU stall warning timeout, which defaults to 21
> seconds in mainline and 60 seconds in many distro kernels.
>
> That is, the SCHED_DEADLINE cycle should never exceed 14 seconds when
> testing mainline on the one hand or 40 seconds when testing enterprise
> distros on the other.
>
> This assumes quite a bit, though:
>
> o The system has ample memory to spare, and isn't running a
> callback-hungry workload. For example, if you "only" have 100MB
> of spare memory and you are also repeatedly and concurrently
> expanding (say) large source trees from tarballs and then deleting
> those source trees, the system might OOM. The reason OOM might
> happen is that each close() of a file generates an RCU callback,
> and 40 seconds worth of waiting-for-a-grace-period structures
> takes up a surprisingly large amount of memory.
>
> So please be careful when combining tests. ;-)
>
> o There are no aggressive real-time workloads on the system.
> The reason for this is that RCU is going to start sending IPIs
> halfway to the RCU CPU stall timeout, and, in certain situations
> on CONFIG_NO_HZ_FULL kernels, much earlier. (These situations
> constitute abuse of CONFIG_NO_HZ_FULL, but then again carefully
> calibrated abuse is what stress testing is all about.)
>
> o The various RCU kthreads will get a chance to run at least once
> during the SCHED_DEADLINE cycle. If in real life, they only
> get a chance to run once per two SCHED_DEADLINE cycles, then of
> course the 14 seconds becomes 7 and the 40 seconds becomes 20.

And there are configurations and workloads that might require division
by three, so that (assuming one chance to run per cycle), the 14 seconds
becomes about 5 and the 40 seconds becomes about 15.

> o The current RCU CPU stall warning defaults remain in
> place. These are set by the CONFIG_RCU_CPU_STALL_TIMEOUT
> Kconfig parameter, which may in turn be overridden by the
> rcupdate.rcu_cpu_stall_timeout kernel boot parameter.
>
> o The current SCHED_DEADLINE default for providing spare cycles
> for other uses remains in place.
>
> o Other kthreads might have other constraints, but given that you
> were seeing RCU CPU stall warnings instead of other failures,
> the needs of RCU's kthreads seem to be a good place to start.
>
> Again, the candidate rough rule of thumb is that the the SCHED_DEADLINE
> cycle be no more than 14 seconds when testing mainline kernels on the one
> hand and 40 seconds when testing enterprise distro kernels on the other.
>
> Dmitry, does that help?

I checked with the people running the Linux Plumbers Conference Scheduler
Microconference, and they said that they would welcome a proposal on
this topic, which I have submitted (please see below). Would anyone
like to join as co-conspirator?

Thanx, Paul

------------------------------------------------------------------------

Title: Making SCHED_DEADLINE safe for kernel kthreads

Abstract:

Dmitry Vyukov's testing work identified some (ab)uses of sched_setattr()
that can result in SCHED_DEADLINE tasks starving RCU's kthreads for
extended time periods, not millisecond, not seconds, not minutes, not even
hours, but days. Given that RCU CPU stall warnings are issued whenever
an RCU grace period fails to complete within a few tens of seconds,
the system did not suffer silently. Although one could argue that people
should avoid abusing sched_setattr(), people are human and humans make
mistakes. Responding to simple mistakes with RCU CPU stall warnings is
all well and good, but a more severe case could OOM the system, which
is a particularly unhelpful error message.

It would be better if the system were capable of operating reasonably
despite such abuse. Several approaches have been suggested.

First, sched_setattr() could recognize parameter settings that put
kthreads at risk and refuse to honor those settings. This approach
of course requires that we identify precisely what combinations of
sched_setattr() parameters settings are risky, especially given that there
are likely to be parameter settings that are both risky and highly useful.

Second, in theory, RCU could detect this situation and take the "dueling
banjos" approach of increasing its priority as needed to get the CPU time
that its kthreads need to operate correctly. However, the required amount
of CPU time can vary greatly depending on the workload. Furthermore,
non-RCU kthreads also need some amount of CPU time, and replicating
"dueling banjos" across all such Linux-kernel subsystems seems both
wasteful and error-prone. Finally, experience has shown that setting
RCU's kthreads to real-time priorities significantly harms performance
by increasing context-switch rates.

Third, stress testing could be limited to non-risky regimes, such that
kthreads get CPU time every 5-40 seconds, depending on configuration
and experience. People needing risky parameter settings could then test
the settings that they actually need, and also take responsibility for
ensuring that kthreads get the CPU time that they need. (This of course
includes per-CPU kthreads!)

Fourth, bandwidth throttling could treat tasks in other scheduling classes
as an aggregate group having a reasonable aggregate deadline and CPU
budget. This has the advantage of allowing "abusive" testing to proceed,
which allows people requiring risky parameter settings to rely on this
testing. Additionally, it avoids complex progress checking and priority
setting on the part of many kthreads throughout the system. However,
if this was an easy choice, the SCHED_DEADLINE developers would likely
have selected it. For example, it is necessary to determine what might
be a "reasonable" aggregate deadline and CPU budget. Reserving 5%
seems quite generous, and RCU's grace-period kthread would optimally
like a deadline in the milliseconds, but would do reasonably well with
many tens of milliseconds, and absolutely needs a few seconds. However,
for CONFIG_RCU_NOCB_CPU=y, the RCU's callback-offload kthreads might
well need a full CPU each! (This happens when the CPU being offloaded
generates a high rate of callbacks.)

The goal of this proposal is therefore to generate face-to-face
discussion, hopefully resulting in a good and sufficient solution to
this problem.