Re: INFO: rcu detected stall in ext4_write_checks

From: Dmitry Vyukov
Date: Mon Jul 15 2019 - 09:29:48 EST


On Mon, Jul 15, 2019 at 3:01 PM Paul E. McKenney <paulmck@xxxxxxxxxxxxx> wrote:
>
> On Sun, Jul 14, 2019 at 08:10:27PM -0700, Paul E. McKenney wrote:
> > On Sun, Jul 14, 2019 at 12:29:51PM -0700, Paul E. McKenney wrote:
> > > On Sun, Jul 14, 2019 at 03:05:22PM -0400, Theodore Ts'o wrote:
> > > > On Sun, Jul 14, 2019 at 05:48:00PM +0300, Dmitry Vyukov wrote:
> > > > > But short term I don't see any other solution than stop testing
> > > > > sched_setattr because it does not check arguments enough to prevent
> > > > > system misbehavior. Which is a pity because syzkaller has found some
> > > > > bad misconfigurations that were oversight on checking side.
> > > > > Any other suggestions?
> > > >
> > > > Or maybe syzkaller can put its own limitations on what parameters are
> > > > sent to sched_setattr? In practice, there are any number of ways a
> > > > root user can shoot themselves in the foot when using sched_setattr or
> > > > sched_setaffinity, for that matter. I imagine there must be some such
> > > > constraints already --- or else syzkaller might have set a kernel
> > > > thread to run with priority SCHED_BATCH, with similar catastrophic
> > > > effects --- or do similar configurations to make system threads
> > > > completely unschedulable.
> > > >
> > > > Real time administrators who know what they are doing --- and who know
> > > > that their real-time threads are well behaved --- will always want to
> > > > be able to do things that will be catastrophic if the real-time thread
> > > > is *not* well behaved. I don't it is possible to add safety checks
> > > > which would allow the kernel to automatically detect and reject unsafe
> > > > configurations.
> > > >
> > > > An apt analogy might be civilian versus military aircraft. Most
> > > > airplanes are designed to be "inherently stable"; that way, modulo
> > > > buggy/insane control systems like on the 737 Max, the airplane will
> > > > automatically return to straight and level flight. On the other hand,
> > > > some military planes (for example, the F-16, F-22, F-36, the
> > > > Eurofighter, etc.) are sometimes designed to be unstable, since that
> > > > way they can be more maneuverable.
> > > >
> > > > There are use cases for real-time Linux where this flexibility/power
> > > > vs. stability tradeoff is going to argue for giving root the
> > > > flexibility to crash the system. Some of these systems might
> > > > literally involve using real-time Linux in military applications,
> > > > something for which Paul and I have had some experience. :-)
> > > >
> > > > Speaking of sched_setaffinity, one thing which we can do is have
> > > > syzkaller move all of the system threads to they run on the "system
> > > > CPU's", and then move the syzkaller processes which are testing the
> > > > kernel to be on the "system under test CPU's". Then regardless of
> > > > what priority the syzkaller test programs try to run themselves at,
> > > > they can't crash the system.
> > > >
> > > > Some real-time systems do actually run this way, and it's a
> > > > recommended configuration which is much safer than letting the
> > > > real-time threads take over the whole system:
> > > >
> > > > http://linuxrealtime.org/index.php/Improving_the_Real-Time_Properties#Isolating_the_Application
> > >
> > > Good point! We might still have issues with some per-CPU kthreads,
> > > but perhaps use of nohz_full would help at least reduce these sorts
> > > of problems. (There could still be issues on CPUs with more than
> > > one runnable threads.)
> >
> > I looked at testing limitations in a bit more detail from an RCU
> > viewpoint, and came up with the following rough rule of thumb (which of
> > course might or might not survive actual testing experience, but should at
> > least be a good place to start). I believe that the sched_setaffinity()
> > testing rule should be that the SCHED_DEADLINE cycle be no more than
> > two-thirds of the RCU CPU stall warning timeout, which defaults to 21
> > seconds in mainline and 60 seconds in many distro kernels.
> >
> > That is, the SCHED_DEADLINE cycle should never exceed 14 seconds when
> > testing mainline on the one hand or 40 seconds when testing enterprise
> > distros on the other.
> >
> > This assumes quite a bit, though:
> >
> > o The system has ample memory to spare, and isn't running a
> > callback-hungry workload. For example, if you "only" have 100MB
> > of spare memory and you are also repeatedly and concurrently
> > expanding (say) large source trees from tarballs and then deleting
> > those source trees, the system might OOM. The reason OOM might
> > happen is that each close() of a file generates an RCU callback,
> > and 40 seconds worth of waiting-for-a-grace-period structures
> > takes up a surprisingly large amount of memory.
> >
> > So please be careful when combining tests. ;-)
> >
> > o There are no aggressive real-time workloads on the system.
> > The reason for this is that RCU is going to start sending IPIs
> > halfway to the RCU CPU stall timeout, and, in certain situations
> > on CONFIG_NO_HZ_FULL kernels, much earlier. (These situations
> > constitute abuse of CONFIG_NO_HZ_FULL, but then again carefully
> > calibrated abuse is what stress testing is all about.)
> >
> > o The various RCU kthreads will get a chance to run at least once
> > during the SCHED_DEADLINE cycle. If in real life, they only
> > get a chance to run once per two SCHED_DEADLINE cycles, then of
> > course the 14 seconds becomes 7 and the 40 seconds becomes 20.
>
> And there are configurations and workloads that might require division
> by three, so that (assuming one chance to run per cycle), the 14 seconds
> becomes about 5 and the 40 seconds becomes about 15.
>
> > o The current RCU CPU stall warning defaults remain in
> > place. These are set by the CONFIG_RCU_CPU_STALL_TIMEOUT
> > Kconfig parameter, which may in turn be overridden by the
> > rcupdate.rcu_cpu_stall_timeout kernel boot parameter.
> >
> > o The current SCHED_DEADLINE default for providing spare cycles
> > for other uses remains in place.
> >
> > o Other kthreads might have other constraints, but given that you
> > were seeing RCU CPU stall warnings instead of other failures,
> > the needs of RCU's kthreads seem to be a good place to start.
> >
> > Again, the candidate rough rule of thumb is that the the SCHED_DEADLINE
> > cycle be no more than 14 seconds when testing mainline kernels on the one
> > hand and 40 seconds when testing enterprise distro kernels on the other.
> >
> > Dmitry, does that help?
>
> I checked with the people running the Linux Plumbers Conference Scheduler
> Microconference, and they said that they would welcome a proposal on
> this topic, which I have submitted (please see below). Would anyone
> like to join as co-conspirator?
>
> Thanx, Paul
>
> ------------------------------------------------------------------------
>
> Title: Making SCHED_DEADLINE safe for kernel kthreads
>
> Abstract:
>
> Dmitry Vyukov's testing work identified some (ab)uses of sched_setattr()
> that can result in SCHED_DEADLINE tasks starving RCU's kthreads for
> extended time periods, not millisecond, not seconds, not minutes, not even
> hours, but days. Given that RCU CPU stall warnings are issued whenever
> an RCU grace period fails to complete within a few tens of seconds,
> the system did not suffer silently. Although one could argue that people
> should avoid abusing sched_setattr(), people are human and humans make
> mistakes. Responding to simple mistakes with RCU CPU stall warnings is
> all well and good, but a more severe case could OOM the system, which
> is a particularly unhelpful error message.
>
> It would be better if the system were capable of operating reasonably
> despite such abuse. Several approaches have been suggested.
>
> First, sched_setattr() could recognize parameter settings that put
> kthreads at risk and refuse to honor those settings. This approach
> of course requires that we identify precisely what combinations of
> sched_setattr() parameters settings are risky, especially given that there
> are likely to be parameter settings that are both risky and highly useful.
>
> Second, in theory, RCU could detect this situation and take the "dueling
> banjos" approach of increasing its priority as needed to get the CPU time
> that its kthreads need to operate correctly. However, the required amount
> of CPU time can vary greatly depending on the workload. Furthermore,
> non-RCU kthreads also need some amount of CPU time, and replicating
> "dueling banjos" across all such Linux-kernel subsystems seems both
> wasteful and error-prone. Finally, experience has shown that setting
> RCU's kthreads to real-time priorities significantly harms performance
> by increasing context-switch rates.
>
> Third, stress testing could be limited to non-risky regimes, such that
> kthreads get CPU time every 5-40 seconds, depending on configuration
> and experience. People needing risky parameter settings could then test
> the settings that they actually need, and also take responsibility for
> ensuring that kthreads get the CPU time that they need. (This of course
> includes per-CPU kthreads!)
>
> Fourth, bandwidth throttling could treat tasks in other scheduling classes
> as an aggregate group having a reasonable aggregate deadline and CPU
> budget. This has the advantage of allowing "abusive" testing to proceed,
> which allows people requiring risky parameter settings to rely on this
> testing. Additionally, it avoids complex progress checking and priority
> setting on the part of many kthreads throughout the system. However,
> if this was an easy choice, the SCHED_DEADLINE developers would likely
> have selected it. For example, it is necessary to determine what might
> be a "reasonable" aggregate deadline and CPU budget. Reserving 5%
> seems quite generous, and RCU's grace-period kthread would optimally
> like a deadline in the milliseconds, but would do reasonably well with
> many tens of milliseconds, and absolutely needs a few seconds. However,
> for CONFIG_RCU_NOCB_CPU=y, the RCU's callback-offload kthreads might
> well need a full CPU each! (This happens when the CPU being offloaded
> generates a high rate of callbacks.)
>
> The goal of this proposal is therefore to generate face-to-face
> discussion, hopefully resulting in a good and sufficient solution to
> this problem.


I would be happy to attend if this won't conflict with important
things on the testing and fuzzing MC.

If we restrict arguments for sched_attr, what would be the criteria
for 100% safe arguments? Moving the check from kernel to user-space
does not relief us from explicitly stating the condition in
black-and-white way. All of sched_runtime/sched_deadline/sched_period
be not larger than 1 second?

The problem is that syzkaller does not allow 100% reliable enforcement
for indirect arguments in memory. E.g. inputs arguments can overlap,
input/output can overlap, weird races affect what's actually being
passed to kernel, the memory being mapped from a weird device, etc.
And that's also useful as it can discover TOCTOU bugs, deadlocks, etc.
We could try to wrap sched_setattr and do some additional restrictions
by giving up on TOCTOU, device-mapped memory, etc.

I am also thinking about dropping CAP_SYS_NICE, it should still allow
some configurations, but no inherently unsafe ones.