Re: INFO: rcu detected stall in ext4_write_checks

From: Paul E. McKenney
Date: Mon Jul 15 2019 - 10:03:38 EST


On Mon, Jul 15, 2019 at 03:46:51PM +0200, Peter Zijlstra wrote:
> On Mon, Jul 15, 2019 at 03:33:11PM +0200, Dmitry Vyukov wrote:
> > On Mon, Jul 15, 2019 at 3:29 PM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> > >
> > > On Sun, Jul 14, 2019 at 11:49:15AM -0700, Paul E. McKenney wrote:
> > > > On Sun, Jul 14, 2019 at 05:48:00PM +0300, Dmitry Vyukov wrote:
> > > > > But short term I don't see any other solution than stop testing
> > > > > sched_setattr because it does not check arguments enough to prevent
> > > > > system misbehavior. Which is a pity because syzkaller has found some
> > > > > bad misconfigurations that were oversight on checking side.
> > > > > Any other suggestions?
> > > >
> > > > Keep the times down to a few seconds? Of course, that might also
> > > > fail to find interesting bugs.
> > >
> > > Right, if syzcaller can put a limit on the period/deadline parameters
> > > (and make sure to not write "-1" to
> > > /proc/sys/kernel/sched_rt_runtime_us) then per the in-kernel
> > > access-control should not allow these things to happen.
> >
> > Since we are racing with emails, could you suggest a 100% safe
> > parameters? Because I only hear people saying "safe", "sane",
> > "well-behaving" :)
> > If we move the check to user-space, it does not mean that we can get
> > away without actually defining what that means.
>
> Right, well, that's part of the problem. I think Paul just did the
> reverse math and figured that 95% of X must not be larger than my
> watchdog timeout and landed on 14 seconds.

I was actually working backwards from thw 21-second RCU CPU stall
timeout, but there are likely many other limits to consider.

> I'm thinking 4 seconds (or rather 4.294967296) would be a very nice
> number.

Works for me! That should give the various RCU kthreads ample
opportunities to execute within the RCU CPU stall timeout.

The rcuo callback-offload kthreads will need special handling, but if
someone has 100 CPUs wildly generating callbacks and allocates but one
CPU to invoke them, there is not much either the RCU or the scheduler
can do to make that work. ;-)

Thanx, Paul

> > Now thinking of this, if we come up with some simple criteria, could
> > we have something like a sysctl that would allow only really "safe"
> > parameters?
>
> I suppose we could do that, something like:
> sysctl_deadline_period_{min,max}. I'll have to dig back a bit on where
> we last talked about that and what the problems where.
>
> For one, setting the min is a lot harder, but I suppose we can start at
> TICK_NSEC or something.