Re: [PATCH 2/3] softirq: avoid spurious stalls due to need_resched()

From: Paul E. McKenney
Date: Fri Mar 03 2023 - 20:16:26 EST


On Fri, Mar 03, 2023 at 03:25:32PM -0800, Dave Taht wrote:
> On Fri, Mar 3, 2023 at 2:56 PM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote:
> >
> > On Fri, Mar 03, 2023 at 01:31:43PM -0800, Jakub Kicinski wrote:
> > > On Fri, 03 Mar 2023 14:30:46 +0100 Thomas Gleixner wrote:
> > > > > - if (time_before(jiffies, end) && !need_resched() &&
> > > > > - --max_restart)
> > > > > + unsigned long limit;
> > > > > +
> > > > > + if (time_is_before_eq_jiffies(end) || !--max_restart)
> > > > > + limit = SOFTIRQ_OVERLOAD_TIME;
> > > > > + else if (need_resched())
> > > > > + limit = SOFTIRQ_DEFER_TIME;
> > > > > + else
> > > > > goto restart;
> > > > >
> > > > > + __this_cpu_write(overload_limit, jiffies + limit);
> > > >
> > > > The logic of all this is non-obvious and I had to reread it 5 times to
> > > > conclude that it is matching the intent. Please add comments.
> > > >
> > > > While I'm not a big fan of heuristical duct tape, this looks harmless
> > > > enough to not end up in an endless stream of tweaking. Famous last
> > > > words...
> > >
> > > Would it all be more readable if I named the "overload_limit"
> > > "overloaded_until" instead? Naming..
> > > I'll add comments, too.
> > >
> > > > But without the sched_clock() changes the actual defer time depends on
> > > > HZ and the point in time where limit is set. That means it ranges from 0
> > > > to 1/HZ, i.e. the 2ms defer time ends up with close to 10ms on HZ=100 in
> > > > the worst case, which perhaps explains the 8ms+ stalls you are still
> > > > observing. Can you test with that sched_clock change applied, i.e. the
> > > > first two commits from
> > > >
> > > > git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git core/softirq
> > > >
> > > > 59be25c466d9 ("softirq: Use sched_clock() based timeout")
> > > > bd5a5bd77009 ("softirq: Rewrite softirq processing loop")
> > >
> > > Those will help, but I spent some time digging into the jiffies related
> > > warts with kprobes - while annoying they weren't a major source of wake
> > > ups. (FWIW the jiffies noise on our workloads is due to cgroup stats
> > > disabling IRQs for multiple ms on the timekeeping CPU).
> > >
> > > Here are fresh stats on why we wake up ksoftirqd on our Web workload
> > > (collected over 100 sec):
> > >
> > > Time exceeded: 484
> > > Loop max run out: 6525
> > > need_resched(): 10219
> > > (control: 17226 - number of times wakeup_process called for ksirqd)
> > >
> > > As you can see need_resched() dominates.
> > >
> > > Zooming into the time exceeded - we can count nanoseconds between
> > > __do_softirq starting and the check. This is the histogram of actual
> > > usecs as seen by BPF (AKA ktime_get_mono_fast_ns() / 1000):
> > >
> > > [256, 512) 1 | |
> > > [512, 1K) 0 | |
> > > [1K, 2K) 217 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
> > > [2K, 4K) 266 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > >
> > > So yes, we can probably save ourselves ~200 wakeup with a better clock
> > > but that's just 1.3% of the total wake ups :(
> > >
> > >
> > > Now - now about the max loop count. I ORed the pending softirqs every
> > > time we get to the end of the loop. Looks like vast majority of the
> > > loop counter wake ups are exclusively due to RCU:
> > >
> > > @looped[512]: 5516
> > >
> > > Where 512 is the ORed pending mask over all iterations
> > > 512 == 1 << RCU_SOFTIRQ.
> > >
> > > And they usually take less than 100us to consume the 10 iterations.
> > > Histogram of usecs consumed when we run out of loop iterations:
> > >
> > > [16, 32) 3 | |
> > > [32, 64) 4786 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > > [64, 128) 871 |@@@@@@@@@ |
> > > [128, 256) 34 | |
> > > [256, 512) 9 | |
> > > [512, 1K) 262 |@@ |
> > > [1K, 2K) 35 | |
> > > [2K, 4K) 1 | |
> > >
> > > Paul, is this expected? Is RCU not trying too hard to be nice?
> >
> > This is from way back in the day, so it is quite possible that better
> > tuning and/or better heuristics should be applied.
> >
> > On the other hand, 100 microseconds is a good long time from an
> > CONFIG_PREEMPT_RT=y perspective!
>
> All I have to add to this conversation is the observation that
> sampling things at the
> nyquist rate helps to observe problems like these.
>
> So if you care about sub 8ms response time, a sub 4ms sampling rate is needed.

My guess is that Jakub is side-stepping Nyquist by sampling every call
to and return from the rcu_do_batch() function.

> > > # cat /sys/module/rcutree/parameters/blimit
> > > 10
> > >
> > > Or should we perhaps just raise the loop limit? Breaking after less
> > > than 100usec seems excessive :(
>
>
> > But note that RCU also has rcutree.rcu_divisor, which defaults to 7.
> > And an rcutree.rcu_resched_ns, which defaults to three milliseconds
> > (3,000,000 nanoseconds). This means that RCU will do:
> >
> > o All the callbacks if there are less than ten.
> >
> > o Ten callbacks or 1/128th of them, whichever is larger.
> >
> > o Unless the larger of them is more than 100 callbacks, in which
> > case there is an additional limit of three milliseconds worth
> > of them.
> >
> > Except that if a given CPU ends up with more than 10,000 callbacks
> > (rcutree.qhimark), that CPU's blimit is set to 10,000.
> >
> > So there is much opportunity to tune the existing heuristics and also
> > much opportunity to tweak the heuristics themselves.
>
> This I did not know, and to best observe rcu in action nyquist is 1.5ms...

This is not an oscillator, and because this all happens within a
single system, you cannot you hang your hat on speed-of-light delays.
In addition, an application can dump thousands of callbacks down RCU's
throat in a very short time, which changes RCU's timing. Also, the
time constants for expedited grace periods are typically in the tens
of microseconds. Something about prioritizing survivability over
measurability. ;-)

But that is OK because ftrace and BPF can provide fine-grained
measurements quite cheaply.

> Something with less constants and more curves seems in order.

In the immortal words of MS-DOS, are you sure?

Thanx, Paul

> > But let's see a good use case before tweaking, please. ;-)
> >
> > Thanx, Paul
> >
> > > > whether that makes a difference? Those two can be applied with some
> > > > minor polishing. The rest of that series is broken by f10020c97f4c
> > > > ("softirq: Allow early break").
> > > >
> > > > There is another issue with this overload limit. Assume max_restart or
> > > > timeout triggered and limit was set to now + 100ms. ksoftirqd runs and
> > > > gets the issue resolved after 10ms.
> > > >
> > > > So for the remaining 90ms any invocation of raise_softirq() outside of
> > > > (soft)interrupt context, which wakes ksoftirqd again, prevents
> > > > processing on return from interrupt until ksoftirqd gets on the CPU and
> > > > goes back to sleep, because task_is_running() == true and the stale
> > > > limit is not after jiffies.
> > > >
> > > > Probably not a big issue, but someone will notice on some weird workload
> > > > sooner than later and the tweaking will start nevertheless. :) So maybe
> > > > we fix it right away. :)
> > >
> > > Hm, Paolo raised this point as well, but the overload time is strictly
> > > to stop paying attention to the fact ksoftirqd is running.
> > > IOW current kernels behave as if they had overload_limit of infinity.
> > >
> > > The current code already prevents processing until ksoftirqd schedules
> > > in, after raise_softirq() from a funky context.
>
>
>
> --
> A pithy note on VOQs vs SQM: https://blog.cerowrt.org/post/juniper/
> Dave Täht CEO, TekLibre, LLC