Re: rcu: INFO: rcu_sched self-detected stall on CPU: Workqueue: xfs-conv/md0 xfs_end_io

From: Brian Foster
Date: Mon Feb 08 2021 - 16:47:04 EST


On Tue, Feb 09, 2021 at 07:43:14AM +1100, Dave Chinner wrote:
> On Mon, Feb 08, 2021 at 09:28:24AM -0800, Darrick J. Wong wrote:
> > On Mon, Feb 09, 2021 at 09:11:40AM -0800, Paul E. McKenney wrote:
> > > On Mon, Feb 08, 2021 at 10:44:58AM -0500, Brian Foster wrote:
> > > > There was a v2 inline that incorporated some directed feedback.
> > > > Otherwise there were questions and ideas about making the whole thing
> > > > faster, but I've no idea if that addresses the problem or not (if so,
> > > > that would be an entirely different set of patches). I'll wait and see
> > > > what Darrick thinks about this and rebase/repost if the approach is
> > > > agreeable..
> > >
> > > There is always the school of thought that says that the best way to
> > > get people to focus on this is to rebase and repost. Otherwise, they
> > > are all too likely to assume that you lost interest in this.
> >
> > I was hoping that a better solution would emerge for clearing
> > PageWriteback on hundreds of thousands of pages, but nothing easy popped
> > out.
> >
> > The hardcoded threshold in "[PATCH v2 2/2] xfs: kick extra large ioends
> > to completion workqueue" gives me unease because who's to say if marking
> > 262,144 pages on a particular CPU will actually stall it long enough to
> > trip the hangcheck? Is the number lower on (say) some pokey NAS box
> > with a lot of storage but a slow CPU?
>
> It's also not the right thing to do given the IO completion
> workqueue is a bound workqueue. Anything that is doing large amounts
> of CPU intensive work should be on a unbound workqueue so that the
> scheduler can bounce it around different CPUs as needed.
>
> Quite frankly, the problem is a huge long ioend chain being built by
> the submission code. We need to keep ioend completion overhead down.
> It runs in either softirq or bound workqueue context and so
> individual items of work that are performed in this context must not
> be -unbounded- in size or time. Unbounded ioend chains are bad for
> IO latency, they are bad for memory reclaim and they are bad for CPU
> scheduling.
>
> As I've said previously, we gain nothing by aggregating ioends past
> a few tens of megabytes of submitted IO. The batching gains are
> completely diminished once we've got enough IO in flight to keep the
> submission queue full. We're talking here about gigabytes of
> sequential IOs in a single ioend chain which are 2-3 orders of
> magnitude larger than needed for optimal background IO submission
> and completion efficiency and throughput. IOWs, we really should be
> limiting the ioend chain length at submission time, not trying to
> patch over bad completion behaviour that results from sub-optimal IO
> submission behaviour...
>

That was the patch I posted prior to the aforementioned set. Granted, it
was an RFC, but for reference:

https://lore.kernel.org/linux-fsdevel/20200825144917.GA321765@bfoster/

(IIRC, you also had a variant that was essentially the same change.)

The discussion that followed in that thread was around the preference to
move completion of large chains into workqueue context instead of
breaking up the chains. The series referenced in my first reply fell out
of that as a targeted fix for the stall warning.

> > That said, /some/ threshold is probably better than no threshold. Could
> > someone try to confirm if that series of Brian's fixes this problem too?
>
> 262144 pages is still too much work to be doing in a single softirq
> IO completion callback. It's likely to be too much work for a bound
> workqueue, too, especially when you consider that the workqueue
> completion code will merge sequential ioends into one ioend, hence
> making the IO completion loop counts bigger and latency problems worse
> rather than better...
>

That was just a conservative number picked based on observation of the
original report (10+ GB ioends IIRC). I figured the review cycle would
involve narrowing it down to something more generically reasonable
(10s-100s of MB?) once we found an acceptable approach (and hopefully
received some testing feedback), but we've never really got to that
point..

Brian

> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@xxxxxxxxxxxxx
>