Re: rcu: INFO: rcu_sched self-detected stall on CPU: Workqueue: xfs-conv/md0 xfs_end_io

From: Dave Chinner
Date: Mon Feb 08 2021 - 16:35:02 EST


On Mon, Feb 08, 2021 at 09:28:24AM -0800, Darrick J. Wong wrote:
> On Mon, Feb 09, 2021 at 09:11:40AM -0800, Paul E. McKenney wrote:
> > On Mon, Feb 08, 2021 at 10:44:58AM -0500, Brian Foster wrote:
> > > There was a v2 inline that incorporated some directed feedback.
> > > Otherwise there were questions and ideas about making the whole thing
> > > faster, but I've no idea if that addresses the problem or not (if so,
> > > that would be an entirely different set of patches). I'll wait and see
> > > what Darrick thinks about this and rebase/repost if the approach is
> > > agreeable..
> >
> > There is always the school of thought that says that the best way to
> > get people to focus on this is to rebase and repost. Otherwise, they
> > are all too likely to assume that you lost interest in this.
>
> I was hoping that a better solution would emerge for clearing
> PageWriteback on hundreds of thousands of pages, but nothing easy popped
> out.
>
> The hardcoded threshold in "[PATCH v2 2/2] xfs: kick extra large ioends
> to completion workqueue" gives me unease because who's to say if marking
> 262,144 pages on a particular CPU will actually stall it long enough to
> trip the hangcheck? Is the number lower on (say) some pokey NAS box
> with a lot of storage but a slow CPU?

It's also not the right thing to do given the IO completion
workqueue is a bound workqueue. Anything that is doing large amounts
of CPU intensive work should be on a unbound workqueue so that the
scheduler can bounce it around different CPUs as needed.

Quite frankly, the problem is a huge long ioend chain being built by
the submission code. We need to keep ioend completion overhead down.
It runs in either softirq or bound workqueue context and so
individual items of work that are performed in this context must not
be -unbounded- in size or time. Unbounded ioend chains are bad for
IO latency, they are bad for memory reclaim and they are bad for CPU
scheduling.

As I've said previously, we gain nothing by aggregating ioends past
a few tens of megabytes of submitted IO. The batching gains are
completely diminished once we've got enough IO in flight to keep the
submission queue full. We're talking here about gigabytes of
sequential IOs in a single ioend chain which are 2-3 orders of
magnitude larger than needed for optimal background IO submission
and completion efficiency and throughput. IOWs, we really should be
limiting the ioend chain length at submission time, not trying to
patch over bad completion behaviour that results from sub-optimal IO
submission behaviour...

> That said, /some/ threshold is probably better than no threshold. Could
> someone try to confirm if that series of Brian's fixes this problem too?

262144 pages is still too much work to be doing in a single softirq
IO completion callback. It's likely to be too much work for a bound
workqueue, too, especially when you consider that the workqueue
completion code will merge sequential ioends into one ioend, hence
making the IO completion loop counts bigger and latency problems worse
rather than better...

Cheers,

Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx