Re: single aio thread is migrated crazily by scheduler

From: Dave Chinner
Date: Mon Dec 02 2019 - 18:06:30 EST


On Mon, Dec 02, 2019 at 05:01:58PM +0800, Hillf Danton wrote:
>
> On Mon, 2 Dec 2019 14:08:44 +1100 Dave Chinner wrote:
> > On Thu, Nov 28, 2019 at 05:40:03PM +0800, Hillf Danton wrote:
> > > On Sat, 16 Nov 2019 10:40:05 Dave Chinner wrote:
> > > > Yeah, the fio task averages 13.4ms on any given CPU before being
> > > > switched to another CPU. Mind you, the stddev is 12ms, so the range
> > > > of how long it spends on any one CPU is pretty wide (330us to
> > > > 330ms).
> > > >
> > > Hey Dave
> > >
> > > > IOWs, this doesn't look like a workqueue problem at all - this looks
> > >
> > > Surprised to see you're so sure it has little to do with wq,
> >
> > Because I understand how the workqueue is used here.
> >
> > Essentially, the workqueue is not necessary for a -pure- overwrite
> > where no metadata updates or end-of-io filesystem work is required.
> >
> > However, change the workload just slightly, such as allocating the
> > space, writing into preallocated space (unwritten extents), using
> > AIO writes to extend the file, using O_DSYNC, etc, and we *must*
> > use a workqueue as we have to take blocking locks and/or run
> > transactions.
> >
> > These may still be very short (e.g. updating inode size) and in most
> > cases will not block, but if they do, then if we don't move the work
> > out of the block layer completion context (i.e. softirq running the
> > block bh) then we risk deadlocking the code.
> >
> > Not to mention none of the filesytem inode locks are irq safe.
> >
> > IOWs, we can remove the workqueue for this -one specific instance-
> > but it does not remove the requirement for using a workqueue for all
> > the other types of write IO that pass through this code.
> >
> So it's not true that it doesn't has anything to do with workqueue.

You misunderstood what I was saying. I meant that this adverse
schdeuler behaviour is not *unique to this specific workqueue
instance* or workload. There are another 5+ workqueues in XFS alone
that are based around the same "do all the deferred work on the same
CPU" queuing behaviour. Several of them are IO completion
processing workqueues, and it is designed this way to avoid running
completion work that access common structures across all the CPUs in
the system.

And, FWIW, we've had this "per-cpu delayed work" processing
mechanism in XFS since ~2002 when per-cpu work queues were
introduced in ~2.5.40. What we are doing with workqueues here is not
new or novel, and it's worked just fine for most of this time...

> > > INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
> > > - queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work);
> > > + schedule_work(&dio->aio.work);
> >
> > This does nothing but change the workqueue from a per-sb wq to the
> > system wq. The work is still bound to the same CPU it is queued on,
> > so nothing will change.
> >
> The system wq is enough here to make some visible difference as CFS will
> be looking to make new lb decision in particular when submitter and
> completion are running on different CPUs.

That's noise caused by slightly different loading of the system
workqueue vs a private work queue. It's likely just enough to move
the scheduler out of the window where it makes incorrect decisions.
i.e. Add a bit more user load or load onto other CPUs, and the
problem will reappear.

As I said, this is *not* a fix for the problem - it just moves it
around so that you can't see it for this specific workload instance.

> It's claimed that "Maintaining CPU affinity across dispatch and completion
> work has been proven to be a significant performance win." If completion
> is running in the softirq context then it would take some time to sort
> out why irq (not CPU) affinity is making difference across CPUs.

We use irq steering to provide CPU affinity for the structures being
used by completion because they are the same ones used by
submission. If completion happens quickly enough, those structures
are still hot in the cache of the submission CPU, and so we don't
drag bio and filesystem structures out of the CPU cache they sit in
by steering the completion to the submission CPU.

Most of the modern high perofrmance storage hardware has hardware
interrupt steering so the block layer doesn't have to do this. See
__blk_mq_complete_request() and __blk_complete_request(). If the
device has multiple hardware queues, they are already delivering CPU
affine completions. Otherwise __blk_complete_request() uses IPIs
to steer the completion to a CPU that shares a cache with the
submission CPU....

IOWs, we are trying to ensure that we run the data IO completion on
the CPU with that has that data hot in cache. When we are running
millions of IOs every second, this matters -a lot-. IRQ steering is
just a mechansim that is used to ensure completion processing hits
hot caches.

Cheers,

Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx