Re: IO queuing and complete affinity with threads (was Re: [PATCH 0/8] IO queuing and complete affinity)

From: David Chinner
Date: Mon Feb 11 2008 - 00:23:04 EST


On Fri, Feb 08, 2008 at 08:59:55AM +0100, Jens Axboe wrote:
> > > > At least they reported it to be the most efficient scheme in their
> > > > testing, and Dave thought that migrating completions out to submitters
> > > > might be a bottleneck in some cases.
> > >
> > > More so than migrating submitters to completers? The advantage of only
> > > movign submitters is that you get rid of the completion locking. Apart
> > > from that, the cost should be the same, especially for the thread based
> > > solution.
> >
> > Not specifically for the block layer, but higher layers like xfs.
>
> True, but that's parallel to the initial statement - that migrating
> completers is most costly than migrating submitters. So I'd like Dave to
> expand on why he thinks that migrating completers it more costly than
> submitters, APART from the locking associated with adding the request to
> a remote CPU list.

What I think Nick is referring to is the comments I made that at a
higher layer (e.g. filesystems) migrating completions to the
submitter CPU may be exactly the wrong thing to do. I don't recall
making any comments on migrating submitters - I think others have
already commented on that so I'll ignore that for the moment and
try to explain why completion on submitter CPU /may/ be bad.

For example, in the case of XFS it is fine for data I/O but it is
wrong for transaction I/O completion. We want to direct all
transaction completions to as few CPUs as possible (one, ideally) so
that all the completion processing happens on the same CPU, rather
than bouncing global cachelines and locks between all the CPUs
taking completion interrupts.

In more detail, the XFS transaction subsystem is asynchronous.
We submit the transaction I/O on the CPU that creates the
transaction so the I/O can come from any CPU in the system. If we
then farm the completion processing out to the submission CPU, that
will push it all over the machine and guarantee that we bounce all
of the XFS transaction log structures and locks all over the machine
on completion as well as submission (right now it's lots of
submission CPUs, few completion CPUs).

An example how bad this can get - this patch:

http://oss.sgi.com/archives/xfs/2007-11/msg00217.html

which prevents simultaneous access to the items tracked in the log
during transaction reservation. Having several hundred CPUs trying
to hit this list at once is really bad for performance - the test
app on the 2048p machine that saw this problem went from ~5500s
runtime down to 9s with the above patch.

I use this example because the transaction I/O completion touches
exactly the same list, locks and structures but is limited in
distribution (and therefore contention) by the number of
simultaneous I/O completion CPUs. Doing completion on the submitter
CPU will cause much wider distribution of completion processing and
introduce exactly the same issues as the transaction reservation
side had.

As it goes, with large, multi-device volumes (e.g. big stripe) we
already see issues with simultaneous completion processing (e.g. the
8p machine mentioned in the above link), so I'd really like to avoid
making these problems worse....

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/