Re: [PATCH] direct-io: only inc/dec inode->i_dio_count for file systems

From: Jens Axboe
Date: Wed Apr 15 2015 - 15:27:27 EST

On 04/15/2015 12:56 PM, Andrew Morton wrote:
On Wed, 15 Apr 2015 12:22:56 -0600 Jens Axboe <axboe@xxxxxx> wrote:


This is a reposting of a patch that was originally in the blk-mq series.
It has huge upside on shared access to a multiqueue device doing
O_DIRECT, it's basically the scaling block that ends up killing
performance. A quick test here reveals that we spend 30% of all system
time just incrementing and decremening inode->i_dio_count. For block
devices this isn't useful at all, as we don't need protection against
truncate. For that test case, performance increases about 3.6x (!!) by
getting rid of this inc/dec per IO.

I've cleaned it up a bit since last time, integrating the checks in
inode_dio_done() and adding a inode_dio_begin() so that callers don't
need to know about this.

We've been running a variant of this patch in the FB kernel for a while.
I'd like to finally get this upstream.

30% overhead for one atomic_inc+atomic_dec+wake_up_bit() per IO? That
seems very high! Is there something else going on?

Nope, that is it. But for this simple test case, that is essentially the only shared state that exists and is dirtied between all CPUs that are running an application that does O_DIRECT to the device. The test case ran at ~9.6M IOPS after the patch, and at ~2.5M IOPS before. On a simple 2 socket box, no less, nothing fancy there.

And it's _just_ the atomic inc/dec. I ran with the exact patch posted, and that still does (needlessly, I think) the wake_up_bit() for inode_dio_done().

Is there similar impact to direct-io-to-file? It would be nice to fix
that up also. Many filesystems do something along the lines of


and with your patch I think we could change them to

__blockdev_direct_IO(..., flags|DIO_IGNORE_TRUNCATE);

which would halve the atomic op load.

I haven't checked pure file, but without extending, I suspect that we should see similar benefits there. In any case, it'd make sense doing, having twice the atomic inc/dec is just a bad idea in general, if we can get rid of it.

A quick grep doesn't show this use case, or I'm just blind. Where do you see that?

But that's piling hack on top of hack. Can we change the

I'd more view it as reducing the hack, the real hack is the way that we manually do atomic_inc() on i_dio_count, and then call a magic inode_dio_done() that decrements it again. It's not very pretty, I'm just reducing the scope of the hack :-)

do_blockdev_direct_IO() interface to "caller shall hold i_mutex, or
increment i_dio_count"? ie: exclusion against truncate is wholly the
caller's responsibility. That way, this awkward sharing of
responsibility between caller and callee gets cleaned up and

That would clean it up, at the expense of touching more churn. I'd be fine with doing it that way.

inode_dio_begin() would be a good place to assert that i_mutex is held,

How would you check it? If the i_dio_count is bumped, then you'd not need to hold i_mutex.

This whole i_dio_count thing is pretty nasty, really. If you stand
back and squint, it's basically an rwsem. I wonder if we can use an

We could, but a bit orthogonal to the patch since we'd do the same avoidance for blkdevs.

What's the reason for DIO_IGNORE_TRUNCATE rather than boring old

Didn't want to tie it to block devices, but rather just keep the "need lock or not" as a separate flag in case there were more use cases.

Jens Axboe

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at