Re: [PATCH 1/3] direct-io: only inc/dec inode->i_dio_count for file systems

From: Jens Axboe
Date: Wed Apr 15 2015 - 18:58:09 EST


On 04/15/2015 04:36 PM, Dave Chinner wrote:
On Wed, Apr 15, 2015 at 04:01:36PM -0600, Jens Axboe wrote:
do_blockdev_direct_IO() increments and decrements the inode
->i_dio_count for each IO operation. It does this to protect against
truncate of a file. Block devices don't need this sort of protection.

For a capable multiqueue setup, this atomic int is the only shared
state between applications accessing the device for O_DIRECT, and it
presents a scaling wall for that. In my testing, as much as 30% of
system time is spent incrementing and decrementing this value. A mixed
read/write workload improved from ~2.5M IOPS to ~9.6M IOPS, with
better latencies too. Before:
.....
diff --git a/fs/inode.c b/fs/inode.c
index f00b16f45507..c4901c40ad65 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1946,18 +1946,31 @@ void inode_dio_wait(struct inode *inode)
EXPORT_SYMBOL(inode_dio_wait);

/*
- * inode_dio_done - signal finish of a direct I/O requests
+ * inode_dio_begin - signal start of a direct I/O requests
* @inode: inode the direct I/O happens on
*
* This is called once we've finished processing a direct I/O request,
* and is used to wake up callers waiting for direct I/O to be quiesced.
*/
-void inode_dio_done(struct inode *inode)
+void inode_dio_inc(struct inode *inode)

function name does not match docbook comment....

Oops, will fix that up.

+{
+ atomic_inc(&inode->i_dio_count);
+}
+EXPORT_SYMBOL(inode_dio_inc);
+
+/*
+ * inode_dio_dec - signal finish of a direct I/O requests
+ * @inode: inode the direct I/O happens on
+ *
+ * This is called once we've finished processing a direct I/O request,
+ * and is used to wake up callers waiting for direct I/O to be quiesced.
+ */
+void inode_dio_dec(struct inode *inode)
{
if (atomic_dec_and_test(&inode->i_dio_count))
wake_up_bit(&inode->i_state, __I_DIO_WAKEUP);
}
-EXPORT_SYMBOL(inode_dio_done);
+EXPORT_SYMBOL(inode_dio_dec);

Bikeshedding: I think this would be better suited to inode_dio_begin()
and inode_dio_end() because now we are trying to say "this is where
the DIO starts, and this is where it ends". It's not really
"reference counting" interface, we're trying to annotate the
boundaries of where DIO iis protected against truncate....

I don't really care, if people like begin/end more than inc/dec, I'm happy with that.

And, realistically, if we are pushing this up into the filesystems
again, we should push it up into *all* filesystems and get rid of it
completely from the DIO layer. That way no new twisty passages in
the direct IO code are needed.

Lets please keep that for a potential round 2. It's not like I'm piling lots of hacks on, it's two one-liner changes. It's not adding a lot to the entropy of direct-io.c. I've been carrying this patch for years now, I really don't want to sign up for futzing around in direct-io.c, nor is that a reasonable requirement imho.

--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/