[patch/rft] jbd2: tag journal writes as metadata I/O

From: Jeff Moyer
Date: Thu Apr 01 2010 - 15:05:16 EST


In running iozone for writes to small files, we noticed a pretty big
discrepency between the performance of the deadline and cfq I/O
schedulers. Investigation showed that I/O was being issued from 2
different contexts: the iozone process itself, and the jbd2/sdh-8 thread
(as expected). Because of the way cfq performs slice idling, the delays
introduced between the metadata and data I/Os were significant. For
example, cfq would see about 7MB/s versus deadline's 35 for the same
workload. I also tested fs_mark with writing and fsyncing 1000 64k
files, and a similar 5x performance difference was observed. Eric
Sandeen suggested that I flag the journal writes as metadata, and once I
did that, the performance difference went away completely (cfq has
special logic to prioritize metadata I/O).

So, I'm submitting this patch for comments and testing. I have a
similar patch for jbd that I will submit if folks agree that this is a
good idea.


Signed-off-by: Jeff Moyer <jmoyer@xxxxxxxxxx>

diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 671da7f..1998265 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -139,7 +139,7 @@ static int journal_submit_commit_record(journal_t *journal,
barrier_done = 1;
- ret = submit_bh(WRITE_SYNC_PLUG, bh);
+ ret = submit_bh(WRITE_SYNC_PLUG | (1<<BIO_RW_META), bh);
if (barrier_done)

@@ -160,7 +160,7 @@ static int journal_submit_commit_record(journal_t *journal,
- ret = submit_bh(WRITE_SYNC_PLUG, bh);
+ ret = submit_bh(WRITE_SYNC_PLUG | (1<<BIO_RW_META), bh);
*cbh = bh;
return ret;
@@ -369,7 +369,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
int tag_bytes = journal_tag_bytes(journal);
struct buffer_head *cbh = NULL; /* For transactional checksums */
__u32 crc32_sum = ~0;
- int write_op = WRITE;
+ int write_op = WRITE_META;

* First job: lock down the current transaction and wait for
@@ -409,7 +409,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
* instead we rely on sync_buffer() doing the unplug for us.
if (commit_transaction->t_synchronous_commit)
- write_op = WRITE_SYNC_PLUG;
+ write_op = WRITE_SYNC_PLUG | (1<<BIO_RW_META);
trace_jbd2_commit_locking(journal, commit_transaction);
stats.run.rs_wait = commit_transaction->t_max_wait;
stats.run.rs_locked = jiffies;

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/