Re: [PATCH v7.1] block: Coordinate flush requests

From: Shaohua Li
Date: Thu Jan 13 2011 - 20:00:39 EST


2011/1/13 Darrick J. Wong <djwong@xxxxxxxxxx>:
> On Thu, Jan 13, 2011 at 01:38:55PM +0800, Shaohua Li wrote:
>> 2011/1/13 Darrick J. Wong <djwong@xxxxxxxxxx>:
>> > On certain types of storage hardware, flushing the write cache takes a
>> > considerable amount of time.  Typically, these are simple storage systems with
>> > write cache enabled and no battery to save that cache during a power failure.
>> > When we encounter a system with many I/O threads that try to flush the cache,
>> > performance is suboptimal because each of those threads issues its own flush
>> > command to the drive instead of trying to coordinate the flushes, thereby
>> > wasting execution time.
>> >
>> > Instead of each thread initiating its own flush, we now try to detect the
>> > situation where multiple threads are issuing flush requests.  The first thread
>> > to enter blkdev_issue_flush becomes the owner of the flush, and all threads
>> > that enter blkdev_issue_flush before the flush finishes are queued up to wait
>> > for the next flush.  When that first flush finishes, one of those sleeping
>> > threads is woken up to perform the next flush and then wake up the other
>> > threads which are asleep waiting for the second flush to finish.
>> >
>> > In the single-threaded case, the thread will simply issue the flush and exit.
>> >
>> > To test the performance of this latest patch, I created a spreadsheet
>> > reflecting the performance numbers I obtained with the same ffsb fsync-happy
>> > workload that I've been running all along:  http://tinyurl.com/6xqk5bs
>> >
>> > The second tab of the workbook provides easy comparisons of the performance
>> > before and after adding flush coordination to the block layer.  Variations in
>> > the runs were never more than about 5%, so the slight performance increases and
>> > decreases are negligible.  It is expected that devices with low flush times
>> > should not show much change, whether the low flush times are due to the lack of
>> > write cache or the controller having a battery and thereby ignoring the flush
>> > command.
>> >
>> > Notice that the elm3b231_ipr, elm3b231_bigfc, elm3b57, elm3c44_ssd,
>> > elm3c44_sata_wc, and elm3c71_scsi profiles showed large performance increases
>> > from flush coordination.  These 6 configurations all feature large write caches
>> > without battery backups, and fairly high (or at least non-zero) average flush
>> > times, as was discovered when I studied the v6 patch.
>> >
>> > Unfortunately, there is one very odd regression: elm3c44_sas.  This profile is
>> > a couple of battery-backed RAID cabinets striped together with raid0 on md.  I
>> > suspect that there is some sort of problematic interaction with md, because
>> > running ffsb on the individual hardware arrays produces numbers similar to
>> > elm3c71_extsas.  elm3c71_extsas uses the same type of hardware array as does
>> > elm3c44_sas, in fact.
>> >
>> > FYI, the flush coordination patch shows performance improvements both with and
>> > without Christoph's patch that issues pure flushes directly.  The spreadsheet
>> > only captures the performance numbers collected without Christoph's patch.
>> Hi,
>> can you explain why there is improvement with your patch? If there are
>> multiple flush, blk_do_flush already has queue for them (the
>> ->pending_flushes list).
>
> With the current code, if we have n threads trying to issue flushes, the block
> layer will issue n flushes one after the other.  I think the point of
> Christoph's pure-flush patch is to skip the serialization step and allow
> issuing of the n pure flushes in parallel.  The point of this patch is optimize
> even more aggressively, such that as soon as the system becomes free to process
> a pure flush (at time t), all the requests for a pure flush that were created
> since the last time a pure flush was actually issued can be covered with a
> single flush issued at time t.  In other words, this patch tries to reduce the
> number of pure flushes issued.
Thanks, but why just doing merge for pure flush? we can merge normal
flush requests
with a pure flush request too.

+ complete_all(&new_flush->ready);
+ spin_unlock(&disk->flush_flag_lock);

- bio_put(bio);
+ complete_all(&flush->finish);
this seems can't guarantee the second waiter runs after the first
waiter, am I missing anything?

it appears we can easily implement this in blk_do_flush, I had
something at my hand too, passed test but no data yet.
Task1: ...FS.....C
Task2: ....F......S...C
Task3: ......F.........S..C
F means a flush is queued, S means a flush is dispatched, C means the flush
is completed. In above, when Task2's flush is completed, we actually can
make Task3 flush complete, as Task2's flush is dispatched after Task3's flush
is queued. With the same reason, we can't merge Task2 and Task3's flush with
Task1's.
---
block/blk-core.c | 1 +
block/blk-flush.c | 24 ++++++++++++++++++++++++
include/linux/blkdev.h | 2 ++
3 files changed, 27 insertions(+)

Index: linux/block/blk-core.c
===================================================================
--- linux.orig/block/blk-core.c 2011-01-13 21:02:24.000000000 +0800
+++ linux/block/blk-core.c 2011-01-13 21:43:03.000000000 +0800
@@ -128,6 +128,7 @@ void blk_rq_init(struct request_queue *q
rq->ref_count = 1;
rq->start_time = jiffies;
set_start_time_ns(rq);
+ rq->next_batched_flush = NULL;
}
EXPORT_SYMBOL(blk_rq_init);

Index: linux/block/blk-flush.c
===================================================================
--- linux.orig/block/blk-flush.c 2011-01-13 21:02:24.000000000 +0800
+++ linux/block/blk-flush.c 2011-01-14 08:22:56.000000000 +0800
@@ -42,8 +42,18 @@ static struct request *blk_flush_complet
/* not complete yet, queue the next flush sequence */
next_rq = queue_next_fseq(q);
} else {
+ struct request *tmp_rq = q->orig_flush_rq->next_batched_flush;
+ struct request *tmp_rq2;
+ sector_t error_sector = q->orig_flush_rq->bio->bi_sector;
+
/* complete this flush request */
__blk_end_request_all(q->orig_flush_rq, q->flush_err);
+ while (tmp_rq) {
+ tmp_rq2 = tmp_rq->next_batched_flush;
+ tmp_rq->bio->bi_sector = error_sector;
+ __blk_end_request_all(tmp_rq, q->flush_err);
+ tmp_rq = tmp_rq2;
+ }
q->orig_flush_rq = NULL;
q->flush_seq = 0;

@@ -164,6 +174,20 @@ struct request *blk_do_flush(struct requ
* processing.
*/
if (q->flush_seq) {
+ struct request *tmp_rq;
+
+ if ((rq->cmd_flags & REQ_FUA) || blk_rq_sectors(rq))
+ goto add_list;
+ list_for_each_entry(tmp_rq, &q->pending_flushes, queuelist) {
+ if (tmp_rq->cmd_flags & REQ_FLUSH) {
+ rq->next_batched_flush =
+ tmp_rq->next_batched_flush;
+ tmp_rq->next_batched_flush = rq;
+ list_del_init(&rq->queuelist);
+ return NULL;
+ }
+ }
+add_list:
list_move_tail(&rq->queuelist, &q->pending_flushes);
return NULL;
}
Index: linux/include/linux/blkdev.h
===================================================================
--- linux.orig/include/linux/blkdev.h 2011-01-13 21:02:24.000000000 +0800
+++ linux/include/linux/blkdev.h 2011-01-13 21:43:03.000000000 +0800
@@ -163,6 +163,8 @@ struct request {

/* for bidi */
struct request *next_rq;
+
+ struct request *next_batched_flush;
};

static inline unsigned short req_get_ioprio(struct request *req)