write-behind on streaming writes

From: Fengguang Wu
Date: Tue May 29 2012 - 11:58:02 EST


Hi Linus,

On Mon, May 28, 2012 at 10:09:56AM -0700, Linus Torvalds wrote:
> Ok, pulled.
>
> However, I have an independent question for you - have you looked at
> any kind of per-file write-behind kind of logic?

Yes, definitely. Especially for NFS, it benefits to keep each file's
dirty pages low. Because in NFS, a simple stat() will require flushing
all the file's dirty pages before proceeding.

However in general there are no strong user requests for this feature.
I guess it's mainly because they still have the choices to use O_SYNC
or O_DIRECT.

Actually O_SYNC is pretty close to the below code for the purpose of
limiting the dirty and writeback pages, except that it's not on by
default, hence means nothing for normal users.

> The reason I ask is that pretty much every time I write some big file
> (usually when over-writing a harddisk), I tend to use my own hackish
> model, which looks like this:
>
> #define BUFSIZE (8*1024*1024ul)
>
> ...
> for (..) {
> ...
> if (write(fd, buffer, BUFSIZE) != BUFSIZE)
> break;
> sync_file_range(fd, index*BUFSIZE, BUFSIZE,
> SYNC_FILE_RANGE_WRITE);
> if (index)
> sync_file_range(fd, (index-1)*BUFSIZE,
> BUFSIZE, SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER);
> ....
>
> and it tends to be *beautiful* for both disk IO performane and for
> system responsiveness while the big write is in progress.

It seems to me all about optimizing the 1-dd case for desktop users,
and the most beautiful thing about per-file write behind is, it keeps
both the number of dirty and writeback pages low in the system when
there are only one or two sequential dirtier tasks. Which is good for
responsiveness.

Note that the above user space code won't work well when there are 10+
dirtier tasks. It effectively creates 10+ IO submitters on different
regions of the disk and thus create lots of seeks. When there are 10+
dirtier tasks, it's not only desirable to have one single flusher
thread to submit all IO, but also for the flusher to work on the
inodes with large write chunk size.

I happen to have some numbers on comparing the current adaptive
(write_bandwidth/2=50MB) and the old fixed 4MB write chunk sizes on
XFS (not choosing ext4 because it internally enforces >=128MB chunk
size). It's basically 4% performance drop in the 1-dd case and up to
20% in the 100-dd case.

3.4.0-rc2 3.4.0-rc2-4M+
----------- ------------------------
114.02 -4.2% 109.23 snb/thresh=8G/xfs-1dd-1-3.4.0-rc2
102.25 -11.7% 90.24 snb/thresh=8G/xfs-10dd-1-3.4.0-rc2
104.17 -17.5% 85.91 snb/thresh=8G/xfs-20dd-1-3.4.0-rc2
104.94 -18.7% 85.28 snb/thresh=8G/xfs-30dd-1-3.4.0-rc2
104.76 -21.9% 81.82 snb/thresh=8G/xfs-100dd-1-3.4.0-rc2

So we probably still want to keep the 0.5s worth of chunk size.

> And I'm wondering if we couldn't expose this kind of write-behind
> logic from the kernel. Sure, it only works for the "contiguous write
> of a single large file" model, but that model isn't actually all
> *that* unusual.
>
> Right now all the write-back logic is based on the
> balance_dirty_pages() model, which is more of a global dirty model.
> Which obviously is needed too - this isn't an "either or" kind of
> thing, it's more of a "maybe we could have a streaming detector *and*
> the 'random writes' code". So I was wondering if anybody had ever been
> looking more at an explicit write-behind model that uses the same kind
> of "per-file window" that the read-ahead code does.

I can imagine it being implemented in kernel this way:

streaming write detector in balance_dirty_pages():

if (not globally throttled &&
is streaming writer &&
it's crossing the N+1 boundary) {
queue writeback work for chunk N to the flusher
wait for work completion
}

The good thing is, that looks not a complex addition. However the
potential problem is, the "wait for work completion" part won't have
guaranteed complete time, especially when there are multiple dd tasks.
This could result in uncontrollable delays in the write() syscall. So
we may do this instead:

- wait for work completion
+ sleep for (chunk_size/write_bandwidth)

To avoid long write() delays, we might further split the one big 0.5s
sleep into smaller sleeps.

> (The above code only works well for known streaming writes, but the
> *model* of saying "ok, let's start writeout for the previous streaming
> block, and then wait for the writeout of the streaming block before
> that" really does tend to result in very smooth IO and minimal
> disruption of other processes..)

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/