Re: Proposal for "proper" durable fsync() and fdatasync()

From: Andrew Morton
Date: Tue Feb 26 2008 - 11:28:56 EST


On Tue, 26 Feb 2008 15:07:45 +0000 Jamie Lokier <jamie@xxxxxxxxxxxxx> wrote:

> SYNC_FILE_RANGE_WRITE scans all pages in the range, looking for dirty
> pages which aren't already queued for write-out. It marks those with
> a "write-out" flag, and starts write I/Os at some unspecified time in
> the near future; it can be assumed writes for all the pages will
> complete eventually if there's no errors. When I/O completes on a
> page, it cleans the page and also clears the write-out flag.
>
> SYNC_FILE_RANGE_WAIT_AFTER waits until all pages in the range don't
> have the "write-out" flag set.
>
> SYNC_FILE_RANGE_WAIT_BEFORE does the same wait, but before marking
> pages for write-out. I don't actually see the point in this. Isn't a
> preceding call with SYNC_FILE_RANGE_WAIT_AFTER equivalent, making
> BEFORE a redundant flag?

Consider the case of pages which are dirty but are already under writeout.
ie: someone redirtied the page after someone started writing the page out.
For these pages the kernel needs to

a) wait for the current writeout to complete

b) start new writeout

c) wait for that writeout to complete.

those are the three stages of sync_file_range(). They are independently
selectable and various combinations provide various results.

The reason for providing b) only (SYNC_FILE_RANGE_WRITE) is so that
userspace can get as much data into the queue as possible, to permit the
kernel to optimise IO scheduling better.

If you perform a) and b) together
(SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE) then you are guaranteed
that all data which was dirty when sync_file_range() executed will be sent
into the queue, but you won't get as much data into the queue if the kernel
encounters dirty, under-writeout pages. This is especially hurtful if
you're trying to feed a lot of little segments into the queue. In that
case perhaps userspace should do an asynchrnous pass
(SYNC_FILE_RANGE_WRITE) to stuff as much data as poss into the queue, then
a SYNC_FILE_RANGE_WAIT_AFTER pass then a
SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER
pass to clean up any stragglers. WHich mode is best very much depends on
the application's file dirtying patterns. One would have to experiment
with it, and tuning of sync_file_range() usage would occur alongside tuning
of the application's write() design.

It's an interesting problem, with potentially high payback.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/