Re: [PATCH RFC] mm: implement write-behind policy for sequential file writes

From: Konstantin Khlebnikov
Date: Mon Oct 02 2017 - 16:58:54 EST

On 02.10.2017 22:54, Linus Torvalds wrote:
On Mon, Oct 2, 2017 at 2:54 AM, Konstantin Khlebnikov
<khlebnikov@xxxxxxxxxxxxxx> wrote:

This patch implements write-behind policy which tracks sequential writes
and starts background writeback when have enough dirty pages in a row.

This looks lovely to me.

I do wonder if you also looked at finishing the background
write-behind at close() time, because it strikes me that once you
start doing that async writeout, it would probably be good to make
sure you try to do the whole file.

Smaller files or tails is lesser problem and forced writeback here
might add bigger overhead due to small requests or too random IO.
Also open+append+close pattern could generate too much IO.

I'm thinking of filesystems that do delayed allocation etc - I'd
expect that you'd want the whole file to get allocated on disk
together, rather than have the "first 256kB aligned chunks" allocated
thanks to write-behind, and then the final part allocated much later
(after other files may have triggered their own write-behind). Think
loads like copying lots of pictures around, for example.

As far as I know ext4 preallocates space beyond file end for writing
patterns like append + fsync. Thus allocated extents should be bigger
than 256k. I haven't looked into this yet.

I don't have any particularly strong feelings about this, but I do
suspect that once you have started that IO, you do want to finish it
all up as the file write is done. No?

I'm aiming into continuous file operations like downloading huge file
or writing verbose log. Original motivation came from low-latency server
workloads which suffers from parallel bulk operations which generates
tons of dirty pages. Probably for general-purpose usage thresholds
should be increased significantly to cover only really bulky patterns.

It would also be really nice to see some numbers. Perhaps a comparison
of "vmstat 1" or similar when writing a big file to some slow medium
like a USB stick (which is something we've done very very badly at,
and this should help smooth out)?

I'll try to find out some real cases with numbers.

For now I see that massive write + fdatasync (dd conf=fdatasync, fio)
always ends earlier because writeback now starts earlier too.
Without fdatasync it's obviously slower.

Cp to usb stick + umount should show same result, plus cp could be
interrupted at any point without contaminating cache with dirty pages.

Kernel compilation tooks almost the same time because most files are
smaller than 256k.