Re: batched write

From: David Chinner
Date: Mon Jun 19 2006 - 20:01:02 EST


On Mon, Jun 19, 2006 at 12:50:49PM -0600, Andreas Dilger wrote:
> On Jun 19, 2006 09:51 -0700, Hans Reiser wrote:
> > Andreas Dilger wrote:
> > >With the caveat that I didn't see the original patch, if this can be a step
> > >down the road toward supporting delayed allocation at the VFS level then
> > >I'm all for such changes.
> >
> > What do you mean by supporting delayed allocation at the VFS level? Do
> > you mean calling to the FS or maybe just not stepping on the FS's toes
> > so much or? Delayed allocation is very fs specific in so far as I can
> > imagine it.
>
> Currently the VM/VFS call into the filesystem in ->prepare_write for each
> page to do block allocation for the filesystem. This is the filesystem's
> chance to return -ENOSPC, etc, because after that point the dirty pages
> are written asynchronously and there is no guarantee that the application
> will even be around when they are finally written to disk.
>
> If the VFS supported delayed allocation it would call into the filesystem
> on a per-sys_write basis to allow the filesystem to RESERVE space for all
> of the pages in the write call,

The VFS doesn't need to support delalloc as delalloc is fundamentally a
filesystem property. The VFS it already provides a hook for delalloc space
reservation that can return ENOSPC - it's called ->prepare_write().

Sure, a batch interface would be nice, but that's an optimisation that
needs to be done regardless of whether the filesystem supports delalloc or
not. The current ->prepare_write() interface shows its limits when having to
do hundreds of thousands (millions, even) of ->prepare_write() calls per
second. This makes for entertaining scaling problems that batching would
make less of a problem.

> and then later (under memory pressure,
> page aging, or even "pull" from the fs) submit a whole batch of contiguous
> pages to the fs efficiently (via ->fill_pages() or whatever).

Can be done right now - XFS does this probe-and-pull operation already for
writes. See xfs_probe_cluster(), xfs_cluster_write() and friends.

Yes, it would be nice to have the VM pass us clusters of adjacent pages, but
given that the file layout drives the cluster size it is more appropriate
to do this from the filesystem. Also, the pages do not contain the state
necessary for the VM to cluster pages in an way that results in efficient
I/O patterns.

Basically, the only thing really needed from the VFS/VM is a method of tagging
delalloc (or unwritten) pages so that the writepage path knows how to treat
the page being written. Currently we keep that state in bufferheads (e.g. see
buffer_delay() usage) attached to the page......

> The fs can know at that time the final file size (if the file isn't still
> being dirtied), can allocate all these blocks in a contiguous chunk, can
> submit all of the IO in a single bio to the block layer or RPC/RDMA to net.

You don't need to know the final file size - just what is contiguous in the
page cache and in the same state as the page being flushed.

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/