Re: Unexpected splice "always copy" behavior observed

From: Nick Piggin
Date: Wed May 19 2010 - 02:31:32 EST


On Tue, May 18, 2010 at 09:25:05AM -0700, Linus Torvalds wrote:
>
>
> On Tue, 18 May 2010, Steven Rostedt wrote:
> >
> > Hopefully we can find a way to avoid the copy to file. But the splice
> > code was created to avoid the copy to and from userspace, it did not
> > guarantee no copy within the kernel itself.
>
> Well, we always _wanted_ to splice directly to a file, but it's just not
> been done properly. It's not entirely trivial, since you need to worry
> about preexisting pages and generally just do the right thing wrt the
> filesystem.
>
> And no, it should NOT use migration code. I suspect you could do something
> fairly simple like:

I was thinking it could possibly reuse some of the migration code for
swapping filesystem state to the new page. But I agree it gets hairy and
is probably better to just insert new pages.

>
> - get the inode semaphore.
> - check if the splice is a pure "extend size" operation for that page
> - if so, just create the page cache entry and mark it dirty
> - otherwise, fall back to copying.
>
> because the "extend file" case is the easiest one, and is likely the only
> one that matters in practice (if you are overwriting an existing file,
> things get _way_ hairier, and why the hell would anybody expect that to be
> fast anyway?)
>
> But somebody needs to write the code..

We can possibly do an attempt to invalidate existing pagecache and
then try to install the new page. The filesystem still needs a look
over to ensure error handling will work properly, and that it does
not make incorrect assumptions about the contents of the page being
passed in.

This still isn't ideal because we drop the filesystem state (eg bufer
heads) on a page which, by definition, will need to be written out soon.
But something smarter could be added if it turns out to be important.

Big if, because I don't like adding complex code without having a
really good reason. I do like having the splice flag there, though.
The more the app can tell the kernel the better. Hopefully people use
it and we can get a better idea of whether these fancy optimisations
will be worth it.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/