Re: splice vs execve lockdep trace.

From: Dave Chinner
Date: Wed Jul 17 2013 - 01:51:19 EST

Next message: Willy Tarreau: "Re: [Ksummit-2013-discuss] [ATTEND] How to act on LKML"
Previous message: Rong Wang: "Re: [PATCH] usb: udc: add gadget state kobject uevent"
In reply to: Linus Torvalds: "Re: splice vs execve lockdep trace."
Next in thread: Linus Torvalds: "Re: splice vs execve lockdep trace."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, Jul 16, 2013 at 09:54:09PM -0700, Linus Torvalds wrote:
> On Tue, Jul 16, 2013 at 9:06 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> >
> > Right, and that's one of the biggest problems page based IO has - we
> > can't serialise it against other IO and other page cache
> > manipulation functions like hole punching. What happens when a
> > splice read or mmap page fault races with a hole punch? You get
> > stale data being left in the page cache because we can't serialise
> > the page read with the page cache invalidation and underlying extent
> > removal.
>
> But Dave, that's *good*.
>
> You call it "stale data".
>
> I call it "the data was valid at some point".

Yes, that's true.

But When i say "stale data" I mean that the data being returned
might not have originally belonged to the underlying file you are
reading.

> This is what "splice()" is fundamentally all about.
>
> Think of it this way: even if you are 100% serialized during the
> "splice()" operation, what do you think happens afterwards?

Same as for read(2) - I don't care. Indeed, the locking is not to
protect what splice does "afterwards". It's for IO serialisation,
not page cache serialisation.

For example, splice read races with hole punch - after the page is
invalidated, but before the blocks are punched out. So we get a
new page, a mapping from the inode, get_blocks() returns valid disk
mapping, we start an IO. it gets blocked in the elevator. Maybe it's
been throttled because the process has used it's blkio quota.
Whatever - the IO gets stuck on the dispatch queue.

Meanwhile, the holepunch runs the transaction to free the disk
blocks, and then something else reallocates those blocks - maybe
even the filesystem for new metadata. That then gets dispatched to
disk. We now have two IOs pending for the same disk blocks....

What that means is that what the splice read returns is undefined.
It *might* be the original data that belonged to the file, in which
case we are good to go, but if the write wins the IO race then the
splice read could return someone else's data or even filesystem
metadata - that's what "stale data" means. Any way you look at it,
that's a bad thing....

> Seriously, think it through.

I have ;)

> That data is in a kernel buffer - the pipe. The fact that it was
> serialized at the time of the original splice() doesn't make _one_
> whit of a difference, because after the splice is over, the data still
> sits around in that pipe buffer, and you're no longer serializing it.
> Somebody else truncating the file or punching a hole in the file DOES
> NOT MATTER. It's too late.

Yes, I agree that once we have a page with valid data in it, we
don't need serialisation for splice. Again, I'll stress that he
serialisation is not for splice page cache manipulations but for
the IO that splice may issue to initialise the pages it them
manipulates....

Cheers,

Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Willy Tarreau: "Re: [Ksummit-2013-discuss] [ATTEND] How to act on LKML"
Previous message: Rong Wang: "Re: [PATCH] usb: udc: add gadget state kobject uevent"
In reply to: Linus Torvalds: "Re: splice vs execve lockdep trace."
Next in thread: Linus Torvalds: "Re: splice vs execve lockdep trace."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]