Re: [RFC] writev() semantics with invalid iovec in the middle

From: Cedric Blancher
Date: Thu Sep 15 2016 - 18:32:58 EST


PAGE_SIZE isn't accurate on architectures which do multiple page
sizes, like 8k, 64k, 512k, 4M, 32M, 256M on SPARC64 and same on
PPC64/Power.

Ced

On 16 September 2016 at 00:29, Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote:
> On Thu, Sep 15, 2016 at 06:23:24AM -0400, Mike Marshall wrote:
>> If you squeeze out every byte won't you still have a short
>> write? And the written data wouldn't be cut at the bad
>> place, but it would have a weird hole or discontinuity there.
>
> ???
>
> What I mean is that if we have an invalid address in the middle of a buffer
> (unmapped, for example), we do not attempt to write every byte prior to that
> invalid address. Of course what we write is going to be contiguous.
>
> Suppose we have a buffer spanning 10 pages (amd64, so these are 4K ones) -
> 7 valid, 3 invalid:
> VVVVIIIVV
> and it starts 100 bytes into the first page. And write goes into a regular
> file on e.g. tmpfs, starting at offset 31. We _can't_ write more than
> 4*4096-100 bytes, no matter what. It will be a short write. As the matter
> of fact, it will be even shorter than that - it will be 3*4096-31 bytes,
> up to the last pagecache boundary we can cover completely. That obviously
> depends upon the filesystem - not everything uses pagecache, for starters.
> However, the caller is *not* guaranteed that write() with an invalid page
> in the middle of a buffer would write everything up to the very beginning
> of the invalid page. A short write will happen, but the amount written
> might be up to page size less than the actual length of valid part in the
> beginning of the buffer.
>
> Now, for writev() we could have invalid pages in any iovec; again, we
> obviously can't write anything past the first invalid page - we'll get
> either a short write or -EFAULT (if nothing got written). That's fine;
> the question is what the caller can count upon wrt shortening.
>
> Again, we are *not* guaranteed writing up to exact boundary. However, the
> current implementation will end up shortening no more than to the iovec
> boundary. I.e. if the first iovec contains only valid pages and there's
> an invalid one in the second iovec, the current implementation will write
> at least everything in the first iovec. That's _not_ promised by POSIX
> or our manpages; moreover, I'm not sure if it's even true for each filesystem.
> And keeping that property is actually inconvenient - if we could discard it,
> we could make partial-copy ->write_end() calls a lot more infrequent.
>
> Unfortunately, some of LTP writev tests end up checking that writev() does
> behave that way - they feed it a three-element iovec with shorter-than-page
> segments, the second of which is all invalid. And they check that the
> entire first segment had been written.
>
> I would really like to drop that property, making it "if some addresses
> in the buffer(s) we are asked to write are invalid, the write will be
> shortened by up to a PAGE_SIZE from the first such invalid address", making
> writev() rules exactly the same as write() ones. Does anybody have objections
> to it?
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at http://vger.kernel.org/majordomo-info.html



--
Cedric Blancher <cedric.blancher@xxxxxxxxx>
[https://plus.google.com/u/0/+CedricBlancher/]
Institute Pasteur