[RFC] writev() semantics with invalid iovec in the middle

From: Al Viro
Date: Wed Sep 14 2016 - 17:35:04 EST


Right now writev() with 3-iovec array that has unmapped address in
the second element and total length less than PAGE_SIZE will write the
first segment and stop at that. Among other things, it guarantees the
short copy, and I would rather have it yeild 0-bytes write (and -EFAULT as
return value).

All POSIX has to say about that is this (in 2.3 Error Numbers):

[EFAULT]
Bad address. The system detected an invalid address in attempting to use
an argument of a call. The reliable detection of this error cannot be
guaranteed, and when not detected may result in the generation of a signal,
indicating an address violation, which is sent to the process.

Note that unmapped page in the middle of a range covered already can lead to
the same kind of short write - i.e. if we have
p = mmap(0, 3*4096, PROT_READ, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
munmap(p + 4096, 4096);
fd = open("/tmp/foo", O_CREAT|O_TRUNC|O_RDWR, 0777);
write(fd, p + 2048, 8192);

write() will yield -EFAULT, not a 2Kb stored. The same will happen with
writev(fd, &(struct iovec){p + 2048, 8192}, 1);
BTW, adding lseek(fd, 2049, SEEK_SET); before that write (or writev) will
result in 2047 bytes being written by the latter.

IOW, we do not try to squeeze every byte that can be squeezed out of the
buffer; generally, an unmapped address anywhere in PAGE_SIZE worth of data
that would go into the same page-aligned chunk of destination can result in
short write cut at the beginning of that chunk. iovec boundaries act
as barriers to short writes, mostly by accident.

Do we need to preserve that special treatment of iovec boundaries? I would
really like to get rid of that - the current behaviour is an easy and reliable
way to trigger a short copy case in ->write_end() and those are fairly
brittle. Sure, we still need to cope with them, and I think I've got all
instances in the current mainline fixed, but they are often suboptimal.

Objections?