[PATCH 2/2] vfs: use &file->f_pos directly on files that have position

From: Kirill Smelkov
Date: Sat Apr 13 2019 - 13:24:54 EST


Long ago vfs read/write operations were passing ppos=&file->f_pos directly to
.read / .write file_operations methods. That changed in 2004 in 55f09ec0087c
("read/write: pass down a copy of f_pos, not f_pos itself.") which started to
pass ppos=&local_var trying to avoid simultaneous read/write/lseek stepping
onto each other toes and overwriting file->f_pos racily. That measure was not
complete and in 2014 commit 9c225f2655e36 ("vfs: atomic f_pos accesses as per
POSIX") added file->f_pos_lock to completely disable simultaneous
read/write/lseek runs. After f_pos_lock was introduced the reason to avoid
passing ppos=&file->f_pos directly due to concurrency vanished.

Linus explains[1]:

In fact, we *used* to (long ago) pass in the address of "file->f_pos"
itself to the low-level read/write routines. We then changed it to do
that indirection through a local copy of pos (and
file_pos_read/file_pos_write) because we didn't do the proper locking,
so different read/write versions could mess with each other (and with
lseek).

But one of the things that commit 9c225f2655e36 ("vfs: atomic f_pos
accesses as per POSIX") did was to add the proper locking at least for
the cases that we care about deeply, so we *could* say that we have
three cases:

- FMODE_ATOMIC_POS: properly locked,
- FMODE_STREAM: no pos at all
- otherwise a "mostly don't care - don't mix!"

and so we could go back to not copying the pos at all, and instead do
something like

loff_t *ppos = f.file->f_mode & FMODE_STREAM ? NULL : &file->f_pos;
ret = vfs_write(f.file, buf, count, ppos);

and perhaps have a long-term plan to try to get rid of the "don't mix"
case entirely (ie "if you use f_pos, then we'll do the proper
locking")

(The above is obviously surrounded by the fdget_pos()/fdput_pos() that
implements the locking decision).

Currently for regular files we always set FMODE_ATOMIC_POS and change that to
FMODE_STREAM if stream_open is used explicitly on open. That leaves other
files, like e.g. sockets and pipes, for "mostly don't care - don't mix!" case.

Sockets, for example, always check that on read/write the initial pos they
receive is 0 and don't update it. And if it is !0 they return -ESPIPE.
That suggests that we can do the switch into passing &file->f_pos directly now
and incrementally convert to FMODE_STREAM files that were doing the stream-like
checking manually in their low-level .read/.write handlers.

Note: it is theoretically possible that a driver updates *ppos inside
even if read/write returns error. For such cases the conversion will
change IO semantic a bit. The semantic that is changing here was
introduced in 2013 in commit 5faf153ebf61 "don't call file_pos_write()
if vfs_{read,write}{,v}() fails".

[1] https://lore.kernel.org/linux-fsdevel/CAHk-=whJtZt52SnhBGrNMnuxFn3GE9X_e02x8BPxtkqrfyZukw@xxxxxxxxxxxxxx/

Suggested-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
Signed-off-by: Kirill Smelkov <kirr@xxxxxxxxxx>
---
fs/read_write.c | 36 ++++--------------------------------
1 file changed, 4 insertions(+), 32 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index d62556be6848..13550b65cb2c 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -571,14 +571,7 @@ ssize_t ksys_read(unsigned int fd, char __user *buf, size_t count)
ssize_t ret = -EBADF;

if (f.file) {
- loff_t pos, *ppos = file_ppos(f.file);
- if (ppos) {
- pos = *ppos;
- ppos = &pos;
- }
- ret = vfs_read(f.file, buf, count, ppos);
- if (ret >= 0 && ppos)
- f.file->f_pos = pos;
+ ret = vfs_read(f.file, buf, count, file_ppos(f.file));
fdput_pos(f);
}
return ret;
@@ -595,14 +588,7 @@ ssize_t ksys_write(unsigned int fd, const char __user *buf, size_t count)
ssize_t ret = -EBADF;

if (f.file) {
- loff_t pos, *ppos = file_ppos(f.file);
- if (ppos) {
- pos = *ppos;
- ppos = &pos;
- }
- ret = vfs_write(f.file, buf, count, ppos);
- if (ret >= 0 && ppos)
- f.file->f_pos = pos;
+ ret = vfs_write(f.file, buf, count, file_ppos(f.file));
fdput_pos(f);
}

@@ -1018,14 +1004,7 @@ static ssize_t do_readv(unsigned long fd, const struct iovec __user *vec,
ssize_t ret = -EBADF;

if (f.file) {
- loff_t pos, *ppos = file_ppos(f.file);
- if (ppos) {
- pos = *ppos;
- ppos = &pos;
- }
- ret = vfs_readv(f.file, vec, vlen, ppos, flags);
- if (ret >= 0 && ppos)
- f.file->f_pos = pos;
+ ret = vfs_readv(f.file, vec, vlen, file_ppos(f.file), flags);
fdput_pos(f);
}

@@ -1042,14 +1021,7 @@ static ssize_t do_writev(unsigned long fd, const struct iovec __user *vec,
ssize_t ret = -EBADF;

if (f.file) {
- loff_t pos, *ppos = file_ppos(f.file);
- if (ppos) {
- pos = *ppos;
- ppos = &pos;
- }
- ret = vfs_writev(f.file, vec, vlen, ppos, flags);
- if (ret >= 0 && ppos)
- f.file->f_pos = pos;
+ ret = vfs_writev(f.file, vec, vlen, file_ppos(f.file), flags);
fdput_pos(f);
}

--
2.21.0.593.g511ec345e1