Re: Update of file offset on write() etc. is non-atomic with I/O

From: Linus Torvalds
Date: Thu Feb 20 2014 - 12:14:38 EST


Yes, I do think we violate POSIX here because of how we handle f_pos
(the earlier thread from 2006 you point to talks about the "thread
safe" part, the point here about the actual wording of "atomic" is a
separate issue).

Long long ago we used to just pass in the pointer to f_pos directly,
and then the low-level write would update it all under the inode
semaphore, and all was good.

And then we ended up having tons of problems with non-regular files
and drivers accessing f_pos and having nasty races with it because
they did *not* have any locking (and very fundamentally didn't want
any), and we broke the serialization of f_pos. We still do the *IO*
atomically, but yes, the f_pos access itself is outside the lock.

Ho humm.. Al, any ideas of how to fix this?

Linus



On Mon, Feb 17, 2014 at 7:41 AM, Michael Kerrisk (man-pages)
<mtk.manpages@xxxxxxxxx> wrote:
> Hello all,
>
> A note from Yongzhi Pan about some of my own code led me to dig deeper
> and discover behavior that is surprising and also seems to be a
> fairly clear violation of POSIX requirements.
>
> It appears that write() (and, presumably read() and other similar
> system calls) are not atomic with respect to performing I/O and
> updating the file offset behavior.
>
> The problem can be demonstrated using the program below.
> That program takes three arguments:
>
> $ ./multi_writer num-children num-blocks block-size > somefile
>
> It creates 'num-children' children, each of which writes 'num-blocks'
> blocks of 'block-size' bytes to standard output; for my experiments,
> stdout is redirected to a file. After all children have finished,
> the parent inspects the size of the file written on stdout, calculates
> the expected size of the file, and displays these two values, and
> their difference on stderr.
>
> Some observations:
>
> * All children inherit the stdout file descriptor from the parent;
> thus the FDs refer to the same open file description, and therefore
> share the file offset.
>
> * When I run this on a multi-CPU BSD systems, I get the expected result:
>
> $ ./multi_writer 10 10000 1000 > g 2> run.log
> $ ls -l g
> -rw------- 1 mkerrisk users 100000000 Jan 17 07:34 g
>
> * Someone else tested this code for me on a Solaris system, and also got
> the expected result.
>
> * On Linux, by contrast, we see behavior such as the following:
>
> $ ./multi_writer 10 10000 1000 > g
> Expected file size: 100000000
> Actual file size: 16323000
> Difference: 83677000
> $ ls -l g
> -rw-r--r--. 1 mtk mtk 16323000 Feb 17 16:05 g
>
> Summary of the above output: some children are overwriting the output
> of other children because output is not atomic with respect to updates
> to the file offset.
>
> For reference, POSIX.1-2008/SUSv4 Section XSI 2.9.7 says:
>
> [[
> 2.9.7 Thread Interactions with Regular File Operations
>
> All of the following functions shall be atomic with respect to each other
> in the effects specified in POSIX.1-2008 when they operate on regular
> files or symbolic links:
>
>
> chmod()
> ...
> pread()
> read()
> ...
> readv()
> pwrite()
> ...
> write()
> writev()
>
>
> If two threads each call one of these functions, each call shall either
> see all of the specified effects of the other call, or none of them.
> ]]
>
> (POSIX.1-2001 has similar text.)
>
> This text is in one of the Threads sections, but it applies equally
> to threads in different processes as to threads in the same process.
>
> I've tested the code below on ext4, XFS, and BtrFS, on kernel 3.12 and a
> number of other recent kernels, all with similar results, which suggests
> the result is in the VFS layer. (Can it really be so simple as no locking
> around pieces such as
>
> loff_t pos = file_pos_read(f.file);
> ret = vfs_write(f.file, buf, count, &pos);
> if (ret >= 0)
> file_pos_write(f.file, pos);
>
> in fs/read_write.c?)
>
> I discovered this behavior after Yongzhi Pan reported some unexpected
> behavior in some of my code that forked to create a parent and
> child that wrote to the same file. In some cases, expected output
> was not appearing. In other words, after a fork(), and in the absence
> of any other synchronization technique, a parent and a child cannot
> safely write to the same file descriptor without risking overwriting
> each other's output. But POSIX requires this, and other systems seem
> to guarantee it.
>
> Am I correct to think there's a kernel problem here?
>
> Thanks,
>
> Michael
>
> ===
>
> /* multi_writer.c
> */
>
> #include <sys/wait.h>
> #include <sys/types.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <sys/fcntl.h>
> #include <sys/stat.h>
> #include <string.h>
> #include <errno.h>
>
> typedef enum { FALSE, TRUE } Boolean;
>
> #define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \
> } while (0)
>
> #define fatal(msg) do { fprintf(stderr, "%s\n", msg); \
> exit(EXIT_FAILURE); } while (0)
>
> #define usageErr(msg, progName) \
> do { fprintf(stderr, "Usage: "); \
> fprintf(stderr, msg, progName); \
> exit(EXIT_FAILURE); } while (0)
>
> int
> main(int argc, char *argv[])
> {
> char *buf;
> int j, k, nblocks, nchildren;
> size_t blocksize;
> struct stat sb;
> // int nchanges;
> // off_t pos;
> long long expected;
>
> if (argc < 4 || strcmp(argv[1], "--help") == 0)
> usageErr("%s num-children num-blocks block-size [O_APPEND-flag]\n",
> argv[0]);
>
> nblocks = atoi(argv[2]);
> blocksize = atoi(argv[3]);
>
> buf = malloc(blocksize + 1);
> if (buf == NULL)
> errExit("malloc");
>
> /* If a fourth command-line argument is specified, set the O_APPEND
> flag on stdout */
>
> if (argc > 4)
> if (fcntl(STDOUT_FILENO, F_SETFL, O_APPEND) == -1)
> errExit("fcntl-F_SETFL");
>
> nchildren = atoi(argv[1]);
>
> /* Create child processes that write blocks to stdout */
>
> for (j = 0; j < nchildren; j++) {
> switch(fork()) {
> case -1:
> errExit("fork");
>
> case 0: /* Each child writes nblocks * blocksize bytes to stdout */
> // nchanges = 0;
>
> /* Put something distinctive in each child's buffer (in case
> we want to analyze byte sequences in the output) */
>
> for (k = 0; k < blocksize; k++)
> buf[k] = 'a' + getpid() % 26;
>
> for (k = 0; k < nblocks; k++) {
> // if (k > 0 && pos != lseek(STDOUT_FILENO, 0, SEEK_END))
> // nchanges++;
> if (write(STDOUT_FILENO, buf, blocksize) != blocksize)
> fatal("write");
> // pos = lseek(STDOUT_FILENO, 0, SEEK_END);
> }
>
> // fprintf(stderr, "%ld: nchanges = %d\n",
> // (long) getpid(), nchanges);
> exit(EXIT_SUCCESS);
>
> default:
> break; /* Parent falls through to create next child */
> }
> }
>
> /* Wait for all children to terminate */
>
> while (wait(NULL) > 0)
> continue;
>
> /* Compare final length of file against expected size */
>
> if (fstat(STDOUT_FILENO, &sb) == -1)
> errExit("fstat");
>
> expected = blocksize * nblocks * nchildren;
> fprintf(stderr, "Expected file size: %10lld\n", expected);
> fprintf(stderr, "Actual file size: %10lld\n", (long long) sb.st_size);
> fprintf(stderr, "Difference: %10lld\n", expected - sb.st_size);
>
> exit(EXIT_SUCCESS);
> }
>
>
> --
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/