Re: [PATCH] clear PageError bit in msync & fsync

From: Rik van Riel
Date: Thu Nov 11 2010 - 23:37:00 EST


On 11/09/2010 04:41 PM, Andrew Morton wrote:

yup. It's a userspace bug, really. Although that bug might be
expressed as "userspace didn't know about linux-specific EIO
behaviour".

Looking at this some more, I am not convinced this is a userspace
bug.

First, let me describe the problem scenario:
1) process A calls write
2) process B calls write
3) process A calls fsync, runs into an IO error, returns -EIO
4) process B calls fsync, returns success
(even though data could have been lost!)

Common sense, as well as these snippets from the fsync man
page, suggest that this behaviour is incorrect:

DESCRIPTION
fsync() transfers ("flushes") all modified in-core data of (i.e.,
modified buffer cache pages for) the file referred to by the file
descriptor fd to the disk device
...
RETURN VALUE
On success, these system calls return zero. On error, -1 is
returned, and errno is set appropriately.

Johannes had the idea of storing an IO error sequence idea
in the file descriptor and the address_space (or inode).
That way when an IO error happens, we can increment the
IO error sequence counter (mapping->io_error_seq).

When doing an fsync or msync, we can see whether the
inode's IO error sequence has been incremented since the
last time we looked. If it has, we set our own counter
to the same number and return -EIO to userspace.

The problem here is that the same file descriptor may be
shared between multiple processes. For example, on fork
the child process inherits file descriptors from the parent.
If the child reads 4kB, the file position indicator (f_pos)
will be changed for the parent (and siblings), too.

This is an odd quirk of Unix semantics, maybe inherited
from vfork (I suspect it would have been easier for everybody
if the child got a _copy_ of each file descriptor, instead of
sharing them with parent and siblings), but chances are there
is some program out there by now that relies on this behaviour.

Given that the file descriptors are shared, we will run into
the same race above, when process A and B are siblings or
parent & child.

That leaves the solution of putting the IO error sequence
number in its own little array pointed to by struct files_struct.
Not only will that complicate expand_files, it will also require
that we tell vfs_sync and similar functions either the file
descriptor number or a pointer to the io error sequence counter,
because the "struct file" does not tell us the file descriptor
number...

I'm not sure anyone will like the complexity of the above
solution (hi Al).

There is another "solution" to consider - redirtying pages
when write IO fails. This has the issue of potentially filling
up system memory with uncleanable dirty pages and deadlocking
the whole system.

There does not seem to be a nice solution to this problem, but
it does look like a problem we may want to fix.

--
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/