IO error semantics

From: Nick Piggin
Date: Mon Jan 18 2010 - 01:05:28 EST

On Mon, Jan 18, 2010 at 04:18:47PM +1100, Nick Piggin wrote:
> We also need to remove some ClearPageUptodate calls I think (similar
> issues), so keep those in mind too. Unfortunately it looks like there
> are also a lot of filesystem specific tests of PageUptodate... but you
> could also move those under the new compatibility s_flag.
> I don't know of a really good way to inject and test filesystem errors.
> Make request failures causes most fs to quickly go readonly or have
> bigger problems. If you're careful like try to only fail read IOs for
> data, or only fail write IOs not involved in integrity or journal
> operations, then test programs just tend to abort pretty quickly. Does
> anyone know of anything more systematic?

This might be a good time to bring up IO error behaviour again. I got
into some debates I think on Andi's hwpoison thread a while back, but
probably not appropriate thread to find a real solution to this.

The problem we have now is that IO error semantics are not well defined.
It is hard to even enumerate all the issues.

read IOs
how to retry? appropriate defaults should happen at the block layer I
think. Should retry behaviour be tunable by the mm/fs, or should that
be coded explicitly as submission retry loops? Either way does imply
there is either similar defaults for all types (or maybe classes) of
drivers, or some way to query/set this.

It would be nice to be able to set fs/driver behaviour from userspace
too, in a generic (not driver or fs specific way). But defaults should
be reasonable and similar between all, I guess.

write IOs
This is more interesting. How to handle write IO errors. In my opinion
we must not invalidate the data before an IO error is returned to
somebody (whether it be fsync or a synchronous write syscall). Any
earlier and the app just gets RAW consistency randomly violated. And I
think it is important to treat IO errors as transparently as possible
until the error can be detected.

I happen to think that actually we should go further and not
invalidate the data at all. This makes implementation simpler, and
also allows us to retry writes like we can retry reads. It's also
problematic to throw out errors at that point because *sync syscalls
coming from elsewhere could result in loss of error reporting (think,

If we go this way, we probably need another syscall and fs helper call
to invalidate the dirty data when we give up on retries. truncate_range
probably not appropriate because it is much harder to implement and
maybe we want to try to get at the most recent data that is on disk.

Also do we need to think about O_SYNC or -o sync type of writes that
are implemented via writeback cache? We could invalidate the dirtied
cache ASAP, which would leave a window where a concurrent read can see
first new, then old data. It would also kind of break the above scheme
in case the pagecache was already dirty via a descriptor without
O_SYNC. It might just make sense to leave the pagecache dirty. Either
way it should be documented I think.

Do we even care enough to bother thinking about this now? (serious question)

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at