Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keepAS_HWPOISON sticky

From: Dave Chinner
Date: Tue Aug 28 2012 - 23:00:02 EST


On Mon, Aug 27, 2012 at 06:05:06PM -0400, Naoya Horiguchi wrote:
> On Mon, Aug 27, 2012 at 08:26:07AM +1000, Dave Chinner wrote:
> > On Fri, Aug 24, 2012 at 01:24:16PM -0400, Naoya Horiguchi wrote:
> > > Let me explain more to clarify my whole scenario. If a memory error
> > > hits on a dirty pagecache, kernel works like below:
> > >
> > > 1. handles a MCE interrupt (logging MCE events,)
> > > 2. calls memory error handler (doing 3 to 6,)
> > > 3. sets PageHWPoison flag on the error page,
> > > 4. unmaps all mappings to processes' virtual addresses,
> >
> > So nothing in userspace sees the bad page after this.
> >
> > > 5. sets AS_HWPOISON on mappings to which the error page belongs
> > > 6. invalidates the error page (unlinks it from LRU list and removes
> > > it from pagecache,)
> > > (memory error handler finished)
> >
> > Ok, so the moment a memory error is handled, the page has been
> > removed from the inode's mapping, and it will never be seen by
> > aplications again. It's a transient error....
> >
> > > 7. later accesses to the file returns -EIO,
> > > 8. AS_HWPOISON is cleared when the file is removed or completely
> > > truncated.
> >
> > .... so why do we have to keep an EIO on the inode forever?
> >
> > If the page is not dirty, then just tossing it from the cache (as
> > is already done) and rereading it from disk next time it is accessed
> > removes the need for any error to be reported at all. It's
> > effectively a transient error at this point, and as such no errors
> > should be visible from userspace.
> >
> > If the page is dirty, then it needs to be treated just like any
> > other failed page write - the page is invalidated and the address
> > space is marked with AS_EIO, and that is reported to the next
> > operation that waits on IO on that file (i.e. fsync)
> >
> > If you have a second application that reads the files that depends
> > on a guarantee of good data, then the first step in that process is
> > that application that writes it needs to use fsync to check the data
> > was written correctly. That ensures that you only have clean pages
> > in the cache before the writer closes the file, and any h/w error
> > then devolves to the above transient clean page invalidation case.
>
> Thank you for detailed explanations.
> And yes, I understand it's ideal, but many applications choose not to
> do that for performance reason.

You choose: data integrity or performance.

> So I think it's helpful if we can surely report to such applications.

If performance is chosen over data integrity, we are under no
obligation to keep the error around indefinitely. Fundamentally,
ensuring a write completes successfully is the reponsibility of the
application, not the kernel. There are so many different filesytem
and storage errors that can be lost right now because data is not
fsync()d, adding another one to them really doesn't change anything.
IOWs, a memory error is no different to a disk failing or the system
crashing when it comes to data integrity. If you care, you use
fsync().

> > Hence I fail to see why this type of IO error needs to be sticky.
> > The error on the mapping is transient - it is gone as soon as the
> > page is removed from the mapping. Hence the error can be dropped as
> > soon as it is reported to userspace because the mapping is now error
> > free.
>
> It's error free only for the applications which do fsync check in
> each write, but not for the applications which don't do.
> I think the penalty for the latters (ignore dirty data lost and get
> wrong results) is too big to consider it as a reasonable trade-off.

I'm guessing that you don't deal with data integrity issues very
often. What you are suggesting is not a reasonable tradeoff - either
applications are coded correctly for data integrity, or they give
up any expectation that errors will be detected and reported
reliably. Hoping that we might be able to report an error somewhere
to someone who didn't care to avoid or collect in the first place
does not improve the situation for anyone....

Cheers,

Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/