Re: [PATCH 08/19] ceph: address space operations

From: Sage Weil
Date: Fri Jul 24 2009 - 00:45:29 EST


On Thu, 23 Jul 2009, Trond Myklebust wrote:
> On Thu, 2009-07-23 at 11:26 -0700, Sage Weil wrote:
> > A related question I had on writepages failures: what is the 'right' thing
> > to do if we get a server error on writeback? If we believe it may be
> > transient (say, ENOSPC), should we redirty pages and hope for better luck
> > next time?
>
> How would ENOSPC be transient? On most systems, ENOSPC requires some
> kind of user action in order to allow recovery, so will they pass the
> error back to the application.

In a distributed environment, other users may be deleting data, or the
cluster might be expanding/rebalancing as new storage is added to the
system. Of course, any retry after ENOSPC should be limited to a small
number of additional attempts.

> On the other hand, an error due to a storage element rebooting might be
> transient, and can probably be dealt with by retrying. It depends on
> what kind of contract you have with applications w.r.t. data integrity.

The general strategy with an unresponsive server is the same as NFS: just
wait indefinitely. (Control-c works, though.)

> > What if we decide it's a fatal error?
>
> Well, the NFS client will record the error, and then pass it back to the
> application on the next write() or on close(). However this strategy
> relies partly on the fact that all NFS clients are required to flush
> pending writes to permanent storage on close().

I see. Looking through the code, I see SetPageError(page) along with the
end_page_writeback stuff, and the error code in the nfs_open_context.

The part I don't understand is what actually happens to pages after the
error flag set. They're still uptodate, but no longer dirty? And can be
overwritten/redirtied? There's also an error flag on the address_space.
Are there any guidelines as far as which should be used?

Thanks-
sage



>
> Cheers
> Trond
>
> > sage
> >
> >
> > On Thu, 23 Jul 2009, Andi Kleen wrote:
> >
> > > Sage Weil <sage@xxxxxxxxxxxx> writes:
> > >
> > > > The ceph address space methods are concerned primarily with managing
> > > > the dirty page accounting in the inode, which (among other things)
> > > > must keep track of which snapshot context each page was dirtied in,
> > > > and ensure that dirty data is written out to the OSDs in snapshort
> > > > order.
> > > >
> > > > A writepage() on a page that is not currently writeable due to
> > > > snapshot writeback ordering constraints is ignored (it was presumably
> > > > called from kswapd).
> > >
> > > Not a detailed review. You would need to get one from someone who
> > > knows the VFS interfaces very well (unfortunately those people are hard
> > > to find). I just read through it.
> > >
> > > One thing I noticed is that you seem to do a lot of memory allocation
> > > in the write out paths (some of it even GFP_KERNEL, not GFP_NOFS)
> > >
> > > The traditional wisdom is that you should not allocate memory in block
> > > writeout, because that can deadlock. The worst case is swapfile
> > > on it, but it can happen with mmap too (e.g. one process using
> > > most memory with a file mmap from your fs) GFP_KERNEL can also recurse,
> > > which can cause other problems in your fs.
> > >
> > > There were some changes to make this problem less severe (e.g. better
> > > dirty pages accounting), but I don't think anyone has really declared
> > > it solved yet. The standard workaround for this is to use mempools
> > > for anything allocated in the writeout path, then you are at least
> > > guaranteed to make forward progress.
> > >
> > > You also had at least one unchecked kmalloc I think.
> > >
> > > -Andi
> > >
> > > --
> > > ak@xxxxxxxxxxxxxxx -- Speaking for myself only.
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > >
> > >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/