Re: [RFC][PATCH 1/2] Add a super operation for writeback

From: Dave Chinner
Date: Mon Jun 02 2014 - 23:34:35 EST

[ please line wrap at something sane like 68 columns ]

On Mon, Jun 02, 2014 at 01:02:29PM -0700, Daniel Phillips wrote:
> On 06/01/2014 08:15 PM, Dave Chinner wrote:
> > On Sun, Jun 01, 2014 at 02:41:02PM -0700, I wrote:
> >> +
> >> +/*
> >> + * Add inode to writeback dirty list with current time.
> >> + */
> >> +void inode_writeback_touch(struct inode *inode)
> >> +{
> >> + struct backing_dev_info *bdi = inode->i_sb->s_bdi;
> >> + spin_lock(&bdi->wb.list_lock);
> >> + inode->dirtied_when = jiffies;
> >> + list_move(&inode->i_wb_list, &bdi->wb.b_dirty);
> >> + spin_unlock(&bdi->wb.list_lock);
> >> +}
> >> +EXPORT_SYMBOL_GPL(inode_writeback_touch);
> > You should be able to use redirty_tail() for this....
> Redirty_tail nearly works, but "if (!list_empty(&wb->b_dirty))" is
> not correct because the inode needs to end up on the dirty list
> whether it was already there or not.

redirty_tail() always moves the inode to the end of the dirty

> This requirement is analogous to __mark_inode_dirty() and must
> tolerate similar races. At the microoptimization level, calling
> redirty_tail from inode_writeback_touch would be less efficient
> and more bulky. Another small issue is, redirty_tail does not
> always update the timestamp, which could trigger some bogus
> writeback events.

redirty_tail does not update the timestamp when it doesn't need to
change. If it needs to be changed because the current value would
violate the time ordering requirements of the list, it rewrites it.

So there is essentially no functional difference between the new
function and redirty_tail....

> > Hmmmm - this is using the wb dirty lists and locks, but you
> > don't pass the wb structure to the writeback callout? That seem
> > wrong to me - why would you bother manipulating these lists if
> > you aren't using them to track dirty inodes in the first place?
> From Tux3's point of view, the wb lists just allow fs-writeback to
> determine when to call ->writeback(). We agree that inode lists
> are a suboptimal mechanism, but that is how fs-writeback currently
> thinks. It would be better if our inodes never go onto wb lists in
> the first place, provided that fs-writeback can still drive
> writeback appropriately.

It can't, and definitely not with the callout you added.

However, we already avoid the VFS writeback lists for certain
filesystems for pure metadata. e.g. XFS does not use the VFS dirty
inode lists for inode metadata changes. They get tracked internally
by the transaction subsystem which does it's own writeback according
to the requirements of journal space availability.

This is done simply by not calling mark_inode_dirty() on any
metadata only change. If we want to do the same for data, then we'd
simply not call mark_inode_dirty() in the data IO path. That
requires a custom ->set_page_dirty method to be provided by the
filesystem that didn't call

__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);

and instead did it's own thing.

So the per-superblock dirty tracking is something we can do right
now, and some filesystems do it for metadata. The missing piece for
data is writeback infrastructure capable of deferring to superblocks
for writeback rather than BDIs....

> Perhaps fs-writeback should have an option to work without inode
> lists at all, and just maintain writeback scheduling statistics in
> the superblock or similar. That would be a big change, more on the
> experimental side. We would be happy to take it on after merge,
> hopefully for the benefit of other filesystems besides Tux3.

Well, I don't see it that way. If you have a requirement to be able
to track dirty inodes internally, then lets move to that implement
that infrastructure rather than hacking around with callouts that
only have a limited shelf-life.

> What we pass to ->writeback() is just a matter of taste at the
> moment because we currently ignore everything except
> &work->nr_pages. Anything sane is fine. Note that bdi_writeback is
> already available from bdi->wb, the "default writeback info",
> whatever that means. A quick tour of existing usage suggests that
> you can reliably obtain the wb that way, but a complete audit
> would be needed.
> Most probably, this API will evolve as new users arrive, and also
> as our Tux3 usage becomes more sophisticated. For now, Tux3 just
> flushes everything without asking questions. Exactly how that
> might change in the future is hard to predict. You are in a better
> position to know what XFS would require in order to use this
> interface.

XFS already has everything it needs internally to track dirty
inodes. In fact, we used to do data writeback from within XFS and we
slowly removed it as the generic writeback code was improved made
the XFS code redundant.

That said, parallelising writeback so we can support hundreds of
GB/s of delayed allocation based writeback is something we
eventually need to do, and that pretty much means we need to bring
dirty data inode tracking back into XFS.

So what we really need is writeback infrastructure that is:

a) independent of the dirty inode lists
b) implements background, periodic and immediate writeback
c) can be driven by BDI or superblock

IOWs, the abstraction we need for this writeback callout is not at
the writeback_sb_inodes() layer, it's above the dirty inode list
queuing layer. IOWs, the callout needs to be closer to the
wb_do_writeback() layer which drives all the writeback work
including periodic and background writeback, and the interactions
between foreground and background work that wb_writeback() uses
needs to be turned into a bunch of helpers...

> > Finally, I don't like the way the wb->list_lock is treated
> > differently by this code. I suspect that we need to rationalise
> > the layering of the wb->list_lock as it is currently not clear
> > what it protects and what (nested) layers of the writeback code
> > actually require it.
> One design goal of this proposed writeback interface is to hide
> the wb list lock entirely from the filesystem so core writeback
> can evolve more easily.

Yes, I realise that. Hence my point that the bleeding of it across
layers of writeback infrstructure is sub-optimal.

> This is not cast in stone, but it smells
> like decent factoring. We can save some spinlocks by violating
> that factoring (as Hirofumi's original hack did) at the cost of
> holding a potentially busy wb lock longer, which does not seem
> like a good trade.

If we abstract at a higher layer, the wb lock protects just the work
queuing and dispatch, and everything else can be done by the
superblock callout. If the superblock callout is missing, then
we simply fall down to the existing wb_writeback() code that retakes
the lock and does it's work....

Let's get the layering and abstraction in the right place the first
time, rather having to revist this in the not-to-distant future.


Dave Chinner
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at