Re: [RFC][PATCH 1/2] Add a super operation for writeback

From: Daniel Phillips
Date: Tue Jun 03 2014 - 03:01:23 EST

Hi Dave,
On 06/02/2014 08:33 PM, Dave Chinner wrote:
> On Mon, Jun 02, 2014 at 01:02:29PM -0700, Daniel Phillips wrote:
>> Redirty_tail nearly works, but "if (!list_empty(&wb->b_dirty))" is
>> not correct because the inode needs to end up on the dirty list
>> whether it was already there or not.
> redirty_tail() always moves the inode to the end of the dirty
> list.
>> This requirement is analogous to __mark_inode_dirty() and must
>> tolerate similar races. At the microoptimization level, calling
>> redirty_tail from inode_writeback_touch would be less efficient
>> and more bulky. Another small issue is, redirty_tail does not
>> always update the timestamp, which could trigger some bogus
>> writeback events.
> redirty_tail does not update the timestamp when it doesn't need to
> change. If it needs to be changed because the current value would
> violate the time ordering requirements of the list, it rewrites it.
> So there is essentially no functional difference between the new
> function and redirty_tail....

Hirofumi, would you care to comment?

>>> Hmmmm - this is using the wb dirty lists and locks, but you
>>> don't pass the wb structure to the writeback callout? That seem
>>> wrong to me - why would you bother manipulating these lists if
>>> you aren't using them to track dirty inodes in the first place?
>> From Tux3's point of view, the wb lists just allow fs-writeback to
>> determine when to call ->writeback(). We agree that inode lists
>> are a suboptimal mechanism, but that is how fs-writeback currently
>> thinks. It would be better if our inodes never go onto wb lists in
>> the first place, provided that fs-writeback can still drive
>> writeback appropriately.
> It can't, and definitely not with the callout you added.

Obviously, fs-writeback does not get to choose inodes or specific pages
with our proposal (which is the whole point) but it still decides when
to call Tux3 vs some other filesystem on the same device, and is still
able to indicate how much data it thinks should be written out. Even
though we ignore the latter suggestion and always flush everything, we
appreciate the timing of those callbacks very much, because they give
us exactly the pressure sensitive batching behavior we want.

> However, we already avoid the VFS writeback lists for certain
> filesystems for pure metadata. e.g. XFS does not use the VFS dirty
> inode lists for inode metadata changes. They get tracked internally
> by the transaction subsystem which does it's own writeback according
> to the requirements of journal space availability.
> This is done simply by not calling mark_inode_dirty() on any
> metadata only change. If we want to do the same for data, then we'd
> simply not call mark_inode_dirty() in the data IO path. That
> requires a custom ->set_page_dirty method to be provided by the
> filesystem that didn't call
> __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> and instead did it's own thing.
> So the per-superblock dirty tracking is something we can do right
> now, and some filesystems do it for metadata. The missing piece for
> data is writeback infrastructure capable of deferring to superblocks
> for writeback rather than BDIs....

We agree that fs-writeback inode lists are broken for anything
more sophisticated than Ext2. An advantage of the patch under
consideration is that it still lets fs-writeback mostly work the
way it has worked for the last few years, except for not allowing it
to pick specific inodes and data pages for writeout. As far as I
can see, it still balances writeout between different filesystems
on the same block device pretty well.

>> Perhaps fs-writeback should have an option to work without inode
>> lists at all, and just maintain writeback scheduling statistics in
>> the superblock or similar. That would be a big change, more on the
>> experimental side. We would be happy to take it on after merge,
>> hopefully for the benefit of other filesystems besides Tux3.
> Well, I don't see it that way. If you have a requirement to be able
> to track dirty inodes internally, then lets move to that implement
> that infrastructure rather than hacking around with callouts that
> only have a limited shelf-life.

What you call shelf-life, I call iteration. Small changes are beautiful.
Apologies for the rhetoric, content is below.

> XFS already has everything it needs internally to track dirty
> inodes. In fact, we used to do data writeback from within XFS and we
> slowly removed it as the generic writeback code was improved made
> the XFS code redundant.
> That said, parallelising writeback so we can support hundreds of
> GB/s of delayed allocation based writeback is something we
> eventually need to do, and that pretty much means we need to bring
> dirty data inode tracking back into XFS.
> So what we really need is writeback infrastructure that is:
> a) independent of the dirty inode lists
> b) implements background, periodic and immediate writeback
> c) can be driven by BDI or superblock
> IOWs, the abstraction we need for this writeback callout is not at
> the writeback_sb_inodes() layer, it's above the dirty inode list
> queuing layer. IOWs, the callout needs to be closer to the
> wb_do_writeback() layer which drives all the writeback work
> including periodic and background writeback, and the interactions
> between foreground and background work that wb_writeback() uses
> needs to be turned into a bunch of helpers...

I agree, and that is what Tux3 really wants. I just wonder about the
wisdom of embarking on a big change to fs-writeback when a small
change will do. How will fs-writeback balancing between different
filesystems on the same device work? What will those helpers look
like? How long will it take to prove the new scheme stable? How can
we prove that no existing fs-writeback clients will regress?

What about a sporting proposal: you post a patch that trumps ours in
elegance, that you could in theory use for XFS. We verify that it
works for Tux3 at least as well as the current patch. We also test
it for you.

>> This is not cast in stone, but it smells
>> like decent factoring. We can save some spinlocks by violating
>> that factoring (as Hirofumi's original hack did) at the cost of
>> holding a potentially busy wb lock longer, which does not seem
>> like a good trade.
> If we abstract at a higher layer, the wb lock protects just the work
> queuing and dispatch, and everything else can be done by the
> superblock callout. If the superblock callout is missing, then
> we simply fall down to the existing wb_writeback() code that retakes
> the lock and does it's work....
> Let's get the layering and abstraction in the right place the first
> time, rather having to revist this in the not-to-distant future.

That would be ideal, but incremental also has its attractions. In that
vein, here is a rewrite along the lines you suggested yesterday. The
only change to your code is, the writeback lock is dropped before calling
->writeback(). I tried doing it the other way (in the fs) but it was just
too ugly. As a result, generic_writeback_sb_inodes is called under the wb
list lock, but ->writeback drops it. An oddity that would obviously go
away with a better abstraction along the lines you suggest.