Re: [PATCH] issue storage device flush via sync_blockdev() (was Re:Linux 2.6.29)

From: Ric Wheeler
Date: Thu Mar 26 2009 - 10:23:21 EST


On 03/25/2009 10:31 PM, Eric Sandeen wrote:
Jeff Garzik wrote:
Eric Sandeen wrote:

What about when you're running over a big raid device with
battery-backed cache, and you trust the cache as much as much as the
disks. Wouldn't this unconditional cache flush be painful there on any
of the callers even if they're rare? (fs unmounts, freezes, unmounts,
etc? Or a fat filesystem on that device doing an fsync?)
What exactly do you think sync_blockdev() does? :)

It used to push os cached data to the storage. Now it tells the storage
to flush cache too (with your patch). This seems fine in general,
although it's not a panacea for all the various data integrity issues
that are being tossed about in this thread. :)

It is used right before a volume goes away. If that's not a time to
flush the cache, I dunno what is.

The _whole purpose_ of sync_blockdev() is to push out the data to
permanent storage. Look at the users -- unmount volume, journal close,
etc. Things that are OK to occur after those points include: power off,
device unplug, etc.

Sure. But I was thinking about enterprise raids with battery backup
which may last for days.

But, ok, I wasn't thinking quite right about the unmount situations etc;
even on enterprise raids like this, flushing things out on unmount makes
sense in the case where you lose power post-unmount and can't restore
power before the battery backup dies.

Enterprise raid systems don't have this issue. They all have sufficient battery power to safely destage the volatile cache to persistent storage on power outage (i.e., they keep enough drives spinning and so on to empty the cache).

Hardware RAID cards that you have in some servers (not external RAID arrays) would have this issue though and are probably fairly common.

I also wondered if a cache flush on one lun issues a cache flush for the
entire controller, or just for that lun. Hopefully the latter, in which
case it's not as big a deal.

Probably just for that LUN - LUN's are usually independent in almost all ways. Firmware of course could do anything, I would assume that most high end arrays would probably ignore a flush command in any case.

A secondary purpose of sync_blockdev() is as a hack, for simple/ancient
bdev-based filesystems that do not wish to bother with barriers and all
the associated complexity to tracking what writes do/do not need flushing.

Yep, I get that and it seems reasonable except ....

xfs, reiserfs, ext4 all avoid the blkdev flush on fsync if barriers are
not enabled, I think for that reason...
Enabling barriers causes slowdowns far greater than that of simply
causing fsync(2) to trigger FLUSH CACHE, because barriers imply FLUSH
CACHE issuance for all in-kernel filesystem journalled/atomic
transactions, in addition to whatever syscalls userspace is issuing.

The number of FLUSH CACHES w/ barriers is orders of magnitude larger
than the number of fsync/fdatasync calls.

I understand all that.

My point is that the above filesystems (xfs, reiserfs, ext4) skip the
blkdev flush on fsync when barriers are explicitly disabled.

They do this because if an admin disables barriers, they are trusting
that the write cache is nonvolatile and will be able to destage fully
even if external power is lost for some time.

In that case you don't need a blkdev_issue_flush on fsync either (or are
at least willing to live with the diminished risk, thanks to the battery
backup), and on xfs, ext4 etc you can turn it off (it goes away w/ the
barriers off setting). With this change to the simple generic fsync
path, you can't turn it off for those filesystems that use it for fsync.

But I suppose it's rare that anybody ever uses a filesystem which uses
this generic sync method on any sort of interesting storage like I'm
talking about, and it's not a big deal... (or maybe that interesting
storage just ignores cache flushes anyway, I dunno).

My main concerns were that these extra cache flushes for fsync aren't
tunable, and that flushes on one lun might affect other luns. I guess
I've talked myself out of those concerns in a couple different ways now. ;)

-Eric

I do enthusiastically agree that we should not be doing barriers and the blkdev flush for file systems that do barriers correctly.

ric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/