Re: [sqlite] light weight write barriers

From: Ric Wheeler
Date: Fri Nov 16 2012 - 13:03:06 EST


On 11/16/2012 10:54 AM, Howard Chu wrote:
Ric Wheeler wrote:
On 11/16/2012 10:06 AM, Howard Chu wrote:
David Lang wrote:
barriers keep getting mentioned because they are a easy concept to understand.
"do this set of stuff before doing any of this other set of stuff, but I don't
care when any of this gets done" and they fit well with the requirements of the
users.

Users readily accept that if the system crashes, they will loose the most recent
stuff that they did,

*some* users may accept that. *None* should.

but they get annoyed when things get corrupted to the point
that they loose the entire file.

this includes things like modifying one option and a crash resulting in the
config file being blank. Yes, you can do the 'write to temp file, sync file,
sync directory, rename file" dance, but the fact that to do so the user must sit
and wait for the syncs to take place can be a problem. It would be far better to
be able to say "write to temp file, and after it's on disk, rename the file" and
not have the user wait. The user doesn't really care if the changes hit disk
immediately, or several seconds (or even 10s of seconds) later, as long as there
is not any possibility of the rename hitting disk before the file contents.

The fact that this could be implemented in multiple ways in the existing
hardware does not mean that there need to be multiple ways exposed to userspace,
it just means that the cost of doing the operation will vary depending on the
hardware that you have. This also means that if new hardware introduces a new
way of implementing this, that improvement can be passed on to the users without
needing application changes.

There are a couple industry failures here:

1) the drive manufacturers sell drives that lie, and consumers accept it
because they don't know better. We programmers, who know better, have failed
to raise a stink and demand that this be fixed.
A) Drives should not lose data on power failure. If a drive accepts a write
request and says "OK, done" then that data should get written to stable
storage, period. Whether it requires capacitors or some other onboard power
supply, or whatever, they should just do it. Keep in mind that today, most of
the difference between enterprise drives and consumer desktop drives is just a
firmware change, that hardware is already identical. Nobody should accept a
product that doesn't offer this guarantee. It's inexcusable.
B) it should go without saying - drives should reliably report back to the
host, when something goes wrong. E.g., if a write request has been accepted,
cached, and reported complete, but then during the actual write an ECC failure
is detected in the cacheline, the drive needs to tell the host "oh by the way,
block XXX didn't actually make it to disk like I told you it did 10ms ago."

If the entire software industry were to simply state "your shit stinks and
we're not going to take it any more" the hard drive industry would have no
choice but to fix it. And in most cases it would be a zero-cost fix for them.

Once you have drives that are actually trustworthy, actually reliable (which
doesn't mean they never fail, it only means they tell the truth about
successes or failures), most of these other issues disappear. Most of the need
for barriers disappear.


I think that you are arguing a fairly silly point.

Seems to me that you're arguing that we should accept inferior technology. Who's really being silly?

No, just suggesting that you either pay for the expensive stuff or learn how to use cost effective, high capacity storage like the rest of the world.

I don't disagree that having non-volatile write caches would be nice, but everyone has learned how to deal with volatile write caches at the low end of market.


If you want that behaviour, you have had it for more than a decade - simply
disable the write cache on your drive and you are done.

You seem to believe it's nonsensical for someone to want both fast and reliable writes, or that it's unreasonable for a storage device to offer the same, cheaply. And yet it is clearly trivial to provide all of the above.

I look forward to seeing your products in the market.

Until you have more than "I want" and "I think" on your storage system design resume, I suggest you spend the money to get the parts with non-volatile write caches or fix your code.

Ric


If you - as a user - want to run faster and use applications that are coded to
handle data integrity properly (fsync, fdatasync, etc), leave the write cache
enabled and use file system barriers.

Applications aren't supposed to need to worry about such details, that's why we have operating systems.

Drives should tell the truth. In event of an error detected after the fact, the drive should report the error back to the host. There's nothing nonsensical there.

When a drive's cache is enabled, the host should maintain a queue of written pages, of a length equal to the size of the drive's cache. If a drive says "hey, block XXX failed" the OS can reissue the write from its own queue. No muss, no fuss, no performance bottlenecks. This is what Real Computers did before the age of VAX Unix.

Everyone has to trade off cost versus something else and this is a very, very
long standing trade off that drive manufacturers have made.

With the cost of storage falling as rapidly as it has in recent years, this is a stupid tradeoff.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/