Re: Linux 2.6.29

From: Kyle Moffett
Date: Thu Mar 26 2009 - 00:58:41 EST


On Wed, Mar 25, 2009 at 11:40 PM, Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> On Wed, 25 Mar 2009, Kyle Moffett wrote:
>> To be honest I think we could provide much better data consistency
>> guarantees and remove a lot of fsync() calls with just a basic
>> per-filesystem barrier() call.
>
> The problem is not that we have a lot of fsync() calls. Quite the reverse.
> fsync() is really really rare. So is being careful in general. The number
> of applications that do even the _minimal_ safety-net of "create new file,
> rename it atomically over an old one" is basically zero. Almost everybody
> ends up rewriting files with something like
>
> Â Â Â Âopen(name, O_CREAT | O_TRUNC, 0666)
> Â Â Â Âwrite();
> Â Â Â Âclose();
>
> where there isn't an fsync in sight, nor any "create temp file", nor
> likely even any real error checking on the write(), much less the
> close().

Really, I think virtually all of the database programs would be
perfectly happy with an "fsbarrier(fd, flags)" syscall, where if "fd"
points to a regular file or directory then it instructs the underlying
filesystem to do whatever internal barrier it supports, and if not
just fail with -ENOTSUPP (so you can fall back to fdatasync(), etc).
Perhaps "flags" would allow a "data" or "metadata" barrier, but if not
it's not a big issue.

I've ended up having to write a fair amount of high-performance
filesystem library code which almost never ends up using fsync() quite
simply because the performance on it sucks so badly. This is one of
the big reasons why so many critical database programs use O_DIRECT
and reinvent the the wheel^H^H^H^H^H^H pagecache. The only way you
can actually use it in high-bandwidth transaction applications is by
doing your own IO-thread and buffering system.

You have to have your own buffer ordering dependencies and call
fdatasync() or fsync() from individual threads in-between specific
ordered IOs. The threading helps you keep other IO in flight while
waiting for the flush to finish. For big databases on spinning media
(SSDs don't work precisely because they are small and your databases
are big) the overhead of a full flush may still be too large. Even
with SSDs, with multiple processes vying for IO bandwidth you still
want some kind of application-level barrier to avoid introducing
bubbles in your IO pipeline.

It all comes down to a trivial calculation: if you can't get
(bandwidth * latency-to-stable-storage) bytes of data queued *behind*
a flush then your disk is going to sit idle waiting for more data
after completing it. If a user-level tool needs to enforce ordering
between IOs the only tool right now is is a full flush; when
database-oriented tools can use a barrier()-ish call instead, they can
issue the op and immediately resume keeping the IO queues full.

Cheers,
Kyle Moffett
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/