Re: application syncing options (was Re: [PATCH] Memory managementlivelock)

From: david
Date: Tue Oct 07 2008 - 13:16:43 EST


On Tue, 7 Oct 2008, Mikulas Patocka wrote:

If you invent new interface that allows submitting several ordered IOs
from userspace, it will require excessive maintenance overhead over long
period of time. So it should be only justified, if the performance
improvement is excessive as well.

It should not be like "here you improve 10% performance on some synthetic
benchmark in one application that was rewritten to support the new
interface" and then create a few more security vulnerabilities (because of
the complexity of the interface) and damage overall Linux progress,
because everyone is catching bugs in the new interface and checking it for
correctness.

the same benchmarks that show that it's far better for the in-kernel
filesystem code to use write barriers should apply for FUSE filesystems.

FUSE is slow by design, and it is used in cases where performance isn't
crucial.

FUSE is slow, but I don't believe that it's a design goal for it to be slow, it's a limitation of the implementation. so things that could speed it up would be a good thing.

this isn't a matter of a few % in performance, if an application is
sync-limited in a way that can be converted to write-ordered the potential is
for the application to speed up my many times.

programs that maintain indexes or caches of data that lives in other files
will be able to write data && barrier && write index && fsync and double their
performance vs write data && fsync && write index && fsync

They can do: write data with O_SYNC; write another piece of data with
O_SYNC.

And the only difference from barriers is the waiting time after the first
O_SYNC before the second I/O is submitted (such delay wouldn't happen with
barriers).

And now I/O delay is in milliseconds and process wakeup time is tens of
microseconds, it doesn't look like eliminating process wakeup time would
do more than few percents.

each sync write needs to wait for a disk rotation (and a seek if you are writing to different files). if you only do two writes you save one disk rotation, if you do five writes you save four disk rotations

databases can potentially do even better, today they need to fsync data to
disk before they can update their journal to indicate that the data has been
written, with a barrier they could order the writes so that the write to the
journal doesn't happen until the writes of the data. they would neve need to
call an fsync at all (when emptying the journal)

Good databases can pack several user transactions into one fsync() write.
If the database server is properly engineered, it accumulates all user
transactions committed so far into one chunk, writes that chunk with one
fsync() call and then reports successful commit to the clients.

if there are multiple users doing transactions at the same time they will benifit from overlapping the fsyncs. but each user session cannot complete their transaction until the fsync completes

So if you increase fsync() latency, it should have no effect on the
transactional throughput --- only on latency of transactions. Similarly,
if you decrease fsync() latency, it won't increase number of processed
transactions.

only if you have all your transactions happening in parallel. in the real world programs sometimes need to wait for one transaction to complete so that they can do the next one.

Certainly, there are primitive embedded database libraries that fsync()
after each transaction, but they don't have good performance anyway.

for systems without solid-state drives or battery-backed caches, the ability
to eliminate fsyncs by being able to rely on the order of the writes is a huge
benifit.

I may ask --- where are the applications that require extra slow fsync()
latency? Databases are not that, they batch transactions.

If you want to improve things, you can try:
* implement O_DSYNC (like O_SYNC, but doesn't update inode mtime)
* implement range_fsync and range_fdatasync (sync on file range --- the
kernel has already support for that, you can just add a syscall)
* turn on FUA bit for O_DSYNC writes, that eliminates the need to flush
drive cache in O_DSYNC call

--- these are definitely less invasive than new I/O submitting interface.

but all of these require that the application stop and wait for each seperate write to take place before proceeding to the next step.

if this doesn't matter, then why the big push to have the in-kernel filesystems start using barriers? I understood that this resulted in large performance increases in the places that they are used from just being able to avoid having to drain the entire request queue, and you are saying that the applications would not only need to wait for the queue to flush, but for the disk to acknowledge the write.

syncs are slow, in some cases _very_ slow.

David Lang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/