Re: msync() behaviour broken for MS_ASYNC, revert patch?

From: Linus Torvalds
Date: Fri Feb 10 2006 - 12:11:08 EST




On Sat, 11 Feb 2006, Nick Piggin wrote:
>
> No, you are thinking about what the kernel does. Subtle difference. A
> smart user wants to:
>
> - start writing this
> - start writing that
> - start writing that-other-thing
> - make sure this that and the other have reached backing store
>
> OK so in effect it is the same thing, but it is better to export the
> interface that reflects how the user interacts with pagecache.
>
> WRITE_SYNC obviously does the "wait for them all" (aka ensure they
> hit backing store) thing too, right? It performs exactly the same
> role that WRITE_WAIT would do in the above example.

NOOOOOO!

Think about it for a second. Think about the usage case you yourself were
quoting.

The "magic" in IO is "overlapping IO". If you don't get overlapping IO,
your interfaces are broken. End of story.

And WRITE_SYNC _cannot_ do overlapping IO.

It's entirely possible that somebody else (or that very same program) has
dirtied the same pages that you started write-out on earlier. And that is
when "wait for writes to finish" and "WRITE_SYNC" _differ_.

If you want synchronous writes, use synchronous writes. But if you want
asynchronous writes, you do _not_ implement them as "start writes now" and
"write synchronously". You implement them as "start writes now" and "wait
for the writes to have finished".

There's another very specific and important difference: "wait for the
writes" is fundamentally an interruptible and pollable operation, which
means that it's a lot easier to integrate into any system that has to do
other things too. In contrast, WRITE_SYNC is _neither_ easily
interruptible nor pollable.

So WRITE_SYNC has clearly different behaviour. There's a good reason the
kernel internally has "start write" + "wait for write", and I'll repeat:
none of those reasons go away just because you move to user space.

> My proposal isn't really different to Andrew's in terms of functionality
> (unless I've missed something), but it is more consistent because it
> does not introduce this completely new concept to our userspace API but
> rather uses the SYNC/ASYNC distinction like everything else.

Your proposal has two _huge_ downsides:

- it changes semantics, and you have absolutely _no_ idea of who depends
on the performance semantics of the old behaviour. In contrast, I can
tell you that we did it once before, and we reverted it.

- it's not at all consistent. The _current_ behaviour is consistent, and
matches 100% the current behaviour of sync vs async write().

I really don't see the point.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/