Re: msync() behaviour broken for MS_ASYNC, revert patch?

From: Linus Torvalds
Date: Fri Feb 10 2006 - 13:53:43 EST

On Sat, 11 Feb 2006, Nick Piggin wrote:
> What do you mean by overlapping?

I'm just talking about the "same area gets re-dirtied while it's already
busy being written". Depending on the _program_, this:
- never happens in practice
- is very common
- should just leave the page dirty
- should always start a new IO (waiting for the old one first, or use a
barrier if you want to be fancy).

Let's do a more hands-on example, just to make it less abstract.

- let's say that you have some kind of file-backed storage, and you
basically want to let the kernel know about your modifications, so that
it can DTRT (and let's ignore what the "right thing" is for a moment)

- The "dirty" bit very fundamentally is obviously at a page granularity,
but your data may well be at a much finer granularity. In particular,
your data may be a log that keeps growing.

- So let's say that you append to the log, and chose (for some reason,
never mind) to let the kernel know. So you effectively do something

memcpy(logptr, newentry, newentrysize);
logptr = logptr + newentrysize;
if (time_to_msync) {
msync(msyncptr, logptr - msyncptr, MS_ASYNC);
msyncptr = logptr;


Now, the question is, what do we want to happen at the MS_ASYNC.

In particular, what happens if the _previous_ MS_ASYNC had started the IO
(either directly, like in your world, or by bdflush just picking it up,
it really doesn't matter) on the _previous_ old end of the log area, so
the partial page at the old "msyncptr" point may actually be under IO

We have multiple choices:
- we ignore the issue (which is what the current behaviour for MS_ASYNC
is, since it just marks things dirty in the page cache)
- we mark the page dirty, but we don't start IO on it, since it's busy
(and since it's dirty, it will _eventually_ get written out)
- we actually wait for the old IO, in order to start IO on it again.

Now, I don't think that the third option is sane for MS_ASYNC (ie I don't
think even you want -that- behaviour), but in general, all these three
choices are actually sane. Notice how none of them actually involve
waiting for the new _result_. It's only a question about whether to wait
for an old write when we start a new one, or leave the new one entirely

> fadvise(fd, 100, 200, FADV_ASYNC_WRITE);
> fadvise(fd, 300, 400, FADV_ASYNC_WRITE);
> fadvise(fd, 100, 200, FADV_WRITE_WAIT);
> fadvise(fd, 300, 400, FADV_WRITE_WAIT);

I'm saying that a valid pattern is

.. dirty offset 100-200 ..
fadvice(fd, 100, 200, FADV_WRITE_START_TRY);

.. dirty offset 200-300 ..
fadvice(fd, 200, 300, FADV_WRITE_START_TRY);

.. dirty offset 300-400 ..
fadvice(fd, 300, 400, FADV_WRITE_START_TRY);

is a valid thing to do ("try to start IO, but don't guarantee it") as a
way to get things going. But that would never pair up with a "wait for
IO", because there's no guarantee that the IO got started (for example, we
may have started the IO when only bytes 100-200 were dirty, then we
dirtied the other bytes, but we didn't re-start the IO for them because
the previous IO to the same page was still pending, so the bytes never hit
storage and they aren't even outstanding).

But the "FADV_WRITE_START_TRY" is actually the best thing if what you are
trying to do is to keep changes _minimal_ so that when you later acutally
finish the whole thing, you can do

fadvice(fd, 100, 400, FADV_WRITE_WAIT);

which is your "write and wait".

So far so good, and we don't actually care. The unconditional "write and
wait" at the end means that it's irrelevant whether the "START_TRY" thing
actually started the IO or not - the START_TRY thing _can_ be a no-op if
you want to.

These sound like the semantics you want. No?

And yes, I'm perfectly happy with them. I think this is what people would
do. I just wanted to make sure that we're AWARE of the fact that it
implies that the ASYNC thing wouldn't necessarily always even start IO.

And the reason I wanted to make sure of that is that the whole thread
started from you complaining about MS_ASYNC not starting the IO. I'm
saying that if you _require_ starting of IO, then the FADV_WRITE_WAIT
actually sensibly has different semantics, which can be a lot cheaper to
do in the presense of other writers (ie then the write-wait would only
need to wait for any outstanding IO, not start writing out stuff that
somebody else had written).

And the reason I wanted to take up the semantic difference is because
there _are_ semantic differences.

If you only "commit" things when you have nothing dangling, you'll see the
above patterns. But it's a valid thing to commit things after you've made
"further" log changes (that you're _not_ ready to commit). For example,
say that your log is really dirtying all the time, but you synchronize it
at certain points and write the pointer to the synchronized state
somewhere else. What would you do?

Your pattern would actually be

.. dirty offset 100-200 ..
fadvice(fd, 100, 200, FADV_WRITE_START);

.. dirty offset 200-300 ..
fadvice(fd, 200, 300, FADV_WRITE_START);

.. dirty offset 300-400 ..
fadvice(fd, 300, 400, FADV_WRITE_START);

.. dirty offset 400-415 .. (for the next transaction)

fadvice(fd, 100, 400, FADV_JUST_WAIT); (for the previous one)

and here is where the semantics differ. The "always start IO, and just
wait for IO" won't be waiting for the partial stuff (that doesn't matter).
While the "write and wait" would synchronously write stuff that we just
don't care about (just because they happen to be on the same "IO
granularity" block).

This "unconditional write start" + "unconditional wait only" pattern in
theory allows you to optimize all the IO patterns by hand, and have less
synchronous waits, because it wouldn't wait for state that is dirty, but
that doesn't matter.

But as long as people are _aware_ of this issue, I don't much care.

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at