Jeremy> Well, it could be implemented just like mmap(), but you have
Jeremy> to make sure you get the semantics right (that is, making
Jeremy> sure that if the file changes after the read, the memory
Jeremy> image *doesn't*. You need to do copy on file write.
It is more complicated than mmap(), because when a page is copied by
COW, the resulting two pages are still potentially shared between
different VM addresses in different tasks, or even within the same task.
Consider when a task writes a file, then reads it into a different area,
does a clone(CLONE_VM) then one task writes to the data it wrote. The
page must be copied by COW because the data it read has to stay the
same, but both resulting pages are still shared between two tasks.
Also, one of the pages is still part of the page cache.
Me> `write' doesn't benefit in quite the same way. Assume that a
Me> page to be written starts out zero-mapped (see below for
Me> zero-mapping ideas), is filled with data, and then written. If
Me> this happens only once then it is worth using the MMU to share
Me> the page with the page-cache. If the page is filled again
Me> though, it has to be copied (as a copy-on-write page) and all you
Me> have gained is that the I/O potentially got started earlier. Of
Me> course, all writes (including NFS) will be delayed in future
Me> anyway, won't they? :-)
Jeremy> You can do things so that they work with normal Unix
Jeremy> semantics, but if done in special ways you get good
Jeremy> speedups. For write you can, as you say, just make a
Jeremy> page-aligned write buffer part of the page cache. The
Jeremy> problem then is that further file writes will change the
Jeremy> process's buffer, and of course you need to make it COW.
Jeremy> If you make the write buffer COA, then you will generally
Jeremy> get the same effect as now (that is, the data is copied),
Jeremy> except on a page-by-page basis. Also, if the process is
Jeremy> careful not to reuse the buffer (for example, by unmapping
Jeremy> it) then you can get the 0 copy case with just write(). On
Jeremy> the other hand, you can do the same with mmap(), so there's
Jeremy> no real benefit in adding all this mechanism since it still
Jeremy> needs special coding to use efficiently.
Me> Using the MMU for `write' might be worthwhile anyway, because
Me> there are special circumstances when the copy can be avoided.
Me> Programs which read and write about the same amount of data
Me> (i.e., file servers) tend to read into the same areas they use
Me> for writing. Provided `read' is using the MMU as well, there is
Me> no need for the process to copy the data it wrote earlier unless
Me> the new write is shorter than the old read. Then it is only as
Me> much as a page's worth of data.
Jeremy> Interesting idea. On the other hand, are there really
Jeremy> things which write from some memory then read the same thing
Jeremy> back?
No, they read different things into the same I/O areas. This applies to
any program that is copying data, optionally reading it as it goes.
This kind of thing is also done by the network subsystem, and although
it could be done as a special case for networking, I'd prefer to see a
more general page sharing mechanism. Perhaps we can assume that all
such programs are optimised to use mmap() already, or perhaps we can't.
stdio can be handled as a special case: it could zero-map buffers when
they are flushed. This avoids the copy, but the page has to be
zero-filled instead. At least this is cheaper (how much depends on the
effectiveness of a zero-page pool).
Me> If the program knows it isn't interested in the data it just
Me> wrote, it could issue an alternative `write_and_zero' system call
Me> which remaps the page and replaces it with a zero-mapped page.
Jeremy> Yes, but you could do the same by using mmap(): mmap(),
Jeremy> fill, munmap(). That's standard.
Does the stdio system do this? Maybe it should. Then again, maybe
there's a significant overhead associated with having many VM areas, one
for each buffer. I think in this case it is better to zero-map the
pages when you've finished with them, as that can be merged with
adjacent data areas into a single VM area.
-- Jamie Lokier