Well, it could be implemented just like mmap(), but you have to make sure
you get the semantics right (that is, making sure that if the file changes
after the read, the memory image *doesn't*. You need to do copy on file
write.
> `write' doesn't benefit in quite the same way. Assume that a page to be
> written starts out zero-mapped (see below for zero-mapping ideas), is
> filled with data, and then written. If this happens only once then it
> is worth using the MMU to share the page with the page-cache. If the
> page is filled again though, it has to be copied (as a copy-on-write
> page) and all you have gained is that the I/O potentially got started
> earlier. Of course, all writes (including NFS) will be delayed in
> future anyway, won't they? :-)
You can do things so that they work with normal Unix semantics, but if done
in special ways you get good speedups. For write you can, as you say, just
make a page-aligned write buffer part of the page cache. The problem then
is that further file writes will change the process's buffer, and if course
you need to make it COWyou need to make it COW.
If you make the write buffer COA, then you will generally get the same effect
as now (that is, the data is copied), except on a page-by-page basis. Also,
if the process is careful not to reuse the buffer (for example, by unmapping
it) then you can get the 0 copy case with just write(). On the other hand,
you can do the same with mmap(), so there's no real benefit in adding all
this mechanism since it still needs special coding to use efficiently.
> Using the MMU for `write' might be worthwhile anyway, because there are
> special circumstances when the copy can be avoided. Programs which read
> and write about the same amount of data (i.e., file servers) tend to
> read into the same areas they use for writing. Provided `read' is using
> the MMU as well, there is no need for the process to copy the data it
> wrote earlier unless the new write is shorter than the old read. Then
> it is only as much as a page's worth of data.
Interesting idea. On the other hand, are there really things which write
from some memory then read the same thing back?
> If the program knows it isn't interested in the data it just wrote, it
> could issue an alternative `write_and_zero' system call which remaps the
> page and replaces it with a zero-mapped page.
Yes, but you could do the same by using mmap(): mmap(), fill, munmap().
That's standard.
> Apart from that though, how about having the idle task (or a
> low-priority kernel thread) fill out a pool of pre-zeroed pages.
This is often used on systems with non-unified VM/buffer cache. On Linux
this would be a problem because by zeroing pages, you may be throwing away
more useful data. You'd need to have some heuristic which tries to anticipate
demand for zeroed pages without overdoing it and throwing away useful cached
data.
J