You need to be able to copy on write from the buffer cache side as well
as user processes - yes its hard but it can be done.
> If the program knows it isn't interested in the data it just wrote, it
> could issue an alternative `write_and_zero' system call which remaps the
> page and replaces it with a zero-mapped page. Real programs won't do
It can just do an mmap of /dev/zero over it.
> `memset' in <asm-i386/strings-i486.h> might go faster on a Pentium if it
> is unrolled a little and uses paired writes, simply because many of the
> zeroes may well get written to the internal cache during the loop, and
> get written to secondary cache, etc., later while other code is happily
> doing other things in the internal cache).
Using the FPU seems fastest. A combination of the FPU saving data and the
integer unit touching the next cache line. Intel lacks the ability to zero
a cache line without prefetch.
> Network skbuffs
> ===============
>
> Having implemented all of the above (you, not me :-), the icing on the
> cake is then to have receiving skbuffs allocated in such a way that the
> data part of the packet from a device happens to have just the right
> page alignment when it comes in... You get the idea. With this,
Counter intuitively here thats not the right trick. We have to touch each
byte to do the checksums (no PC cards do checksum -yet-). Thus if we can
lock user mode pages and do device->userspace in one transfer the copy is
absorbed in the checksum at no cost. This is fun post 2.0 sort of stuff.
> Writing is similar.
Writing is far more complex. Consider TCP locking user pages COW, and unlocking
them as the ACK frames come in.
Alan