In article <20000904022227.B21981@pcep-jamie.cern.ch>,
Jamie Lokier <email@example.com> wrote:
>Alan Cox wrote:
>> > read/recv block while the NIC DMAs into user space main memory.
>> Thats actually not always a win either. A DMA to user space flushes
>> those pages out of cache which isnt so ideal if the CPU wants
>> them. Some of the results are suprisingly counter-intuitive like this
>Does it flush the CPU cache? I thought the CPU just snooped the bus and
>updated its cache with new data.
In theory you could do "snoop and update".
In practice I do not know of a single chip that actually does that.
Pretty much _everybody_ does "invalidate on write".
(The "invalidate on write" is the sane way of doing SMP cache coherency,
which is probably why. Trying to have shared dirty cache-lines is just
not a viable option in the end).
>Ugh. User space DMA gets complicated quickly. The performance question
>is, perhaps, can you do this without a TLB flush (but with locking the
>struct page of course). Note that it doesn't matter if another thread,
>and this includes truncate/write in another thread, clobbers the page
>data. That's just the normal effect of two concurrent writers to the
Simple calculation: to actually even _find_ the physical page, you
usually need to do at least three levels of page table walking. Sure,
some CPU's have "translate" instructions to do it for you in hardware
and use the TLB to help you, but the most common architecture out there
does not do that. So with the 4GB+ option, in order to just _find_ the
physical page (so that you can do DMA to it), you need to do that
complex page table walk.
Let's say that that page table walk is 50 instructions at best case.
And that pretty assumes that you did the thing in assembly code and were
Also, you can pretty much assume that even if the code is in the cache,
the page tables themselves probably aren't. So assume a minimum of three
cache misses right there (plus the code - 50 instructions).
Furthermore, to actually pin a page down, even if you do _nothing_ else,
you'd at least need a SMP-safe increment (and eventual decrement).
That's another 24 cycles just in those two instructions on x86.
End result: you've done the work equivalent to about 4 cache misses.
Just to look up the physical page, and not actually _doing_ anything
with it. Never mind locking it in memory or anything like that. And
that's assuming you got no icache misses on the actual _code_ to do all
Basically any copy <= 4 cache lines is "free" compared to trying to be
clever. That's 128 bytes on most machines right now. And cache-lines
are growing: 64 and 128 byte cache-lines are not that unlikely these
days (I think Athlon has a 64-byte cache-line, for example, just in the
PC space, and alpha and sparc64 do also).
So basically the cost of a simple memcpy() isn't neceassarily that big.
The above calculations were rather kind towards the "lock the page
down" case, and it's not all that unlikely that the cost of locking down
a page is on the same order of magnitude as just doing a "memcpy()" on
the whole page.
The above gets _much_ worse in real life. If you truly wan tto do
zero-copy from user space and get real UNIX semantics for writes(),
you'd better protect the page somehow, so that if the data hasn't made
it out to the network (or needed to be re-sent) by the time the system
call returns, the user can't change the user-mode buffer before the data
That's when you get into TLB invalidates etc. By which time you're
talking another few cache invalidates, and possibly some nasty cross-CPU
calls for SMP TLB coherency.
People who claim "zero-copy" is a great thing often ignore the costs of
_not_ copying altogether.
(This is the same mistake that people who do complexity analysis often
stumble on. Sure, "constant time" is perfect. Except it's not
necessarily unusual "constant time" is 50 times larger than O(n) in
practice. Same goes for zero-copy - it's "perfect", but can easily be
slower than just plain old "good").
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to firstname.lastname@example.org
Please read the FAQ at http://www.tux.org/lkml/
This archive was generated by hypermail 2b29 : Thu Sep 07 2000 - 21:00:16 EST