If we release the lock for less than 100 ticks
(e.g copy 50 bytes which are in the L2-cache), then we loose CPU time:
without release:
CPU 1:
...
copy from user (100 ticks)
...
CPU 2 : spins the whole time
--> one CPU is waiting
with release:
CPU 1:
unlock_kernel, transfers the cache line. (100 ticks)
<< CPU 2 can start
copy from user (100 ticks)
<< the CPU spins.
CPY 2:
spins,
reads the cache line, restarts.
As you can see, we loose 100 ticks to transfer the cache line,
then both CPU operate in parallel for 100 ticks,
the CPU 1 spins.
--> Sum: -100+100=0.
Later the lock will be transfered back to CPU 1.
--> overall we lose.
We only scale better if we release the lock far longer than
the time required to transfer the lock from 1 CPU to the other
CPU. I'd say we should release the lock only if we assume that
the other CPU has a good chance to release the lock before
we need it again (~1000 ticks, or 300 bytes uncached memmove,
or 500 bytes uncached memset)
> If the code that will run is very fast then we'll have less probabilty to
> scale and we risk to release the kernel lock without really improve the
> other CPU.
That's the second reason why we should release the lock only if we
assume that more than a certain minimum time is required for
the operation.
Tomorrow, I'll download your latest patch, and I'll modify my
RDTSC code so that I can measure the individual functions from
uaccess.h.
My current guess:
- the string routines are to fast--> do not unlock
- clear_page(): 4000-6000 ticks --> unlock
- copy_to/from_user(): release if more than (n) bytes
- clear_user(): ??
- cksum...(): probably, perhaps if more than (x) bytes.
-- Manfred Note: I've read that patches are on the way to Linus for 2.3.3 which make the complete page-cache parallel on SMP [from Ingo Molnar].
- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/