Re: OOPS: 2.6.16-rc6 cpufreq_conservative
From: Linus Torvalds
Date: Sun Mar 19 2006 - 14:04:47 EST
On Sun, 19 Mar 2006, Linus Torvalds wrote:
>
> Actually, looking at it a bit more, I think we can do better even
> _without_ fixing the "calling convention".
Here's the "before and after" example of nr_context_switches() from
kernel/sched.c with this patch on x86-64 (selected because the whole
function is just basically one simple "loop over all possible CPU's and
return the sum of context switches").
I'll do the "after" first, because it's actually readable:
nr_context_switches:
pushq %rbp
xorl %esi, %esi
xorl %ecx, %ecx
movq %rsp, %rbp
.L210:
#APP
btl %ecx,cpu_possible_map(%rip)
sbbl %eax,%eax
#NO_APP
testl %eax, %eax
je .L211
movq _cpu_pda(,%rcx,8), %rdx
movq $per_cpu__runqueues, %rax
addq 8(%rdx), %rax
addq 40(%rax), %rsi
.L211:
incq %rcx
cmpq $8, %rcx
jne .L210
leave
movq %rsi, %rax
ret
ie the loop starts at .L210, end basically iterates %rcx from 0..7, and
you can read the assembly and it actually makes sense.
(Whether it _works_ is another matter: I haven't actually booted the
patched kernel ;)
Compare that to the "before" state:
nr_context_switches:
pushq %rbp
movl $8, %eax
movq cpu_possible_map(%rip), %rdi
#APP
bsfq %rdi,%rdx ; cmovz %rax,%rdx
#NO_APP
cmpl $9, %edx
movq %rdx, %rax
movl $8, %edx
cmovge %edx, %eax
movq %rsp, %rbp
pushq %rbx
movslq %eax,%rcx
xorl %r8d, %r8d
jmp .L261
.L262:
movq _cpu_pda(,%rcx,8), %rdx
movl $8, %esi
movq $per_cpu__runqueues, %rax
movq %rdi, %rbx
movq 8(%rdx), %rdx
addq 40(%rax,%rdx), %r8
movl %ecx, %edx
movl %esi, %eax
leal 1(%rdx), %ecx
subl %edx, %eax
decl %eax
shrq %cl, %rbx
cltq
movq %rbx, %rcx
#APP
bsfq %rcx,%rbx ; cmovz %rax,%rbx
#NO_APP
leal 1(%rdx,%rbx), %edx
cmpl $9, %edx
cmovge %esi, %edx
movslq %edx,%rcx
.L261:
cmpq $7, %rcx
jbe .L262
popq %rbx
leave
movq %r8, %rax
ret
which is not only obviously bigger (40 instructions vs just 18), I also
dare anybody to actually read and understand the assembly.
Now, the simple version will iterate 8 times over the loop, while the more
complex version will iterate just as many times as there are CPU's in the
actual system. So there's a trade-off. The "load the CPU mask first and
then shift it down by one every time" thing (that required different
syntax) would have been able to exit early. This one isn't. So the syntax
change would still help a lot (and would avoid the "btl").
Of course, if people were to set CONFIG_NR_CPUS appropriately for their
system, the shorter version gets increasingly better (since it then won't
do any unnecessary work).
Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/