Re: [patch] speed up / fix the new generic semaphore code (fixAIM7 40% regression with 2.6.26-rc1)
From: Ingo Molnar
Date: Thu May 08 2008 - 17:46:30 EST
* Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> > I suspect some more performance could be won in this particular
> > workload by getting rid of the BKL dependency altogether.
>
> Did somebody have any trace on which BKL taker it is? It apparently
> wasn't file locking. Was it the tty layer?
yeah, i captured a trace of all the down()s that happen in the workload,
using ftrace/sched_switch + stacktrace + a tracepoint in down(). Here's
the semaphore activities in the trace:
# grep lock_kernel /debug/tracing/trace | cut -d: -f2- | sort | \
uniq -c | sort -n | cut -d= -f1-4
9 down <= lock_kernel <= proc_lookup_de <= proc_lookup <
12 down <= lock_kernel <= de_put <= proc_delete_inode <
14 down <= lock_kernel <= proc_lookup_de <= proc_lookup <
19 down <= lock_kernel <= vfs_ioctl <= do_vfs_ioctl <
58 down <= lock_kernel <= tty_release <= __fput <
62 down <= lock_kernel <= chrdev_open <= __dentry_open <
70 down <= lock_kernel <= sys_fcntl <= system_call_after_swapgs <
2512 down <= lock_kernel <= opost <= write_chan <
2574 down <= lock_kernel <= write_chan <= tty_write <
note that this is running the fixed semaphore code, so contended
semaphores are really rare in the trace. The histogram above includes
all calls to down().
here's the full trace file (from a single CPU, on a dual-core box
running aim7):
http://redhat.com/~mingo/misc/aim7-ftrace.txt
some other interesting stats. Top 5 wakeups sources:
# grep wake_up aim7-ftrace.txt | cut -d: -f2- | sort | uniq -c |
sort -n | cut -d= -f1-6 | tail -5
[...]
340 default_wake_function <= __wake_up_common <= __wake_up_sync <= unix_write_space <= sock_wfree <= skb_release_all <
411 default_wake_function <= autoremove_wake_function <= __wake_up_common <= __wake_up_sync <= pipe_read <= do_sync_read <
924 default_wake_function <= autoremove_wake_function <= __wake_up_common <= __wake_up_sync <= pipe_write <= do_sync_write <
1301 default_wake_function <= __wake_up_common <= __wake_up <= n_tty_receive_buf <= pty_write <= write_chan <
2065 wake_up_state <= prepare_signal <= send_signal <= __group_send_sig_info <= group_send_sig_info <= __kill_pgrp_info <
top 10 scheduling points:
# grep -w schedule aim7-ftrace.txt | cut -d: -f2- | sort |
uniq -c | sort -n | cut -d= -f1-4 | tail -10
[...]
465 schedule <= __cond_resched <= _cond_resched <= shrink_active_list <
582 schedule <= cpu_idle <= start_secondary <= <
929 schedule <= pipe_wait <= pipe_read <= do_sync_read <
990 schedule <= __cond_resched <= _cond_resched <= shrink_page_list <
1034 schedule <= io_schedule <= get_request_wait <= __make_request <
1140 schedule <= worker_thread <= kthread <= child_rip <
1512 schedule <= retint_careful <= <= 0 <
1571 schedule <= __cond_resched <= _cond_resched <= shrink_active_list <
2034 schedule <= schedule_timeout <= do_select <= core_sys_select <
2355 schedule <= sysret_careful <= <= 0 <
as visible from the trace, this is a CONFIG_PREEMPT_NONE kernel, so most
of the preemptions get triggered by wakeups and get executed from the
return-from-syscall need_resched check. But there's a fair amount of
select() related sleeping as well, and a fair amount of
IRQ-hits-userspace driven preemptions as well.
a rather hectic workload, with a surprisingly large amount of TTY
related activity.
and here's a few seconds worth of NMI driven readprofile output:
1131 page_fault 70.6875
1181 find_lock_page 8.2014
1302 __isolate_lru_page 12.8911
1344 copy_page_c 84.0000
1588 page_lock_anon_vma 34.5217
1616 ext3fs_dirhash 3.4753
1683 ext3_htree_store_dirent 6.8975
1976 str2hashbuf 12.7484
1992 copy_user_generic_string 31.1250
2362 do_unlinkat 6.5069
2969 try_to_unmap 2.0791
3031 will_become_orphaned_pgrp 24.4435
4009 __copy_user_nocache 12.3735
4627 congestion_wait 34.0221
6624 clear_page_c 414.0000
18711 try_to_unmap_one 30.8254
34447 page_referenced 140.0285
38669 do_filp_open 17.7789
166569 page_referenced_one 886.0053
213361 *unknown*
216021 sync_page 3375.3281
391888 page_check_address 1414.7581
962212 total 0.3039
system overhead is consistently 20% during this test.
the page_check_address() overhead is surprising - tons of rmap
contention? about 10% wall-clock overhead in that function alone - and
this is just on a dual-core box!
Below is the instruction level profile of that function. The second
column is the # of profile hits. The spin_lock() overhead is clearly
visible.
Ingo
___----- # of profile hits
v
ffffffff80286244: 3478 <page_check_address>:
ffffffff80286244: 3478 55 push %rbp
ffffffff80286245: 1092 48 89 d0 mov %rdx,%rax
ffffffff80286248: 0 48 c1 e8 27 shr $0x27,%rax
ffffffff8028624c: 0 48 89 e5 mov %rsp,%rbp
ffffffff8028624f: 717 41 57 push %r15
ffffffff80286251: 1 25 ff 01 00 00 and $0x1ff,%eax
ffffffff80286256: 0 49 89 cf mov %rcx,%r15
ffffffff80286259: 1354 41 56 push %r14
ffffffff8028625b: 0 49 89 fe mov %rdi,%r14
ffffffff8028625e: 0 48 89 d7 mov %rdx,%rdi
ffffffff80286261: 1507 41 55 push %r13
ffffffff80286263: 0 41 54 push %r12
ffffffff80286265: 0 53 push %rbx
ffffffff80286266: 1763 48 83 ec 08 sub $0x8,%rsp
ffffffff8028626a: 0 48 8b 56 48 mov 0x48(%rsi),%rdx
ffffffff8028626e: 4 48 8b 14 c2 mov (%rdx,%rax,8),%rdx
ffffffff80286272: 3174 f6 c2 01 test $0x1,%dl
ffffffff80286275: 0 0f 84 cc 00 00 00 je ffffffff80286347 <page_check_address+0x103>
ffffffff8028627b: 0 48 89 f8 mov %rdi,%rax
ffffffff8028627e: 64468 48 be 00 f0 ff ff ff mov $0x3ffffffff000,%rsi
ffffffff80286285: 0 3f 00 00
ffffffff80286288: 1 48 b9 00 00 00 00 00 mov $0xffff810000000000,%rcx
ffffffff8028628f: 0 81 ff ff
ffffffff80286292: 4686 48 c1 e8 1b shr $0x1b,%rax
ffffffff80286296: 0 48 21 f2 and %rsi,%rdx
ffffffff80286299: 0 25 f8 0f 00 00 and $0xff8,%eax
ffffffff8028629e: 7468 48 01 d0 add %rdx,%rax
ffffffff802862a1: 0 48 8b 14 08 mov (%rax,%rcx,1),%rdx
ffffffff802862a5: 11 f6 c2 01 test $0x1,%dl
ffffffff802862a8: 4409 0f 84 99 00 00 00 je ffffffff80286347 <page_check_address+0x103>
ffffffff802862ae: 0 48 89 f8 mov %rdi,%rax
ffffffff802862b1: 0 48 21 f2 and %rsi,%rdx
ffffffff802862b4: 1467 48 c1 e8 12 shr $0x12,%rax
ffffffff802862b8: 0 25 f8 0f 00 00 and $0xff8,%eax
ffffffff802862bd: 0 48 01 d0 add %rdx,%rax
ffffffff802862c0: 944 48 8b 14 08 mov (%rax,%rcx,1),%rdx
ffffffff802862c4: 17 f6 c2 01 test $0x1,%dl
ffffffff802862c7: 1 74 7e je ffffffff80286347 <page_check_address+0x103>
ffffffff802862c9: 927 48 89 d0 mov %rdx,%rax
ffffffff802862cc: 77 48 c1 ef 09 shr $0x9,%rdi
ffffffff802862d0: 3 48 21 f0 and %rsi,%rax
ffffffff802862d3: 1238 81 e7 f8 0f 00 00 and $0xff8,%edi
ffffffff802862d9: 1 48 01 c8 add %rcx,%rax
ffffffff802862dc: 0 4c 8d 24 38 lea (%rax,%rdi,1),%r12
ffffffff802862e0: 792 41 f6 04 24 81 testb $0x81,(%r12)
ffffffff802862e5: 25 74 60 je ffffffff80286347 <page_check_address+0x103>
ffffffff802862e7: 118074 48 c1 ea 0c shr $0xc,%rdx
ffffffff802862eb: 41187 48 b8 00 00 00 00 00 mov $0xffffe20000000000,%rax
ffffffff802862f2: 0 e2 ff ff
ffffffff802862f5: 182 48 6b d2 38 imul $0x38,%rdx,%rdx
ffffffff802862f9: 25998 48 8d 1c 02 lea (%rdx,%rax,1),%rbx
ffffffff802862fd: 0 4c 8d 6b 10 lea 0x10(%rbx),%r13
ffffffff80286301: 0 4c 89 ef mov %r13,%rdi
ffffffff80286304: 80598 e8 4b 17 28 00 callq ffffffff80507a54 <_spin_lock>
ffffffff80286309: 36022 49 8b 0c 24 mov (%r12),%rcx
ffffffff8028630d: 1623 f6 c1 81 test $0x81,%cl
ffffffff80286310: 5 74 32 je ffffffff80286344 <page_check_address+0x100>
ffffffff80286312: 16 48 b8 00 00 00 00 00 mov $0x1e0000000000,%rax
ffffffff80286319: 0 1e 00 00
ffffffff8028631c: 359 48 ba b7 6d db b6 6d mov $0x6db6db6db6db6db7,%rdx
ffffffff80286323: 0 db b6 6d
ffffffff80286326: 12 48 c1 e1 12 shl $0x12,%rcx
ffffffff8028632a: 0 49 8d 04 06 lea (%r14,%rax,1),%rax
ffffffff8028632e: 492 48 c1 e9 1e shr $0x1e,%rcx
ffffffff80286332: 23 48 c1 f8 03 sar $0x3,%rax
ffffffff80286336: 0 48 0f af c2 imul %rdx,%rax
ffffffff8028633a: 1390 48 39 c8 cmp %rcx,%rax
ffffffff8028633d: 0 75 05 jne ffffffff80286344 <page_check_address+0x100>
ffffffff8028633f: 0 4d 89 2f mov %r13,(%r15)
ffffffff80286342: 165 eb 06 jmp ffffffff8028634a <page_check_address+0x106>
ffffffff80286344: 0 fe 43 10 incb 0x10(%rbx)
ffffffff80286347: 11886 45 31 e4 xor %r12d,%r12d
ffffffff8028634a: 17451 5a pop %rdx
ffffffff8028634b: 14507 5b pop %rbx
ffffffff8028634c: 42 4c 89 e0 mov %r12,%rax
ffffffff8028634f: 1736 41 5c pop %r12
ffffffff80286351: 40 41 5d pop %r13
ffffffff80286353: 1727 41 5e pop %r14
ffffffff80286355: 1420 41 5f pop %r15
ffffffff80286357: 44 c9 leaveq
ffffffff80286358: 1685 c3 retq
gcc-4.2.3, the config is at:
http://redhat.com/~mingo/misc/config-Thu_May__8_22_23_21_CEST_2008
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/