Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1

From: Mark_H_Johnson
Date: Thu Sep 09 2004 - 15:05:40 EST


>would it be possible to test with DMA disabled? (hdparm -d0 /dev/hda) It
>might take some extra work to shun the extra latency reports from the
>PIO IDE path (which is quite slow) but once that is done you should be
>able to see whether these long 0.5 msec delays remain even if all (most)
>DMA activity has been eliminated.

OK. With new patches in hand, I have a set of latency results w/ IDE
DMA turned off. For reference, tests were run with:

# hdparm /dev/hda

/dev/hda:
multcount = 16 (on)
IO_support = 1 (32-bit)
unmaskirq = 1 (on)
using_dma = 0 (off)
keepsettings = 0 (off)
readonly = 0 (off)
readahead = 64 (on)
geometry = 58168/16/63, sectors = 58633344, start = 0

# cat /sys/block/hda/queue/max_sectors_kb
32
# cat /sys/block/hda/queue/read_ahead_kb
32
# cat /proc/sys/net/core/netdev_max_backlog
8
# dmesg -n 1

and all tests run w/ a -R8 kernel plus patches for sched, timer_tsc,
and ide-iops (to add latency trace outputs or suppress known long paths).

No latency traces > 600 usec. These tests were run simply with
head -c 750000000 /dev/zero >tmpfile (disk writes)
cp tmpfile tmpfile2 (disk copy)
cat tmpfile tmpfile2 >/dev/null
while capturing the latency traces in another process. 72 total traces
captured in 15 minutes of tests.

For reference, the 750 meg file size is about 1.5 x physical memory.
No I/O to the audio card. X was running (to monitor the test) and a
network was connected (but generally idle).

PIO trace
=========

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-R8-nd
-------------------------------------------------------
latency: 369 us, entries: 3 (3) | [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
-----------------
| task: IRQ 14/140, uid:0 nice:-10 policy:0 rt_prio:0
-----------------
=> started at: ide_outsl+0x44/0x50
=> ended at: ide_outsl+0x44/0x50
=======>
00000001 0.000ms (+0.370ms): touch_preempt_timing (ide_outsl)
00000001 0.370ms (+0.000ms): touch_preempt_timing (ide_outsl)
00000001 0.370ms (+0.000ms): update_max_trace (check_preempt_timing)

which appears to be the PIO path ide_outsl. I had a few similar traces
with ide_insl during the copy / read tests as well.

send_IPI_mask_bitmask
=====================

I have several traces where send_IPI_mask_bitmask (flush_tlb_others)
shows up. For example...

00000002 0.010ms (+0.001ms): kunmap_atomic (zap_pte_range)
00000001 0.011ms (+0.000ms): flush_tlb_mm (unmap_vmas)
00000002 0.012ms (+0.000ms): flush_tlb_others (flush_tlb_mm)
00000002 0.012ms (+0.000ms): spin_lock (flush_tlb_others)
00000003 0.013ms (+0.001ms): spin_lock (<00000000>)
00000003 0.014ms (+0.000ms): send_IPI_mask (flush_tlb_others)
00000003 0.014ms (+0.132ms): send_IPI_mask_bitmask (flush_tlb_others)
00010003 0.147ms (+0.000ms): do_nmi (flush_tlb_others)
00010003 0.147ms (+0.001ms): do_nmi (ide_outsl)
00010003 0.149ms (+0.000ms): notifier_call_chain (default_do_nmi)
00010003 0.149ms (+0.000ms): profile_tick (nmi_watchdog_tick)
00010003 0.150ms (+0.000ms): profile_hook (profile_tick)
00010003 0.150ms (+0.000ms): read_lock (profile_hook)
00010004 0.150ms (+0.000ms): read_lock (<00000000>)
00010004 0.150ms (+0.000ms): notifier_call_chain (profile_hook)
00010004 0.151ms (+0.000ms): _read_unlock (profile_tick)
00010003 0.151ms (+0.250ms): profile_hit (nmi_watchdog_tick)
00000003 0.401ms (+0.000ms): _spin_unlock (flush_tlb_others)
00000001 0.402ms (+0.000ms): free_pages_and_swap_cache (unmap_vmas)
00000001 0.402ms (+0.000ms): lru_add_drain (free_pages_and_swap_cache)

or this one...

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-R8-nd
-------------------------------------------------------
latency: 445 us, entries: 22 (22) | [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
-----------------
| task: kswapd0/40, uid:0 nice:0 policy:0 rt_prio:0
-----------------
=> started at: spin_lock+0x24/0x90
=> ended at: _spin_unlock+0x2d/0x60
=======>
00000001 0.000ms (+0.000ms): spin_lock (page_referenced_file)
00000001 0.000ms (+0.000ms): spin_lock (<00000000>)
00000001 0.000ms (+0.000ms): vma_prio_tree_next (page_referenced_file)
00000001 0.001ms (+0.001ms): prio_tree_first (vma_prio_tree_next)
00000001 0.003ms (+0.001ms): prio_tree_left (prio_tree_first)
00000001 0.005ms (+0.000ms): page_referenced_one (page_referenced_file)
00000001 0.005ms (+0.000ms): spin_lock (page_referenced_one)
00000002 0.006ms (+0.001ms): spin_lock (<00000000>)
00000002 0.007ms (+0.000ms): kmap_atomic (page_referenced_one)
00000003 0.007ms (+0.001ms): page_address (page_referenced_one)
00000003 0.009ms (+0.000ms): flush_tlb_page (page_referenced_one)
00000004 0.010ms (+0.000ms): flush_tlb_others (flush_tlb_page)
00000004 0.010ms (+0.000ms): spin_lock (flush_tlb_others)
00000005 0.011ms (+0.001ms): spin_lock (<00000000>)
00000005 0.012ms (+0.000ms): send_IPI_mask (flush_tlb_others)
00000005 0.012ms (+0.431ms): send_IPI_mask_bitmask (flush_tlb_others)
00000005 0.444ms (+0.000ms): _spin_unlock (flush_tlb_others)
00000003 0.445ms (+0.000ms): kunmap_atomic (page_referenced_one)
00000002 0.445ms (+0.000ms): _spin_unlock (page_referenced_one)
00000001 0.445ms (+0.000ms): _spin_unlock (page_referenced_file)
00000001 0.446ms (+0.000ms): sub_preempt_count (_spin_unlock)
00000001 0.446ms (+0.000ms): update_max_trace (check_preempt_timing)

or this one...

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-R8-nd
-------------------------------------------------------
latency: 428 us, entries: 49 (49) | [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
-----------------
| task: get_ltrace.sh/5514, uid:0 nice:-9 policy:0 rt_prio:0
-----------------
=> started at: flush_tlb_mm+0x1d/0xd0
=> ended at: flush_tlb_mm+0x52/0xd0
=======>
00000001 0.000ms (+0.001ms): flush_tlb_mm (copy_mm)
00000001 0.001ms (+0.000ms): flush_tlb_others (flush_tlb_mm)
00000001 0.002ms (+0.000ms): spin_lock (flush_tlb_others)
00000002 0.002ms (+0.000ms): spin_lock (<00000000>)
00000002 0.003ms (+0.000ms): send_IPI_mask (flush_tlb_others)
00000002 0.003ms (+0.415ms): send_IPI_mask_bitmask (flush_tlb_others)
00000002 0.419ms (+0.000ms): smp_apic_timer_interrupt (flush_tlb_others)
00010002 0.419ms (+0.000ms): profile_tick (smp_apic_timer_interrupt)
00010002 0.419ms (+0.000ms): profile_hook (profile_tick)
00010002 0.419ms (+0.000ms): read_lock (profile_hook)
00010003 0.420ms (+0.000ms): read_lock (<00000000>)
00010003 0.420ms (+0.000ms): notifier_call_chain (profile_hook)
00010003 0.420ms (+0.000ms): _read_unlock (profile_tick)
00010002 0.420ms (+0.000ms): profile_hit (smp_apic_timer_interrupt)
00010002 0.420ms (+0.000ms): update_process_times
(smp_apic_timer_interrupt)
...

or this one...

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-R8-nd
-------------------------------------------------------
latency: 330 us, entries: 9 (9) | [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
-----------------
| task: get_ltrace.sh/5514, uid:0 nice:-9 policy:0 rt_prio:0
-----------------
=> started at: flush_tlb_mm+0x1d/0xd0
=> ended at: flush_tlb_mm+0x52/0xd0
=======>
00000001 0.000ms (+0.001ms): flush_tlb_mm (copy_mm)
00000001 0.001ms (+0.000ms): flush_tlb_others (flush_tlb_mm)
00000001 0.002ms (+0.000ms): spin_lock (flush_tlb_others)
00000002 0.002ms (+0.000ms): spin_lock (<00000000>)
00000002 0.003ms (+0.000ms): send_IPI_mask (flush_tlb_others)
00000002 0.004ms (+0.326ms): send_IPI_mask_bitmask (flush_tlb_others)
00000002 0.330ms (+0.000ms): _spin_unlock (flush_tlb_others)
00000001 0.331ms (+0.000ms): sub_preempt_count (flush_tlb_mm)
00000001 0.331ms (+0.000ms): update_max_trace (check_preempt_timing)

try_to_wake_up
==============

Buried inside a pretty long trace in kswapd0, I saw the following...

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-R8-nd
-------------------------------------------------------
latency: 426 us, entries: 286 (286) | [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
-----------------
| task: kswapd0/40, uid:0 nice:0 policy:0 rt_prio:0
-----------------
=> started at: spin_lock+0x24/0x90
=> ended at: _spin_unlock+0x2d/0x60
=======>
00000001 0.000ms (+0.000ms): spin_lock (page_referenced_file)
00000001 0.000ms (+0.000ms): spin_lock (<00000000>)
00000001 0.000ms (+0.000ms): vma_prio_tree_next (page_referenced_file)
00000001 0.000ms (+0.000ms): prio_tree_first (vma_prio_tree_next)
00000001 0.001ms (+0.000ms): prio_tree_left (prio_tree_first)
00000001 0.001ms (+0.000ms): page_referenced_one (page_referenced_file)
00000001 0.001ms (+0.000ms): spin_lock (page_referenced_one)
00000002 0.002ms (+0.000ms): spin_lock (<00000000>)
...
00000006 0.108ms (+0.000ms): task_rq_lock (try_to_wake_up)
00000006 0.108ms (+0.000ms): spin_lock (task_rq_lock)
00000007 0.108ms (+0.000ms): spin_lock (<00000000>)
00000007 0.109ms (+0.000ms): wake_idle (try_to_wake_up)
00000007 0.109ms (+0.000ms): activate_task (try_to_wake_up)
00000007 0.109ms (+0.000ms): sched_clock (activate_task)
00000007 0.109ms (+0.000ms): recalc_task_prio (activate_task)
00000007 0.109ms (+0.000ms): effective_prio (recalc_task_prio)
00000007 0.110ms (+0.000ms): __activate_task (try_to_wake_up)
00000007 0.110ms (+0.000ms): enqueue_task (__activate_task)
00000007 0.110ms (+0.000ms): sched_info_queued (enqueue_task)
00000007 0.110ms (+0.000ms): resched_task (try_to_wake_up)
00000007 0.111ms (+0.000ms): task_rq_unlock (try_to_wake_up)
00000007 0.111ms (+0.000ms): _spin_unlock_irqrestore (try_to_wake_up)
00000006 0.111ms (+0.298ms): preempt_schedule (try_to_wake_up)
00000005 0.409ms (+0.000ms): _spin_unlock (flush_tlb_others)
00000004 0.409ms (+0.000ms): preempt_schedule (flush_tlb_others)
00000003 0.410ms (+0.000ms): preempt_schedule (flush_tlb_page)
00000003 0.410ms (+0.000ms): kunmap_atomic (page_referenced_one)
00000002 0.410ms (+0.000ms): preempt_schedule (page_referenced_one)
00000002 0.410ms (+0.000ms): _spin_unlock (page_referenced_one)
00000001 0.410ms (+0.000ms): preempt_schedule (page_referenced_one)

There is an almost 300 usec hit there as we unwind the nested layers.

So the long wait on paths through sched and timer_tsc appear to be
eliminated with PIO to the disk.

Is there some "for sure" way to limit the size and/or duration of
DMA transfers so I get reasonable performance from the disk (and
other DMA devices) and reasonable latency?
--Mark

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/