Re: performance "regression" in cfq compared to anticipatory, deadline and noop

From: Daniel J Blueman
Date: Tue Dec 09 2008 - 10:14:30 EST

Next message: Brian King: "Re: [PATCH 1/1] sched: CPU remove deadlock fix"
Previous message: Peter Zijlstra: "Re: [PATCH 1/1] sched: CPU remove deadlock fix"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi Jens, Fabio,

On Mon, Aug 25, 2008 at 5:06 PM, Fabio Checconi <fchecconi@xxxxxxxxx> wrote:
>> From: Daniel J Blueman <daniel.blueman@xxxxxxxxx>
>> Date: Mon, Aug 25, 2008 04:39:01PM +0100
>>
>> On Mon, Aug 25, 2008 at 9:29 PM, Fabio Checconi <fchecconi@xxxxxxxxx> wrote:
>> > Hi,
>> >
>> >> From: Daniel J Blueman <daniel.blueman@xxxxxxxxx>
>> >> Date: Sun, Aug 24, 2008 09:24:37PM +0100
>> >>
>> >> Hi Fabio, Jens,
>> >>
>> > ...
>> >> This was the last test I didn't get around to. Alas, is did help, but
>> >> didn't give the merging required for full performance:
>> >>
>> >> # echo 1 >/proc/sys/vm/drop_caches; dd if=/dev/sda of=/dev/null
>> >> bs=128k count=2000
>> >> 262144000 bytes (262 MB) copied, 2.47787 s, 106 MB/s
>> >>
>> >> # echo 1 >/proc/sys/vm/drop_caches; hdparm -t /dev/sda
>> >> Timing buffered disk reads: 308 MB in 3.01 seconds = 102.46 MB/sec
>> >>
>> >> It is an improvement over the baseline performance of 2.6.27-rc4:
>> >>
>> >> # echo 1 >/proc/sys/vm/drop_caches; dd if=/dev/sda of=/dev/null
>> >> bs=128k count=2000
>> >> 262144000 bytes (262 MB) copied, 2.56514 s, 102 MB/s
>> >>
>> >> # echo 1 >/proc/sys/vm/drop_caches; hdparm -t /dev/sda
>> >> Timing buffered disk reads: 294 MB in 3.02 seconds = 97.33 MB/sec
>> >>
>> >> Note that platter speed is around 125MB/s (which I get near at smaller
>> >> read sizes).
>> >>
>> >> I feel 128KB read requests are perhaps important, as this is a
>> >> commonly-used RAID stripe size, and may explain the read-performance
>> >> drop sometimes we see in hardware vs software RAID benchmarks.
>> >>
>> >> How can we generate some ideas or movement on fixing/improving this behaviour?
>> >>
>> >
>> > Thank you for testing. The blktrace output for this run should be
>> > interesting, esp. to compare it with a blktrace obtained from anticipatory
>> > with the same workload - IIRC anticipatory didn't suffer from the problem,
>> > and anticipatory has a slightly different dispatching mechanism that
>> > this patch tried to bring into cfq.
>> >
>> > Even if a proper fix may not belong to the elevator itself, I think
>> > that this couple (this last test + anticipatory) of traces should help
>> > in better understanding what is still going wrong.
>> >
>> > Thank you in advance.
>>
>> See http://quora.org/blktrace-n.tar.bz2
>>
>> Where n is:
>> 0 - 2.6.27-rc4 unpatched
>> 1 - 2.6.27-rc4 with your CFQ patch, CFQ scheduler
>> 2 - 2.6.27-rc4 with your CFQ patch, anticipatory scheduler
>> 3 - 2.6.27-rc4 with your CFQ patch, deadline scheduler
>>
>> I have found it's not always possible to reproduce this issue, eg now,
>> with stock CFQ, I'm seeing consistent 117-123MB/s with hdparm and dd
>> (as above), whereas I was seeing a consistent 95-103MB/s, so the
>> blktraces may not show the slower-performance pattern - even with
>> precisely the same (controlled) environment.
>>
>
> If I read them correctly, all the traces show dispatches with
> requests still growing; the elevator cannot know if a request
> will grow or not once it has been queued, and the heuristics
> we tried so far to postpone dispatches gave no results.
>
> I don't see any elevator-only solution to the problem...

I was running into this performance issue again:

Everything same as before, 2.6.24, CFQ scheduler, Seagate 7200.11
320GB SATA (SD11 firmware) on a quiescent and well-powered system:

# sync; echo 3 >/proc/sys/vm/drop_caches; dd if=/dev/sda of=/dev/null
bs=128k count=1000
1000+0 records in
1000+0 records out
131072000 bytes (131 MB) copied, 2.24231 s, 58.5 MB/s

I found that tuning the AHCI SATA TCQ depth to 2 provides exactly the
performance we expect:

# echo 2 >/sys/block/sda/device/queue_depth
# sync; echo 3 >/proc/sys/vm/drop_caches; dd if=/dev/sda of=/dev/null
bs=128k count=1000
1000+0 records in
1000+0 records out
131072000 bytes (131 MB) copied, 0.98503 s, 133 MB/s

depth 1: 132 MB/s
depth 2: 133 MB/s
depth 3: 69.1 MB/s
depth 4: 59.7 MB/s
depth 8: 54.9 MB/s
depth 16: 57.1 MB/s
depth 31: 58.0 MB/s

Very interesting interaction, and the figures are very stable. Could
this be a product of the maximum time the drive waits to coalesce
requests before acting on them? If so, how can we diagnose this, apart
from you guys getting one of these disks?

Thanks,
Daniel
--
Daniel J Blueman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Brian King: "Re: [PATCH 1/1] sched: CPU remove deadlock fix"
Previous message: Peter Zijlstra: "Re: [PATCH 1/1] sched: CPU remove deadlock fix"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]