Re: fio mmap randread 64k more than 40% regression with 2.6.33-rc1

From: Corrado Zoccolo
Date: Mon Jan 04 2010 - 13:28:36 EST


Hi Yanmin,
On Mon, Jan 4, 2010 at 9:18 AM, Zhang, Yanmin
<yanmin_zhang@xxxxxxxxxxxxxxx> wrote:
> On Sat, 2010-01-02 at 19:52 +0100, Corrado Zoccolo wrote:
>> Hi
>> On Sat, Jan 2, 2010 at 1:33 PM, Zhang, Yanmin
>> <yanmin_zhang@xxxxxxxxxxxxxxx> wrote:
>> > On Fri, 2010-01-01 at 17:32 +0100, Corrado Zoccolo wrote:
>> >> Hi Yanmin,
>> >> On Fri, Jan 1, 2010 at 11:12 AM, Zhang, Yanmin
>> >> <yanmin_zhang@xxxxxxxxxxxxxxx> wrote:
>> >> > On Thu, 2009-12-31 at 11:34 +0100, Corrado Zoccolo wrote:
>> >> >> Hi Yanmin,
>> >> >> On Thu, Dec 31, 2009 at 10:16 AM, Zhang, Yanmin
>> >> >> <yanmin_zhang@xxxxxxxxxxxxxxx> wrote:
>> >> >> > Comparing with kernel 2.6.32, fio mmap randread 64k has more than 40% regression with
>> >> >> > 2.6.33-rc1.
>> >> >>
>> >> > Thanks for your timely reply. Some comments inlined below.
>> >> >
>> >> >> Can you compare the performance also with 2.6.31?
>> >> > We did. We run Linux kernel Performance Tracking project and run many benchmarks when a RC kernel
>> >> > is released.
>> >> >
>> >> > The result of 2.6.31 is quite similar to the one of 2.6.32. But the one of 2.6.30 is about
>> >> > 8% better than the one of 2.6.31.
>> >> >
>> >> >> I think I understand what causes your problem.
>> >> >> 2.6.32, with default settings, handled even random readers as
>> >> >> sequential ones to provide fairness. This has benefits on single disks
>> >> >> and JBODs, but causes harm on raids.
>> >> > I didn't test RAID as that machine with hardware RAID HBA is crashed now. But if we turn on
>> >> > hardware RAID in HBA, mostly we use noop io scheduler.
>> >> I think you should start testing cfq with them, too. From 2.6.33, we
>> >> have some big improvements in this area.
>> > Great! I once compared cfq and noop against non-raid and raid0. One interesting finding
>> > about sequential read testing is when there are fewer processes to read files on the raid0
>> > JBOD, noop on raid0 is pretty good, but when there are lots of processes to do so on a non-raid
>> > JBOD, cfq is pretty better. I planed to investigate it, but too busy in other issues.
>> >
>> >> >
>> >> >> For 2.6.33, we changed the way in which this is handled, restoring the
>> >> >> enable_idle = 0 for seeky queues as it was in 2.6.31:
>> >> >> @@ -2218,13 +2352,10 @@ cfq_update_idle_window(struct cfq_data *cfqd,
>> >> >> struct cfq_queue *cfqq,
>> >> >> Â Â Â Âenable_idle = old_idle = cfq_cfqq_idle_window(cfqq);
>> >> >>
>> >> >> Â Â Â Âif (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
>> >> >> - Â Â Â Â Â (!cfqd->cfq_latency && cfqd->hw_tag && CFQQ_SEEKY(cfqq)))
>> >> >> + Â Â Â Â Â (sample_valid(cfqq->seek_samples) && CFQQ_SEEKY(cfqq)))
>> >> >> Â Â Â Â Â Â Â Âenable_idle = 0;
>> >> >> (compare with 2.6.31:
>> >> >> Â Â Â Â if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
>> >> >> Â Â Â Â Â Â (cfqd->hw_tag && CIC_SEEKY(cic)))
>> >> >> Â Â Â Â Â Â Â Â enable_idle = 0;
>> >> >> excluding the sample_valid check, it should be equivalent for you (I
>> >> >> assume you have NCQ disks))
>> >> >> and we provide fairness for them by servicing all seeky queues
>> >> >> together, and then idling before switching to other ones.
>> >> > As for function cfq_update_idle_window, you is right. But since
>> >> > 2.6.32, CFQ merges many patches and the patches have impact on each other.
>> >> >
>> >> >>
>> >> >> The mmap 64k randreader will have a large seek_mean, resulting in
>> >> >> being marked seeky, but will send 16 * 4k sequential requests one
>> >> >> after the other, so alternating between those seeky queues will cause
>> >> >> harm.
>> >> >>
>> >> >> I'm working on a new way to compute seekiness of queues, that should
>> >> >> fix your issue, correctly identifying those queues as non-seeky (for
>> >> >> me, a queue should be considered seeky only if it submits more than 1
>> >> >> seeky requests for 8 sequential ones).
>> >> >>
>> >> >> >
>> >> >> > The test scenario: 1 JBOD has 12 disks and every disk has 2 partitions. Create
>> >> >> > 8 1-GB files per partition and start 8 processes to do rand read on the 8 files
>> >> >> > per partitions. There are 8*24 processes totally. randread block size is 64K.
>> >> >> >
>> >> >> > We found the regression on 2 machines. One machine has 8GB memory and the other has
>> >> >> > 6GB.
>> >> >> >
>> >> >> > Bisect is very unstable. The related patches are many instead of just one.
>> >> >> >
>> >> >> >
>> >> >> > 1) commit 8e550632cccae34e265cb066691945515eaa7fb5
>> >> >> > Author: Corrado Zoccolo <czoccolo@xxxxxxxxx>
>> >> >> > Date: Â Thu Nov 26 10:02:58 2009 +0100
>> >> >> >
>> >> >> > Â Âcfq-iosched: fix corner cases in idling logic
>> >> >> >
>> >> >> >
>> >> >> > This patch introduces about less than 20% regression. I just reverted below section
>> >> >> > and this part regression disappear. It shows this regression is stable and not impacted
>> >> >> > by other patches.
>> >> >> >
>> >> >> > @@ -1253,9 +1254,9 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
>> >> >> > Â Â Â Â Â Â Â Âreturn;
>> >> >> >
>> >> >> > Â Â Â Â/*
>> >> >> > - Â Â Â Â* still requests with the driver, don't idle
>> >> >> > + Â Â Â Â* still active requests from this queue, don't idle
>> >> >> > Â Â Â Â */
>> >> >> > - Â Â Â if (rq_in_driver(cfqd))
>> >> >> > + Â Â Â if (cfqq->dispatched)
>> >> >> > Â Â Â Â Â Â Â Âreturn;
>> >> > Although 5 patches are related to the regression, above line is quite
>> >> > independent. Reverting above line could always improve the result for about
>> >> > 20%.
>> >> I've looked at your fio script, and it is quite complex,
>> > As we have about 40 fio sub cases, we have a script to create fio job file from
>> > a specific parameter list. So there are some superfluous parameters.
>> >
>> My point is that there are so many things going on, that is more
>> difficult to analyse the issues.
>> I prefer looking at one problem at a time, so (initially) removing the
>> possibility of queue merging, that Shaohua already investigated, can
>> help in spotting the still not-well-understood problem.
> Sounds reasonable.
>
>> Could you generate the same script, but with each process accessing
>> only one of the files, instead of chosing it at random?
> Ok. New testing starts 8 processes per partition and every process just works
> on one file.
Great, thanks.
>
>>
>> > Another point is we need stable result.
>> >
>> >> with lot of
>> >> things going on.
>> >> Let's keep this for last.
>> > Ok. But the change like what you do mostly reduces regresion.
>> >
>> >> I've created a smaller test, that already shows some regression:
>> >> [global]
>> >> direct=0
>> >> ioengine=mmap
>> >> size=8G
>> >> bs=64k
>> >> numjobs=1
>> >> loops=5
>> >> runtime=60
>> >> #group_reporting
>> >> invalidate=0
>> >> directory=/media/hd/cfq-tests
>> >>
>> >> [job0]
>> >> startdelay=0
>> >> rw=randread
>> >> filename=testfile1
>> >>
>> >> [job1]
>> >> startdelay=0
>> >> rw=randread
>> >> filename=testfile2
>> >>
>> >> [job2]
>> >> startdelay=0
>> >> rw=randread
>> >> filename=testfile3
>> >>
>> >> [job3]
>> >> startdelay=0
>> >> rw=randread
>> >> filename=testfile4
>> >>
>> >> The attached patches, in particular 0005 (that apply on top of
>> >> for-linus branch of Jen's tree
>> >> git://git.kernel.dk/linux-2.6-block.git) fix the regression on this
>> >> simplified workload.
>> > I didn't download the tree. I tested the 3 attached patches against 2.6.33-rc1. The
>> > result isn't resolved.
>> Can you quantify if there is an improvement, though?
>
> Ok. Because of company policy, I could only post percent instead of real number
Sure, it is fine.
>
>> Please, also include Shahoua's patches.
>> I'd like to see the comparison between (always with low_latency set to 0):
>> plain 2.6.33
>> plain 2.6.33 + shahoua's
>> plain 2.6.33 + shahoua's + my patch
>> plain 2.6.33 + shahoua's + my patch + rq_in_driver vs dispatched patch.
>
> 1) low_latency=0
> 2.6.32 kernel                  0
> 2.6.33-rc1 Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â-0.33
> 2.6.33-rc1_shaohua               Â-0.33
> 2.6.33-rc1+corrado               Â0.03
> 2.6.33-rc1_corrado+shaohua           Â0.02
> 2.6.33-rc1_corrado+shaohua+rq_in_driver     0.01
>
So my patch fixes the situation for low_latency = 0, as I expected.
I'll send it to Jens with a proper changelog.

> 2) low_latency=1
> 2.6.32 kernel                  0
> 2.6.33-rc1 Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â-0.45
> 2.6.33-rc1+corrado               Â-0.24
> 2.6.33-rc1_corrado+shaohua           Â-0.23
> 2.6.33-rc1_corrado+shaohua+rq_in_driver     -0.23
The results are as expected. With each process working on a separate
file, Shahoua's patches do not influence the result sensibly.
Interestingly, even rq_in_driver doesn't improve in this case, so
maybe its effect is somewhat connected to queue merging.
The remaining -23% is due to timeslice shrinking, that is done to
reduce max latency when there are too many processes doing I/O, at the
expense of throughput. It is a documented change, and the suggested
way if you favor throughput over latency is to set low_latency = 0.

>
>
> When low_latency=1, we get the biggest number with kernel 2.6.32.
> Comparing with low_latency=0's result, the prior one is about 4% better.
Ok, so 2.6.33 + corrado (with low_latency =0) is comparable with
fastest 2.6.32, so we can consider the first part of the problem
solved.

For the queue merging issue, maybe Jeff has some improvements w.r.t
shaohua's approach.

Thanks,
Corrado
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/