Re: fio mmap randread 64k more than 40% regression with 2.6.33-rc1

From: Vivek Goyal
Date: Wed Jan 20 2010 - 14:18:50 EST

Next message: Justin P. Mattock: "Re: [PATCH] Disable i8042 checks on Intel Apple Macs"
Previous message: Arnd Bergmann: "Re: Generic DMA - BUG_ON"
In reply to: Corrado Zoccolo: "Re: fio mmap randread 64k more than 40% regression with 2.6.33-rc1"
Next in thread: Shaohua Li: "Re: fio mmap randread 64k more than 40% regression with 2.6.33-rc1"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, Jan 19, 2010 at 10:58:26PM +0100, Corrado Zoccolo wrote:
> On Tue, Jan 19, 2010 at 10:40 PM, Vivek Goyal <vgoyal@xxxxxxxxxx> wrote:
> > On Tue, Jan 19, 2010 at 09:10:33PM +0100, Corrado Zoccolo wrote:
> >> On Mon, Jan 18, 2010 at 4:06 AM, Zhang, Yanmin
> >> <yanmin_zhang@xxxxxxxxxxxxxxx> wrote:
> >> > On Sat, 2010-01-16 at 17:27 +0100, Corrado Zoccolo wrote:
> >> >> Hi Yanmin
> >> >> On Mon, Jan 4, 2010 at 7:28 PM, Corrado Zoccolo <czoccolo@xxxxxxxxx> wrote:
> >> >> > Hi Yanmin,
> >> >> >> When low_latency=1, we get the biggest number with kernel 2.6.32.
> >> >> >> Comparing with low_latency=0's result, the prior one is about 4% better.
> >> >> > Ok, so 2.6.33 + corrado (with low_latency =0) is comparable with
> >> >> > fastest 2.6.32, so we can consider the first part of the problem
> >> >> > solved.
> >> >> >
> >> >> I think we can return now to your full script with queue merging.
> >> >> I'm wondering if (in arm_slice_timer):
> >> >> - if (cfqq->dispatched)
> >> >> + if (cfqq->dispatched || (cfqq->new_cfqq && rq_in_driver(cfqd)))
> >> >> return;
> >> >> gives the same improvement you were experiencing just reverting to rq_in_driver.
> >> > I did a quick testing against 2.6.33-rc1. With the new method, fio mmap randread 46k
> >> > has about 20% improvement. With just checking rq_in_driver(cfqd), it has
> >> > about 33% improvement.
> >> >
> >> Jeff, do you have an idea why in arm_slice_timer, checking
> >> rq_in_driver instead of cfqq->dispatched gives so much improvement in
> >> presence of queue merging, while it doesn't have noticeable effect
> >> when there are no merges?
> >
> > Performance improvement because of replacing cfqq->dispatched with
> > rq_in_driver() is really strange. This will mean we will do even lesser
> > idling on the cfqq. That means faster cfqq switching and that should mean more
> > seeks (for this test case) and reduce throughput. This is just opposite to your approach of treating a random read mmap queue as sync where we will idle on
> > the queue.
> The tests (previous mails in this thread) show that, if no queue
> merging is happening, handling the queue as sync_idle, and setting
> low_latency = 0 to have bigger slices completely recovers the
> regression.
> If, though, we have queue merges, current arm_slice_timer shows
> regression w.r.t. the rq_in_driver version (2.6.32).
> I think a possible explanation is that we are idling instead of
> switching to an other queue that would be merged with this one. In
> fact, my half-backed try to have the rq_in_driver check conditional on
> queue merging fixed part of the regression (not all, because queue
> merges are not symmetrical, and I could be seeing the queue that is
> 'new_cfqq' for an other).
>

Just a data point. I ran 8 fio mmap jobs, bs=64K, direct=1, size=2G
runtime=30 with vanilla kernel (2.6.33-rc4) and with modified kernel which
replaced cfqq->dispatched with rq_in_driver(cfqd).

I did not see any significant throughput improvement but I did see max_clat
halfed in modified kernel.

Vanilla kernel
==============
read bw: 3701KB/s
max clat: 401050 us
Number of times idle timer was armed: 20980
Number of cfqq expired/switched: 6377
cfqq merge operations: 0

Modified kernel (rq_in_driver(cfqd))
===================================
read bw: 3645KB/s
max clat: 800515 us
Number of times idle timer was armed: 2875
Number of cfqq expired/switched: 17750
cfqq merge operations: 0

This kind of confirms that rq_in_driver(cfqd) will reduce the number of
times we idle on queues and will make queue switching faster. That also
explains the reduce max clat.

If that's the case, then it should also have increased the number of seeks
(at least on yanmin's setup of JBOD), and reduce throughput. But instead
reverse seems to be happening in his setup.

Yanmin, as Jeff mentioned, if you can capture the blktrace of vanilla and
modified kernel and upload somewhere to look at, it might help.

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Justin P. Mattock: "Re: [PATCH] Disable i8042 checks on Intel Apple Macs"
Previous message: Arnd Bergmann: "Re: Generic DMA - BUG_ON"
In reply to: Corrado Zoccolo: "Re: fio mmap randread 64k more than 40% regression with 2.6.33-rc1"
Next in thread: Shaohua Li: "Re: fio mmap randread 64k more than 40% regression with 2.6.33-rc1"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]