Re: [PATCH] cfq: Make use of service count to estimate the rb_key offset

From: Corrado Zoccolo
Date: Fri Nov 27 2009 - 03:16:25 EST

Next message: Ralf Baechle: "Re: [PATCH 06/38] move iodev_remove to .devexit.text"
Previous message: Eric Dumazet: "Re: [patch] spitz: better voltage-to-percentage conversion"
In reply to: Gui Jianfeng: "Re: [PATCH] cfq: Make use of service count to estimate the rb_key offset"
Next in thread: Gui Jianfeng: "Re: [PATCH] cfq: Make use of service count to estimate the rb_key offset"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi Guy,
On Fri, Nov 27, 2009 at 2:42 AM, Gui Jianfeng
<guijianfeng@xxxxxxxxxxxxxx> wrote:
> Corrado Zoccolo wrote:
>> Hi Gui, Jens
>> On Thu, Nov 26, 2009 at 7:20 AM, Gui Jianfeng
>> <guijianfeng@xxxxxxxxxxxxxx> wrote:
>>> Hi Jens, Czoccolo
>>>
>>> For the moment, different workload cfq queues are put into different
>>> service trees. But CFQ still uses "busy_queues" to estimate rb_key
>>> offset when inserting a cfq queue into a service tree. I think this
>>> isn't appropriate, and it should make use of service tree count to do
>>> this estimation. This patch is for for-2.6.33 branch.
>>
>> In cfq_choose_wl, we rely on consistency of rb_keys across service
>> trees to compute the next workload to be serviced.
>> Â Â Â Â for (i = 0; i < 3; ++i) {
>> Â Â Â Â Â Â Â Â /* otherwise, select the one with lowest rb_key */
>> Â Â Â Â Â Â Â Â queue = cfq_rb_first(service_tree_for(prio, i, cfqd));
>> Â Â Â Â Â Â Â Â if (queue &&
>> Â Â Â Â Â Â Â Â Â Â (!key_valid || time_before(queue->rb_key, lowest_key))) {
>> Â Â Â Â Â Â Â Â Â Â Â Â lowest_key = queue->rb_key;
>> Â Â Â Â Â Â Â Â Â Â Â cur_best = i;
>> Â Â Â Â Â Â Â Â Â Â Â key_valid = true;
>> Â Â Â Â Â Â Â }
>> Â Â Â Â }
>>
>> If you change how the rb_key is computed (so it is no longer
>> consistent across service trees) without changing how it is used can
>> introduce problems.
>
> ÂOk, I think I was missing this part. This part still behaves like old CFQ regardless
> Âof workload type. I'm wondering why you prefer starting from sync no-idle only when
> Âpriorities switched, after that, you do it like old CFQ behavior?

When switching priorities (e.g. from RT to BE), we may come from a
long stall. In this case, I think it is better to run no-idle first.
During normal operation, instead, we want a fair, starvation free way
to switch between workloads, and I thought it was simpler to mimic old
CFQ behaviour, instead of cook up a different method.
The difference between new and old CFQ is that now, when we decide to
service one no-idle request, we will then service subsequent ones from
the same workload type.
This allows processing them optimally on NCQ hardware.
Moreover, when no more no-idle requests are available, but the
timeslice for this workload did not expire yet, we will wait for more.
This guarantees fairness for no-idle workload.

> In order to improve
> Âlatency for sync no-idle workload, is it possible to take workload type into account,
> Ânot only rely on rb_keys across service trees?
When loading a program into memory, your process will go through
various phases w.r.t. disk access pattern: some are seeky, some others
are sequential.

If you just improve latency for one workload, penalizing the others,
you won't get an overall improvement of the system.
The new scheme improves overall system behaviour because grouping
no-idle requests together gives a better utilization of the disk, and
fairness allows also processes making seeky requests to progress.
Penalizing the idle service tree, instead, you will give you lower
overall throughput (forbidding progress to the processes that make
sequential requests), while penalizing writeback you will find
yourself waiting for freeing dirty pages more often, and maybe
incurring in OOM conditions.

Regarding the rb_key computation, I have done various experiments, and
found that the actual formula doesn't matter much on rotational
hardware, where the slice length has most importance.
But I think it is essential on NCQ SSDs, to obtain fairness.
Unfortunately, I don't have an NCQ SSD, so I can't test my improvement ideas.

Thanks,
Corrado

> ÂThanks,
> ÂGui
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Ralf Baechle: "Re: [PATCH 06/38] move iodev_remove to .devexit.text"
Previous message: Eric Dumazet: "Re: [patch] spitz: better voltage-to-percentage conversion"
In reply to: Gui Jianfeng: "Re: [PATCH] cfq: Make use of service count to estimate the rb_key offset"
Next in thread: Gui Jianfeng: "Re: [PATCH] cfq: Make use of service count to estimate the rb_key offset"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]