Re: [PATCH] Avoid preferential treatment of groups that aren't backlogged

From: Chad Talbott
Date: Thu Feb 10 2011 - 13:58:02 EST


On Wed, Feb 9, 2011 at 7:57 PM, Vivek Goyal <vgoyal@xxxxxxxxxx> wrote:
> On Wed, Feb 09, 2011 at 06:45:25PM -0800, Chad Talbott wrote:
>> On Wed, Feb 9, 2011 at 6:09 PM, Vivek Goyal <vgoyal@xxxxxxxxxx> wrote:
>> > In upstream code once a group gets backlogged we put it at the end
>> > and not at the beginning of the tree. (I am wondering are you looking
>> > at the google internal code :-))
>> >
>> > So I don't think that issue of a low weight group getting more disk
>> > time than its fair share is present in upstream kernels.
>>
>> You've caught me re-using a commit description.  :)
>>
>> Here's an example of the kind of tests that fail without this patch
>> (run via the test that Justin and Akshay have posted):
>>
>> 15:35:35 INFO ----- Running experiment 14: 950 rdrand, 50 rdrand.delay10
>> 15:35:55 INFO Experiment completed in 20.4 seconds
>> 15:35:55 INFO experiment 14 achieved DTFs: 886, 113
>> 15:35:55 INFO experiment 14 FAILED: max observed error is 64, allowed is 50
>>
>> 15:35:55 INFO ----- Running experiment 15: 950 rdrand, 50 rdrand.delay50
>> 15:36:16 INFO Experiment completed in 20.5 seconds
>> 15:36:16 INFO experiment 15 achieved DTFs: 891, 108
>> 15:36:16 INFO experiment 15 FAILED: max observed error is 59, allowed is 50
>>
>> Since this is Jens' unmodified tree, I've had to change
>> BLKIO_WEIGHT_MIN to 10 to allow this test to proceed.  We typically
>> run many jobs with small weights, and achieve the requested isolation:
>> see below results with this patch:
>>
>> 14:59:17 INFO ----- Running experiment 14: 950 rdrand, 50 rdrand.delay10
>> 14:59:36 INFO Experiment completed in 19.0 seconds
>> 14:59:36 INFO experiment 14 achieved DTFs: 947, 52
>> 14:59:36 INFO experiment 14 PASSED: max observed error is 3, allowed is 50
>>
>> 14:59:36 INFO ----- Running experiment 15: 950 rdrand, 50 rdrand.delay50
>> 14:59:55 INFO Experiment completed in 18.5 seconds
>> 14:59:55 INFO experiment 15 achieved DTFs: 944, 55
>> 14:59:55 INFO experiment 15 PASSED: max observed error is 6, allowed is 50
>>
>> As you can see, it's with seeky workloads that come and go from the
>> service tree where this patch is required.
>
> I have not look into or run the tests posted by Justin and Akshay. Can you
> give more details about these tests.

> Are you running with group_isolation=0 or 1. These tests seem to be random
> read and if group_isolation=0 (default), then all the random read queues
> should go in root group and there will be no service differentiation.

The test sets group_isolation=1 as part of its setup, as this is our
standard configuration.

> If you ran different random readers in different groups of differnet
> weight with group_isolation=1, then there is a case of having service
> differentiation. In that case we will idle for 8ms on each group before
> we expire the group. So in these test cases are low weight groups not
> submitting IO with-in 8ms? Putting a random reader in separate group
> with think time > 8, I think is going to hurt a lot because for every
> single IO dispatched group is going to weight for 8ms before it is
> expired.

You're right about the behavior of group_idle. We have more
experience with earlier kernels (before group_idle). With this patch
we are able to achieve isolation without group_idle even with these
large ratios. (Without group_idle the random reader workloads will
get marked seeky, and idling is disabled. Without group_idle, we have
to remember the vdisktime to get isolation.)

> Can you run blktrace and verify what's happenig?

I can run a blktrace, and I think it will show what you expect.

Chad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/