Re: Switching to MQ by default may generate some bug reports

From: Paolo Valente
Date: Mon Aug 07 2017 - 13:32:51 EST

Next message: Davidlohr Bueso: "Re: [PATCH] ipc: optimize semget/shmget/msgget for lots of keys"
Previous message: Stefan Agner: "[PATCH RESEND] lib/mpi: fix build with clang"
In reply to: Paolo Valente: "Re: Switching to MQ by default may generate some bug reports"
Next in thread: Paolo Valente: "Re: Switching to MQ by default may generate some bug reports"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

> Il giorno 05 ago 2017, alle ore 00:05, Paolo Valente <paolo.valente@xxxxxxxxxx> ha scritto:
>
>>
>> Il giorno 04 ago 2017, alle ore 13:01, Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> ha scritto:
>>
>> On Fri, Aug 04, 2017 at 09:26:20AM +0200, Paolo Valente wrote:
>>>> I took that into account BFQ with low-latency was also tested and the
>>>> impact was not a universal improvement although it can be a noticable
>>>> improvement. From the same machine;
>>>>
>>>> dbench4 Loadfile Execution Time
>>>> 4.12.0 4.12.0 4.12.0
>>>> legacy-cfq mq-bfq mq-bfq-tput
>>>> Amean 1 80.67 ( 0.00%) 83.68 ( -3.74%) 84.70 ( -5.00%)
>>>> Amean 2 92.87 ( 0.00%) 121.63 ( -30.96%) 88.74 ( 4.45%)
>>>> Amean 4 102.72 ( 0.00%) 474.33 (-361.77%) 113.97 ( -10.95%)
>>>> Amean 32 2543.93 ( 0.00%) 1927.65 ( 24.23%) 2038.74 ( 19.86%)
>>>>
>>>
>>> Thanks for trying with low_latency disabled. If I read numbers
>>> correctly, we move from a worst case of 361% higher execution time to
>>> a worst case of 11%. With a best case of 20% of lower execution time.
>>>
>>
>> Yes.
>>
>>> I asked you about none and mq-deadline in a previous email, because
>>> actually we have a double change here: change of the I/O stack, and
>>> change of the scheduler, with the first change probably not irrelevant
>>> with respect to the second one.
>>>
>>
>> True. However, the difference between legacy-deadline mq-deadline is
>> roughly around the 5-10% mark across workloads for SSD. It's not
>> universally true but the impact is not as severe. While this is not
>> proof that the stack change is the sole root cause, it makes it less
>> likely.
>>
>
> I'm getting a little lost here. If I'm not mistaken, you are saying,
> since the difference between two virtually identical schedulers
> (legacy-deadline and mq-deadline) is only around 5-10%, while the
> difference between cfq and mq-bfq-tput is higher, then in the latter
> case it is not the stack's fault. Yet the loss of mq-bfq-tput in the
> above test is exactly in the 5-10% range? What am I missing? Other
> tests with mq-bfq-tput not yet reported?
>
>>> By chance, according to what you have measured so far, is there any
>>> test where, instead, you expect or have seen bfq-mq-tput to always
>>> lose? I could start from there.
>>>
>>
>> global-dhp__io-fio-randread-async-randwrite-xfs but marginal enough that
>> it could be the stack change.
>>
>> global-dhp__io-dbench4-fsync-ext4 was a universal loss across any
>> machine tested. This is global-dhp__io-dbench4-fsync from mmtests using
>> ext4 as a filesystem. The same is not true for XFS so the filesystem
>> matters.
>>
>
> Ok, then I will try to repeat global-dhp__io-dbench4-fsync-ext4 as
> soon as I can, thanks.
>
>

I've run this test and tried to further investigate this regression.
For the moment, the gist seems to be that blk-mq plays an important
role, not only with bfq (unless I'm considering the wrong numbers).
Even if your main purpose in this thread was just to give a heads-up,
I guess it may be useful to share what I have found out. In addition,
I want to ask for some help, to try to get closer to the possible
causes of at least this regression. If you think it would be better
to open a new thread on this stuff, I'll do it.

First, I got mixed results on my system. I'll focus only on the the
case where mq-bfq-tput achieves its worst relative performance w.r.t.
to cfq, which happens with 64 clients. Still, also in this case
mq-bfq is better than cfq in all average values, but Flush. I don't
know which are the best/right values to look at, so, here's the final
report for both schedulers:

CFQ

Operation Count AvgLat MaxLat
--------------------------------------------------
Flush 13120 20.069 348.594
Close 133696 0.008 14.642
LockX 512 0.009 0.059
Rename 7552 1.857 415.418
ReadX 270720 0.141 535.632
WriteX 89591 421.961 6363.271
Unlink 34048 1.281 662.467
UnlockX 512 0.007 0.057
FIND_FIRST 62016 0.086 25.060
SET_FILE_INFORMATION 15616 0.995 176.621
QUERY_FILE_INFORMATION 28734 0.004 1.372
QUERY_PATH_INFORMATION 170240 0.163 820.292
QUERY_FS_INFORMATION 28736 0.017 4.110
NTCreateX 178688 0.437 905.567

MQ-BFQ-TPUT

Operation Count AvgLat MaxLat
--------------------------------------------------
Flush 13504 75.828 11196.035
Close 136896 0.004 3.855
LockX 640 0.005 0.031
Rename 8064 1.020 288.989
ReadX 297600 0.081 685.850
WriteX 93515 391.637 12681.517
Unlink 34880 0.500 146.928
UnlockX 640 0.004 0.032
FIND_FIRST 63680 0.045 222.491
SET_FILE_INFORMATION 16000 0.436 686.115
QUERY_FILE_INFORMATION 30464 0.003 0.773
QUERY_PATH_INFORMATION 175552 0.044 148.449
QUERY_FS_INFORMATION 29888 0.009 1.984
NTCreateX 183152 0.289 300.867

Are these results in line with yours for this test?

Anyway, to investigate this regression more in depth, I took two
further steps. First, I repeated the same test with bfq-sq, my
out-of-tree version of bfq for legacy block (identical to mq-bfq apart
from the changes needed for bfq to live in blk-mq). I got:

BFQ-SQ-TPUT

Operation Count AvgLat MaxLat
--------------------------------------------------
Flush 12618 30.212 484.099
Close 123884 0.008 10.477
LockX 512 0.010 0.170
Rename 7296 2.032 426.409
ReadX 262179 0.251 985.478
WriteX 84072 461.398 7283.003
Unlink 33076 1.685 848.734
UnlockX 512 0.007 0.036
FIND_FIRST 58690 0.096 220.720
SET_FILE_INFORMATION 14976 1.792 466.435
QUERY_FILE_INFORMATION 26575 0.004 2.194
QUERY_PATH_INFORMATION 158125 0.112 614.063
QUERY_FS_INFORMATION 28224 0.017 1.385
NTCreateX 167877 0.827 945.644

So, the worst-case regression is now around 15%. This made me suspect
that blk-mq influences results a lot for this test. To crosscheck, I
compared legacy-deadline and mq-deadline too.

LEGACY-DEADLINE

Operation Count AvgLat MaxLat
--------------------------------------------------
Flush 13267 9.622 298.206
Close 135692 0.007 10.627
LockX 640 0.008 0.066
Rename 7827 0.544 481.123
ReadX 285929 0.220 2698.442
WriteX 92309 430.867 5191.608
Unlink 34534 1.133 619.235
UnlockX 640 0.008 0.724
FIND_FIRST 63289 0.086 56.851
SET_FILE_INFORMATION 16000 1.254 844.065
QUERY_FILE_INFORMATION 29883 0.004 0.618
QUERY_PATH_INFORMATION 173232 0.089 1295.651
QUERY_FS_INFORMATION 29632 0.017 4.813
NTCreateX 181464 0.479 2214.343

MQ-DEADLINE

Operation Count AvgLat MaxLat
--------------------------------------------------
Flush 13760 90.542 13221.495
Close 137654 0.008 27.133
LockX 640 0.009 0.115
Rename 8064 1.062 246.759
ReadX 297956 0.051 347.018
WriteX 94698 425.636 15090.020
Unlink 35077 0.580 208.462
UnlockX 640 0.007 0.291
FIND_FIRST 66630 0.566 530.339
SET_FILE_INFORMATION 16000 1.419 811.494
QUERY_FILE_INFORMATION 30717 0.004 1.108
QUERY_PATH_INFORMATION 176153 0.182 517.419
QUERY_FS_INFORMATION 30857 0.018 18.562
NTCreateX 184145 0.281 582.076

So, with both bfq and deadline there seems to be a serious regression,
especially on MaxLat, when moving from legacy block to blk-mq. The
regression is much worse with deadline, as legacy-deadline has the
lowest max latency among all the schedulers, whereas mq-deadline has
the highest one.

Regardless of the actual culprit of this regression, I would like to
investigate further this issue. In this respect, I would like to ask
for a little help. I would like to isolate the workloads generating
the highest latencies. To this purpose, I had a look at the loadfile
client-tiny.txt, and I still have a doubt: is every item in the
loadfile executed somehow several times (for each value of the number
of clients), or is it executed only once? More precisely, IIUC, for
each operation reported in the above results, there are several items
(lines) in the loadfile. So, is each of these items executed only
once?

I'm asking because, if it is executed only once, then I guess I can
find the critical tasks ore easily. Finally, if it is actually
executed only once, is it expected that the latency for such a task is
one order of magnitude higher than that of the average latency for
that group of tasks? I mean, is such a task intrinsically much
heavier, and then expectedly much longer, or is the fact that latency
is much higher for this task a sign that something in the kernel
misbehaves for that task?

While waiting for some feedback, I'm going to execute your test
showing great unfairness between writes and reads, and to also check
whether responsiveness does worsen if the write workload for that test
is being executed in the background.

Thanks,
Paolo

> ...
>> --
>> Mel Gorman
>> SUSE Labs

Next message: Davidlohr Bueso: "Re: [PATCH] ipc: optimize semget/shmget/msgget for lots of keys"
Previous message: Stefan Agner: "[PATCH RESEND] lib/mpi: fix build with clang"
In reply to: Paolo Valente: "Re: Switching to MQ by default may generate some bug reports"
Next in thread: Paolo Valente: "Re: Switching to MQ by default may generate some bug reports"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]