Re: Switching to MQ by default may generate some bug reports

From: Paolo Valente
Date: Tue Aug 08 2017 - 13:16:35 EST

Next message: Neeraj Upadhyay: "[PATCH] rcu: Skip additional checks if rcu_cpu_stall_suppress is set"
Previous message: Dmitry Safonov: "Re: [PATCH] arm64/vdso: Support mremap() for vDSO"
In reply to: Mel Gorman: "Re: Switching to MQ by default may generate some bug reports"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

> Il giorno 08 ago 2017, alle ore 12:30, Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> ha scritto:
>
> On Mon, Aug 07, 2017 at 07:32:41PM +0200, Paolo Valente wrote:
>>>> global-dhp__io-dbench4-fsync-ext4 was a universal loss across any
>>>> machine tested. This is global-dhp__io-dbench4-fsync from mmtests using
>>>> ext4 as a filesystem. The same is not true for XFS so the filesystem
>>>> matters.
>>>>
>>>
>>> Ok, then I will try to repeat global-dhp__io-dbench4-fsync-ext4 as
>>> soon as I can, thanks.
>>>
>>>
>>
>> I've run this test and tried to further investigate this regression.
>> For the moment, the gist seems to be that blk-mq plays an important
>> role, not only with bfq (unless I'm considering the wrong numbers).
>> Even if your main purpose in this thread was just to give a heads-up,
>> I guess it may be useful to share what I have found out. In addition,
>> I want to ask for some help, to try to get closer to the possible
>> causes of at least this regression. If you think it would be better
>> to open a new thread on this stuff, I'll do it.
>>
>
> I don't think it's necessary unless Christoph or Jens object and I doubt
> they will.
>
>> First, I got mixed results on my system.
>
> For what it's worth, this is standard. In my experience, IO benchmarks
> are always multi-modal, particularly on rotary storage. Cases of universal
> win or universal loss for a scheduler or set of tuning are rare.
>
>> I'll focus only on the the
>> case where mq-bfq-tput achieves its worst relative performance w.r.t.
>> to cfq, which happens with 64 clients. Still, also in this case
>> mq-bfq is better than cfq in all average values, but Flush. I don't
>> know which are the best/right values to look at, so, here's the final
>> report for both schedulers:
>>
>
> For what it's worth, it has often been observed that dbench overall
> performance was dominated by flush costs. This is also true for the
> standard reported throughput figures rather than the modified load file
> elapsed time that mmtests reports. In dbench3 it was even worse where the
> "performance" was dominated by whether the temporary files were deleted
> before writeback started.
>
>> CFQ
>>
>> Operation Count AvgLat MaxLat
>> --------------------------------------------------
>> Flush 13120 20.069 348.594
>> Close 133696 0.008 14.642
>> LockX 512 0.009 0.059
>> Rename 7552 1.857 415.418
>> ReadX 270720 0.141 535.632
>> WriteX 89591 421.961 6363.271
>> Unlink 34048 1.281 662.467
>> UnlockX 512 0.007 0.057
>> FIND_FIRST 62016 0.086 25.060
>> SET_FILE_INFORMATION 15616 0.995 176.621
>> QUERY_FILE_INFORMATION 28734 0.004 1.372
>> QUERY_PATH_INFORMATION 170240 0.163 820.292
>> QUERY_FS_INFORMATION 28736 0.017 4.110
>> NTCreateX 178688 0.437 905.567
>>
>> MQ-BFQ-TPUT
>>
>> Operation Count AvgLat MaxLat
>> --------------------------------------------------
>> Flush 13504 75.828 11196.035
>> Close 136896 0.004 3.855
>> LockX 640 0.005 0.031
>> Rename 8064 1.020 288.989
>> ReadX 297600 0.081 685.850
>> WriteX 93515 391.637 12681.517
>> Unlink 34880 0.500 146.928
>> UnlockX 640 0.004 0.032
>> FIND_FIRST 63680 0.045 222.491
>> SET_FILE_INFORMATION 16000 0.436 686.115
>> QUERY_FILE_INFORMATION 30464 0.003 0.773
>> QUERY_PATH_INFORMATION 175552 0.044 148.449
>> QUERY_FS_INFORMATION 29888 0.009 1.984
>> NTCreateX 183152 0.289 300.867
>>
>> Are these results in line with yours for this test?
>>
>
> Very broadly speaking yes, but it varies. On a small machine, the differences
> in flush latency are visible but not as dramatic. It only has a few
> CPUs. On a machine that tops out with 32 CPUs, it is more noticable. On
> the one machine I have that topped out with CFQ/BFQ at 64 threads, the
> latency of flush is vaguely similar
>
> CFQ BFQ BFQ-TPUT
> latency avg-Flush-64 287.05 ( 0.00%) 389.14 ( -35.57%) 349.90 ( -21.90%)
> latency avg-Close-64 0.00 ( 0.00%) 0.00 ( -33.33%) 0.00 ( 0.00%)
> latency avg-LockX-64 0.01 ( 0.00%) 0.01 ( -16.67%) 0.01 ( 0.00%)
> latency avg-Rename-64 0.18 ( 0.00%) 0.21 ( -16.39%) 0.18 ( 3.28%)
> latency avg-ReadX-64 0.10 ( 0.00%) 0.15 ( -40.95%) 0.15 ( -40.95%)
> latency avg-WriteX-64 0.86 ( 0.00%) 0.81 ( 6.18%) 0.74 ( 13.75%)
> latency avg-Unlink-64 1.49 ( 0.00%) 1.52 ( -2.28%) 1.14 ( 23.69%)
> latency avg-UnlockX-64 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
> latency avg-NTCreateX-64 0.26 ( 0.00%) 0.30 ( -16.15%) 0.21 ( 19.62%)
>
> So, different figures to yours but the general observation that flush
> latency is higher holds.
>
>> Anyway, to investigate this regression more in depth, I took two
>> further steps. First, I repeated the same test with bfq-sq, my
>> out-of-tree version of bfq for legacy block (identical to mq-bfq apart
>> from the changes needed for bfq to live in blk-mq). I got:
>>
>> <SNIP>
>>
>> So, with both bfq and deadline there seems to be a serious regression,
>> especially on MaxLat, when moving from legacy block to blk-mq. The
>> regression is much worse with deadline, as legacy-deadline has the
>> lowest max latency among all the schedulers, whereas mq-deadline has
>> the highest one.
>>
>
> I wouldn't worry too much about max latency simply because a large
> outliier can be due to multiple factors and it will be variable.
> However, I accept that deadline is not necessarily great either.
>
>> Regardless of the actual culprit of this regression, I would like to
>> investigate further this issue. In this respect, I would like to ask
>> for a little help. I would like to isolate the workloads generating
>> the highest latencies. To this purpose, I had a look at the loadfile
>> client-tiny.txt, and I still have a doubt: is every item in the
>> loadfile executed somehow several times (for each value of the number
>> of clients), or is it executed only once? More precisely, IIUC, for
>> each operation reported in the above results, there are several items
>> (lines) in the loadfile. So, is each of these items executed only
>> once?
>>
>
> The load file is executed multiple times. The normal loadfile was
> basically just the same commands, or very similar commands, run multiple
> times within a single load file. This made the workload too sensitive to
> the exact time the workload finished and too coarse.
>
>> I'm asking because, if it is executed only once, then I guess I can
>> find the critical tasks ore easily. Finally, if it is actually
>> executed only once, is it expected that the latency for such a task is
>> one order of magnitude higher than that of the average latency for
>> that group of tasks? I mean, is such a task intrinsically much
>> heavier, and then expectedly much longer, or is the fact that latency
>> is much higher for this task a sign that something in the kernel
>> misbehaves for that task?
>>
>
> I don't think it's quite as easily isolated. It's all the operations in
> combination that replicate the behaviour. If it was just a single operation
> like "fsync" then it would be fairly straight-forward but the full mix
> is relevant as it matters when writeback kicks off, when merges happen,
> how much dirty data was outstanding when writeback or sync started etc.
>
> I see you've made other responses to the thread so rather than respond
> individually
>
> o I've queued a subset of tests with Ming's v3 patchset as that was the
> latest branch at the time I looked. It'll take quite some time to execute
> as the grid I use to collect data is backlogged with other work
>
> o I've included pgioperf this time because it is good at demonstrate
> oddities related to fsync. Granted it's mostly simulating a database
> workload that is typically recommended to use deadline scheduler but I
> think it's still a useful demonstration
>
> o If you want a patch set queued that may improve workload pattern
> detection for dbench then I can add that to the grid with the caveat that
> results take time. It'll be a blind test as I'm not actively debugging
> IO-related problems right now.
>
> o I'll keep an eye out for other workloads that demonstrate empirically
> better performance given that a stopwatch and desktop performance is
> tough to quantify even though I'm typically working in other areas. While
> I don't spend a lot of time on IO-related problems, it would still
> be preferred if switching to MQ by default was a safe option so I'm
> interested enough to keep it in mind.
>

Hi Mel,
thanks for your thorough responses (I'm about to write something about
the read-write unfairness issue, with, again, some surprise).

I want to reply only to your last point above. With our
responsiveness benchmark of course you don't need a stopwatch, but,
yes, to get some minimally comprehensive results you need a machine
with at least a desktop application like a terminal installed.

Thanks,
Paolo

> --
> Mel Gorman
> SUSE Labs

Next message: Neeraj Upadhyay: "[PATCH] rcu: Skip additional checks if rcu_cpu_stall_suppress is set"
Previous message: Dmitry Safonov: "Re: [PATCH] arm64/vdso: Support mremap() for vDSO"
In reply to: Mel Gorman: "Re: Switching to MQ by default may generate some bug reports"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]