Re: [PATCH RFC 09/22] block, cfq: replace CFQ with the BFQ-v0 I/O scheduler
From: Paolo Valente
Date: Thu Apr 14 2016 - 06:23:26 EST
Il giorno 13/apr/2016, alle ore 22:41, Tejun Heo <tj@xxxxxxxxxx> ha scritto:
> Hello,
>
> Sorry about the long delay.
>
> On Wed, Mar 09, 2016 at 07:34:15AM +0100, Paolo Valente wrote:
>> This is probably the focal point of our discussion. Unfortunately, I
>> am still not convinced of your claim. In fact, basing budgets on
>> sectors (service), instead of time, still seems to me the only way to
>> provide the stronger bandwidth and low-latency guarantees that I have
>> tried to highlight in my previous email. And these guarantees do not
>> seem to concern only a single special case, but several, common use
>> cases for server and desktop systems. I will try to repeat these facts
>> more concisely, and hopefully more clearly, in my replies to next
>> points.
>
> I'm still trying to get my head wrapped around why basing the
> scheduling on bandwidth would have those benefits because the
> connection isn't intuitive to me at all. If you're saying that most
> workloads care about bandwidth a lot more and the specifics of their
> IO patterns are mostly accidental and should be discounted for
> scheduling decisions, I can understand how that could be. Is that
> what you're saying?
>
Yes. I will try to elaborate more in my replies to your points below.
>>> I see. Once a queue starts timing out its slice, it gets switched to
>>> time based scheduling; however, as you mentioned, workloads which
>>> generate moderate random IOs would still get preferential treatment
>>> over workloads which generate sequential IOs, by definition.
>>
>> Exactly. However, there is still something I don’t fully understand in
>> your doubt. With BFQ, workloads that generate moderate random IOs
>> would actually do less damage to throughput, on average, than with
>> CFQ. In fact, with CFQ the queues containing these IOs would
>> systematically get a full time slice, while there are two
>> possibilities with BFQ:
>> 1) If the degree of randomness of (the IOs in) these queues is not too
>> high, then these queues are likely to finish budgets before
>> timeouts. In this case, with BFQ these queues get less service than
>> with CFQ, and thus can waste throughput less.
>> 2) If the degree of randomness of these queues is very high, then they
>> consume full time slices with BFQ, exactly as with CFQ.
>>
>> Of course, performance may differ if time slices, i.e., timeouts,
>> differ between BFQ and CFQ, but this is easy to tune, if needed.
>
> Hmm.. the above doesn't really make sense to me, so you're saying that
> bandwidth based control only cuts down the slice a random workload
> would use and thus wouldn't benefit them; however, that cut-down of
> slice is based on bandwidth consumption, so it would kick in a lot
> more for sequential workloads. It wouldn't make a random workload's
> slice longer than the timeout but it would make sequantial ones'
> slices shorter. What am I missing?
>
Not all sequential or quasi-sequential workloads achieve the same
throughput. According to the tests I have run so far, there is a
variation of at most a ~30%, depending on the device. The
"sector-based plus timeout" service scheme of BFQ adapts elastically
to this variation: parameters are configured in such a way that fast
sequential workloads finish their budget at most a ~30% of the time
before their timeout, while slow sequential workloads tend to finish
closer to the timeout (the slower they are, the closer they finish).
More precisely, the timeout is a hard limit to the maximum slowness
tolerated by BFQ for a sequential workload.
Random workloads systematically hit the timeout.
As shown in my previous numerical example, and confirmed so far by my
tests, this service scheme guarantees a high throughput as well as an
easy and accurate bandwidth distribution.
>>> Workloads are varied and underlying device performs wildly differently
>>> depending on the specific IO pattern. Because rotating disks suck so
>>> badly, it's true that there's a lot of wiggle room in what the IO
>>> scheduler can do. People are accustomed to dealing with random
>>> behaviors. That said, it still doesn't feel comfortable to use the
>>> obviously wrong unit as the fundamental basis of resource
>>> distribution.
>>
>> Actually this does not seem to match our (admittedly limited)
>> experience with: low-to-high-end rotational devices, RAIDS, SSDs, SD
>> cards and eMMCs. When stimulated with the same patterns in out tests,
>> these devices always responded with about the same IO service
>> times. And this seems to comply with intuition, because, apart from
>> different initial cache states, the same stimuli cause about the same
>> arm movements, cache hits/misses, and circuitry operations.
>
> Oh, that's not what I meant. If you feed the same sequence of IOs,
> they would behave similarly. What I meant was that the composition of
> IOs themselves would change significantly depneding on how different
> types of workloads get scheduled.
>
Definitely. To better sync with you, let me change my point as
follows: for all use cases characterized by a given I/O locality,
sector-based scheduling makes it easy to distribute bandwidth as
desired, and yields accurate and stable results.
As for I/O locality, IMHO many, or probably most use cases are
characterized by a rather stable locality:
- file-hosting services have mostly-sequential, greedy file
reads/writes (download/uploads)
- audio and video streaming services have periodic, mostly-sequential
reads (streaming), combined with greedy mostly-sequential writes
(updates)
- information-dumping services have mostly-sequential, often greedy,
writes
- database services have mostly-random workloads, in most cases
- single users, on their PCs, generate time-varying workloads that
depend on their habits and on the tasks at hand, but while any of
these tasks is being executed, the I/O has typically a well-defined
pattern
>>> So, yes, I see that bandwidth based control would yield a better
>>> result for this specific use case but at the same time this is a very
>>> specialized use case and probably the *only* use case where bandwidth
>>> based distribution makes sense - equivalent logically sequential
>>> workloads where the specifics of IO pattern are completely accidental.
>>> We can't really design the whole thing around that single use case.
>>
>> Actually, the tight bandwidth guarantees that I have tried to
>> highlight are the ground on which the other low-latency guarantees are
>> built. So, to sum up the set of guarantees that Linus discussed in
>> more detail in his email, BFQ mainly guarantees, even in the presence
>> of throughput fluctuations, and thanks also to sector-based
>> scheduling:
>> 1) Stable(r) and tight bandwidth distribution for mostly-sequential
>> reads/writes
>
> So, yeah, the above makes toal sense.
>
>> 2) Stable(r) and high responsiveness
>> 3) Stable(r) and low latency for soft real-time applications
>> 4) Faster execution of dev tasks, such as compile and git operations
>> (checkout, merge, …), in the presence of background workloads, and
>> while guaranteeing a high responsiveness too
>
> But can you please enlighten me on why 2-4 are inherently tied to
> bandwidth-based scheduling?
Goals 2-4 are obtained by granting a higher share of the throughput
to the applications to privilege. The more stably and accurately the
underlying scheduling engine is able to enforce the desired bandwidth
distribution, the more stably and accurately higher shares can be
guaranteed. Then 2-4 follows from 1, i.e., from that BFQ guarantees
a stabler and tight(er) bandwidth distribution.
>
>>> Hmmm... it could be that I'm mistaken on how trigger happy the switch
>>> to time based scheduling is. Maybe it's sensitive enough that
>>> bandwidth based scheduling is only applied to workloads which are
>>> mostly sequential. I'm sorry if I'm being too dense on this point but
>>> can you please give me some examples on what would happen when
>>> sequential workloads and random ones mix?
>>>
>>
>> In the simplest case,
>> . sequential workloads would get sector-based service guarantees, with
>> the resulting stability and low-latency properties that I have tried
>> to highlight;
>> . random workloads would get time-based service, and thus similar
>> service guarantees as with CFQ (actually guarantees would still be
>> tighter, because of the more accurate scheduling policy of BFQ).
>
> But don't the above inherently mean that sequential workloads would
> get less in terms of IO time?
>
Yes: if a sequential workload is so fast to finish its budgets before
the timeout, then it gets less IO time. The purpose of a sector-based
scheduler is not to distribute IO time, but to distribute bandwidth.
In particular, BFQ tries to achieve the latter goal by giving to each
process as little time as possible (which further improves latency).
> To summarize,
>
> 1. I still don't understand why bandwidth-based scheduling is better
> (sorry). The only reason I can think of is that most workloads
> that we care about are at least quasi-sequential and can benefit
> from ignoring randomness to a certain degree. Is that it?
>
If I have understood correctly, you refer to that maximum ~30%
throughput loss that a quasi-sequential workload can incur (because of
some randomness or of other unlucky accidents). If so, then I think
you fully got the point.
> 2. I don't think strict fairness matters is all that important for IO
> scheduling in general. Whatever gives us the best overall result
> should work, so if bandwidth based scheduling does that great;
> however, fairness does matter across cgroups. A cgroup configured
> to receive 50% of IO resources should get close to that no matter
> what others are doing, would bfq be able to do that?
>
BFQ guarantees 50% of the bandwidth of the resource, not 50% of the
time. In this respect, with 50% of the time instead of 50% of the
bandwidth, a group suffers from the bandwidth fluctuation, higher
latency and throughput loss problems that I have tried to highlight.
Plus, it is not possible to easily answer to questions like, e.g.: "how
long would it take to copy this file"?.
In any case, it is of course possible to get time distribution also
with BFQ, by 'just' letting it work in the time domain. However,
changing BFQ to operate in the time domain, or, probably much better,
extending BFQ to operate correctly in both domains, would be a lot of
work. I don't know whether it would be worth the effort and the
extra complexity.
Thanks,
Paolo
> Thanks.
>
> --
> tejun