Re: [PATCH RFC 09/22] block, cfq: replace CFQ with the BFQ-v0 I/O scheduler
From: Paolo Valente
Date: Wed Mar 09 2016 - 01:34:34 EST
Il giorno 01/mar/2016, alle ore 19:46, Tejun Heo <tj@xxxxxxxxxx> ha scritto:
> Hello, Paolo.
>
> Sorry about the delay.
>
> On Sat, Feb 20, 2016 at 11:23:43AM +0100, Paolo Valente wrote:
>> Before replying to your points, I want to stress that I'm not a
>> champion of budget-based scheduling at all costs. Budget-based
>> scheduling just seems to provide tight bandwidth and latency
>> guarantees that are practically impossible to get with time-based
>> scheduling. I will try to explain this fact better, and to provide
>> also a numerical example, in my replies to your points.
>
> I do like the budget-based scheduling. It just feels that the budget
> is based on the wrong unit.
>
This is probably the focal point of our discussion. Unfortunately, I
am still not convinced of your claim. In fact, basing budgets on
sectors (service), instead of time, still seems to me the only way to
provide the stronger bandwidth and low-latency guarantees that I have
tried to highlight in my previous email. And these guarantees do not
seem to concern only a single special case, but several, common use
cases for server and desktop systems. I will try to repeat these facts
more concisely, and hopefully more clearly, in my replies to next
points.
> ...
>> I think I got your point. In any case, a queue is not punished *after*
>> it has consumed an undue amount of the resource, because a queue just
>> cannot get to consume an undue amount of the resource. There is a
>> timeout: a queue cannot be served for more than a pre-defined maximum
>> time slice.
>>
>> Even if a queue expires before that timeout, BFQ checks anyway, on the
>> expiration of the queue, whether the queue is so slow to not deserve
>> accurate service-based guarantees. This is done to achieve additional
>> robustness. In fact, if service-based guarantees were provided to a
>> very slow queue that, for some reason, never causes the timeout to
>> expire, then the queue would happen to be served too often, i.e., to
>> get the undue amount of IO resource you mention.
>
> I see. Once a queue starts timing out its slice, it gets switched to
> time based scheduling; however, as you mentioned, workloads which
> generate moderate random IOs would still get preferential treatment
> over workloads which generate sequential IOs, by definition.
>
Exactly. However, there is still something I don’t fully understand in
your doubt. With BFQ, workloads that generate moderate random IOs
would actually do less damage to throughput, on average, than with
CFQ. In fact, with CFQ the queues containing these IOs would
systematically get a full time slice, while there are two
possibilities with BFQ:
1) If the degree of randomness of (the IOs in) these queues is not too
high, then these queues are likely to finish budgets before
timeouts. In this case, with BFQ these queues get less service than
with CFQ, and thus can waste throughput less.
2) If the degree of randomness of these queues is very high, then they
consume full time slices with BFQ, exactly as with CFQ.
Of course, performance may differ if time slices, i.e., timeouts,
differ between BFQ and CFQ, but this is easy to tune, if needed.
> ...
>> Your metaphor is clear, but it does not seem to match what I expect
>> from my storage device. As a user of my PC, I’m concerned, e.g., about
>> how long it takes to copy a large file. I’m not concerned about what
>> percentage of time will be guaranteed to my file copy if other
>> processes happen to do I/O in parallel. As a similar example, a good
>
> The underlying device is fundamentally incapable of giving guarantees
> like that. The only way to get a (quasi) bandwidth guarantee from a
> disk device is either ensuring that the IO is almost completely
> sequential or there's enough buffer in capacity for the expected
> seekiness of the IO pattern.
>
> For use cases where the differences in seekiness across workloads are
> accidental - e.g. all are trying to stream different files but some
> files are more fragmented by accident - using bandwidth as the
> resource unit would be helpful in mitigating the random gaps that the
> user shouldn't be bothered by, but that'd be focusing on a pretty
> narrow set of use cases.
>
> Workloads are varied and underlying device performs wildly differently
> depending on the specific IO pattern. Because rotating disks suck so
> badly, it's true that there's a lot of wiggle room in what the IO
> scheduler can do. People are accustomed to dealing with random
> behaviors. That said, it still doesn't feel comfortable to use the
> obviously wrong unit as the fundamental basis of resource
> distribution.
>
Actually this does not seem to match our (admittedly limited)
experience with: low-to-high-end rotational devices, RAIDS, SSDs, SD
cards and eMMCs. When stimulated with the same patterns in out tests,
these devices always responded with about the same IO service
times. And this seems to comply with intuition, because, apart from
different initial cache states, the same stimuli cause about the same
arm movements, cache hits/misses, and circuitry operations.
Of course, things differ with write and re-write operations on
flash-based devices, especially if remapping is performed. But
write-intensive or only-write IOs do not seem to be the best kind of
IO with these devices, so we did not focus on them much.
>> file-hosting service must probably guarantee reasonable read/write,
>> i.e., download/upload, speeds to users (of course, I/O speed matters
>> only if the bottleneck is not the network). Most likely, a user of
>> such a service does not care (directly) about how much resource-time
>> is guaranteed to the I/O related to his/her downloads/uploads.
>>
>> With a budget-based service scheme, you can easily provide these
>> service guarantees accurately. In particular, you can achieve this
>> goal even if the bandwidth fluctuates, provided that fluctuations
>> depend mostly on I/O patterns. That is, it must hold true that the
>> device achieves about the same bandwidth with the same I/O
>> pattern. This is exactly the case with both rotational and
>
> So, yes, I see that bandwidth based control would yield a better
> result for this specific use case but at the same time this is a very
> specialized use case and probably the *only* use case where bandwidth
> based distribution makes sense - equivalent logically sequential
> workloads where the specifics of IO pattern are completely accidental.
> We can't really design the whole thing around that single use case.
>
Actually, the tight bandwidth guarantees that I have tried to
highlight are the ground on which the other low-latency guarantees are
built. So, to sum up the set of guarantees that Linus discussed in
more detail in his email, BFQ mainly guarantees, even in the presence
of throughput fluctuations, and thanks also to sector-based
scheduling:
1) Stable(r) and tight bandwidth distribution for mostly-sequential
reads/writes
2) Stable(r) and high responsiveness
3) Stable(r) and low latency for soft real-time applications
4) Faster execution of dev tasks, such as compile and git operations
(checkout, merge, …), in the presence of background workloads, and
while guaranteeing a high responsiveness too
>> non-rotational devices. With a time-based scheme, it is impossible to
>> provide these service guarantees, if bandwidth fluctuates and
>> requirements are minimally stringent. I will try to show it with a
>> simple numerical example. Before this numerical example, I would like
>> to report a last practical example.
>
> I agree that bandwidth based distribution would behave better for this
> use case but think the above paragraph is going a bit too far.
> Bandwidth based distribution can stick to the line better but that
> just means that time based scheduling would need a bit more buffer in
> achieving the same level of behavioral consistency. It's not like
> bandwidth can actually be guarnateed no matter what we do.
>
I hope that my reply to the above point responds somehow also to this
point.
> ...
>> To achieve, with a time-based scheduler, the same precise and stable
>> bandwidth distribution as with a budget-based scheduler, the only
>> solution is to change weights dynamically, as a function of how the
>> throughput achieved by B varies in its time slots. Such a solution
>> would be definitely complex, if ever feasible and stable.
>
> Isn't that a use case specifically carved for bandwidth based
> distribution? Imagine how this would work when e.g. there are mostly
> sequential IO workloads and fluctuating random workloads. Addition of
> another sequential workload would behave as expected but addition of
> random workloads would cripple everyone to the same level if the
> random workloads play their cards right.
>
> One side of the coin is "if I have two parallel file copies, they
> proceed at the same speed regardless of how they're distributed across
> disk" and the other is "but if I start an application which
> intermittently issues random IOs, my copies take 5x longer". Isn't
> "the two parallel copies mostly keep the same pace but may deviate a
> bit but addition of random IOs doesn't collapse the whole thing" a
> better proposition?
>
> Hmmm... it could be that I'm mistaken on how trigger happy the switch
> to time based scheduling is. Maybe it's sensitive enough that
> bandwidth based scheduling is only applied to workloads which are
> mostly sequential. I'm sorry if I'm being too dense on this point but
> can you please give me some examples on what would happen when
> sequential workloads and random ones mix?
>
In the simplest case,
. sequential workloads would get sector-based service guarantees, with
the resulting stability and low-latency properties that I have tried
to highlight;
. random workloads would get time-based service, and thus similar
service guarantees as with CFQ (actually guarantees would still be
tighter, because of the more accurate scheduling policy of BFQ).
Things differ a bit for the queues that the low-latency heuristics of
BFQ ‘like'. In fact, BFQ forgives randomness a little bit more for
these queues, so as to improve responsiveness or low latency for soft
real-time applications. Of course, this may cause a throughput drop
while an interactive or a soft real-time application is being
privileged. If one only aims at maximum throughput, and does not care
about responsiveness or low latency, he/she can just set the
low_latency parameter to 0.
Thanks,
Paolo
> Thanks.
>
> --
> tejun