Re: [PATCH RFC 09/22] block, cfq: replace CFQ with the BFQ-v0 I/O scheduler

From: Tejun Heo
Date: Tue Mar 01 2016 - 13:47:06 EST


Hello, Paolo.

Sorry about the delay.

On Sat, Feb 20, 2016 at 11:23:43AM +0100, Paolo Valente wrote:
> Before replying to your points, I want to stress that I'm not a
> champion of budget-based scheduling at all costs. Budget-based
> scheduling just seems to provide tight bandwidth and latency
> guarantees that are practically impossible to get with time-based
> scheduling. I will try to explain this fact better, and to provide
> also a numerical example, in my replies to your points.

I do like the budget-based scheduling. It just feels that the budget
is based on the wrong unit.

...
> I think I got your point. In any case, a queue is not punished *after*
> it has consumed an undue amount of the resource, because a queue just
> cannot get to consume an undue amount of the resource. There is a
> timeout: a queue cannot be served for more than a pre-defined maximum
> time slice.
>
> Even if a queue expires before that timeout, BFQ checks anyway, on the
> expiration of the queue, whether the queue is so slow to not deserve
> accurate service-based guarantees. This is done to achieve additional
> robustness. In fact, if service-based guarantees were provided to a
> very slow queue that, for some reason, never causes the timeout to
> expire, then the queue would happen to be served too often, i.e., to
> get the undue amount of IO resource you mention.

I see. Once a queue starts timing out its slice, it gets switched to
time based scheduling; however, as you mentioned, workloads which
generate moderate random IOs would still get preferential treatment
over workloads which generate sequential IOs, by definition.

...
> Your metaphor is clear, but it does not seem to match what I expect
> from my storage device. As a user of my PC, Iâm concerned, e.g., about
> how long it takes to copy a large file. Iâm not concerned about what
> percentage of time will be guaranteed to my file copy if other
> processes happen to do I/O in parallel. As a similar example, a good

The underlying device is fundamentally incapable of giving guarantees
like that. The only way to get a (quasi) bandwidth guarantee from a
disk device is either ensuring that the IO is almost completely
sequential or there's enough buffer in capacity for the expected
seekiness of the IO pattern.

For use cases where the differences in seekiness across workloads are
accidental - e.g. all are trying to stream different files but some
files are more fragmented by accident - using bandwidth as the
resource unit would be helpful in mitigating the random gaps that the
user shouldn't be bothered by, but that'd be focusing on a pretty
narrow set of use cases.

Workloads are varied and underlying device performs wildly differently
depending on the specific IO pattern. Because rotating disks suck so
badly, it's true that there's a lot of wiggle room in what the IO
scheduler can do. People are accustomed to dealing with random
behaviors. That said, it still doesn't feel comfortable to use the
obviously wrong unit as the fundamental basis of resource
distribution.

> file-hosting service must probably guarantee reasonable read/write,
> i.e., download/upload, speeds to users (of course, I/O speed matters
> only if the bottleneck is not the network). Most likely, a user of
> such a service does not care (directly) about how much resource-time
> is guaranteed to the I/O related to his/her downloads/uploads.
>
> With a budget-based service scheme, you can easily provide these
> service guarantees accurately. In particular, you can achieve this
> goal even if the bandwidth fluctuates, provided that fluctuations
> depend mostly on I/O patterns. That is, it must hold true that the
> device achieves about the same bandwidth with the same I/O
> pattern. This is exactly the case with both rotational and

So, yes, I see that bandwidth based control would yield a better
result for this specific use case but at the same time this is a very
specialized use case and probably the *only* use case where bandwidth
based distribution makes sense - equivalent logically sequential
workloads where the specifics of IO pattern are completely accidental.
We can't really design the whole thing around that single use case.

> non-rotational devices. With a time-based scheme, it is impossible to
> provide these service guarantees, if bandwidth fluctuates and
> requirements are minimally stringent. I will try to show it with a
> simple numerical example. Before this numerical example, I would like
> to report a last practical example.

I agree that bandwidth based distribution would behave better for this
use case but think the above paragraph is going a bit too far.
Bandwidth based distribution can stick to the line better but that
just means that time based scheduling would need a bit more buffer in
achieving the same level of behavioral consistency. It's not like
bandwidth can actually be guarnateed no matter what we do.

...
> To achieve, with a time-based scheduler, the same precise and stable
> bandwidth distribution as with a budget-based scheduler, the only
> solution is to change weights dynamically, as a function of how the
> throughput achieved by B varies in its time slots. Such a solution
> would be definitely complex, if ever feasible and stable.

Isn't that a use case specifically carved for bandwidth based
distribution? Imagine how this would work when e.g. there are mostly
sequential IO workloads and fluctuating random workloads. Addition of
another sequential workload would behave as expected but addition of
random workloads would cripple everyone to the same level if the
random workloads play their cards right.

One side of the coin is "if I have two parallel file copies, they
proceed at the same speed regardless of how they're distributed across
disk" and the other is "but if I start an application which
intermittently issues random IOs, my copies take 5x longer". Isn't
"the two parallel copies mostly keep the same pace but may deviate a
bit but addition of random IOs doesn't collapse the whole thing" a
better proposition?

Hmmm... it could be that I'm mistaken on how trigger happy the switch
to time based scheduling is. Maybe it's sensitive enough that
bandwidth based scheduling is only applied to workloads which are
mostly sequential. I'm sorry if I'm being too dense on this point but
can you please give me some examples on what would happen when
sequential workloads and random ones mix?

Thanks.

--
tejun