Re: [PATCH RFC 09/22] block, cfq: replace CFQ with the BFQ-v0 I/O scheduler
From: Paolo Valente
Date: Sat Feb 20 2016 - 05:24:44 EST
Hi
Il giorno 17/feb/2016, alle ore 18:02, Tejun Heo <tj@xxxxxxxxxx> ha scritto:
> Hello,
>
> On Wed, Feb 17, 2016 at 10:02:00AM +0100, Paolo Valente wrote:
>> your first statement “bfq is using bandwidth as the virtual time" is
>> not very clear to me. In contrast, after that you raise a clear,
>
> Another way to put that would be "bfq is using bandwidth as the
> measure of IO resource provided by a device and to be distributed".
>
Sorry for bothering you about this, it was just fear of
misunderstandings.
Before replying to your points, I want to stress that I'm not a
champion of budget-based scheduling at all costs. Budget-based
scheduling just seems to provide tight bandwidth and latency
guarantees that are practically impossible to get with time-based
scheduling. I will try to explain this fact better, and to provide
also a numerical example, in my replies to your points.
> ...
>> Bandwidth fluctuation is one of the key issues addressed in the very
>> definition of BFQ (more precisely in the definition of the engine of
>> BFQ, introduced in this patch). In particular, BFQ schedules queues
>> so as to approximate an ideal service in which every queue receives,
>> over any time interval, its share of the bandwidth, regardless of
>> the actual value of the bandwidth, and even if the bandwidth
>> fluctuates. IOW, in the ideal service approximated by BFQ, if the
>
> So, this is a problem. For illustration, let's ignore penalization of
> seeky workload and assume a hard drive which can do 100MB/s for a
> fully sequential workload and 1MB/s for 4k random one. In other
> words, a sequential workload doing 100 MB/s is consuming 100% of
> available IO resource and so does 4k random workload doing 1MB/s.
>
> Let's say there are two sequential workload. If the two workloads are
> interleaved, the total bandwidth achieved would be lower than 100MB/s
> with the amount of drop dependent on how often the two workloads are
> switched. Given big enough scheduling stride, the total won't be too
> much lower and assuming the same weight each should be getting
> bandwidth a bit lower than 50MB/s.
>
> If we do the same for two random workloads, the story is the same.
> Each should be getting close to 512KB/s of bandwidth.
>
ok
> Now let's put one sequential workload and one random workload on the
> device. What *should* happen is clear - the sequential workload
> should be achieving close to 50MB/s while the random one 512KB/s so
> that each is still consuming the equal proportion of the available IO
> resource on the device. For devices with low concurrency such as
> disks, this measure of IO resources can be almost directly represented
> as how long each workload occupies the device.
>
Yes, and this is exactly what happens with BFQ (as well as with
CFQ). For random workloads, BFQ just switches to time-based
scheduling. I will try to provide more details on this in my reply to
your concern about providing seeky queues with an undue amount of
resources.
> If we use the proportion of bandwidth a workload is getting as the
> measure of IO resource consumed, the picture becomes very different
> and, I think, lost. The amount of bandwidth available is heavily
> dependent on what mix of IOs the device is getting which would then
> feed back into how much proportion of IO resource each workload
> forming a feedback loop. More importantly, random workload would be
> consuming far more IO resource than it's entitled to. I assume that
> is why bfq currently would need special punishment of seeky workloads.
> In the end what that coefficient does is trying to translate bandwidth
> consumption to IO time consumption in an ad-hoc and inaccruate way.
>
Correct, apart from accuracy. The 'punishment' is the precise way to
say: “regardless of the number of sectors served, this process has
been fully served during its turn”. And this matches what actually
happened: because of the switch from budget to time slice, the process
has received all it had to receive.
Anyway, I agree that a dual-mode policy is more complex than a plain
time-based policy. But, in return, the budget-based component of a
dual-mode policy provides the benefits I will try to highlight below.
>> bandwidth fluctuates during a given time interval, then, in every
>> sub-interval during which the bandwidth is constant (possibly an
>> infinitesimal time interval), each queue receives a fraction of the
>> total bandwidth equal to the ratio between its weight and the sum of
>> the weights of the active queues. BFQ is an optimal scheduler in
>> that it guarantees the minimum possible deviation, for a
>> slice-by-slice service scheme, with respect to the ideal
>> service. Finally, BFQ actually provides these fairness guarantees
>> only to queues whose I/O pattern yields a reasonably high
>> throughput. This fact has to do with the issue you raise.
>
> I don't think it's fair at all. It distributes undue amount of IO
> resources to seeky workloads and then tries to compensate for it by
> punishing them with a special coefficient. The devices bfq is
> targeting primarily are the ones where whether the workload is seeky
> or not results in multiple orders of magnitude difference in terms of
> available IO bandwidth. It simply doesn't make sense to use bandwidth
> as the measurement for available IO resources. That'd be worse than
> calcualting computation resource consumed as the number of
> instructions executed regardless of whether the instruction is a
> simple register to register move or double precision floating point
> division, which we don't do for obvious reasons.
>
I think I got your point. In any case, a queue is not punished *after*
it has consumed an undue amount of the resource, because a queue just
cannot get to consume an undue amount of the resource. There is a
timeout: a queue cannot be served for more than a pre-defined maximum
time slice.
Even if a queue expires before that timeout, BFQ checks anyway, on the
expiration of the queue, whether the queue is so slow to not deserve
accurate service-based guarantees. This is done to achieve additional
robustness. In fact, if service-based guarantees were provided to a
very slow queue that, for some reason, never causes the timeout to
expire, then the queue would happen to be served too often, i.e., to
get the undue amount of IO resource you mention.
In view of the issues you are raising, I looked at the related
comments in the code again, and I realized that some of them are not
very clear. Once we have somehow solved these issues, I will try to
improve these comments (if they still make sense).
>> The virtual time is the key concept used to achieve the above
>> optimal fairness guarantees. To show how intuitively, let me restate
>> the above guarantees in terms of "amount of service”, i.e., of
>> number of sectors read/written: BFQ guarantees that every active
>> queue receives, over any time interval, an amount of service equal
>> to the total amount of service provided by the system, multiplied by
>> the share of the bandwidth for the queue (apart from the
>> above-mentioned minimum, unavoidable deviation). Informally, this
>> implies that the amount of service received by each active queue,
>> during any given time interval, is proportional to the weight of the
>> queue. As a consequence, the normalized service received by every
>> active queue, i.e., the amount of service divided by the weight of
>> the queue, grows at the same rate. In BFQ, every queue is associated
>> with a virtual time, equal exactly to this normalized service. The
>> goal of BFQ is then to equalize queues’ virtual times (using also a
>> further global quantity, called system virtual time). To achieve
>> this goal, BFQ does not schedule time slices, but budgets, measured
>> in amount of service.
>
> I think I understand what you're saying. What I'm trying to say is
> the unit bfq is currently using for budgeting is almost completely
> bogus. It's the wrong unit. The "amount of service" a group receives
> is fundamentally "the amount of time" it occupies the target device.
> If that can be approximated by bandwidth as in network interfaces,
> using bandwidth as the unit is fine. However, that isn't the case at
> all for disks.
>
I will try to make my point in my replies below.
>> Hoping that we are now in sync on virtual times, I will try to
>> discuss your comment on why not to schedule time (slices) in the
>> first place. The fact that BFQ does not schedule time slices with
>> the same scheduling policy as CFQ, but budgets with an optimally
>> fair policy, provides the following main practical benefits:
>>
>> 1) It allows BFQ to distribute bandwidth as desired among processes
>> or groups of processes, regardless of: device parameters, bandwidth
>> fluctuation, overall workload composition (BFQ gives up this
>> bandwidth-fair policy only if this would cause a throughput drop, as
>> discussed in the second part of this email). It is impossible for a
>> time-based scheduling policy to provide such a service guarantee on
>> a device with a fluctuating bandwidth. In fact, a process achieving,
>> during its allotted time slice, a lower bandwidth than other luckier
>> processes, will just receive less service when it has the
>> opportunity to use the device. As a consequence, an unlucky process
>> may easily receive less bandwidth than another process with the same
>> weight, or even of a process with a lower weight. On HDDs, this
>> trivially happens to processes reading/writing on the inner zones of
>> the disk. On SSDs, the variability is due to more complex
>> causes. This problem has been one of the main motivations for the
>> definition of BFQ.
>
> I think you're confused about what IO resource is. Let's say there is
> one swing and two kids. When we say the two kids get to share the
> swing equally, it means that each kid would get the same amount of
> time on the swing, not the same number of strokes or the length of
> circumference traveled. A kid who gets 30% of the swing gets 30% time
> share of the swing regardless of who else the kid is sharing the swing
> with. This is exactly the same situation.
>
Your metaphor is clear, but it does not seem to match what I expect
from my storage device. As a user of my PC, I’m concerned, e.g., about
how long it takes to copy a large file. I’m not concerned about what
percentage of time will be guaranteed to my file copy if other
processes happen to do I/O in parallel. As a similar example, a good
file-hosting service must probably guarantee reasonable read/write,
i.e., download/upload, speeds to users (of course, I/O speed matters
only if the bottleneck is not the network). Most likely, a user of
such a service does not care (directly) about how much resource-time
is guaranteed to the I/O related to his/her downloads/uploads.
With a budget-based service scheme, you can easily provide these
service guarantees accurately. In particular, you can achieve this
goal even if the bandwidth fluctuates, provided that fluctuations
depend mostly on I/O patterns. That is, it must hold true that the
device achieves about the same bandwidth with the same I/O
pattern. This is exactly the case with both rotational and
non-rotational devices. With a time-based scheme, it is impossible to
provide these service guarantees, if bandwidth fluctuates and
requirements are minimally stringent. I will try to show it with a
simple numerical example. Before this numerical example, I would like
to report a last practical example.
A network-monitoring company had the problem of guaranteeing lossless
traffic recording even while serving queries about dumped
traffic. IOW, a certain minimum write speed had to be guaranteed also
while data were read. I could easily satisfy their needs with BFQ,
after some benchmarking of the average overall I/O rate of the system,
plus a simple weight tuning. Since the write speed that had to be
guaranteed was quite close to the I/O peak rate of the storage device,
it would have been practically impossible (or at least extremely
complicated) to achieve the same result with a time-based I/O
scheduler. I’ll try to show why with the following numerical example.
Suppose you have two processes, say A and B, doing mostly sequential
I/O, on a device with a peak rate of 100MB/s. For some reason you need
to guarantee 90MB/s to A and 10MB/s to B, with a tolerance of at most
10% for the bandwidth guaranteed to each process. This is both a
simplified version of the problem I tackled for the network-monitoring
company, and a possible oversimplified version of the requirements of
a high-quality file-hosting service in which, e.g., a given higher
bandwidth is to be guaranteed to user downloads/syncs, and a lower,
but non-null, bandwidth to management operations.
If both processes reach ~100MB/s during their service slot, then, with
both a budget- and time-based policy, we can meet the above bandwidth
requirements by just assigning 9 and 1 as weights to A and B,
respectively (the weight sum equals 10, thus A gets (9/10)*100 MB/s
and B gets (1/10)*100 MB/s). Things change considerably if, e.g., the
already penalized process B may get a lower throughput while it is
served. This may happen to B systematically, as a function of the
locality of its I/O pattern and of the characteristics of the
rotational or non-rotational device. To mention a very simple yet
still realistic scenario, it is enough that the storage resource is an
HDD, and that B alternatively reads a large file from the inner zones
and a large file from the outer zones of the HDD. For example, suppose
that, regardless of whether the policy is budget- or time-based, B
reaches a throughput equal to ~70MB/s during some budget/time slots,
and to ~100MB/s during other budget/time slots. On the other hand, A
always reaches ~100MB/s while it is served.
For simplicity, suppose that the budget-based scheduler assigns
budgets of 10MB, which implies that, to serve a budget of A, 100ms are
needed, while either 100ms or 142ms are needed for B. To meet the
bandwidth requirement with a budget-based scheduler, it is enough to:
1) start with a sensible weight assignment, for example 9 and 1 in our
case;
2) let the processes run;
3) check the resulting aggregate throughput.
In our example, the latter will of course fluctuate between
(100*900+70*142)/1042 = ~96MB/s and 100MB/s. This implies that A and B
enjoy bandwidths, respectively, in the ranges [86.4, 90] and [9.6, 10]
MB/s. Then requirements are met and we are done.
With a time-based policy, if we assign weights 9 and 1 again, we get a
slightly higher average aggregate throughput, because it fluctuates
between (100*900+70*100)/1000 = ~97MB/s and 100MB/s. But B enjoys a
bandwidth in the range [7, 10] MB/s. Then we have failed to satisfy
requirements.
So, as a simple next step, we could try to reduce the weight of A as
little as possible to guarantees that B meets its requirement also
during its unlucky time intervals. This assignment is 6 and 1, and
guarantees to B a bandwidth in the range [10, 14.3] MB/s. The
aggregate throughput is still high, as it ranges between
(100*600+70*100)/700 = ~95.7MB/s and 100MB/s. But the bandwidth
guaranteed to A is now equal to 85.7, i.e., A never meets its
requirement.
To let both A and B meet their requirements, we need to use more
precise weights. For example, with 87 and 13, we would get 87 MB/s for
A and the range [9.1, 13] MB/s for B. But this entails several major
problems:
1) It requires the scheduler to be able to enforce weights with a
two-digit precision, which is not trivial.
2) Suppose that a round-robin policy with fixed time-slices succeeds
in enforcing the above very precise weight assignment. Such a policy
may cause a high bandwidth oscillation for each process. In fact, it
implies serving 87 slices of A and 13 slices of B in a 100-slice
round. It is not easy for a round-robin scheduler to properly
interleave slices so as to reduce oscillations, when it has to
preserve a slice distribution with a two-digit precision.If, instead,
the scheduler meets weights by increasing slice sizes in proportion to
weights, then there is no solution to avoid large oscillations. This
problem becomes a serious source of high latencies too, as I will try
to highlight in my reply to your next point.
3) Even if the scheduler can somehow succeed in enforcing such a
precise weight assignment smoothly, B still experiences a bandwidth
fluctuation in the range [9.1, 13], i.e., its bandwidth fluctuates by
more than 40%. With the budget-based scheduler, the oscillation was by
4%!
4) A does not get 90 MB/s even in the time periods during which B gets
100MB/s in its time slices.
To achieve, with a time-based scheduler, the same precise and stable
bandwidth distribution as with a budget-based scheduler, the only
solution is to change weights dynamically, as a function of how the
throughput achieved by B varies in its time slots. Such a solution
would be definitely complex, if ever feasible and stable.
>> 2) The above bandwidth fairness of a budget-based policy is key for
>> providing sound low-latency guarantees to soft real-time
>> applications, such as audio or video players. In fact, a time-based
>> policy is nasty with an unlucky soft real-time process whose I/O
>> happens to yield, in the time slice during which the I/O requests of
>> the process are served, a lower throughput than the peak rate. The
>> reason is that, first, a lower throughput makes it more difficult
>> for the process to reach its target bandwidth. Secondly, the
>> scheduling policy does not tend to balance this unlucky situation at
>> all: the time slice for the process is basically the same as for any
>> other process with the same weight. This is one of the reasons why,
>> in our experiments, BFQ constantly guarantees to soft real-time
>> applications a much lower latency than CFQ.
>
> I don't think this is because of any fundamental properties of
> bandwidth being used as the unit but more that the algorithm favors
> low bandwidth consumers and the "soft real-time" ones don't get
> punished as seeky. It seems more a product of distortion in
> distribution scheme, which btw is fine if it serves a practical
> purpose, but this should be achievable in a different way too.
>
Not really. The main part is evidently played by the heuristic for
soft real-time applications that you mentioned. But, for all the
reasons explained above, we have that, with a time-based engine
beneath:
1) The result would be quite unstable. With a simple example as above,
as well as through proper experiments, it is possible to show that
there may be a difference of even one-order of magnitude between the
latency guaranteed by a budget-based scheduler and that guaranteed by
a time-based scheduler, e.g., 20 or 30% against 2 or 3%.
2) To provide worst-case latency guarantees comparable to that of a
budget-based scheduler, and with a similarly simple weight-raising
scheme, the only possibility is to inflate much more the weight of the
target application. This increases the above bandwidth oscillation
problem, and unjustly steals bandwidth from other processes.
3) To not have to over-inflate weights, higher-precision weights must
be used. But this calls for a scheduler able to enforce the needed
precise assignment.
4) If the scheduler has to enforce a high-precision weight assignment
with a round-robin fixed-slice scheme, or by increasing slice sizes,
then the service becomes highly oscillating. This may imply much
higher worst-case latencies, because latencies are proportional to
round durations with round-robin schemes.
These are the additional reasons (which I have tried to explained in
much more detail this time) because BFQ guarantees to soft real-time
applications latencies that a time-base scheduler could not guarantee
even using the same heuristics as BFQ.
>> 3) For the same reasons as in the above case, a budget-based policy
>> is also key for providing sound high-responsiveness guarantees,
>> i.e., for guaranteeing that, even in the presence of heavy
>> background workloads, applications are started quickly, and in
>> general, that interactive tasks are completed promptly. Again, this
>> is one of the reasons why BFQ provides responsiveness guarantees not
>> comparable with CFQ.
>
> I probably am missing something but I don't quite get how the above
> property is a fundamental property of using bandwidth as the unit. If
> it is, it'd be only because the unit distorts the actual IO resource
> distribution in a favorable way. However, because the unit is
> distorted to begin with, bfq has to apply seeky penalty to correct it.
> I'd be far happier with opt-in correction (identifying the useful
> distortions and implementing them) than the current opt-out scheme
> where the fundamental unit is wrong.
>
For the same reasons as those detailed above for soft real-time
applications, BFQ heuristics for a high responsiveness achieve a
performance and a stability that could not be achieved with a
time-based scheduler.
>> For all the three cases above, and concerning unlucky applications,
>> i.e., applications whose I/O patterns yield a lower throughput than
>
> No, that's not an unlucky application. That's an expensive
> application in terms of IO. It's the slow swinging kid.
>
>> other applications, I think it is important to stress that the
>> unfairness, high-latency or low-responsiveness problems experienced
>> by these applications with a time-based policy are not transient: in
>> the presence of a background workload, these application are
>> mistreated in the same way *every time* their I/O is replayed.
>
> I think you're getting it backwards. Let's go back to the swing
> example. If you think allocating per-stroke or
> per-circumference-traveled is a valid strategy when different kids can
> show multiple orders of magnitude differences on those scales, please
> convince me why.
>
As I wrote above, when there are multiple orders of magnitude, BFQ
just switches to time-based scheduling.
>> Unfortunately, there are applications for which BFQ cannot provide
>> the above desirable service guarantees: the applications whose I/O
>> patterns may kill the throughput if they are guaranteed the service
>> fairness in item 1 (they may also cause other applications to suffer
>> from a high latency). In this respect, if I have well understood the
>> “punish-seeky-queues” case you have highlighted, you refer to the
>> case where a process does not finish its budget within a reasonable
>> time interval. In this case, the process is de-scheduled, and
>> treated as if it has used all its budget. As you have noted, this is
>> a switch from a fair budget-based policy to a fair time-based
>> policy, to prevent I/O patterns yielding a very low throughput from
>> causing throughput and latency problems. This simple switch becomes
>> more complex with the refinements introduced by the following
>> patches, which take into account also whether the application is
>> soft real-time or interactive, and, if so, let the trade-off become
>> more favorable to the application, even if its I/O is not
>> throughput-friendly.
>
> I hope my point is clear by now. The above is correcting for the
> fundamental flaws in the unit used for distribution. It's mapping
> bandwidth to time in an ad-hoc and mostly arbitrary way.
>
> In the same cgroup, fairness can take the second seat to overall
> bandwidth or practical characteristics. That's completely fine.
> What bothers me are
>
> 1. The direction is the other way around. It starts with a flawed
> fundamental unit and then opts out "really bad ones". I think the
> better way would be using the correct resource unit as the basis
> and opt in specific distortions to achieve certain practical
> advantages. The later patches add specific opt-in behaviors
> anyway.
>
> 2. For IO control, the practical distortions should be much lower and
> the distribution should be based on the actual IO resource.
>
> Maybe I'm missing out something big. If so, please hammer it into me.
>
Maybe *I* am missing something big here. However, I hope that what I
wrote above helps us converge somehow.
Thanks,
Paolo
> Thanks.
>
> --
> tejun