Re: [PATCH RFC 09/22] block, cfq: replace CFQ with the BFQ-v0 I/O scheduler
From: Paolo Valente
Date: Wed Feb 17 2016 - 04:02:18 EST
Hi,
your first statement “bfq is using bandwidth as the virtual time" is not very clear to me. In contrast, after that you raise a clear, important issue. So, I will first try to sync with you on your first statement (and hopefully help find related bugs in the documentation of BFQ). Then I will try to discuss the problem you highlight. Instead, as for your detailed comments and suggestions about the code, I will just try to apply all of them in the following submission of this patch.
About your first statement, I am not sure that the definition of virtual times, as you rephrased it, is correct. To check this with you, I will repeat here how these quantities are defined. In particular, to try to not add further confusion, in the next paragraph I will first repeat the goal for which these quantities are defined. This information is also tightly related to the problem you highlight. Then I will repeat the actual definition of virtual times in the following paragraph.
Bandwidth fluctuation is one of the key issues addressed in the very definition of BFQ (more precisely in the definition of the engine of BFQ, introduced in this patch). In particular, BFQ schedules queues so as to approximate an ideal service in which every queue receives, over any time interval, its share of the bandwidth, regardless of the actual value of the bandwidth, and even if the bandwidth fluctuates. IOW, in the ideal service approximated by BFQ, if the bandwidth fluctuates during a given time interval, then, in every sub-interval during which the bandwidth is constant (possibly an infinitesimal time interval), each queue receives a fraction of the total bandwidth equal to the ratio between its weight and the sum of the weights of the active queues. BFQ is an optimal scheduler in that it guarantees the minimum possible deviation, for a slice-by-slice service scheme, with respect to the ideal service. Finally, BFQ actually provides these fairness guarantees only to queues whose I/O pattern yields a reasonably high throughput. This fact has to do with the issue you raise.
The virtual time is the key concept used to achieve the above optimal fairness guarantees. To show how intuitively, let me restate the above guarantees in terms of "amount of service”, i.e., of number of sectors read/written: BFQ guarantees that every active queue receives, over any time interval, an amount of service equal to the total amount of service provided by the system, multiplied by the share of the bandwidth for the queue (apart from the above-mentioned minimum, unavoidable deviation). Informally, this implies that the amount of service received by each active queue, during any given time interval, is proportional to the weight of the queue. As a consequence, the normalized service received by every active queue, i.e., the amount of service divided by the weight of the queue, grows at the same rate. In BFQ, every queue is associated with a virtual time, equal exactly to this normalized service. The goal of BFQ is then to equalize queues’ virtual times (using also a further global quantity, called system virtual time). To achieve this goal, BFQ does not schedule time slices, but budgets, measured in amount of service.
Hoping that we are now in sync on virtual times, I will try to discuss your comment on why not to schedule time (slices) in the first place. The fact that BFQ does not schedule time slices with the same scheduling policy as CFQ, but budgets with an optimally fair policy, provides the following main practical benefits:
1) It allows BFQ to distribute bandwidth as desired among processes or groups of processes, regardless of: device parameters, bandwidth fluctuation, overall workload composition (BFQ gives up this bandwidth-fair policy only if this would cause a throughput drop, as discussed in the second part of this email). It is impossible for a time-based scheduling policy to provide such a service guarantee on a device with a fluctuating bandwidth. In fact, a process achieving, during its allotted time slice, a lower bandwidth than other luckier processes, will just receive less service when it has the opportunity to use the device. As a consequence, an unlucky process may easily receive less bandwidth than another process with the same weight, or even of a process with a lower weight. On HDDs, this trivially happens to processes reading/writing on the inner zones of the disk. On SSDs, the variability is due to more complex causes. This problem has been one of the main motivations for the definition of BFQ.
2) The above bandwidth fairness of a budget-based policy is key for providing sound low-latency guarantees to soft real-time applications, such as audio or video players. In fact, a time-based policy is nasty with an unlucky soft real-time process whose I/O happens to yield, in the time slice during which the I/O requests of the process are served, a lower throughput than the peak rate. The reason is that, first, a lower throughput makes it more difficult for the process to reach its target bandwidth. Secondly, the scheduling policy does not tend to balance this unlucky situation at all: the time slice for the process is basically the same as for any other process with the same weight. This is one of the reasons why, in our experiments, BFQ constantly guarantees to soft real-time applications a much lower latency than CFQ.
3) For the same reasons as in the above case, a budget-based policy is also key for providing sound high-responsiveness guarantees, i.e., for guaranteeing that, even in the presence of heavy background workloads, applications are started quickly, and in general, that interactive tasks are completed promptly. Again, this is one of the reasons why BFQ provides responsiveness guarantees not comparable with CFQ.
For all the three cases above, and concerning unlucky applications, i.e., applications whose I/O patterns yield a lower throughput than other applications, I think it is important to stress that the unfairness, high-latency or low-responsiveness problems experienced by these applications with a time-based policy are not transient: in the presence of a background workload, these application are mistreated in the same way *every time* their I/O is replayed.
Unfortunately, there are applications for which BFQ cannot provide the above desirable service guarantees: the applications whose I/O patterns may kill the throughput if they are guaranteed the service fairness in item 1 (they may also cause other applications to suffer from a high latency). In this respect, if I have well understood the “punish-seeky-queues” case you have highlighted, you refer to the case where a process does not finish its budget within a reasonable time interval. In this case, the process is de-scheduled, and treated as if it has used all its budget. As you have noted, this is a switch from a fair budget-based policy to a fair time-based policy, to prevent I/O patterns yielding a very low throughput from causing throughput and latency problems. This simple switch becomes more complex with the refinements introduced by the following patches, which take into account also whether the application is soft real-time or interactive, and, if so, let the trade-off become more favorable to the application, even if its I/O is not throughput-friendly.
To sum up, the reason for not enforcing a plain time-based policy for every process is simply not to lose all the above benefits. Of course, there is certainly room for further improvements.
I am sorry for the long reply, I have I tried to cover all the points in sufficient detail. Looking forward to your reply,
Paolo
Il giorno 11/feb/2016, alle ore 23:22, Tejun Heo <tj@xxxxxxxxxx> ha scritto:
> Hello,
>
> Here's the biggest question I have. If I'm reading the paper and code
> correctly, bfq is using bandwidth as the virtual time used for
> scheduling. This doesn't make sense to me because for most IO devices
> and especially for rotating disks bandwidth is a number which
> fluctuates easily and widely. Trying to schedule a process with
> sequential IO pattern and another with random IO on bandwidth basis
> would be disastrous.
>
> bfq adjusts for the fact by "punishing" seeky workloads, which, I
> guess, is because doing naive bandwidth based scheduling is simply
> unworkable. The punishment is naturally charging it more than it
> consumed but that looks like a twisted way of doing time based
> scheduling. It uses some heuristics to determine how much a seeky
> queue should get over-charged so that the overcharged amount equals
> the actual IO resource consumed, which is IO time on the device.
>
> cfq may have a lot of shortcomings but abstracting IO resource in
> terms of device IO time isn't one of them and bfq follows all the
> necessary pieces of scheduling mechanics to be able to do so too -
> single issuer at a time with opportunistic idling.
>
> So, I'm scratching my head wondering why this wasn't time based to
> begin with. My limited understanding of the scheduling algorithm
> doesn't show anything why this couldn't have been time based. What am
> I missing?
>
> I haven't scrutinized the code closely and what follows are all
> trivial stylistic comments.
>
>> block/Kconfig.iosched | 8 +-
>> block/bfq.h | 502 ++++++
>
> Why does this need to be in a separate file? It doesn't seem to get
> used anywhere other than cfq-iosched.c.
>
>> +/**
>> + * struct bfq_data - per device data structure.
>> + * @queue: request queue for the managed device.
>> + * @sched_data: root @bfq_sched_data for the device.
>> + * @busy_queues: number of bfq_queues containing requests (including the
>> + * queue in service, even if it is idling).
> ...
>
> I'm personally not a big fan of documenting struct fields this way.
> It's too easy to get them out of sync.
>
>> + * All the fields are protected by the @queue lock.
>> + */
>> +struct bfq_data {
>> + struct request_queue *queue;
>> +
>> + struct bfq_sched_data sched_data;
>
> and also I think it's easier on the eyes to align field names
>
> struct request_queue *queue;
> struct bfq_sched_data sched_data;
>
> That said, this is fundamentally bike-shedding, so please feel free
> to ignore.
>
>> +enum bfqq_state_flags {
>> + BFQ_BFQQ_FLAG_busy = 0, /* has requests or is in service */
>> + BFQ_BFQQ_FLAG_wait_request, /* waiting for a request */
>> + BFQ_BFQQ_FLAG_must_alloc, /* must be allowed rq alloc */
> ...
>> +#define BFQ_BFQQ_FNS(name) \
> ...
>> +
>> +BFQ_BFQQ_FNS(busy);
>> +BFQ_BFQQ_FNS(wait_request);
>> +BFQ_BFQQ_FNS(must_alloc);
>
> Heh... can we not do this? What's wrong with just doing the following?
>
> enum {
> BFQQ_BUSY = 1U << 0,
> BFQQ_WAIT_REQUEST = 1U << 1,
> BFQQ_MUST_ALLOC = 1U << 2,
> ...
> };
>
>> +/* Logging facilities. */
>> +#define bfq_log_bfqq(bfqd, bfqq, fmt, args...) \
>> + blk_add_trace_msg((bfqd)->queue, "bfq%d " fmt, (bfqq)->pid, ##args)
>
> I don't think it's too productive to repeat bfq's. bfqq_log()?
>
>> +#define bfq_log(bfqd, fmt, args...) \
>> + blk_add_trace_msg((bfqd)->queue, "bfq " fmt, ##args)
>> +
>> +/* Expiration reasons. */
>> +enum bfqq_expiration {
>> + BFQ_BFQQ_TOO_IDLE = 0, /*
>> + * queue has been idling for
>> + * too long
>> + */
>> + BFQ_BFQQ_BUDGET_TIMEOUT, /* budget took too long to be used */
>> + BFQ_BFQQ_BUDGET_EXHAUSTED, /* budget consumed */
>> + BFQ_BFQQ_NO_MORE_REQUESTS, /* the queue has no more requests */
>> +};
>
> Given that these are less common than the flags, maybe use something
> like BFQQ_EXP_TOO_IDLE?
>
>> +static struct bfq_queue *bfq_entity_to_bfqq(struct bfq_entity *entity);
>
> bfqe_to_bfqq()? And so on. I don't think being verbose with heavily
> repeated keywords helps much.
>
>> +static struct bfq_queue *bic_to_bfqq(struct bfq_io_cq *bic, bool is_sync)
>> +{
>> + return bic->bfqq[is_sync];
>> +}
>> +
>> +static void bic_set_bfqq(struct bfq_io_cq *bic, struct bfq_queue *bfqq,
>> + bool is_sync)
>> +{
>> + bic->bfqq[is_sync] = bfqq;
>> +}
>
> Do helpers like the above add anything? Don't these obfuscate more
> than help?
>
>> +static struct bfq_data *bfq_get_bfqd_locked(void **ptr, unsigned long *flags)
>
> bfqd_get_and_lock_irqsave()?
>
>> +{
>> + struct bfq_data *bfqd;
>> +
>> + rcu_read_lock();
>> + bfqd = rcu_dereference(*(struct bfq_data **)ptr);
>> +
>> + if (bfqd != NULL) {
>> + spin_lock_irqsave(bfqd->queue->queue_lock, *flags);
>> + if (ptr == NULL)
>> + pr_crit("get_bfqd_locked pointer NULL\n");
>
> I don't get it. How can @ptr can be NULL at this point? The kernel
> would have crashed on the above rcu_deref.
>
>> + else if (*ptr == bfqd)
>> + goto out;
>> + spin_unlock_irqrestore(bfqd->queue->queue_lock, *flags);
>> + }
>> +
>> + bfqd = NULL;
>> +out:
>> + rcu_read_unlock();
>> + return bfqd;
>> +}
>> +
>> +static void bfq_put_bfqd_unlock(struct bfq_data *bfqd, unsigned long *flags)
>> +{
>> + spin_unlock_irqrestore(bfqd->queue->queue_lock, *flags);
>> +}
>
> Ditto, I don't think these single liners help anything. Just make
> sure that the counterpart's name clearly indicates that it's a locking
> function.
>
> ...
>> +/* hw_tag detection: parallel requests threshold and min samples needed. */
>> +#define BFQ_HW_QUEUE_THRESHOLD 4
>> +#define BFQ_HW_QUEUE_SAMPLES 32
>> +
>> +#define BFQQ_SEEK_THR (sector_t)(8 * 1024)
>> +#define BFQQ_SEEKY(bfqq) ((bfqq)->seek_mean > BFQQ_SEEK_THR)
>
> Inconsistent indenting?
>
> ...
>> + * Shift for timestamp calculations. This actually limits the maximum
>> + * service allowed in one timestamp delta (small shift values increase it),
>> + * the maximum total weight that can be used for the queues in the system
>> + * (big shift values increase it), and the period of virtual time
>> + * wraparounds.
>> */
>> +#define WFQ_SERVICE_SHIFT 22
>
> Collect constants in one place?
>
> ...
>> +static struct bfq_entity *bfq_entity_of(struct rb_node *node)
>> +{
>> + struct bfq_entity *entity = NULL;
>>
>> + if (node)
>> + entity = rb_entry(node, struct bfq_entity, rb_node);
>>
>> + return entity;
>> +}
>
> Maybe
>
> if (!node)
> return NULL;
> return rb_entry(node, ...);
>
>> +static unsigned short bfq_weight_to_ioprio(int weight)
>> {
>> + return IOPRIO_BE_NR * BFQ_WEIGHT_CONVERSION_COEFF - weight < 0 ?
>> + 0 : IOPRIO_BE_NR * BFQ_WEIGHT_CONVERSION_COEFF - weight;
>
> Heh, how about?
>
> return max(IOPRIO_BE_NR * BFQ_WEIGHT_CONVERSION_COEFF - weight, 0);
>
>> +static struct rb_node *bfq_find_deepest(struct rb_node *node)
>> {
>> + struct rb_node *deepest;
>> +
>> + if (!node->rb_right && !node->rb_left)
>> + deepest = rb_parent(node);
>> + else if (!node->rb_right)
>> + deepest = node->rb_left;
>> + else if (!node->rb_left)
>> + deepest = node->rb_right;
>> + else {
>> + deepest = rb_next(node);
>> + if (deepest->rb_right)
>> + deepest = deepest->rb_right;
>> + else if (rb_parent(deepest) != node)
>> + deepest = rb_parent(deepest);
>> + }
>
> According to CodyingStyle, if any clause gets braced, all clauses get
> braced.
>
> And there were some lines which were unnecessarily broken when joining
> them would still fit under 80col limit.
>
> Thanks.
>
> --
> tejun