Re: [PATCH RFC 01/14] block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler

From: Paolo Valente
Date: Tue Mar 14 2017 - 10:18:34 EST



> Il giorno 06 mar 2017, alle ore 20:40, Bart Van Assche <bart.vanassche@xxxxxxxxxxx> ha scritto:
>

Hi Bart,
thanks for such an accurate review. I'm addressing the issues you
raised, and I'll get back in touch as soon as I have finished.

Paolo

> On 03/04/2017 08:01 AM, Paolo Valente wrote:
>> BFQ is a proportional-share I/O scheduler, whose general structure,
>> plus a lot of code, are borrowed from CFQ.
>> [ ... ]
>
> This description is very useful. However, since it is identical to the
> description this patch adds to Documentation/block/bfq-iosched.txt I
> propose to leave it out from the patch description.
>
> What seems missing to me is an overview of the limitations of BFQ. Does
> BFQ e.g. support multiple hardware queues?
>
>> +3. What are BFQ's tunable?
>> +==========================
>> +[ ... ]
>
> A thorough knowledge of BFQ is required to tune it properly. Users don't
> want to tune I/O schedulers. Has it been considered to invent algorithms
> to tune these parameters automatically?
>
>> + * Licensed under GPL-2.
>
> The COPYING file at the top of the tree mentions that GPL-v2 licensing
> should be specified as follows close to the start of each source file:
>
> This program is free software; you can redistribute it and/or modify
> it under the terms of the GNU General Public License as published by
> the Free Software Foundation; either version 2 of the License, or
> (at your option) any later version.
>
> This program is distributed in the hope that it will be useful,
> but WITHOUT ANY WARRANTY; without even the implied warranty of
> MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> GNU General Public License for more details.
>
> You should have received a copy of the GNU General Public License
> along with this program; if not, write to the Free Software
> Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA
> 02110-1301 USA
>
>> + * BFQ is a proportional-share I/O scheduler, with some extra
>> + * low-latency capabilities. BFQ also supports full hierarchical
>> + * scheduling through cgroups. Next paragraphs provide an introduction
>> + * on BFQ inner workings. Details on BFQ benefits and usage can be
>> + * found in Documentation/block/bfq-iosched.txt.
>
> That reference should be sufficient - please do not duplicate
> Documentation/block/bfq-iosched.txt in block/bfq-iosched.c.
>
>> +/**
>> + * struct bfq_service_tree - per ioprio_class service tree.
>> + *
>> + * Each service tree represents a B-WF2Q+ scheduler on its own. Each
>> + * ioprio_class has its own independent scheduler, and so its own
>> + * bfq_service_tree. All the fields are protected by the queue lock
>> + * of the containing bfqd.
>> + */
>> +struct bfq_service_tree {
>> + /* tree for active entities (i.e., those backlogged) */
>> + struct rb_root active;
>> + /* tree for idle entities (i.e., not backlogged, with V <= F_i)*/
>> + struct rb_root idle;
>> +
>> + struct bfq_entity *first_idle; /* idle entity with minimum F_i */
>> + struct bfq_entity *last_idle; /* idle entity with maximum F_i */
>> +
>> + u64 vtime; /* scheduler virtual time */
>> + /* scheduler weight sum; active and idle entities contribute to it */
>> + unsigned long wsum;
>> +};
>
> Inline comments next to structure members are ugly and make the
> structure definition hard to read. Please follow the instructions in
> Documentation/kernel-doc-nano-HOWTO.txt for documenting structure members.
>
>> + u64 finish; /* B-WF2Q+ finish timestamp (aka F_i) */
>> + u64 start; /* B-WF2Q+ start timestamp (aka S_i) */
>
> For all times and timestamps, please document the time unit (e.g. s, ms,
> us, ns, jiffies, ...).
>
>> +enum bfq_device_speed {
>> + BFQ_BFQD_FAST,
>> + BFQ_BFQD_SLOW,
>> +};
>
> What is the meaning of "fast" and "slow" devices in this context?
> Anyway, since the first patch that uses this enum is patch 6, please
> defer introduction of this enum until patch 6.
>
>> +
>> +/**
>> + * struct bfq_data - per-device data structure.
>> + *
>> + * All the fields are protected by @lock.
>> + */
>> +struct bfq_data {
>> + /* device request queue */
>> + struct request_queue *queue;
>> + [ ... ]
>> +
>> + /* on-disk position of the last served request */
>> + sector_t last_position;
>
> What is the relevance of last_position if there are multiple hardware
> queues? Will the BFQ algorithm fail to realize its guarantees in that case?
>
> What is the relevance of this structure member for block devices that
> have multiple spindles, e.g. arrays of hard disks?
>
>> +enum bfqq_state_flags {
>> + BFQ_BFQQ_FLAG_busy = 0, /* has requests or is in service */
>> + BFQ_BFQQ_FLAG_wait_request, /* waiting for a request */
>> + BFQ_BFQQ_FLAG_non_blocking_wait_rq, /*
>> + * waiting for a request
>> + * without idling the device
>> + */
>> + BFQ_BFQQ_FLAG_fifo_expire, /* FIFO checked in this slice */
>> + BFQ_BFQQ_FLAG_idle_window, /* slice idling enabled */
>> + BFQ_BFQQ_FLAG_sync, /* synchronous queue */
>> + BFQ_BFQQ_FLAG_budget_new, /* no completion with this budget */
>> + BFQ_BFQQ_FLAG_IO_bound, /*
>> + * bfqq has timed-out at least once
>> + * having consumed at most 2/10 of
>> + * its budget
>> + */
>> +};
>
> The "BFQ_BFQQ_FLAG_" prefix looks silly and too long to me. How about
> e.g. using the prefix "BFQQF_" instead?
>
>> +#define BFQ_BFQQ_FNS(name) \
>> +static void bfq_mark_bfqq_##name(struct bfq_queue *bfqq) \
>> +{ \
>> + (bfqq)->flags |= (1 << BFQ_BFQQ_FLAG_##name); \
>> +} \
>> +static void bfq_clear_bfqq_##name(struct bfq_queue *bfqq) \
>> +{ \
>> + (bfqq)->flags &= ~(1 << BFQ_BFQQ_FLAG_##name); \
>> +} \
>> +static int bfq_bfqq_##name(const struct bfq_queue *bfqq) \
>> +{ \
>> + return ((bfqq)->flags & (1 << BFQ_BFQQ_FLAG_##name)) != 0; \
>> +}
>
> Are the bodies of the above functions duplicates of __set_bit(),
> __clear_bit() and test_bit()?
>
>> +/* Expiration reasons. */
>> +enum bfqq_expiration {
>> + BFQ_BFQQ_TOO_IDLE = 0, /*
>> + * queue has been idling for
>> + * too long
>> + */
>> + BFQ_BFQQ_BUDGET_TIMEOUT, /* budget took too long to be used */
>> + BFQ_BFQQ_BUDGET_EXHAUSTED, /* budget consumed */
>> + BFQ_BFQQ_NO_MORE_REQUESTS, /* the queue has no more requests */
>> + BFQ_BFQQ_PREEMPTED /* preemption in progress */
>> +};
>
> The prefix of these constants refers twice to "BFQ" and does not make it
> clear that these constants are about expiration. How about using the
> "BFQQE_" prefix instead?
>
>> +/* Maximum backwards seek, in KiB. */
>> +static const int bfq_back_max = 16 * 1024;
>
> Where does this constant come from? Should it depend on geometry data
> like e.g. the number of sectors in a cylinder?
>
>> +#define for_each_entity(entity) \
>> + for (; entity ; entity = NULL)
>
> Why has this confusing #define been introduced? Shouldn't all
> occurrences of this macro be changed into the equivalent "if (entity)"?
> We don't want silly macros like this in the Linux kernel.
>
>> +#define for_each_entity_safe(entity, parent) \
>> + for (parent = NULL; entity ; entity = parent)
>
> Same question here - why has this macro been introduced and how has its
> name been chosen? Since this macro is used only once and since no value
> is assigned to 'parent' in the code controlled by this construct, please
> remove this macro and use something that is less confusing than a "for"
> loop for something that is not a loop.
>
>> +/**
>> + * bfq_weight_to_ioprio - calc an ioprio from a weight.
>> + * @weight: the weight value to convert.
>> + *
>> + * To preserve as much as possible the old only-ioprio user interface,
>> + * 0 is used as an escape ioprio value for weights (numerically) equal or
>> + * larger than IOPRIO_BE_NR * BFQ_WEIGHT_CONVERSION_COEFF.
>> + */
>> +static unsigned short bfq_weight_to_ioprio(int weight)
>> +{
>> + return IOPRIO_BE_NR * BFQ_WEIGHT_CONVERSION_COEFF - weight < 0 ?
>> + 0 : IOPRIO_BE_NR * BFQ_WEIGHT_CONVERSION_COEFF - weight;
>> +}
>
> Please consider using max() or max_t() to make this function less verbose.
>
>> +
>> +/**
>> + * bfq_active_extract - remove an entity from the active tree.
>> + * @st: the service_tree containing the tree.
>> + * @entity: the entity being removed.
>> + */
>> +static void bfq_active_extract(struct bfq_service_tree *st,
>> + struct bfq_entity *entity)
>> +{
>> + struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
>> + struct rb_node *node;
>> +
>> + node = bfq_find_deepest(&entity->rb_node);
>> + bfq_extract(&st->active, entity);
>> +
>> + if (node)
>> + bfq_update_active_tree(node);
>> +
>> + if (bfqq)
>> + list_del(&bfqq->bfqq_list);
>> +}
>
> Which locks protect the data structures manipulated by this and other
> functions? Have you considered to use lockdep_assert_held() to document
> these assumptions?
>
>> + case (BFQ_RQ1_WRAP|BFQ_RQ2_WRAP): /* both rqs wrapped */
>
> Please don't use parentheses if no confusion is possible. Additionally,
> checkpatch should have requested you to insert a space before and after
> the logical or operator.
>
>> +static void __bfq_set_in_service_queue(struct bfq_data *bfqd,
>> + struct bfq_queue *bfqq)
>> +{
>> + if (bfqq) {
>> + bfq_mark_bfqq_budget_new(bfqq);
>> + bfq_clear_bfqq_fifo_expire(bfqq);
>> +
>> + bfqd->budgets_assigned = (bfqd->budgets_assigned*7 + 256) / 8;
>
> Checkpatch should have asked you to insert spaces around the
> multiplication operator.
>
>> +/*
>> + * bfq_default_budget - return the default budget for @bfqq on @bfqd.
>> + * @bfqd: the device descriptor.
>> + * @bfqq: the queue to consider.
>> + *
>> + * We use 3/4 of the @bfqd maximum budget as the default value
>> + * for the max_budget field of the queues. This lets the feedback
>> + * mechanism to start from some middle ground, then the behavior
>> + * of the process will drive the heuristics towards high values, if
>> + * it behaves as a greedy sequential reader, or towards small values
>> + * if it shows a more intermittent behavior.
>> + */
>> +static unsigned long bfq_default_budget(struct bfq_data *bfqd,
>> + struct bfq_queue *bfqq)
>> +{
>> + unsigned long budget;
>> +
>> + /*
>> + * When we need an estimate of the peak rate we need to avoid
>> + * to give budgets that are too short due to previous measurements.
>> + * So, in the first 10 assignments use a ``safe'' budget value.
>> + */
>> + if (bfqd->budgets_assigned < 194 && bfqd->bfq_user_max_budget == 0)
>> + budget = bfq_default_max_budget;
>> + else
>> + budget = bfqd->bfq_max_budget;
>> +
>> + return budget - budget / 4;
>> +}
>
> Where does the magic constant "194" come from?
>
>
>> + } else
>> + /*
>> + * Async queues get always the maximum possible
>> + * budget, as for them we do not care about latency
>> + * (in addition, their ability to dispatch is limited
>> + * by the charging factor).
>> + */
>> + budget = bfqd->bfq_max_budget;
>> +
>
> Please balance braces. Checkpatch should have warned about the use of "}
> else" instead of "} else {".
>
>> +static unsigned long bfq_calc_max_budget(u64 peak_rate, u64 timeout)
>> +{
>> + unsigned long max_budget;
>> +
>> + /*
>> + * The max_budget calculated when autotuning is equal to the
>> + * amount of sectors transferred in timeout at the
>> + * estimated peak rate.
>> + */
>> + max_budget = (unsigned long)(peak_rate * 1000 *
>> + timeout >> BFQ_RATE_SHIFT);
>> +
>> + return max_budget;
>> +}
>
> Where does the constant 1000 come from? What are the units of peak_rate
> and timeout? What is the maximum value of peak_rate? Can the
> multiplication overflow?
>
>> +/*
>> + * In addition to updating the peak rate, checks whether the process
>> + * is "slow", and returns 1 if so. This slow flag is used, in addition
>> + * to the budget timeout, to reduce the amount of service provided to
>> + * seeky processes, and hence reduce their chances to lower the
>> + * throughput. See the code for more details.
>> + */
>> +static bool bfq_update_peak_rate(struct bfq_data *bfqd, struct bfq_queue *bfqq,
>> + bool compensate)
>> +{
>> + u64 bw, usecs, expected, timeout;
>> + ktime_t delta;
>> + int update = 0;
>> +
>> + if (!bfq_bfqq_sync(bfqq) || bfq_bfqq_budget_new(bfqq))
>> + return false;
>> +
>> + if (compensate)
>> + delta = bfqd->last_idling_start;
>> + else
>> + delta = ktime_get();
>> + delta = ktime_sub(delta, bfqd->last_budget_start);
>> + usecs = ktime_to_us(delta);
>> +
>> + /* Don't trust short/unrealistic values. */
>> + if (usecs < 100 || usecs >= LONG_MAX)
>> + return false;
>
> If usecs >= LONG_MAX that indicates a kernel bug. Please consider
> triggering a kernel warning in that case.
>
>> +/*
>> + * Budget timeout is not implemented through a dedicated timer, but
>> + * just checked on request arrivals and completions, as well as on
>> + * idle timer expirations.
>> + */
>> +static bool bfq_bfqq_budget_timeout(struct bfq_queue *bfqq)
>> +{
>> + if (bfq_bfqq_budget_new(bfqq) ||
>> + time_before(jiffies, bfqq->budget_timeout))
>> + return false;
>> + return true;
>> +}
>
> Have you considered to use time_is_after_jiffies() instead of
> time_before(jiffies, ...)?
>
>> +static void bfq_init_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
>> + struct bfq_io_cq *bic, pid_t pid, int is_sync)
>> +{
>> + RB_CLEAR_NODE(&bfqq->entity.rb_node);
>> + INIT_LIST_HEAD(&bfqq->fifo);
>> +
>> + bfqq->ref = 0;
>> + bfqq->bfqd = bfqd;
>> +
>> + if (bic)
>> + bfq_set_next_ioprio_data(bfqq, bic);
>> +
>> + if (is_sync) {
>> + if (!bfq_class_idle(bfqq))
>> + bfq_mark_bfqq_idle_window(bfqq);
>> + bfq_mark_bfqq_sync(bfqq);
>> + } else
>> + bfq_clear_bfqq_sync(bfqq);
>> +
>> + bfqq->ttime.last_end_request = ktime_get_ns() - (1ULL<<32);
>> +
>> + bfq_mark_bfqq_IO_bound(bfqq);
>> +
>> + bfqq->pid = pid;
>> +
>> + /* Tentative initial value to trade off between thr and lat */
>> + bfqq->max_budget = bfq_default_budget(bfqd, bfqq);
>> + bfqq->budget_timeout = bfq_smallest_from_now();
>> + bfqq->pid = pid;
>> +
>> + /* first request is almost certainly seeky */
>> + bfqq->seek_history = 1;
>> +}
>
> What is the meaning of the 1ULL << 32 constant?
>
>> +static int __init bfq_init(void)
>> +{
>> + int ret;
>> + char msg[50] = "BFQ I/O-scheduler: v0";
>
> Please leave out "[50]" and use "static const char" instead of "char".
>
>> diff --git a/block/elevator.c b/block/elevator.c
>> index 01139f5..786fdcd 100644
>> --- a/block/elevator.c
>> +++ b/block/elevator.c
>> @@ -221,14 +221,20 @@ int elevator_init(struct request_queue *q, char *name)
>>
>> if (!e) {
>> /*
>> - * For blk-mq devices, we default to using mq-deadline,
>> - * if available, for single queue devices. If deadline
>> - * isn't available OR we have multiple queues, default
>> - * to "none".
>> + * For blk-mq devices, we default to using bfq, if
>> + * available, for single queue devices. If bfq isn't
>> + * available, we try mq-deadline. If neither is
>> + * available, OR we have multiple queues, default to
>> + * "none".
>> */
>> if (q->mq_ops) {
>> + if (q->nr_hw_queues == 1) {
>> + e = elevator_get("bfq", false);
>> + if (!e)
>> + e = elevator_get("mq-deadline", false);
>> + }
>> if (q->nr_hw_queues == 1)
>> - e = elevator_get("mq-deadline", false);
>> + e = elevator_get("bfq", false);
>> if (!e)
>> return 0;
>> } else
>>
>
> As Jens wrote, it's way too early to make BFQ the default scheduler.
>
> Bart.