Re: [PATCH RFC - TAKE TWO - 11/12] block, bfq: boost the throughput on NCQ-capable flash-based devices

From: Paolo Valente
Date: Mon Jun 16 2014 - 06:47:04 EST

Next message: Andreas Schwab: "Re: [PATCH] tell gcc optimizer to never introduce new data races"
Previous message: Sebastian Hesselbarth: "Re: [PATCH v6 0/7] ARM: berlin: add AHCI support"
Next in thread: Tejun Heo: "Re: [PATCH RFC - TAKE TWO - 11/12] block, bfq: boost the throughput on NCQ-capable flash-based devices"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Il giorno 04/giu/2014, alle ore 15:56, Tejun Heo <tj@xxxxxxxxxx> ha scritto:

> Hello, Paolo.

Hello, and sorry for the late reply.

> […]
>>
>> Actually we have been asked several times to improve random-I/O
>> performance on HDDs over the last years, by people recording, for
>> the typical tasks performed by their machines, much lower throughput
>> than with the other schedulers. Major problems have been reported
>> for server workloads (database, web), and for btrfs. According to
>> the feedback received after introducing this optimization in bfq,
>> those problems seem to be finally gone.
>
> I see. The equal priority part can probably work in enough cases to
> be meaningful given that it just depends on the busy queues having the
> same weight instead of everything in the system. It'd nice to note
> that in the comment tho.
>
> I'm still quite skeptical about the cgroup part tho. The triggering
> condition is too specific and fragile. If I'm reading the bfq blkcg
> implementation correctly, it seems to be applying the scheduling
> algorithm recursively walking down the tree one level at a time.

Yes.

> cfq
> does it differently. cfq flattens the hierarchy by calculating the
> nested weight of each active leaf queue and schedule all of them from
> the same service tree. IOW, scheduling algorithm per-se doesn't care
> about the hierarchy. All it sees are differing weights competing
> equally regardless of the hierarchical structure.
>

To preserve the desired distribution of the device throughput (or time), this
scheme entails updating weights every time the set of active queues changes.
For example (sorry for the trivial example, but I just want to make sure that I am
not misunderstanding what you are telling me), suppose that two groups,
A and B, are reserved 50% of the throughput each, and that the first group contains
two processes, whereas the second group just one process. Apart from the
additional per-queue interventions of cfq to improve latency or throughput, the
queues the two processes in in group A are reserved 50% of the group throughput
each. It follows that, if all the three queues are backlogged, then a correct weight
distribution for a flat underlying scheduler is, e.g., 25 and 25 for the two processes
in group A, and 50 for the process in group B.

But, if one of the queues in group A becomes idle, then the correct weights
of the still-active queues become, e.g., 50 and 50.

Changing weights this way has basically no influence of the per-request latency
guarantees of cfq, because cfq is based on a round-robin scheme, and the latter
already suffers from a large deviation with respect to an ideal service. In contrast,
things change dramatically with an accurate scheduler, such as the internal
B-WF2Q+ scheduler of BFQ. In that case, as explained, e.g., here for packet
scheduling (but the problem is exactly the same)

http://www.algogroup.unimore.it/people/paolo/dynWF2Q+/dynWF2Q+.pdf

weight changes would just break service guarantees, and lead to the same
deviation as a round-robin scheduler. As proved in the same
document, to preserve guarantees, weight updates should be delayed/concealed
in a non-trivial (although computationally cheap) way.

> If the same strategy can be applied to bfq, possibly the same strategy
> of checking whether all the active queues have the same weight can be
> used regardless of blkcg? That'd be simpler and a lot more robust.
>

Unfortunately, because of the above problems, this scheme would break
service guarantees with an accurate scheduler such as bfq.

The hierarchical scheme used by bfq does not suffer from this problem,
also because it does implement the mechanisms described in the
above document. In particular, it allows these mechanism to be
implemented in a quite simple way, whereas things would become
more convoluted after flattening the hierarchy.

If useful, I am willing to provide more details, although these aspects are most
certainly quite theoretical and boring.

> Another thing I'm curious about is that the logic that you're using to
> disable idling assumes that the disk will serve the queued commands
> more or less in fair manner over time, right? If so, why does queues
> having the same weight matter? Shouldn't the bandwidth scheduling
> eventually make them converge to the specified weights over time?
> Isn't wr_coeff > 1 test enough for maintaining reasonable
> responsiveness?

Unfortunately, when I first defined bfq with Fabio, this was one of the main
mistakes made in most of existing research I/O schedulers. More precisely, your
statement is true for async queues, but fails for sync queues. The problem is that
the (only) way in which a process pushes a scheduler into giving it its reserved
throughput is by issuing requests at a higher rate than that at which they are
served. But, with sync queues this just cannot happen. In particular,
postponing the service of a sync request delays the arrival of the next,
but not-yet-arrived, sync request of the same process. Intuitively, for the scheduler,
it is like if the process was happy with the throughput it is receiving, because
it happens to issue requests exactly at that rate.

This problems is described in more detail, for example, in Section 5.3 of
http://www.algogroup.unimore.it/people/paolo/disk_sched/bf1-v1-suite-results.pdf

bfq deals with this problem by properly back-shifting request timestamps when
needed. But idling is necessary for this mechanism to work (unless more
complex solutions are adopted).

>
>> Besides, turning back to bfq, if its low-latency heuristics are
>> disabled for non rotational devices, then, according to our results
>> with slower devices, such as SD Cards and eMMCs, latency becomes
>> easily unbearable, with no throughput gain.
>
> Hmmm... looking at the nonrot optimizations again, so yeah assuming
> the weight counting is necessary for NCQ hdds the part specific to
> ssds isn't that big. You probably wanna sequence it the other way
> around tho. This really should be primarily about disks at this
> point.
>
> The thing which still makes me cringe is how it scatters
> blk_queue_nonrot() tests across multiple places without clear
> explanation on what's going on. A bfqq being constantly seeky or not
> doesn't have much to do with whether the device is rotational or not.
> Its effect does and I don't think avoiding the overhead of keeping the
> counters is meaningful. Things like this make the code a lot harder
> to maintain in the long term as code is organized according to
> seemingly arbitrary optimization rather than semantic structure. So,
> let's please keep the accounting and optimization separate.
>

Added to the list of changes to make, thanks.

Paolo

> Thanks.
>
> --
> tejun

--
Paolo Valente
Algogroup
Dipartimento di Fisica, Informatica e Matematica
Via Campi, 213/B
41125 Modena - Italy
homepage: http://algogroup.unimore.it/people/paolo/

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Andreas Schwab: "Re: [PATCH] tell gcc optimizer to never introduce new data races"
Previous message: Sebastian Hesselbarth: "Re: [PATCH v6 0/7] ARM: berlin: add AHCI support"
Next in thread: Tejun Heo: "Re: [PATCH RFC - TAKE TWO - 11/12] block, bfq: boost the throughput on NCQ-capable flash-based devices"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]