On Thu 27-10-16 10:26:18, Jens Axboe wrote:
On 10/27/2016 03:26 AM, Jan Kara wrote:
On Wed 26-10-16 10:12:38, Jens Axboe wrote:
On 10/26/2016 10:04 AM, Paolo Valente wrote:
Il giorno 26 ott 2016, alle ore 17:32, Jens Axboe <axboe@xxxxxxxxx> ha scritto:
On 10/26/2016 09:29 AM, Christoph Hellwig wrote:
On Wed, Oct 26, 2016 at 05:13:07PM +0200, Arnd Bergmann wrote:
The question to ask first is whether to actually have pluggable
schedulers on blk-mq at all, or just have one that is meant to
do the right thing in every case (and possibly can be bypassed
completely).
That would be my preference. Have a BFQ-variant for blk-mq as an
option (default to off unless opted in by the driver or user), and
not other scheduler for blk-mq. Don't bother with bfq for non
blk-mq. It's not like there is any advantage in the legacy-request
device even for slow devices, except for the option of having I/O
scheduling.
It's the only right way forward. blk-mq might not offer any substantial
advantages to rotating storage, but with scheduling, it won't offer a
downside either. And it'll take us towards the real goal, which is to
have just one IO path.
ok
Adding a new scheduler for the legacy IO path
makes no sense.
I would fully agree if effective and stable I/O scheduling would be
available in blk-mq in one or two months. But I guess that it will
take at least one year optimistically, given the current status of the
needed infrastructure, and given the great difficulties of doing
effective scheduling at the high parallelism and extreme target speeds
of blk-mq. Of course, this holds true unless little clever scheduling
is performed.
So, what's the point in forcing a lot of users wait another year or
more, for a solution that has yet to be even defined, while they could
enjoy a much better system, and then switch an even better system when
scheduling is ready in blk-mq too?
That same argument could have been made 2 years ago. Saying no to a new
scheduler for the legacy framework goes back roughly that long. We could
have had BFQ for mq NOW, if we didn't keep coming back to this very
point.
I'm hesistant to add a new scheduler because it's very easy to add, very
difficult to get rid of. If we do add BFQ as a legacy scheduler now,
it'll take us years and years to get rid of it again. We should be
moving towards LESS moving parts in the legacy path, not more.
We can keep having this discussion every few years, but I think we'd
both prefer to make some actual progress here. It's perfectly fine to
add an interface for a single queue interface for an IO scheduler for
blk-mq, since we don't care too much about scalability there. And that
won't take years, that should be a few weeks. Retrofitting BFQ on top of
that should not be hard either. That can co-exist with a real multiqueue
scheduler as well, something that's geared towards some fairness for
faster devices.
OK, so some solution like having a variant of blk_sq_make_request() that
will consume requests, do IO scheduling decisions on them, and feed them
into the HW queue is it sees fit would be acceptable? That will provide the
IO scheduler a global view that it needs for complex scheduling decisions
so it should indeed be relatively easy to port BFQ to work like that.
I'd probably start off Omar's base [1] that switches the software queues
to store bios instead of requests, since that lifts the of the 1:1
mapping between what we can queue up and what we can dispatch. Without
that, the IO scheduler won't have too much to work with. And with that
in place, it'll be a "bio in, request out" type of setup, which is
similar to what we have in the legacy path.
I'd keep the software queues, but as a starting point, mandate 1
hardware queue to keep that as the per-device view of the state. The IO
scheduler would be responsible for moving one or more bios from the
software queues to the hardware queue, when they are ready to dispatch.
[1] https://github.com/osandov/linux/commit/8ef3508628b6cf7c4712cd3d8084ee11ef5d2530
Yeah, but what would be software queues actually good for for a single
queue device with device-global IO scheduling? The IO scheduler doing
complex decisions will keep all the bios / requests in a single structure
anyway so there's no scalability to gain from per-cpu software queues...
So you can directly consume bios in your ->make_request handler, place it
in IO scheduler structures and then push requests out to the HW queue in
response to HW tags getting freed (i.e. IO completion). No need
for intermediate software queues. But maybe I miss something.