Re: [GIT PULL] Block pull request for- 4.11-rc1

From: Jens Axboe
Date: Wed Feb 22 2017 - 13:16:50 EST

On 02/21/2017 04:23 PM, Linus Torvalds wrote:
> On Tue, Feb 21, 2017 at 3:15 PM, Jens Axboe <axboe@xxxxxxxxx> wrote:
>> But under a device managed by blk-mq, that device exposes a number of
>> hardware queues. For older style devices, that number is typically 1
>> (single queue).
> ... but why would this ever be different from the normal IO scheduler?

Because we have a different set of schedulers for blk-mq, different
than the legacy path. mq-deadline is a basic port that will work
fine with rotational storage, but it's not going to be a good choice
for NVMe because of scalability issues.

We'll have BFQ on the blk-mq side, catering to the needs of those
folks that currently rely on the richer feature set that CFQ supports.

We've continually been working towards getting rid of the legacy
IO path, and its set of schedulers. So if it's any consolation,
those options will go away in the future.

> IOW, what makes single-queue mq scheduling so special that
> (a) it needs its own config option
> (b) it is different from just the regular IO scheduler in the first place?
> So the whole thing stinks. The fact that it then has an
> incomprehensible config option seems to be just gravy on top of the
> crap.

What do you mean by "the regular IO scheduler"? These are different

As explained above, single-queue mq devices generally DO want mq-deadline.
multi-queue mq devices, we don't have a good choice for them right now,
so we retain the current behavior (that we've had since blk-mq was
introduced in 3.13) of NOT doing any IO scheduling for them. If you
do want scheduling for them, set the option, or configure udev to
make the right choice for you.

I agree the wording isn't great, and we can improve that. But I do
think that the current choices make sense.

>> "none" just means that we don't have a scheduler attached.
> .. which makes no sense to me in the first place.
> People used to try to convince us that doing IO schedulers was a
> mistake, because modern disk hardware did a better job than we could
> do in software.
> Those people were full of crap. The regular IO scheduler used to have
> a "NONE" option too. Maybe it even still has one, but only insane
> people actually use it.
> Why is the MQ stuff magically so different that NONE would make sense at all>?

I was never one of those people, and I've always been a strong advocate
for imposing scheduling to keep devices in check. The regular IO scheduler
pool includes "noop", which is probably the one you are thinking of. That
one is a bit different than the new "none" option for blk-mq, in that it
does do insertion sorts and it does do merges. "none" does some merging,
but only where it happens to make sense. There's no insertion sorting.

> And equally importantly: why do we _ask_ people these issues? Is this
> some kind of sick "cover your ass" thing, where you can say "well, I
> asked about it", when inevitably the choice ends up being the wrong
> one?
> We have too damn many Kconfig options as-is, I'm trying to push back
> on them. These two options seem fundamentally broken and stupid.
> The "we have no good idea, so let's add a Kconfig option" seems like a
> broken excuse for these things existing.
> So why ask this question in the first place?
> Is there any possible reason why "NONE" is a good option at all? And
> if it is the _only_ option (because no other better choice exists), it
> damn well shouldn't be a kconfig option!

I'm all for NOT asking questions, and not providing tunables. That's
generally how I do write code. See the blk-wbt stuff, for instance, that
basically just has one tunable that's set sanely by default, and we
figure out the rest.

I don't want to regress performance of blk-mq devices by attaching
mq-deadline to them. When we do have a sane scheduler choice, we'll
make that the default. And yes, maybe we can remove the Kconfig option
at that point.

For single queue devices, we could kill the option. But we're expecting
bfq-mq for 4.12, and we'll want to have the option at that point unless
you want to rely solely on runtime setting of the scheduler through
udev or by the sysadmin.

Jens Axboe