Re: [RFC] blk-mq and I/O scheduling

From: Jens Axboe
Date: Wed Nov 25 2015 - 14:48:08 EST

Next message: Arnaldo Carvalho de Melo: "[PATCH 2/4] perf probe: Fix to free temporal Dwarf_Frame correctly"
Previous message: Arnaldo Carvalho de Melo: "[PATCH 4/4] tools lib bpf: Don't do a feature check when cleaning"
In reply to: Christoph Hellwig: "Re: [RFC] blk-mq and I/O scheduling"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 11/19/2015 05:02 AM, Andreas Herrmann wrote:

Hi,

I've looked into blk-mq and possible support for I/O scheduling.

The reason for this is to minimize performance degradation with
rotational devices when scsi_mod.use_blk_mq=1 is switched on.

I think that the degradation is well reflected with fio measurements.
With an increasing number of jobs you'll encounter a significant
performance drop for sequential reads and writes with blk-mq in
contrast to CFQ. blk-mq ensures that requests from different processes
(CPUs) are "perfectly shuffled" in a hardware queue. This is no
problem for non-rotational devices for which blk-mq is aimed for but
not so nice for rotational disks.

(i) I've done some tests with patch c2ed2f2dcf92 (blk-mq: first cut
deadline scheduling) from branch mq-deadline of linux-block
repository. I've not seen a significant performance impact when
enabling it (neither for non-rotational nor for rotational
disks).

(ii) I've played with code to enable sorting/merging of requests. I
did this in flush_busy_ctxs. This didn't have a performance
impact either. On a closer look this was due to high frequency
of calls to __blk_mq_run_hw_queue. There was almost nothing to
sort (too few requests). I guess that's also the reason why (i)
had not much impact.

(iii) With CFQ I've observed similar performance patterns to blk-mq if
slice_idle was set to 0.

(iv) I thought about introducing a per software queue time slice
during which blk-mq will service only one software queue (one
CPU) and not flush all software queues. This could help to
enqueue multiple requests belonging to the same process (as long
as it runs on same CPU) into a hardware queue. A minimal patch
to implement this is attached below.

The latter helped to improve performance for sequential reads and
writes. But it's not on a par with CFQ. Increasing the time slice is
suboptimal (as shown with the 2ms results, see below). It might be
possible to get better performance when further reducing the initial
time slice and adapting it up to a maximum value if there are
repeatedly pending requests for a CPU.

After these observations and assuming that non-rotational devices are
most likely fine using blk-mq without I/O scheduling support I wonder
whether

- it's really a good idea to re-implement scheduling support for
blk-mq that eventually behaves like CFQ for rotational devices.

- it's technical possible to support both blk-mq and CFQ for different
devices on the same host adapter. This would allow to use "good old"
code for "good old" rotational devices. (But this might not be a
choice if in the long run a goal is to get rid of non-blk-mq code --
not sure what the plans are.)

What do you think about this?

Sorry I did not get around to properly looking at this this week, I'll tend to it next week. I think the concept of tying the idling to a specific CPU is likely fine, though I wonder if there are cases where we preempt more heavily and subsequently miss breaking the idling properly. I don't think we want/need cfq for blk-mq, but basic idling could potentially be enough. That's still a far cry from a full cfq implementation. The long term plans are still to move away from the legacy IO path, though with things like scheduling, that's sure to take some time.

That is actually where the mq-deadline work comes in. The mq-deadline work is missing a test patch to limit tag allocations, and a bunch of other little things to truly make it functional. There might be some options for folding it all together, with idling, as that would still be important on rotating storage going forward.

--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Arnaldo Carvalho de Melo: "[PATCH 2/4] perf probe: Fix to free temporal Dwarf_Frame correctly"
Previous message: Arnaldo Carvalho de Melo: "[PATCH 4/4] tools lib bpf: Don't do a feature check when cleaning"
In reply to: Christoph Hellwig: "Re: [RFC] blk-mq and I/O scheduling"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]