Re: [PATCH v4 12/13] blk-mq: Use shared tags for shared sbitmap support

From: John Garry
Date: Wed Sep 29 2021 - 09:33:19 EST

On 24/09/2021 11:39, John Garry wrote:
+ Kashyap

On 24/09/2021 11:23, Hannes Reinecke wrote:
On 9/24/21 10:28 AM, John Garry wrote:
Currently we use separate sbitmap pairs and active_queues atomic_t for
shared sbitmap support.

However a full sets of static requests are used per HW queue, which is
quite wasteful, considering that the total number of requests usable at
any given time across all HW queues is limited by the shared sbitmap depth.

As such, it is considerably more memory efficient in the case of shared
sbitmap to allocate a set of static rqs per tag set or request queue, and
not per HW queue.

So replace the sbitmap pairs and active_queues atomic_t with a shared
tags per tagset and request queue, which will hold a set of shared static

Since there is now no valid HW queue index to be passed to the blk_mq_ops
.init and .exit_request callbacks, pass an invalid index token. This
changes the semantics of the APIs, such that the callback would need to
validate the HW queue index before using it. Currently no user of shared
sbitmap actually uses the HW queue index (as would be expected).

Continue to use term "shared sbitmap" for now, as the meaning is known.

Signed-off-by: John Garry <john.garry@xxxxxxxxxx>
  block/blk-mq-sched.c   | 82 ++++++++++++++++++-------------------
  block/blk-mq-tag.c     | 61 ++++++++++------------------
  block/blk-mq-tag.h     |  6 +--
  block/blk-mq.c         | 91 +++++++++++++++++++++++-------------------
  block/blk-mq.h         |  5 ++-
  include/linux/blk-mq.h | 15 ++++---
  include/linux/blkdev.h |  3 +-
  7 files changed, 125 insertions(+), 138 deletions(-)

The overall idea to keep the full request allocation per queue was to ensure memory locality for the requests themselves.
When moving to a shared request structure we obviously loose that feature.

But I'm not sure if that matters here; the performance impact might be too small to be measurable, seeing that we'll be most likely bound by hardware latencies anyway.

Nevertheless: have you tested for performance regressions with this patchset?

I have tested relatively lower rates, like ~450K IOPS, without any noticeable regression.

I'm especially thinking of Kashyaps high-IOPS megaraid setup; if there is a performance impact that'll be likely scenario where we can measure it.

I can test higher rates, like 2M IOPS, when I get access to the HW.

@Kashyap, Any chance you can help test performance here?

But even if there is a performance impact this patchset might be worthwhile, seeing that it'll reduce the memory footprint massively.

Sure, I don't think that minor performance improvements can justify the excessive memory.

JFYI, with 6x SAS SSDs on my arm64 board, I see:

Before (5.15-rc2 baseline):
none: 445K IOPs, mq-deadline: 418K IOPs (fio read)

none: 442K IOPs, mq-deadline: 407K IOPs (fio read)

So only a marginal drop there for mq-deadline.

I'll try my 12x SAS SSD setup when I get a chance. Kashyap is kindly also testing.