Re: [PATCH RFC 3/5] blk-mq: Facilitate a shared tags per tagset

From: John Garry
Date: Wed Nov 13 2019 - 09:57:41 EST


On 13/11/2019 14:06, Hannes Reinecke wrote:
On 11/13/19 2:36 PM, John Garry wrote:
Some SCSI HBAs (such as HPSA, megaraid, mpt3sas, hisi_sas_v3 ..) support
multiple reply queues with single hostwide tags.

In addition, these drivers want to use interrupt assignment in
pci_alloc_irq_vectors(PCI_IRQ_AFFINITY). However, as discussed in [0],
CPU hotplug may cause in-flight IO completion to not be serviced when an
interrupt is shutdown.

To solve that problem, Ming's patchset to drain hctx's should ensure no
IOs are missed in-flight [1].

However, to take advantage of that patchset, we need to map the HBA HW
queues to blk mq hctx's; to do that, we need to expose the HBA HW queues.

In making that transition, the per-SCSI command request tags are no
longer unique per Scsi host - they are just unique per hctx. As such, the
HBA LLDD would have to generate this tag internally, which has a certain
performance overhead.

However another problem is that blk mq assumes the host may accept
(Scsi_host.can_queue * #hw queue) commands. In [2], we removed the Scsi
host busy counter, which would stop the LLDD being sent more than
.can_queue commands; however, we should still ensure that the block layer
does not issue more than .can_queue commands to the Scsi host.

To solve this problem, introduce a shared tags per blk_mq_tag_set, which
may be requested when allocating the tagset.

New flag BLK_MQ_F_TAG_HCTX_SHARED should be set when requesting the
tagset.

This is based on work originally from Ming Lei in [3].

[0] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@xxxxxxxxxxxxxxxxxxxxxxx/
[1] https://lore.kernel.org/linux-block/20191014015043.25029-1-ming.lei@xxxxxxxxxx/
[2] https://lore.kernel.org/linux-scsi/20191025065855.6309-1-ming.lei@xxxxxxxxxx/
[3] https://lore.kernel.org/linux-block/20190531022801.10003-1-ming.lei@xxxxxxxxxx/

Signed-off-by: John Garry <john.garry@xxxxxxxxxx>
---
block/blk-core.c | 1 +
block/blk-flush.c | 2 +
block/blk-mq-debugfs.c | 2 +-
block/blk-mq-tag.c | 85 ++++++++++++++++++++++++++++++++++++++++++
block/blk-mq-tag.h | 1 +
block/blk-mq.c | 61 +++++++++++++++++++++++++-----
block/blk-mq.h | 9 +++++
include/linux/blk-mq.h | 3 ++
include/linux/blkdev.h | 1 +
9 files changed, 155 insertions(+), 10 deletions(-)

[ .. ]
@@ -396,15 +398,17 @@ static struct request *blk_mq_get_request(struct request_queue *q,
blk_mq_tag_busy(data->hctx);
}
- tag = blk_mq_get_tag(data);
- if (tag == BLK_MQ_TAG_FAIL) {
- if (clear_ctx_on_error)
- data->ctx = NULL;
- blk_queue_exit(q);
- return NULL;
+ if (data->hctx->shared_tags) {
+ shared_tag = blk_mq_get_shared_tag(data);
+ if (shared_tag == BLK_MQ_TAG_FAIL)
+ goto err_shared_tag;
}
- rq = blk_mq_rq_ctx_init(data, tag, data->cmd_flags, alloc_time_ns);
+ tag = blk_mq_get_tag(data);
+ if (tag == BLK_MQ_TAG_FAIL)
+ goto err_tag;
+
+ rq = blk_mq_rq_ctx_init(data, tag, shared_tag, data->cmd_flags, alloc_time_ns);
if (!op_is_flush(data->cmd_flags)) {
rq->elv.icq = NULL;
if (e && e->type->ops.prepare_request) {

Hi Hannes,

Why do you need to keep a parallel tag accounting between 'normal' and
'shared' tags here?
Isn't is sufficient to get a shared tag only, and us that in lieo of the
'real' one?

In theory, yes. Just the 'shared' tag should be adequate.

A problem I see with this approach is that we lose the identity of which tags are allocated for each hctx. As an example for this, consider blk_mq_queue_tag_busy_iter(), which iterates the bits for each hctx. Now, if you're just using shared tags only, that wouldn't work.

Consider blk_mq_can_queue() as another example - this tells us if a hctx has any bits unset, while with only using shared tags it would tell if any bits unset over all queues, and this change in semantics could break things. At a glance, function __blk_mq_tag_idle() looks problematic also.

And this is where it becomes messy to implement.


I would love to combine both,

Same here...

as then we can easily do a reverse mapping
by using the 'tag' value to lookup the command itself, and can possibly
do the 'scsi_cmd_priv' trick of embedding the LLDD-specific parts within
the command. With this split we'll be wasting quite some memory there,
as the possible 'tag' values are actually nr_hw_queues * shared_tags.

Yeah, understood. That's just a trade-off I saw.

Thanks,
John