[PATCH 2/2] block/xen-blkfront: Handle non-indirect grant with 64KB pages

From: Julien Grall
Date: Fri Sep 11 2015 - 15:33:48 EST

Next message: Lee Jones: "Re: [PATCH v2 7/9] ARM: STi: DT: STiH407: Add FDMA driver dt nodes."
Previous message: Julien Grall: "[PATCH 0/2] block/xen-blkfront: Support non-indirect with 64KB page granularity"
In reply to: Julien Grall: "[PATCH 1/2] block/xen-blkfront: Introduce blkif_ring_get_request"
Next in thread: Bob Liu: "Re: [Xen-devel] [PATCH 0/2] block/xen-blkfront: Support non-indirect with 64KB page granularity"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

The minimal size of request in the block framework is always PAGE_SIZE.
It means that when 64KB guest is support, the request will at least be
64KB.

Although, if the backend doesn't support indirect grant (such as QDISK
in QEMU), a ring request is only able to accomodate 11 segments of 4KB
(i.e 44KB).

The current frontend is assuming that an I/O request will always fit in
a ring request. This is not true any more when using 64KB page
granularity and will therefore crash during the boot.

On ARM64, the ABI is completely neutral to the page granularity used by
the domU. The guest has the choice between different page granularity
supported by the processors (for instance on ARM64: 4KB, 16KB, 64KB).
This can't be enforced by the hypervisor and therefore it's possible to
run guests using different page granularity.

So we can't mandate the block backend to support non-indirect grant
when the frontend is using 64KB page granularity and have to fix it
properly in the frontend.

The solution exposed below is based on modifying directly the frontend
guest rather than asking the block framework to support smaller size
(i.e < PAGE_SIZE). This is because the change is the block framework are
not trivial as everything seems to relying on a struct *page (see [1]).
Although, it may be possible that someone succeed to do it in the future
and we would therefore be able to use advantage.

Given that a block request may not fit in a single ring request, a
second request is introduced for the data that cannot fit in the first
one. This means that the second request should never be used on Linux
configuration using a page granularity < 44KB.

Note that the parameters blk_queue_max_* helpers haven't been updated.
The block code will set mimimum size supported and we may be able to
support directly any change in the block framework that lower down the
mimimal size of a request.

[1] http://lists.xen.org/archives/html/xen-devel/2015-08/msg02200.html

Signed-off-by: Julien Grall <julien.grall@xxxxxxxxxx>

---
Cc: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>
Cc: "Roger Pau MonnÃ" <roger.pau@xxxxxxxxxx>
Cc: Boris Ostrovsky <boris.ostrovsky@xxxxxxxxxx>
Cc: David Vrabel <david.vrabel@xxxxxxxxxx>
---
drivers/block/xen-blkfront.c | 199 +++++++++++++++++++++++++++++++++++++++----
1 file changed, 183 insertions(+), 16 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index f9d55c3..03772c9 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -60,6 +60,20 @@

#include <asm/xen/hypervisor.h>

+/*
+ * The block framework is always working on segment of PAGE_SIZE minimum.
+ * When Linux is using a different page size than xen, it may not be possible
+ * to put all the data in a single segment.
+ * This can happen when the backend doesn't support indirect grant and
+ * therefore the maximum amount of data that a request can carry is
+ * BLKIF_MAX_SEGMENTS_PER_REQUEST * XEN_PAGE_SIZE = 44KB
+ *
+ * Note that we only support one extra request. So the Linux page size
+ * should be <= ( 2 * BLKIF_MAX_SEGMENTS_PER_REQUEST * XEN_PAGE_SIZE) =
+ * 88KB.
+ */
+#define HAS_EXTRA_REQ (BLKIF_MAX_SEGMENTS_PER_REQUEST < XEN_PFN_PER_PAGE)
+
enum blkif_state {
BLKIF_STATE_DISCONNECTED,
BLKIF_STATE_CONNECTED,
@@ -79,6 +93,19 @@ struct blk_shadow {
struct grant **indirect_grants;
struct scatterlist *sg;
unsigned int num_sg;
+ enum
+ {
+ REQ_WAITING,
+ REQ_DONE,
+ REQ_FAIL
+ } status;
+
+ #define NO_ASSOCIATED_ID ~0UL
+ /*
+ * Id of the sibling if we ever need 2 requests when handling a
+ * block I/O request
+ */
+ unsigned long associated_id;
};

struct split_bio {
@@ -467,6 +494,8 @@ static unsigned long blkif_ring_get_request(struct blkfront_info *info,

id = get_id_from_freelist(info);
info->shadow[id].request = req;
+ info->shadow[id].status = REQ_WAITING;
+ info->shadow[id].associated_id = NO_ASSOCIATED_ID;

(*ring_req)->u.rw.id = id;

@@ -508,6 +537,9 @@ struct setup_rw_req {
bool need_copy;
unsigned int bvec_off;
char *bvec_data;
+
+ bool require_extra_req;
+ struct blkif_request *ring_req2;
};

static void blkif_setup_rw_req_grant(unsigned long gfn, unsigned int offset,
@@ -521,8 +553,24 @@ static void blkif_setup_rw_req_grant(unsigned long gfn, unsigned int offset,
unsigned int grant_idx = setup->grant_idx;
struct blkif_request *ring_req = setup->ring_req;
struct blkfront_info *info = setup->info;
+ /*
+ * We always use the shadow of the first request to store the list
+ * of grant associated to the block I/O request. This made the
+ * completion more easy to handle even if the block I/O request is
+ * split.
+ */
struct blk_shadow *shadow = &info->shadow[setup->id];

+ if (unlikely(setup->require_extra_req &&
+ grant_idx >= BLKIF_MAX_SEGMENTS_PER_REQUEST)) {
+ /*
+ * We are using the second request, setup grant_idx
+ * to be the index of the segment array
+ */
+ grant_idx -= BLKIF_MAX_SEGMENTS_PER_REQUEST;
+ ring_req = setup->ring_req2;
+ }
+
if ((ring_req->operation == BLKIF_OP_INDIRECT) &&
(grant_idx % GRANTS_PER_INDIRECT_FRAME == 0)) {
if (setup->segments)
@@ -537,7 +585,11 @@ static void blkif_setup_rw_req_grant(unsigned long gfn, unsigned int offset,

gnt_list_entry = get_grant(&setup->gref_head, gfn, info);
ref = gnt_list_entry->gref;
- shadow->grants_used[grant_idx] = gnt_list_entry;
+ /*
+ * All the grants are stored in the shadow of the first
+ * request. Therefore we have to use the global index
+ */
+ shadow->grants_used[setup->grant_idx] = gnt_list_entry;

if (setup->need_copy) {
void *shared_data;
@@ -579,11 +631,31 @@ static void blkif_setup_rw_req_grant(unsigned long gfn, unsigned int offset,
(setup->grant_idx)++;
}

+static void blkif_setup_extra_req(struct blkif_request *first,
+ struct blkif_request *second)
+{
+ uint16_t nr_segments = first->u.rw.nr_segments;
+
+ /*
+ * The second request is only present when the first request uses
+ * all its segments. It's always the continuity of the first one
+ */
+ first->u.rw.nr_segments = BLKIF_MAX_SEGMENTS_PER_REQUEST;
+
+ second->u.rw.nr_segments = nr_segments - BLKIF_MAX_SEGMENTS_PER_REQUEST;
+ second->u.rw.sector_number = first->u.rw.sector_number +
+ (BLKIF_MAX_SEGMENTS_PER_REQUEST * XEN_PAGE_SIZE) / 512;
+
+ second->u.rw.handle = first->u.rw.handle;
+ second->operation = first->operation;
+}
+
static int blkif_queue_rw_req(struct request *req)
{
struct blkfront_info *info = req->rq_disk->private_data;
- struct blkif_request *ring_req;
- unsigned long id;
+ struct blkif_request *ring_req, *ring_req2 = NULL;
+ unsigned long id, id2 = NO_ASSOCIATED_ID;
+ bool require_extra_req = false;
int i;
struct setup_rw_req setup = {
.grant_idx = 0,
@@ -628,19 +700,19 @@ static int blkif_queue_rw_req(struct request *req)
/* Fill out a communications ring structure. */
id = blkif_ring_get_request(info, req, &ring_req);

- BUG_ON(info->max_indirect_segments == 0 &&
- GREFS(req->nr_phys_segments) > BLKIF_MAX_SEGMENTS_PER_REQUEST);
- BUG_ON(info->max_indirect_segments &&
- GREFS(req->nr_phys_segments) > info->max_indirect_segments);
-
num_sg = blk_rq_map_sg(req->q, req, info->shadow[id].sg);
num_grant = 0;
/* Calculate the number of grant used */
for_each_sg(info->shadow[id].sg, sg, num_sg, i)
num_grant += gnttab_count_grant(sg->offset, sg->length);

+ require_extra_req = info->max_indirect_segments == 0 &&
+ num_grant > BLKIF_MAX_SEGMENTS_PER_REQUEST;
+ BUG_ON(!HAS_EXTRA_REQ && require_extra_req);
+
info->shadow[id].num_sg = num_sg;
- if (num_grant > BLKIF_MAX_SEGMENTS_PER_REQUEST) {
+ if (num_grant > BLKIF_MAX_SEGMENTS_PER_REQUEST &&
+ likely(!require_extra_req)) {
/*
* The indirect operation can only be a BLKIF_OP_READ or
* BLKIF_OP_WRITE
@@ -680,10 +752,29 @@ static int blkif_queue_rw_req(struct request *req)
}
}
ring_req->u.rw.nr_segments = num_grant;
+ if (unlikely(require_extra_req)) {
+ id2 = blkif_ring_get_request(info, req, &ring_req2);
+ /*
+ * Only the first request contain the scatter-gather
+ * list.
+ */
+ info->shadow[id2].num_sg = 0;
+
+ blkif_setup_extra_req(ring_req, ring_req2);
+
+ /* Link the 2 requests together */
+ info->shadow[id2].associated_id = id;
+ info->shadow[id].associated_id = id2;
+ }
}

setup.ring_req = ring_req;
setup.id = id;
+
+ setup.require_extra_req = require_extra_req;
+ if (unlikely(require_extra_req))
+ setup.ring_req2 = ring_req2;
+
for_each_sg(info->shadow[id].sg, sg, num_sg, i) {
BUG_ON(sg->offset + sg->length > PAGE_SIZE);

@@ -706,6 +797,8 @@ static int blkif_queue_rw_req(struct request *req)

/* Keep a private copy so we can reissue requests when recovering. */
info->shadow[id].req = *ring_req;
+ if (unlikely(require_extra_req))
+ info->shadow[id2].req = *ring_req2;

if (new_persistent_gnts)
gnttab_free_grant_references(setup.gref_head);
@@ -797,7 +890,15 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size,
memset(&info->tag_set, 0, sizeof(info->tag_set));
info->tag_set.ops = &blkfront_mq_ops;
info->tag_set.nr_hw_queues = 1;
- info->tag_set.queue_depth = BLK_RING_SIZE(info);
+ if (HAS_EXTRA_REQ && info->max_indirect_segments == 0) {
+ /* When non-indirect grant is not supported, the I/O request
+ * will be split between mutiple request in the ring.
+ * To avoid problem when sending the request, divide by
+ * 2 the depth of the queue.
+ */
+ info->tag_set.queue_depth = BLK_RING_SIZE(info) / 2;
+ } else
+ info->tag_set.queue_depth = BLK_RING_SIZE(info);
info->tag_set.numa_node = NUMA_NO_NODE;
info->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
info->tag_set.cmd_size = 0;
@@ -1217,19 +1318,67 @@ static void blkif_copy_from_grant(unsigned long gfn, unsigned int offset,
kunmap_atomic(shared_data);
}

-static void blkif_completion(struct blk_shadow *s, struct blkfront_info *info,
+static bool blkif_completion(unsigned long *id, struct blkfront_info *info,
struct blkif_response *bret)
{
int i = 0;
struct scatterlist *sg;
int num_sg, num_grant;
+ struct blk_shadow *s = &info->shadow[*id];
struct copy_from_grant data = {
- .s = s,
.grant_idx = 0,
};

num_grant = s->req.operation == BLKIF_OP_INDIRECT ?
s->req.u.indirect.nr_segments : s->req.u.rw.nr_segments;
+
+ /* The I/O request may be split in two */
+ if (unlikely(s->associated_id != NO_ASSOCIATED_ID)) {
+ struct blk_shadow *s2 = &info->shadow[s->associated_id];
+
+ /* Keep the status of the current response in shadow */
+ s->status = (bret->status == BLKIF_RSP_OKAY) ?
+ REQ_DONE : REQ_FAIL;
+
+ /* Wait the second response if not yet here */
+ if (s2->status == REQ_WAITING)
+ return 0;
+
+ /*
+ * The status of the current response will be used in
+ * order to know if the request has failed.
+ * Update the current response status only if has not
+ * failed.
+ */
+ if (bret->status == BLKIF_RSP_OKAY && s2->status == REQ_FAIL)
+ bret->status = BLKIF_RSP_ERROR;
+
+ /*
+ * All the grants is stored in the first shadow in order
+ * to make the completion code simpler.
+ */
+ num_grant += s2->req.u.rw.nr_segments;
+
+ /*
+ * The two responses may not come in order. Only the
+ * first request will store the scatter-gather list
+ */
+ if (s2->num_sg != 0) {
+ /* Update "id" with the ID of the first response */
+ *id = s->associated_id;
+ s = s2;
+ }
+
+ /* We don't need anymore the second request, so recycling
+ * it now
+ */
+ if (add_id_to_freelist(info, s->associated_id))
+ WARN(1, "%s: can't recycle the second part (id = %ld)"
+ "of the request\n",
+ info->gd->disk_name, s->associated_id);
+ }
+
+ data.s = s;
num_sg = s->num_sg;

if (bret->operation == BLKIF_OP_READ && info->feature_persistent) {
@@ -1299,6 +1448,8 @@ static void blkif_completion(struct blk_shadow *s, struct blkfront_info *info,
}
}
}
+
+ return 1;
}

static irqreturn_t blkif_interrupt(int irq, void *dev_id)
@@ -1339,8 +1490,14 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
}
req = info->shadow[id].request;

- if (bret->operation != BLKIF_OP_DISCARD)
- blkif_completion(&info->shadow[id], info, bret);
+ if (bret->operation != BLKIF_OP_DISCARD) {
+ /*
+ * We may need to wait for an extra response if the
+ * I/O request is split in 2
+ */
+ if (!blkif_completion(&id, info, bret))
+ continue;
+ }

if (add_id_to_freelist(info, id)) {
WARN(1, "%s: response to %s (id %ld) couldn't be recycled!\n",
@@ -1863,8 +2020,18 @@ static int blkfront_setup_indirect(struct blkfront_info *info)
unsigned int psegs, grants;
int err, i;

- if (info->max_indirect_segments == 0)
- grants = BLKIF_MAX_SEGMENTS_PER_REQUEST;
+ if (info->max_indirect_segments == 0) {
+ if (!HAS_EXTRA_REQ)
+ grants = BLKIF_MAX_SEGMENTS_PER_REQUEST;
+ else {
+ /*
+ * When an extra req is required, the maximum
+ * grants supported is related to the size of the
+ * Linux block segment
+ */
+ grants = GRANTS_PER_PSEG;
+ }
+ }
else
grants = info->max_indirect_segments;
psegs = grants / GRANTS_PER_PSEG;
--
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Lee Jones: "Re: [PATCH v2 7/9] ARM: STi: DT: STiH407: Add FDMA driver dt nodes."
Previous message: Julien Grall: "[PATCH 0/2] block/xen-blkfront: Support non-indirect with 64KB page granularity"
In reply to: Julien Grall: "[PATCH 1/2] block/xen-blkfront: Introduce blkif_ring_get_request"
Next in thread: Bob Liu: "Re: [Xen-devel] [PATCH 0/2] block/xen-blkfront: Support non-indirect with 64KB page granularity"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]