Re: [Xen-devel] [PATCH 2/2] block/xen-blkfront: Handle non-indirect grant with 64KB pages
From: Roger Pau MonnÃ
Date: Tue Oct 06 2015 - 05:39:52 EST
El 05/10/15 a les 19.05, Julien Grall ha escrit:
> On 05/10/15 17:01, Roger Pau Monnà wrote:
>> El 11/09/15 a les 21.32, Julien Grall ha escrit:
>>> The minimal size of request in the block framework is always PAGE_SIZE.
>>> It means that when 64KB guest is support, the request will at least be
>>> 64KB.
>>>
>>> Although, if the backend doesn't support indirect grant (such as QDISK
>>> in QEMU), a ring request is only able to accomodate 11 segments of 4KB
>>> (i.e 44KB).
>>>
>>> The current frontend is assuming that an I/O request will always fit in
>>> a ring request. This is not true any more when using 64KB page
>>> granularity and will therefore crash during the boot.
>> ^ during boot.
>>>
>>> On ARM64, the ABI is completely neutral to the page granularity used by
>>> the domU. The guest has the choice between different page granularity
>>> supported by the processors (for instance on ARM64: 4KB, 16KB, 64KB).
>>> This can't be enforced by the hypervisor and therefore it's possible to
>>> run guests using different page granularity.
>>>
>>> So we can't mandate the block backend to support non-indirect grant
>>> when the frontend is using 64KB page granularity and have to fix it
>>> properly in the frontend.
>>>
>>> The solution exposed below is based on modifying directly the frontend
>>> guest rather than asking the block framework to support smaller size
>>> (i.e < PAGE_SIZE). This is because the change is the block framework are
>>> not trivial as everything seems to relying on a struct *page (see [1]).
>>> Although, it may be possible that someone succeed to do it in the future
>>> and we would therefore be able to use advantage.
>> ^ it. (no advantage IMHO)
>>>
>>> Given that a block request may not fit in a single ring request, a
>>> second request is introduced for the data that cannot fit in the first
>>> one. This means that the second request should never be used on Linux
>>> configuration using a page granularity < 44KB.
>> ^ if the page size is smaller than 44KB.
>>>
>>> Note that the parameters blk_queue_max_* helpers haven't been updated.
>>> The block code will set mimimum size supported and we may be able to
>> ^ the minimum extra space ^
>>> support directly any change in the block framework that lower down the
>>> mimimal size of a request.
>> ^ minimal
>>
>> I have a concern regarding the splitting done in this patch.
>>
>> What happens with FUA requests when split? For example the frontend
>> receives a FUA requests with 64KB of data, and this is split into two
>> different requests on the ring, is this going to cause trouble in the
>> higher layers if for example the first request is completed but the
>> second is not? Could we leave the disk in a bad state as a consequence
>> of this?
>
> If a block request is split into two ring requests, we will wait the two
> responses before reporting the completion to the higher layers (see
> blkif_interrupt and blkif_completion).
>
> Furthermore, the second ring request will always use the same operation
> as the first one. Note that you will flush twice which is not nice but
> could be improved.
>
>>
>>> [1] http://lists.xen.org/archives/html/xen-devel/2015-08/msg02200.html
>>>
>>> Signed-off-by: Julien Grall <julien.grall@xxxxxxxxxx>
>>>
>>> ---
>>> Cc: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>
>>> Cc: "Roger Pau MonnÃ" <roger.pau@xxxxxxxxxx>
>>> Cc: Boris Ostrovsky <boris.ostrovsky@xxxxxxxxxx>
>>> Cc: David Vrabel <david.vrabel@xxxxxxxxxx>
>>> ---
>>> drivers/block/xen-blkfront.c | 199 +++++++++++++++++++++++++++++++++++++++----
>>> 1 file changed, 183 insertions(+), 16 deletions(-)
>>>
>>> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
>>> index f9d55c3..03772c9 100644
>>> --- a/drivers/block/xen-blkfront.c
>>> +++ b/drivers/block/xen-blkfront.c
>>> @@ -60,6 +60,20 @@
>>>
>>> #include <asm/xen/hypervisor.h>
>>>
>>> +/*
>>> + * The block framework is always working on segment of PAGE_SIZE minimum.
>>
>> The above sentence needs to be reworded.
>
> What about:
>
> "The mininal size of the segment supported by the block framework is
> PAGE_SIZE."
That sounds fine.
>
>>
>>> + * When Linux is using a different page size than xen, it may not be possible
>>> + * to put all the data in a single segment.
>>> + * This can happen when the backend doesn't support indirect grant and
>> indirect requests ^
>>> + * therefore the maximum amount of data that a request can carry is
>>> + * BLKIF_MAX_SEGMENTS_PER_REQUEST * XEN_PAGE_SIZE = 44KB
>>> + *
>>> + * Note that we only support one extra request. So the Linux page size
>>> + * should be <= ( 2 * BLKIF_MAX_SEGMENTS_PER_REQUEST * XEN_PAGE_SIZE) =
>>> + * 88KB.
>>> + */
>>> +#define HAS_EXTRA_REQ (BLKIF_MAX_SEGMENTS_PER_REQUEST < XEN_PFN_PER_PAGE)
>>
>> Since you are already introducing a preprocessor define, I would like
>> the code added to be protected by it.
>
> Most of the code is protected by an if (HAS_REQ_EXTRA && ...) or
> variable setting based on HAS_REQ_EXTRA. Furthermore there is extra
> protection with some BUG_ON.
>
> I would rather avoid to use the preprocessor to avoid ending up with:
>
> #ifdef HAS_EXTRA_REQ
> /* Code */
> #else
> /* Code */
> #endif
>
> and potentially miss update on the ARM64 with 64KB pages because
> currently most of people are developing PV drivers on x86.
Ok, I was also thinking whether a compile time assert would be fine:
BUILD_BUG_ON(PAGE_SIZE > 2 * BLKIF_MAX_SEGMENTS_PER_REQUEST *
XEN_PAGE_SIZE);
But then if indirect descriptors are available the assert is no longer true.
>>> enum blkif_state {
>>> BLKIF_STATE_DISCONNECTED,
>>> BLKIF_STATE_CONNECTED,
>>> @@ -79,6 +93,19 @@ struct blk_shadow {
>>> struct grant **indirect_grants;
>>> struct scatterlist *sg;
>>> unsigned int num_sg;
>>
>> #if HAS_EXTRA_REQ
>>
>>> + enum
>>> + {
>>> + REQ_WAITING,
>>> + REQ_DONE,
>>> + REQ_FAIL
>>> + } status;
>>> +
>>> + #define NO_ASSOCIATED_ID ~0UL
>>> + /*
>>> + * Id of the sibling if we ever need 2 requests when handling a
>>> + * block I/O request
>>> + */
>>> + unsigned long associated_id;
>>
>> #endif
>
> See my remark above.
>
>>
>>> };
>>>
>>> struct split_bio {
>>> @@ -467,6 +494,8 @@ static unsigned long blkif_ring_get_request(struct blkfront_info *info,
>>>
>>> id = get_id_from_freelist(info);
>>> info->shadow[id].request = req;
>>> + info->shadow[id].status = REQ_WAITING;
>>> + info->shadow[id].associated_id = NO_ASSOCIATED_ID;
>>>
>>> (*ring_req)->u.rw.id = id;
>>>
>>> @@ -508,6 +537,9 @@ struct setup_rw_req {
>>> bool need_copy;
>>> unsigned int bvec_off;
>>> char *bvec_data;
>>> +
>>> + bool require_extra_req;
>>> + struct blkif_request *ring_req2;
>>
>> extra_ring_req?
>
> Will do.
>
>
>>> };
>>>
>>> static void blkif_setup_rw_req_grant(unsigned long gfn, unsigned int offset,
>>> @@ -521,8 +553,24 @@ static void blkif_setup_rw_req_grant(unsigned long gfn, unsigned int offset,
>>> unsigned int grant_idx = setup->grant_idx;
>>> struct blkif_request *ring_req = setup->ring_req;
>>> struct blkfront_info *info = setup->info;
>>> + /*
>>> + * We always use the shadow of the first request to store the list
>>> + * of grant associated to the block I/O request. This made the
>> ^ grants ^ makes
>>> + * completion more easy to handle even if the block I/O request is
>>> + * split.
>>> + */
>>> struct blk_shadow *shadow = &info->shadow[setup->id];
>>>
>>> + if (unlikely(setup->require_extra_req &&
>>> + grant_idx >= BLKIF_MAX_SEGMENTS_PER_REQUEST)) {
>>> + /*
>>> + * We are using the second request, setup grant_idx
>>> + * to be the index of the segment array
>>> + */
>>> + grant_idx -= BLKIF_MAX_SEGMENTS_PER_REQUEST;
>>> + ring_req = setup->ring_req2;
>>> + }
>>> +
>>> if ((ring_req->operation == BLKIF_OP_INDIRECT) &&
>>> (grant_idx % GRANTS_PER_INDIRECT_FRAME == 0)) {
>>> if (setup->segments)
>>> @@ -537,7 +585,11 @@ static void blkif_setup_rw_req_grant(unsigned long gfn, unsigned int offset,
>>>
>>> gnt_list_entry = get_grant(&setup->gref_head, gfn, info);
>>> ref = gnt_list_entry->gref;
>>> - shadow->grants_used[grant_idx] = gnt_list_entry;
>>> + /*
>>> + * All the grants are stored in the shadow of the first
>>> + * request. Therefore we have to use the global index
>>> + */
>>> + shadow->grants_used[setup->grant_idx] = gnt_list_entry;
>>>
>>> if (setup->need_copy) {
>>> void *shared_data;
>>> @@ -579,11 +631,31 @@ static void blkif_setup_rw_req_grant(unsigned long gfn, unsigned int offset,
>>> (setup->grant_idx)++;
>>> }
>>>
>>> +static void blkif_setup_extra_req(struct blkif_request *first,
>>> + struct blkif_request *second)
>>> +{
>>> + uint16_t nr_segments = first->u.rw.nr_segments;
>>> +
>>> + /*
>>> + * The second request is only present when the first request uses
>>> + * all its segments. It's always the continuity of the first one
>>> + */
>>> + first->u.rw.nr_segments = BLKIF_MAX_SEGMENTS_PER_REQUEST;
>>> +
>>> + second->u.rw.nr_segments = nr_segments - BLKIF_MAX_SEGMENTS_PER_REQUEST;
>>> + second->u.rw.sector_number = first->u.rw.sector_number +
>>> + (BLKIF_MAX_SEGMENTS_PER_REQUEST * XEN_PAGE_SIZE) / 512;
>>
>> Does this need to take into account if sectors have been skipped in the
>> previous requests due to empty data?
>
> I'm not sure to understand this question.
>
> AFAIU, the data is always contiguous in a block segment even though it
> can be split accross multiple Linux page. The number of segments is
> calculated from the amount of data see "Calculate the number of grant
> used" the code. The second request will only be created if this number
> is greater than BLKIF_MAX_SEGMENTS_PER_REQUEST.
>
> So the second request will always be the continuity of the first one.
Ack.
>> Also I'm not sure how correct it is to hardcode 512 here.
>
> Well, we are hardcoding it in blkfront and the block framework does the
> same (see blk_limits_max_hw_sectors).
Right, seems like the whole block system is always working with 512 as
the basic unit.
>>
>>> +
>>> + second->u.rw.handle = first->u.rw.handle;
>>> + second->operation = first->operation;
>>> +}
>>> +
>>> static int blkif_queue_rw_req(struct request *req)
>>> {
>>> struct blkfront_info *info = req->rq_disk->private_data;
>>> - struct blkif_request *ring_req;
>>> - unsigned long id;
>>> + struct blkif_request *ring_req, *ring_req2 = NULL;
>>> + unsigned long id, id2 = NO_ASSOCIATED_ID;
>>> + bool require_extra_req = false;
>>> int i;
>>> struct setup_rw_req setup = {
>>> .grant_idx = 0,
>>> @@ -628,19 +700,19 @@ static int blkif_queue_rw_req(struct request *req)
>>> /* Fill out a communications ring structure. */
>>> id = blkif_ring_get_request(info, req, &ring_req);
>>>
>>> - BUG_ON(info->max_indirect_segments == 0 &&
>>> - GREFS(req->nr_phys_segments) > BLKIF_MAX_SEGMENTS_PER_REQUEST);
>>> - BUG_ON(info->max_indirect_segments &&
>>> - GREFS(req->nr_phys_segments) > info->max_indirect_segments);
>>> -
>>> num_sg = blk_rq_map_sg(req->q, req, info->shadow[id].sg);
>>> num_grant = 0;
>>> /* Calculate the number of grant used */
>>> for_each_sg(info->shadow[id].sg, sg, num_sg, i)
>>> num_grant += gnttab_count_grant(sg->offset, sg->length);
>>>
>>> + require_extra_req = info->max_indirect_segments == 0 &&
>>> + num_grant > BLKIF_MAX_SEGMENTS_PER_REQUEST;
>>> + BUG_ON(!HAS_EXTRA_REQ && require_extra_req);
>>> +
>>> info->shadow[id].num_sg = num_sg;
>>> - if (num_grant > BLKIF_MAX_SEGMENTS_PER_REQUEST) {
>>> + if (num_grant > BLKIF_MAX_SEGMENTS_PER_REQUEST &&
>>> + likely(!require_extra_req)) {
>>> /*
>>> * The indirect operation can only be a BLKIF_OP_READ or
>>> * BLKIF_OP_WRITE
>>> @@ -680,10 +752,29 @@ static int blkif_queue_rw_req(struct request *req)
>>> }
>>> }
>>> ring_req->u.rw.nr_segments = num_grant;
>>> + if (unlikely(require_extra_req)) {
>>> + id2 = blkif_ring_get_request(info, req, &ring_req2);
>>
>> How can you guarantee that there's always going to be another free
>> request? AFAICT blkif_queue_rq checks for RING_FULL, but you don't
>> actually know if there's only one slot or more than one available.
>
> Because the depth of the queue is divided by 2 when the extra request is
> used (see xlvbd_init_blk_queue).
I see, that's quite restrictive but I guess it's better than introducing
a new ring macro in order to figure out if there are at least two free
slots.
Roger.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/