RE: [RFC PATCH 08/18] virt/mshv: map and unmap guest memory

From: Michael Kelley
Date: Mon Mar 08 2021 - 14:30:51 EST


From: Nuno Das Neves <nunodasneves@xxxxxxxxxxxxxxxxxxx> Sent: Monday, March 8, 2021 11:14 AM
>
> On 2/8/2021 11:45 AM, Michael Kelley wrote:
> > From: Nuno Das Neves <nunodasneves@xxxxxxxxxxxxxxxxxxx> Sent: Friday, November
> 20, 2020 4:30 PM
> >>

[snip]

> >> @@ -245,16 +249,318 @@ hv_call_delete_partition(u64 partition_id)
> >> return -hv_status_to_errno(status);
> >> }
> >>
> >> +static int
> >> +hv_call_map_gpa_pages(u64 partition_id,
> >> + u64 gpa_target,
> >> + u64 page_count, u32 flags,
> >> + struct page **pages)
> >> +{
> >> + struct hv_map_gpa_pages *input_page;
> >> + int status;
> >> + int i;
> >> + struct page **p;
> >> + u32 completed = 0;
> >> + u64 hypercall_status;
> >> + unsigned long remaining = page_count;
> >> + int rep_count;
> >> + unsigned long irq_flags;
> >> + int ret = 0;
> >> +
> >> + while (remaining) {
> >> +
> >> + rep_count = min(remaining, HV_MAP_GPA_BATCH_SIZE);
> >> +
> >> + local_irq_save(irq_flags);
> >> + input_page = (struct hv_map_gpa_pages *)(*this_cpu_ptr(
> >> + hyperv_pcpu_input_arg));
> >> +
> >> + input_page->target_partition_id = partition_id;
> >> + input_page->target_gpa_base = gpa_target;
> >> + input_page->map_flags = flags;
> >> +
> >> + for (i = 0, p = pages; i < rep_count; i++, p++)
> >> + input_page->source_gpa_page_list[i] =
> >> + page_to_pfn(*p) & HV_MAP_GPA_MASK;
> >
> > The masking seems a bit weird. The mask allows for up to 64G page frames,
> > which is 256 Tbytes of total physical memory, which is probably the current
> > Hyper-V limit on memory size (48 bit physical address space, though 52 bit
> > physical address spaces are coming). So the masking shouldn't ever be doing
> > anything. And if it was doing something, that probably should be treated as
> > an error rather than simply dropping the high bits.
>
> Good point - It looks like the mask isn't needed.
>
> >
> > Note that this code does not handle the case where PAGE_SIZE !=
> > HV_HYP_PAGE_SIZE. But maybe we'll never run the root partition with a
> > page size other than 4K.
> >
>
> For now on x86 it won't happen, but maybe on ARM?
> It shouldn't be hard to support this case, especially since
> PAGE_SIZE >= HV_HYP_PAGE_SIZE. Do you think we need it in this patch set?

No, from my perspective, this case does not need to be handled in
this patch set.

>
> >> + hypercall_status = hv_do_rep_hypercall(
> >> + HVCALL_MAP_GPA_PAGES, rep_count, 0, input_page, NULL);
> >> + local_irq_restore(irq_flags);
> >> +
> >> + status = hypercall_status & HV_HYPERCALL_RESULT_MASK;
> >> + completed = (hypercall_status & HV_HYPERCALL_REP_COMP_MASK) >>
> >> + HV_HYPERCALL_REP_COMP_OFFSET;
> >> +
> >> + if (status == HV_STATUS_INSUFFICIENT_MEMORY) {
> >> + ret = hv_call_deposit_pages(NUMA_NO_NODE,
> >> + partition_id, 256);
> >
> > Why adding 256 pages? I'm just contrasting with other places that add
> > 1 page at a time. Maybe a comment to explain ....
> >
>
> Empirically determined. I'll add a #define and comment.
>
> >> + if (ret)
> >> + break;
> >> + } else if (status != HV_STATUS_SUCCESS) {
> >> + pr_err("%s: completed %llu out of %llu, %s\n",
> >> + __func__,
> >> + page_count - remaining, page_count,
> >> + hv_status_to_string(status));
> >> + ret = -hv_status_to_errno(status);
> >> + break;
> >> + }
> >> +
> >> + pages += completed;
> >> + remaining -= completed;
> >> + gpa_target += completed;
> >> + }
> >> +
> >> + if (ret && completed) {
> >
> > Is the above the right test? Completed could be zero from the most
> > recent iteration, but still could be partially succeeded based on a previous
> > successful iteration. I think this needs to check whether remaining equals
> > page_count.
> >
>
> You're right; I'll change it to (ret && remaining < page_count)
>
> >> + pr_err("%s: Partially succeeded; mapped regions may be in invalid state",
> >> + __func__);
> >> + ret = -EBADFD;
> >> + }
> >> +
> >> + return ret;
> >> +}
> >> +
> >> +static int
> >> +hv_call_unmap_gpa_pages(u64 partition_id,
> >> + u64 gpa_target,
> >> + u64 page_count, u32 flags)
> >> +{
> >> + struct hv_unmap_gpa_pages *input_page;
> >> + int status;
> >> + int ret = 0;
> >> + u32 completed = 0;
> >> + u64 hypercall_status;
> >> + unsigned long remaining = page_count;
> >> + int rep_count;
> >> + unsigned long irq_flags;
> >> +
> >> + local_irq_save(irq_flags);
> >> + input_page = (struct hv_unmap_gpa_pages *)(*this_cpu_ptr(
> >> + hyperv_pcpu_input_arg));
> >> +
> >> + input_page->target_partition_id = partition_id;
> >> + input_page->target_gpa_base = gpa_target;
> >> + input_page->unmap_flags = flags;
> >> +
> >> + while (remaining) {
> >> + rep_count = min(remaining, HV_MAP_GPA_BATCH_SIZE);
> >> + hypercall_status = hv_do_rep_hypercall(
> >> + HVCALL_UNMAP_GPA_PAGES, rep_count, 0, input_page, NULL);
> >
> > Similarly, this code doesn't handle PAGE_SIZE != HV_HYP_PAGE_SIZE.
> >
>
> As above - do we need this for this patch set? This won't happen on x86.

Again, not needed from my perspective.

>
> >> + status = hypercall_status & HV_HYPERCALL_RESULT_MASK;
> >> + completed = (hypercall_status & HV_HYPERCALL_REP_COMP_MASK) >>
> >> + HV_HYPERCALL_REP_COMP_OFFSET;
> >> + if (status != HV_STATUS_SUCCESS) {
> >> + pr_err("%s: completed %llu out of %llu, %s\n",
> >> + __func__,
> >> + page_count - remaining, page_count,
> >> + hv_status_to_string(status));
> >> + ret = -hv_status_to_errno(status);
> >> + break;
> >> + }
> >> +
> >> + remaining -= completed;
> >> + gpa_target += completed;
> >> + input_page->target_gpa_base = gpa_target;
> >> + }
> >> + local_irq_restore(irq_flags);
> >
> > I have some concern about holding interrupts disabled for this long.
> >
>
> How about I move the interrupt enabling/disabling inside the loop? i.e.:
> while (remaining) {
> local_irq_save(irq_flags);
> input_page = (struct hv_unmap_gpa_pages *)(*this_cpu_ptr(
> hyperv_pcpu_input_arg));
>
> input_page->target_partition_id = partition_id;
> input_page->target_gpa_base = gpa_target;
> input_page->unmap_flags = flags;
> rep_count = min(remaining, HV_MAP_GPA_BATCH_SIZE);
> status = hv_do_rep_hypercall(
> HVCALL_UNMAP_GPA_PAGES, rep_count, 0, input_page, NULL);
> local_irq_restore(irq_flags);
>
> completed = (status & HV_HYPERCALL_REP_COMP_MASK) >>
> HV_HYPERCALL_REP_COMP_OFFSET;
> status &= HV_HYPERCALL_RESULT_MASK;
> if (status != HV_STATUS_SUCCESS) {
> pr_err("%s: completed %llu out of %llu, %s\n",
> __func__,
> page_count - remaining, page_count,
> hv_status_to_string(status));
> ret = hv_status_to_errno(status);
> break;
> }
>
> remaining -= completed;
> gpa_target += completed;
> }
>
>

Yes, that would help.

> >> +
> >> + if (ret && completed) {
> >
> > Same comment as before.
> >
>
> Ditto as above.
>
> >> + pr_err("%s: Partially succeeded; mapped regions may be in invalid state",
> >> + __func__);
> >> + ret = -EBADFD;
> >> + }
> >> +
> >> + return ret;
> >> +}
> >> +
> >> +static long
> >> +mshv_partition_ioctl_map_memory(struct mshv_partition *partition,
> >> + struct mshv_user_mem_region __user *user_mem)
> >> +{
> >> + struct mshv_user_mem_region mem;
> >> + struct mshv_mem_region *region;
> >> + int completed;
> >> + unsigned long remaining, batch_size;
> >> + int i;
> >> + struct page **pages;
> >> + u64 page_count, user_start, user_end, gpfn_start, gpfn_end;
> >> + u64 region_page_count, region_user_start, region_user_end;
> >> + u64 region_gpfn_start, region_gpfn_end;
> >> + long ret = 0;
> >> +
> >> + /* Check we have enough slots*/
> >> + if (partition->regions.count == MSHV_MAX_MEM_REGIONS) {
> >> + pr_err("%s: not enough memory region slots\n", __func__);
> >> + return -ENOSPC;
> >> + }
> >> +
> >> + if (copy_from_user(&mem, user_mem, sizeof(mem)))
> >> + return -EFAULT;
> >> +
> >> + if (!mem.size ||
> >> + mem.size & (PAGE_SIZE - 1) ||
> >> + mem.userspace_addr & (PAGE_SIZE - 1) ||
> >
> > There's a PAGE_ALIGNED macro that expresses exactly what
> > each of the previous two tests is doing.
> >
>
> Since these need to be HV_HYP_PAGE_SIZE aligned, I will add a
> HV_HYP_PAGE_ALIGNED macro for this.

I was thinking that PAGE_SIZE and PAGE_ALIGNED are correct. If
this code were running on an ARM64 system with a 64K page
size, the 64K alignment would be fine and will make sense from
the user space perspective. You don't want to be mapping part
of a user space page. And 64K alignment will certainly satisfy
Hyper-V's requirement for 4K alignment. The real requirement
from Hyper-V's standpoint is that the alignment not be smaller
than 4K. But maybe I'm misunderstanding.

Michael