RE: [PATCH v5 10/10] Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs
From: Michael Kelley
Date: Thu Mar 13 2025 - 23:28:14 EST
From: Nuno Das Neves <nunodasneves@xxxxxxxxxxxxxxxxxxx> Sent: Thursday, March 13, 2025 7:16 PM
>
> On 3/13/2025 9:43 AM, Michael Kelley wrote:
> > From: Nuno Das Neves <nunodasneves@xxxxxxxxxxxxxxxxxxx> Sent: Wednesday,
> February 26, 2025 3:08 PM
> >>
> >
> > I've done a partial review of the code in this patch. See comments inline
> > as usual.
> >
> > I'd like to still review most of the code in mshv_root_main.c, and maybe
> > some of mshv_synic.c and include/uapi/linux/mshv.c. I'll send a separate
> > email with those comments when I complete them. The patch is huge, so
> > I'm breaking my review comments into two parts.
> >
> > I've glanced through mshv_eventfd.c, mshv_eventfd.h, and mshv_irq.c,
> > but I don't have enough knowledge/expertise in these areas to add any
> > useful comments, so I'm not planning to review them further.
> >
> Thanks for taking a look. Just so you know, I was getting ready to post v6 of
> this patchset when I saw this email. So not all the comments will be addressed
> in the next version, but I've noted them and I will keep an eye out for the
> second part if you send it after v6 is posted.
That's fine.
>
> <snip>
> >> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst
> >> b/Documentation/userspace-api/ioctl/ioctl-number.rst
> >> index 6d1465315df3..66dcfaae698b 100644
> >> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
> >> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
> >> @@ -370,6 +370,8 @@ Code Seq# Include File Comments
> >> 0xB7 all uapi/linux/remoteproc_cdev.h <mailto:linux-
> >> remoteproc@xxxxxxxxxxxxxxx>
> >> 0xB7 all uapi/linux/nsfs.h <mailto:Andrei Vagin
> >> <avagin@xxxxxxxxxx>>
> >> 0xB8 01-02 uapi/misc/mrvl_cn10k_dpi.h Marvell CN10K DPI driver
> >> +0xB8 all uapi/linux/mshv.h Microsoft Hyper-V /dev/mshv driver
> >
> > Hmmm. Doesn't this mean that the mshv ioctls overlap with the Marvell
> > CN10K DPI ioctls? Is that intentional? I thought the goal of the central
> > registry in ioctl-number.rst is to avoid overlap.
> >
> Yes, they overlap. In practice it really doesn't matter IMO - IOCTL numbers
> are only interpreted by the driver of the device that the ioctl() syscall
> is made on.
>
> I believe the whole scheme to generate unique IOCTL numbers and try not to
> overlap them was is some case I'm not familiar with - something like
> multiple drivers handling IOCTLs on the same device FD? And maybe it's handy
> in debugging if you see an IOCTL number in isolation and want to know where
> it comes from?
Yes, I think the debugging aspect is one that is mentioned in the text in
ioctl-number.rst. For example, maybe you want to filter the ioctl() system call
based on a particular ioctl value.
>
> On a practical note, we have been using this IOCTL range for some time
> in other upstream code like our userspace rust library which interfaces with
> this driver (https://github.com/rust-vmm/mshv). So it would also be nice to
> keep that all working as much as possible with the kernel code that is on
> this mailing list.
I can see that having to change the ioctl values could be a pain. And apparently
there are already some historical overlaps. Also I saw that later in Patch 10 where
the mshv ioctls are defined, you have some overlaps just within the mshv space.
I don't really like the idea of having overlaps, either within the mshv space, or
with other drivers. Doing a transition to non-overlapping values would probably
mean the driver having to accept both the "old" and "new" values for a while
until the rust library can be updated and deployed copies using the old values
go away. It could be done in a relatively seamless fashion, but I can't really
make a strong argument that it would be worth the hassle.
>
> <snip>>> +#endif /* _MSHV_H */
> >> diff --git a/drivers/hv/mshv_common.c b/drivers/hv/mshv_common.c
> >> new file mode 100644
> >> index 000000000000..d97631dcbee1
> >> --- /dev/null
> >> +++ b/drivers/hv/mshv_common.c
> >> @@ -0,0 +1,161 @@
> >> +// SPDX-License-Identifier: GPL-2.0-only
> >> +/*
> >> + * Copyright (c) 2024, Microsoft Corporation.
> >> + *
> >> + * This file contains functions that are called from one or more modules: ROOT,
> >> + * DIAG, or VTL. If any of these modules are configured to build, this file is
> >
> > What are the DIAG and VTL modules? I see only a root module in the Makefile.
> >
> Ah, yep, they are not in this patchset but will follow. I can remove thereferences to them
> here, and make this comment future tense: "functions that WILL
> be called from one or more modules".
>
> <snip>>> +
> >> +struct mshv_vp {
> >> + u32 vp_index;
> >> + struct mshv_partition *vp_partition;
> >> + struct mutex vp_mutex;
> >> + struct hv_vp_register_page *vp_register_page;
> >> + struct hv_message *vp_intercept_msg_page;
> >> + void *vp_ghcb_page;
> >> + struct hv_stats_page *vp_stats_pages[2];
> >> + struct {
> >> + atomic64_t vp_signaled_count;
> >> + struct {
> >> + u64 intercept_suspend: 1;
> >> + u64 root_sched_blocked: 1; /* root scheduler only */
> >> + u64 root_sched_dispatched: 1; /* root scheduler only */
> >> + u64 reserved: 62;
> >
> > Hmmm. This looks like 65 bits allocated in a u64.
> >
> Indeed it is, good catch
>
> >> +
> >> + /*
> >> + * Since MSHV does not support more than one async hypercall in flight
> >
> > Wording is a bit messed up. Drop the "Since"?
> >
> Yep, thanks
>
> >> + * for a single partition. Thus, it is okay to define per partition
> >> + * async hypercall status.
> >> + */
> <snip>
> >> +
> >> +extern struct mshv_root mshv_root;
> >> +extern enum hv_scheduler_type hv_scheduler_type;
> >> +extern u8 __percpu **hv_synic_eventring_tail;
> >
> > Per comments on an earlier patch, the __percpu is in the wrong place.
> >
> Thanks, will fix here too.
> <snip>>> +int hv_call_create_partition(u64 flags,
> >> + struct hv_partition_creation_properties creation_properties,
> >> + union hv_partition_isolation_properties isolation_properties,
> >> + u64 *partition_id)
> >> +{
> >> + struct hv_input_create_partition *input;
> >> + struct hv_output_create_partition *output;
> >> + u64 status;
> >> + int ret;
> >> + unsigned long irq_flags;
> >> +
> >> + do {
> >> + local_irq_save(irq_flags);
> >> + input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> >> + output = *this_cpu_ptr(hyperv_pcpu_output_arg);
> >> +
> >> + memset(input, 0, sizeof(*input));
> >> + input->flags = flags;
> >> + input->compatibility_version = HV_COMPATIBILITY_21_H2;
> >> +
> >> + memcpy(&input->partition_creation_properties, &creation_properties,
> >> + sizeof(creation_properties));
> >
> > This is an example of a generic question/concern that occurs in several places. By
> > doing a memcpy into the hypercall input, the assumption is that the creation
> > properties supplied by the caller have zeros in all the reserved or unused fields.
> > Is that a valid assumption?
> >
> When the entire struct is provided as a function parameter, I think it's a valid
> assumption that that struct is initialized correctly by the caller.
>
> The alternative (taking it to an extreme, in my opinion) is that we go through
> each field in the parameters and assign them all individually, which could be quite
> a lot of fields. E.g. going through all the bits in these structs with 60+ bitfields
> and re-setting them here to be sure the reserved bits are 0.
Agreed -- I would not advocate the extreme alternative. But perhaps the
requirement on the caller to correctly initialize all the memory should
be made more explicit in the cases where that's true.
>
> >> +
> >> + memcpy(&input->isolation_properties, &isolation_properties,
> >> + sizeof(isolation_properties));
> >> +
> >> + status = hv_do_hypercall(HVCALL_CREATE_PARTITION,
> >> + input, output);
<snip>
> >> +/* Ask the hypervisor to map guest ram pages or the guest mmio space */
> >> +static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
> >> + u32 flags, struct page **pages, u64 mmio_spa)
> >> +{
> >> + struct hv_input_map_gpa_pages *input_page;
> >> + u64 status, *pfnlist;
> >> + unsigned long irq_flags, large_shift = 0;
> >> + int ret = 0, done = 0;
> >> + u64 page_count = page_struct_count;
> >> +
> >> + if (page_count == 0 || (pages && mmio_spa))
> >> + return -EINVAL;
> >> +
> >> + if (flags & HV_MAP_GPA_LARGE_PAGE) {
> >> + if (mmio_spa)
> >> + return -EINVAL;
> >> +
> >> + if (!HV_PAGE_COUNT_2M_ALIGNED(page_count))
> >> + return -EINVAL;
> >> +
> >> + large_shift = HV_HYP_LARGE_PAGE_SHIFT - HV_HYP_PAGE_SHIFT;
> >> + page_count >>= large_shift;
> >> + }
> >> +
> >> + while (done < page_count) {
> >> + ulong i, completed, remain = page_count - done;
> >> + int rep_count = min(remain, HV_MAP_GPA_BATCH_SIZE);
> >> +
> >> + local_irq_save(irq_flags);
> >> + input_page = *this_cpu_ptr(hyperv_pcpu_input_arg);
> >> +
> >> + input_page->target_partition_id = partition_id;
> >> + input_page->target_gpa_base = gfn + (done << large_shift);
> >> + input_page->map_flags = flags;
> >> + pfnlist = input_page->source_gpa_page_list;
> >> +
> >> + for (i = 0; i < rep_count; i++)
> >> + if (flags & HV_MAP_GPA_NO_ACCESS) {
> >> + pfnlist[i] = 0;
> >> + } else if (pages) {
> >> + u64 index = (done + i) << large_shift;
> >> +
> >> + if (index >= page_struct_count) {
> >
> > Can this test ever be true? It looks like the pages array must
> > have space for each 4K page even if mapping in 2Meg granularity.> But only every 512th
> entry in the pages array is looked at
> > (which seems a little weird). But based on how rep_count is set up,
> > I don't see how the algorithm could go past the end of the pages
> > array.
> >
> I don't think the test can actually be true - IIRC I wrote it as a kind
> of "is my math correct?" sanity check, and there was a pr_err() or a
> WARN()here in a previous iteration of the code.
>
> The large page list is a bit weird - When we allocate the large pages in
> the kernel, we get all the (4K) page structs for that range back from the
> kernel, and we hang onto them. When mapping the large pages into the
> hypervisor we just have to map the PFN of the first page of each 2M page,
> hence the skipping.
>
> Now I'm thinking about it again, maybe we can discard most of the 4K page
> structs the kernel gives back and keep it as a packed array of the "head"
> pages which are all we really need (and then also simplify this mapping
> code and save some memory).
>
> The current code was just the simplest way to add the large page
> functionality on top of what we already had, but looks like it could
> probably be improved.
>
> >> + ret = -EINVAL;
> >> + break;
> >> + }
> >> + pfnlist[i] = page_to_pfn(pages[index]);
> >> + } else {
> >> + pfnlist[i] = mmio_spa + done + i;
> >> + }
> >> + if (ret)
> >> + break;
> >
> > This test could also go away if the ret = -EINVAL error above can't
> > happen.
> >
> Ack
> <snip>
> >> +
> >> +/* Ask the hypervisor to map guest mmio space */
> >> +int hv_call_map_mmio_pages(u64 partition_id, u64 gfn, u64 mmio_spa, u64 numpgs)
> >> +{
> >> + int i;
> >> + u32 flags = HV_MAP_GPA_READABLE | HV_MAP_GPA_WRITABLE |
> >> + HV_MAP_GPA_NOT_CACHED;
> >> +
> >> + for (i = 0; i < numpgs; i++)
> >> + if (page_is_ram(mmio_spa + i))
> >
> > FWIW, doing this check one-page-at-a-time is somewhat expensive if numpgs
> > is large. The underlying data structures should support doing a single range
> > check, but I haven't looked at whether functions exist to do such a range check.
> >
> Indeed - I'll make a note to investigate, thanks.
>
> >> + return -EINVAL;
> >> +
> >> + return hv_do_map_gpa_hcall(partition_id, gfn, numpgs, flags, NULL,
> >> + mmio_spa);
> >> +}
> >> +
> >> +int hv_call_unmap_gpa_pages(u64 partition_id, u64 gfn, u64 page_count_4k,
> >> + u32 flags)
> >> +{
> >> + struct hv_input_unmap_gpa_pages *input_page;
> >> + u64 status, page_count = page_count_4k;
> >> + unsigned long irq_flags, large_shift = 0;
> >> + int ret = 0, done = 0;
> >> +
> >> + if (page_count == 0)
> >> + return -EINVAL;
> >> +
> >> + if (flags & HV_UNMAP_GPA_LARGE_PAGE) {
> >> + if (!HV_PAGE_COUNT_2M_ALIGNED(page_count))
> >> + return -EINVAL;
> >> +
> >> + large_shift = HV_HYP_LARGE_PAGE_SHIFT - HV_HYP_PAGE_SHIFT;
> >> + page_count >>= large_shift;
> >> + }
> >> +
> >> + while (done < page_count) {
> >> + ulong completed, remain = page_count - done;
> >> + int rep_count = min(remain, HV_MAP_GPA_BATCH_SIZE);
> >
> > Using HV_MAP_GPA_BATCH_SIZE seems a little weird here since there's
> > no input array and hence no constraint based on keeping input args to
> > just one page. Is it being used as an arbitrary limit so the rep_count
> > passed to the hypercall isn't "too large" for some definition of "too large"?
> > If that's the case, perhaps a separate #define and a comment would
> > make sense. I kept trying to figure out how the batch size for unmap was
> > related to the map hypercall, and I don't think there is any relationship.
> >
> I think batching this was intentional so that we can be sure to re-enable
> interrupts periodically when unmapping an entire VM's worth of memory. That
> said, as you know the hypercall will return if it takes longer than a certain
> amount of time so I guess that is "built-in" in some sense.
>
> I think keeping the batching, but #defining a specific value for unmap as you
> suggest is a good idea.
>
> I'd be inclined to use a similar number (something like 512).
Yes, I agree the batching should be kept because interrupts are disabled.
If the rep hypercall is taking a "long time", it will by itself come up for
air to allow taking interrupts. If interrupts were not disabled, that would
solve the problem. But with interrupts disabled, you do need some
batching.
Using a number like 512 makes sense to me. Just add a comment that
the value is somewhat arbitrary and only to allow for interrupts. It's
not based on memory space for input/output arguments like all the
other batch sizes.
>
> >> +
> >> + local_irq_save(irq_flags);
> >> + input_page = *this_cpu_ptr(hyperv_pcpu_input_arg);
> >> +
> >> + input_page->target_partition_id = partition_id;
> >> + input_page->target_gpa_base = gfn + (done << large_shift);
> >> + input_page->unmap_flags = flags;
> >> + status = hv_do_rep_hypercall(HVCALL_UNMAP_GPA_PAGES, rep_count,
> >> + 0, input_page, NULL);
> >> + local_irq_restore(irq_flags);
> >> +
> >> + completed = hv_repcomp(status);
> >> + if (!hv_result_success(status)) {
> >> + ret = hv_result_to_errno(status);
> >> + break;
> >> + }
> >> +
> >> + done += completed;
> >> + }
> >> +
> >> + return ret;
> >> +}
> >> +
> >> +int hv_call_get_gpa_access_states(u64 partition_id, u32 count, u64 gpa_base_pfn,
> >> + union hv_gpa_page_access_state_flags state_flags,
> >> + int *written_total,
> >> + union hv_gpa_page_access_state *states)
> >> +{
> >> + struct hv_input_get_gpa_pages_access_state *input_page;
> >> + union hv_gpa_page_access_state *output_page;
> >> + int completed = 0;
> >> + unsigned long remaining = count;
> >> + int rep_count, i;
> >> + u64 status;
> >> + unsigned long flags;
> >> +
> >> + *written_total = 0;
> >> + while (remaining) {
> >> + local_irq_save(flags);
> >> + input_page = *this_cpu_ptr(hyperv_pcpu_input_arg);
> >> + output_page = *this_cpu_ptr(hyperv_pcpu_output_arg);
> >> +
> >> + input_page->partition_id = partition_id;
> >> + input_page->hv_gpa_page_number = gpa_base_pfn + *written_total;
> >> + input_page->flags = state_flags;
> >> + rep_count = min(remaining, HV_GET_GPA_ACCESS_STATES_BATCH_SIZE);
> >> +
> >> + status = hv_do_rep_hypercall(HVCALL_GET_GPA_PAGES_ACCESS_STATES,
> rep_count,
> >> + 0, input_page, output_page);
> >> + if (!hv_result_success(status)) {
> >> + local_irq_restore(flags);
> >> + break;
> >> + }
> >> + completed = hv_repcomp(status);
> >> + for (i = 0; i < completed; ++i)
> >> + states[i].as_uint8 = output_page[i].as_uint8;
> >> +
> >> + states += completed;
> >> + *written_total += completed;
> >> + remaining -= completed;
> >> + local_irq_restore(flags);
> >
> > FWIW, this local_irq_restore() could move up three lines to before the progress
> > accounting is done.
> >
> Good point, thanks.
> <snip>
> >> + memset(input, 0, sizeof(*input));
> >> + memset(output, 0, sizeof(*output));
> >
> > Why is the output set to zero? I would think Hyper-V is responsible for
> > ensuring that the output is properly populated, with unused fields/areas
> > set to zero.
> >
> Overabundance of caution, I think! It doesn't need to be zeroed AFAIK.
>
> I recently did a some cleanup (in our internal tree) to make sure we are
> memset()ing the input and *not* memset()ing the output everywhere, but
> it didn't make it into this series. There are a few more places like this.
>
> <snip>
> >> +
> >> +int hv_call_set_vp_state(u32 vp_index, u64 partition_id,
> >> + /* Choose between pages and bytes */
> >> + struct hv_vp_state_data state_data, u64 page_count,
> >
> > The size of "struct hv_vp_state_data" looks to be 24 bytes (3 64-bit words).
> > Is there a reason to pass this by value instead of as a pointer? I guess it works
> > like this, but it seems atypical.
> >
> No particular reason. I'm guessing the compiler will pass this by copying it to this
> function's stack frame - 24 bytes is still rather small so I don't think it's an issue.
Right, I think the code works as written. It's just atypical. When I see stuff that
doesn't fit the usual pattern, I always wonder "why?" And if there's no good
reason "why", reverting to the usual pattern avoids somebody else wondering
"why" in the future. :-) But you can leave it as is.
>
> I'm also under the impression the compiler may optimize this to a pointer since it is
> not modified?
It might. I'm not sure.
>
> I usually only pass a pointer (for read-only values) when it's something really
> large that I *definitely* don't want to be copied on the stack (like, 100 bytes?).
> In that case I probably only have a pointer to vmalloc'd/kalloc()'d memory anyway.
>
> <snip>
> >> + local_irq_save(flags);
> >> + status = hv_do_fast_hypercall8(HVCALL_CLEAR_VIRTUAL_INTERRUPT,
> >> + partition_id) &
> >> + HV_HYPERCALL_RESULT_MASK;
> >
> > This "anding" with HV_HYPERCALL_RESULT_MASK should be removed.
> >
> Yep, thanks.
>
> >> + local_irq_restore(flags);
> >
> > The irq save/restore isn't needed here since this is a fast hypercall and
> > per-cpu arg memory is not used.
> >
> Agreed, will remove these for the fast hypercall sites.
>
> <snip>
> >> + input->proximity_domain_info = hv_numa_node_to_pxm_info(node);
> >> + status = hv_do_hypercall(HVCALL_CREATE_PORT, input, NULL) &
> >> + HV_HYPERCALL_RESULT_MASK;
> >
> > Use the hv_status checking macros instead of and'ing with
> > HV_HYPERCALL_RESULT_MASK.
> >
> Thanks, these need a bit of cleanup.
>
> <snip>
> >> + status = hv_do_fast_hypercall16(HVCALL_DELETE_PORT,
> >> + input.as_uint64[0],
> >> + input.as_uint64[1]) &
> >> + HV_HYPERCALL_RESULT_MASK;
> >> + local_irq_restore(flags);
> >
> > Same a previous comment about and'ing. And irq save/restore
> > isn't needed.
> >
> ack
>
> <snip>
> >> + status = hv_do_hypercall(HVCALL_CONNECT_PORT, input, NULL) &
> >> + HV_HYPERCALL_RESULT_MASK;
> >
> > Same here. Use hv_* macros.
> >
> ack
>
> <snip>
> >> + status = hv_do_fast_hypercall16(HVCALL_DISCONNECT_PORT,
> >> + input.as_uint64[0],
> >> + input.as_uint64[1]) &
> >> + HV_HYPERCALL_RESULT_MASK;
> >> + local_irq_restore(flags);
> >
> > Same as above.
> >
> ack
>
> <snip>
> >> + local_irq_save(flags);
> >> + input.sint_index = sint_index;
> >> + status = hv_do_fast_hypercall8(HVCALL_NOTIFY_PORT_RING_EMPTY,
> >> + input.as_uint64) &
> >> + HV_HYPERCALL_RESULT_MASK;
> >> + local_irq_restore(flags);
> >
> > Same as above.
> >
> ack, and I'll double check we don't have other fast hypercall sites doing this
>
> <snip>>> + /*
> >> + * This is required to make sure that reserved field is set to
> >> + * zero, because MSHV has a check to make sure reserved bits are
> >> + * set to zero.
> >> + */
> >
> > Is this comment about checking reserved bits unique to this hypercall? If not, it
> > seems a little odd to see this comment here, but not other places where the input
> > is zero'ed.
> >
> I agree the comment isn't necessary - memset()ing the input to zero should be the
> default policy. I'll remove it.
This is another case where something doesn't fit the usual pattern, and I wonder
"why". :-)
>
> >> + memset(input_page, 0, sizeof(*input_page));
> >> + /* Only set the partition id if you are making the pages
> >> + * exclusive
> >> + */
> >> + if (flags & HV_MODIFY_SPA_PAGE_HOST_ACCESS_MAKE_EXCLUSIVE)
> >> + input_page->partition_id = partition_id;
> >> + input_page->flags = flags;
> >> + input_page->host_access = host_access;
> >> +
> >> + for (i = 0; i < rep_count; i++) {
> >> + u64 index = (done + i) << large_shift;
> >> +
> >> + if (index >= page_struct_count)
> >> + return -EINVAL;
> >
> > Can this test ever be true?
> >
> See above in hv_do_map_gpa_hcall(), it's more of a sanity check or assert.
>
> >> +
> >> + input_page->spa_page_list[i] =
> >> + page_to_pfn(pages[index]);
> >
> > When large_shift is non-zero, it seems weird to be skipping over most
> > of the entries in the "pages" array. But maybe there's a reason for that.
> >
> See above where we do the same thing in hv_do_map_gpa_hcall(). The hypervisor
> only needs to see the "head" pages - the GPAs of the 2MB pages.
>
> >> + }
> >> +
> >> + status = hv_do_rep_hypercall(code, rep_count, 0, input_page,
> >> + NULL);
> >> + local_irq_restore(irq_flags);
> >> +
> >> + completed = hv_repcomp(status);
> >> +
> >> + if (!hv_result_success(status))
> >> + return hv_result_to_errno(status);
> >> +
> >> + done += completed;
> >> + }
> >> +
> >> + return 0;
> >> +}
> >
> > [snip the rest of the patch that I haven't reviewed yet]
> >
> > Michael