Re: [PATCH 2/2] mshv: Add support for integrated scheduler

From: Stanislav Kinsburskii

Date: Mon Feb 02 2026 - 12:23:05 EST


On Fri, Jan 30, 2026 at 08:22:34PM +0000, Anirudh Rayabharam wrote:
> On Fri, Jan 30, 2026 at 10:51:10AM -0800, Stanislav Kinsburskii wrote:
> > On Fri, Jan 30, 2026 at 06:43:09PM +0000, Anirudh Rayabharam wrote:
> > > On Fri, Jan 30, 2026 at 10:37:38AM -0800, Stanislav Kinsburskii wrote:
> > > > On Fri, Jan 30, 2026 at 05:30:25PM +0000, Anirudh Rayabharam wrote:
> > > > > On Thu, Jan 29, 2026 at 11:09:46AM -0800, Stanislav Kinsburskii wrote:
> > > > > > On Thu, Jan 29, 2026 at 05:47:02PM +0000, Michael Kelley wrote:
> > > > > > > From: Stanislav Kinsburskii <skinsburskii@xxxxxxxxxxxxxxxxxxx> Sent: Wednesday, January 21, 2026 2:36 PM
> > > > > > > >
> > > > > > > > From: Andreea Pintilie <anpintil@xxxxxxxxxxxxx>
> > > > > > > >
> > > > > > > > Query the hypervisor for integrated scheduler support and use it if
> > > > > > > > configured.
> > > > > > > >
> > > > > > > > Microsoft Hypervisor originally provided two schedulers: root and core. The
> > > > > > > > root scheduler allows the root partition to schedule guest vCPUs across
> > > > > > > > physical cores, supporting both time slicing and CPU affinity (e.g., via
> > > > > > > > cgroups). In contrast, the core scheduler delegates vCPU-to-physical-core
> > > > > > > > scheduling entirely to the hypervisor.
> > > > > > > >
> > > > > > > > Direct virtualization introduces a new privileged guest partition type - L1
> > > > > > > > Virtual Host (L1VH) — which can create child partitions from its own
> > > > > > > > resources. These child partitions are effectively siblings, scheduled by
> > > > > > > > the hypervisor's core scheduler. This prevents the L1VH parent from setting
> > > > > > > > affinity or time slicing for its own processes or guest VPs. While cgroups,
> > > > > > > > CFS, and cpuset controllers can still be used, their effectiveness is
> > > > > > > > unpredictable, as the core scheduler swaps vCPUs according to its own logic
> > > > > > > > (typically round-robin across all allocated physical CPUs). As a result,
> > > > > > > > the system may appear to "steal" time from the L1VH and its children.
> > > > > > > >
> > > > > > > > To address this, Microsoft Hypervisor introduces the integrated scheduler.
> > > > > > > This the s allows an L1VH partition to schedule its own vCPUs and those of its
> > > > > > > > guests across its "physical" cores, effectively emulating root scheduler
> > > > > > > > behavior within the L1VH, while retaining core scheduler behavior for the
> > > > > > > > rest of the system.
> > > > > > > >
> > > > > > > > The integrated scheduler is controlled by the root partition and gated by
> > > > > > > > the vmm_enable_integrated_scheduler capability bit. If set, the hypervisor
> > > > > > > > supports the integrated scheduler. The L1VH partition must then check if it
> > > > > > > > is enabled by querying the corresponding extended partition property. If
> > > > > > > > this property is true, the L1VH partition must use the root scheduler
> > > > > > > > logic; otherwise, it must use the core scheduler.
> > > > > > > >
> > > > > > > > Signed-off-by: Andreea Pintilie <anpintil@xxxxxxxxxxxxx>
> > > > > > > > Signed-off-by: Stanislav Kinsburskii <skinsburskii@xxxxxxxxxxxxxxxxxxx>
> > > > > > > > ---
> > > > > > > > drivers/hv/mshv_root_main.c | 79 +++++++++++++++++++++++++++++--------------
> > > > > > > > include/hyperv/hvhdk_mini.h | 6 +++
> > > > > > > > 2 files changed, 58 insertions(+), 27 deletions(-)
> > > > > > > >
> > > >
> > > > <snip>
> > > >
> > > > > > > > -root_sched_deinit:
> > > > > > > > - root_scheduler_deinit();
> > > > > > > > - return err;
> > > > > > > > }
> > > > > > > >
> > > > > > > > -static void mshv_init_vmm_caps(struct device *dev)
> > > > > > > > +static int mshv_init_vmm_caps(struct device *dev)
> > > > > > > > {
> > > > > > > > - /*
> > > > > > > > - * This can only fail here if HVCALL_GET_PARTITION_PROPERTY_EX or
> > > > > > > > - * HV_PARTITION_PROPERTY_VMM_CAPABILITIES are not supported. In that
> > > > > > > > - * case it's valid to proceed as if all vmm_caps are disabled (zero).
> > > > > > > > - */
> > > > > > > > - if (hv_call_get_partition_property_ex(HV_PARTITION_ID_SELF,
> > > > > > > > - HV_PARTITION_PROPERTY_VMM_CAPABILITIES,
> > > > > > > > - 0, &mshv_root.vmm_caps,
> > > > > > > > - sizeof(mshv_root.vmm_caps)))
> > > > > > > > - dev_warn(dev, "Unable to get VMM capabilities\n");
> > > > > > > > + int ret;
> > > > > > > > +
> > > > > > > > + ret = hv_call_get_partition_property_ex(HV_PARTITION_ID_SELF,
> > > > > > > > + HV_PARTITION_PROPERTY_VMM_CAPABILITIES,
> > > > > > > > + 0, &mshv_root.vmm_caps,
> > > > > > > > + sizeof(mshv_root.vmm_caps));
> > > > > > > > + if (ret) {
> > > > > > > > + dev_err(dev, "Failed to get VMM capabilities: %d\n", ret);
> > > > > > > > + return ret;
> > > > > > > > + }
> > > > > > >
> > > > > > > This is a functional change that isn't mentioned in the commit message.
> > > > > > > Why is it now appropriate to fail instead of treating the VMM capabilities
> > > > > > > as all disabled? Presumably there are older versions of the hypervisor that
> > > > > > > don't support the requirements described in the original comment, but
> > > > > > > perhaps they are no longer relevant?
> > > > > > >
> > > > > >
> > > > > > To fail is now the only option for the L1VH partition. It must discover
> > > > > > the scheduler type. Without this information, the partition cannot
> > > > > > operate. The core scheduler logic will not work with an integrated
> > > > > > scheduler, and vice versa.
> > > > >
> > > > > I don't think we need to fail here. If we don't find vmm caps, that
> > > > > means we are on an older hypervisor that supports l1vh but not
> > > > > integrated scheduler (yes, such a version exists). In this case since
> > > > > integrated scheduler is not supported by the hypervisor, the core
> > > > > scheduler logic will work.
> > > > >
> > > >
> > > > The older hypervisor version won't have the integrated scheduler
> > > > capabity bit.
> > > > And we can't operate in core schedule mode if the integrated is enabled
> > > > underneath us.
> > >
> > > The older hypervisor won't have the integrated scheduler capability bit.
> > > This means that the older hypervisor doesn't support integrated
> > > scheduler (this is how vmm caps work: if the bit doesn't exist or
> > > vmm caps themselves don't exist the feature should be assumed as not
> > > available). If the hypervisor doesn't support integrated scheduler in the
> > > first place, it can't be enabled underneath us. So, it is safe to
> > > operate in core scheduler mode.
> > >
> >
> > We can’t tell whether the hypervisor is older and simply doesn’t have
> > the VMM caps bit, or whether we just failed to fetch the VMM caps.
>
> If we failed to fetch the VMM caps i.e. the hypervisor doesn't support
> the vmm caps property, we must assume that all the bits in vmm caps are
> 0 (i.e. no features are available). This is how vmm capabilities are
> supposed to be interpreted. This is something I checked with the
> hypervisor team some time back.
>
> >
> > In other words, we can’t distinguish between “an older hypervisor
> > without integrated scheduler support” and “a newer hypervisor with an
> > integrated scheduler, but we failed to fetch the VMM caps”.
> >
> > But for completeness: are you saying there is an older hypervisor
> > version that supports L1VH, but does not support VMM caps?
>
> I don't know how much of the Azure fleet still runs it but yes such a
> hypervisor version exists.
>

We don't need to support interim hypervisor versions in the upstream
kernel: these version will go away, and then this logic will become not
only a dead code path but also incorrect.

We can keep the existing logic that treats failure to fetch VMM as
notrmal internally until required.

Thanks,
Stanislav

> Thanks,
> Anirudh
>
> >
> > Thanks, Stanislav
> >
> > > Thanks,
> > > Anirudh.