RE: [PATCH] Drivers: hv: mshv_vtl: Fix vmemmap_shift exceeding MAX_FOLIO_ORDER

From: Michael Kelley

Date: Fri Apr 03 2026 - 14:37:12 EST

From: Naman Jain <namjain@xxxxxxxxxxxxxxxxxxx> Sent: Friday, April 3, 2026 12:25 AM
>

Nit: I wonder what's the best prefix to use in the patch Subject field.
"Drivers: hv: mshv_vtl:" is rather long. There was agreement to use
just "mshv:" for the root partition code, and I probably misused that
in commit 754cf84504ea. How about just "mshv_vtl:" as the prefix for this
patch and other VTL patches going forward?

> On 4/3/2026 9:05 AM, Michael Kelley wrote:
> > From: Naman Jain <namjain@xxxxxxxxxxxxxxxxxxx> Sent: Tuesday, March 31, 2026 10:40 PM
> >>
> >> When registering VTL0 memory via MSHV_ADD_VTL0_MEMORY, the kernel
> >> computes pgmap->vmemmap_shift as the number of trailing zeros in the
> >> OR of start_pfn and last_pfn, intending to use the largest compound
> >> page order both endpoints are aligned to.
> >>
> >> However, this value is not clamped to MAX_FOLIO_ORDER, so a
> >> sufficiently aligned range (e.g. physical range 0x800000000000-
> >> 0x800080000000, corresponding to start_pfn=0x800000000 with 35
> >> trailing zeros) can produce a shift larger than what
> >> memremap_pages() accepts, triggering a WARN and returning -EINVAL:
> >>
> >> WARNING: ... memremap_pages+0x512/0x650
> >> requested folio size unsupported
> >>
> >> The MAX_FOLIO_ORDER check was added by
> >> commit 646b67d57589 ("mm/memremap: reject unreasonable folio/compound
> >> page sizes in memremap_pages()").
> >> When CONFIG_HAVE_GIGANTIC_FOLIOS=y, CONFIG_SPARSEMEM_VMEMMAP=y, and
> >> CONFIG_HUGETLB_PAGE is not set, MAX_FOLIO_ORDER resolves to
> >> (PUD_SHIFT - PAGE_SHIFT) = 18. Any range whose PFN alignment exceeds
> >> order 18 hits this path.
> >
> > I'm not clear on what point you are making with this specific
> > configuration that results in MAX_FOLIO_ORDER being 18. Is it just
> > an example? Is 18 the largest expected value for MAX_FOLIO_ORDER?
> > And note that PUD_SHIFT and PAGE_SHIFT might have different values
> > on arm64 with a page size other than 4K.
> >
>
> Yes, this was just an example. It is not generalized enough, I will
> remove it.
> MAX_FOLIO_ORDER could go beyond 18.
>
> >>
> >> Fix this by clamping vmemmap_shift to MAX_FOLIO_ORDER so we always
> >> request the largest order the kernel supports, rather than an
> >> out-of-range value.
> >>
> >> Also fix the error path to propagate the actual error code from
> >> devm_memremap_pages() instead of hard-coding -EFAULT, which was
> >> masking the real -EINVAL return.
> >>
> >> Fixes: 7bfe3b8ea6e3 ("Drivers: hv: Introduce mshv_vtl driver")
> >> Cc: <stable@xxxxxxxxxxxxxxx>
> >> Signed-off-by: Naman Jain <namjain@xxxxxxxxxxxxxxxxxxx>
> >> ---
> >> drivers/hv/mshv_vtl_main.c | 8 ++++++--
> >> 1 file changed, 6 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/drivers/hv/mshv_vtl_main.c b/drivers/hv/mshv_vtl_main.c
> >> index 5856975f32e12..255fed3a740c1 100644
> >> --- a/drivers/hv/mshv_vtl_main.c
> >> +++ b/drivers/hv/mshv_vtl_main.c
> >> @@ -405,8 +405,12 @@ static int mshv_vtl_ioctl_add_vtl0_mem(struct mshv_vtl *vtl, void __user *arg)
> >> /*
> >> * Determine the highest page order that can be used for the given memory range.
> >> * This works best when the range is aligned; i.e. both the start and the length.
> >> + * Clamp to MAX_FOLIO_ORDER to avoid a WARN in memremap_pages() when the range
> >> + * alignment exceeds the maximum supported folio order for this kernel config.
> >> */
> >> - pgmap->vmemmap_shift = count_trailing_zeros(vtl0_mem.start_pfn | vtl0_mem.last_pfn);
> >> + pgmap->vmemmap_shift = min_t(unsigned long,
> >> + count_trailing_zeros(vtl0_mem.start_pfn | vtl0_mem.last_pfn),
> >> + MAX_FOLIO_ORDER);
> >
> > Is it necessary to use min_t() here, or would min() work? Neither count_trailing_zeros()
> > nor MAX_FOLIO_ORDER is ever negative, so it seems like just min() would work with
> > no potential for doing a bogus comparison or assignment.
> >
>
> min could work, yes. I just felt min_t is more safer for comparing these
> two different types of values -
> count_trailing_zeroes being 'int'
> MAX_FOLIO_ORDER being a macro, calculated by applying bit operations.
>
> and destination being unsigned int.
>
>
> include/linux/mmzone.h:#define MAX_FOLIO_ORDER MAX_PAGE_ORDER
> include/linux/mmzone.h:#define MAX_FOLIO_ORDER PFN_SECTION_SHIFT
> include/linux/mmzone.h:#define MAX_FOLIO_ORDER (ilog2(SZ_16G) - PAGE_SHIFT)
> include/linux/mmzone.h:#define MAX_FOLIO_ORDER (ilog2(SZ_1G) - PAGE_SHIFT)
> include/linux/mmzone.h:#define MAX_FOLIO_ORDER (PUD_SHIFT - PAGE_SHIFT)
>
> I am fine with anything you suggest here.

There's a fair number of patches on LKML that are replacing min_t() with
min(). At some point in the not-too-distant past, the implementation of
min() was improved to deal with different but compatible integer types.
My sense is that min() is the better choice for general integer comparisons,
particularly when the values are known to be non-negative.

>
> > The shift is calculated using the originally passed in start_pfn and last_pfn, while the
> > "range" struct in pgmap has an "end" value that is one page less. So is the idea to
> > go ahead and create the mapping with folios of a size that includes that last page,
> > and then just waste the last page of the last folio?
>
> No, waste does not occur. The vmemmap_shift determines the folio order,
> and memmap_init_zone_device() walks the range [start_pfn, last_pfn) in
> steps of (1 << vmemmap_shift) pages, creating one folio per step. The OR
> operation ensures both boundaries are aligned to multiples of (1 <<
> vmemmap_shift), guaranteeing the range divides evenly into folios with
> no partial folio at the end.
> The intention is to find the highest folio order possible here, and if
> it reaches the MAX_FOLIO_ORDER, restrict vmemmap_shift to it.

OK, I figured out what is confusing me. I had a misunderstanding
when I reviewed this code during its original submission, and that
misunderstanding has influenced my (incorrect) review of this change.

The struct mshv_vtl_ram_disposition that is passed from user space has
two fields: start_pfn and last_pfn. But last_pfn is somewhat misnamed
in my view. For example, an aligned 2 MiB of memory would consist of
512 PFNs. If the first PFN is 0x200, the last PFN would be 0x3FF. But in
the semantics of the struct, the last_pfn field should be 0x400.

In response to my comments in the original review, you added the comment
about last_pfn being excluded in the pagemap range, which is true. But it's
not because that page is somehow reserved or being wasted. It's because
the range is being described by specifying the PFN *after* the last PFN.

With the start_pfn and last_pfn fields used to determine the highest
page order that can be used, the slightly unorthodox semantics of
last_pfn make that calculation easy. But then you must subtract 1
from last_pfn when setting the range start and end for
devm_memremap_pages() to use. And the code does that, so the code
is all correct. The comment might be improved to speak about the
semantics of the last_pfn field, not that a page of memory is
intentionally being excluded/wasted. And/or maybe the struct
mshv_vtl_ram_disposition definition should get a comment to clarify
the semantics of last_pfn.

Michael