RE: [PATCH] hv_balloon: Fallback to generic_online_page() for non-HV hot added mem
From: Michael Kelley
Date: Mon Jan 06 2025 - 19:39:17 EST
From: Jacob Pan <jacob.pan@xxxxxxxxxxxxxxxxxxx> Sent: Thursday, January 2, 2025 1:40 PM
>
> When device memory blocks are hot-added via add_memory_driver_managed(),
> the HV balloon driver intercepts them but fails to online these memory
> blocks. This could result in device driver initialization failures.
>
> To address this, fall back to the generic online callback for memory
> blocks not added by the HV balloon driver. Similar cases are handled the
> same way in virtio-mem driver.
I had a little bit of trouble figuring out what problem this patch is solving.
Is this a correct description?
The Hyper-V balloon driver installs a custom callback for handling page
onlining operations performed by the memory hotplug subsystem. This
custom callback is global, and overrides the default callback
(generic_online_page) that Linux otherwise uses. The custom callback
properly handles memory that is hot-added by the balloon driver as part
of a Hyper-V hot-add region.
But memory can also be hot-added directly by a device driver for a vPCI
device, particularly GPUs. In such a case, the custom callback installed by
the balloon driver runs, but won't find the page in its hot-add region list
and doesn't online it, which could cause driver initialization failures.
Fix this by having the balloon custom callback run generic_online_page()
when the page isn't part of a Hyper-V hot-add region, thereby doing the
default Linux behavior. This allows device driver hot-adds to work
properly. Similar cases are handled the same way in the virtio-mem driver.
>
> Suggested-by: Vikram Sethi <vsethi@xxxxxxxxxx>
> Tested-by: Michael Frohlich <mfrohlich@xxxxxxxxxxxxx>
> Signed-off-by: Jacob Pan <jacob.pan@xxxxxxxxxxxxxxxxxxx>
> ---
> drivers/hv/hv_balloon.c | 18 ++++++++++--------
> 1 file changed, 10 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c
> index a99112e6f0b8..c999daf34108 100644
> --- a/drivers/hv/hv_balloon.c
> +++ b/drivers/hv/hv_balloon.c
> @@ -766,16 +766,18 @@ static void hv_online_page(struct page *pg, unsigned int
> order)
> struct hv_hotadd_state *has;
> unsigned long pfn = page_to_pfn(pg);
>
> - guard(spinlock_irqsave)(&dm_device.ha_lock);
> - list_for_each_entry(has, &dm_device.ha_region_list, list) {
> - /* The page belongs to a different HAS. */
> - if (pfn < has->start_pfn ||
> - (pfn + (1UL << order) > has->end_pfn))
> - continue;
> + scoped_guard(spinlock_irqsave, &dm_device.ha_lock) {
> + list_for_each_entry(has, &dm_device.ha_region_list, list) {
> + /* The page belongs to a different HAS. */
> + if (pfn < has->start_pfn ||
> + (pfn + (1UL << order) > has->end_pfn))
> + continue;
>
> - hv_bring_pgs_online(has, pfn, 1UL << order);
> - break;
> + hv_bring_pgs_online(has, pfn, 1UL << order);
> + return;
> + }
> }
> + generic_online_page(pg, order);
> }
>
> static int pfn_covered(unsigned long start_pfn, unsigned long pfn_cnt)
> --
> 2.34.1
>
The code LGTM. I'd suggest updating the commit message along the
lines of what I wrote. My descriptions can be a bit wordy, but
hopefully they add clarity on the overall picture.
In any case,
Reviewed-by: Michael Kelley <mhklinux@xxxxxxxxxxx>