Re: [PATCH 2/2] block: allow different-pgmap pages as separate bvecs in bio_add_page

From: Naman Jain

Date: Thu Apr 02 2026 - 01:21:29 EST




On 4/1/2026 7:38 PM, Christoph Hellwig wrote:
On Wed, Apr 01, 2026 at 08:23:29AM +0000, Naman Jain wrote:
bio_add_page() and bio_integrity_add_page() reject pages from a
different dev_pagemap entirely, returning 0 even when the page could
be added as a new bvec entry. The pgmap check was intended only to
prevent merging into the same bvec segment, not to block the page
from being added at all.

This causes callers to unnecessarily start a new bio when a buffer
spans pages from two different pgmaps, even though the bio has room
for another bvec.

This is not unnecessary. A single dma mapping operation can only
map a single target pgmap. The old SG API works around this by
doing multiple mapping operation underneath, but compared to that
just having multiple bios is much easier and more efficient.

What is your use case here?

Hello Christoph,

Thanks for reviewing these patches.

The use case driving this patch is the MSHV VTL driver (drivers/hv/mshv_vtl_main.c) for VMs with paravisor architecture (OpenHCL/OpenVMM: https://openvmm.dev/guide/index.html).

In this setup, the guest runs at two Virtual Trust Levels:
- VTL2 (higher privilege): runs a Linux kernel acting as "paravisor" that handles device I/O on behalf of the guest
- VTL0 (lower privilege): runs the actual guest OS (Windows/Linux)

VTL2 Linux performs block I/O (NVMe, SCSI, etc.) using VTL0's memory as DMA buffers. To enable this, VTL0 memory is registered into the VTL2 kernel via the MSHV_ADD_VTL0_MEMORY ioctl, which calls devm_memremap_pages() to create MEMORY_DEVICE_GENERIC zone device pages.

The ioctl is called multiple times, by the Virtual Machine Manager (VMM), registering VTL0's physical address space in chunks. Each call creates a separate dev_pagemap. This chunking is necessary because:

1. Firmware/UEFI fragments the guest physical address space (MMIO holes,
reserved regions)
2. Alignment constraints: vmemmap_shift is computed from the range alignment, and highly aligned large ranges can exceed MAX_FOLIO_ORDER, causing devm_memremap_pages() to fail

When a direct I/O request spans pages from different chunks (different pgmaps), the current code rejected the second page entirely:

if (!zone_device_pages_have_same_pgmap(bv->bv_page, page))
return 0; // Rejection - forces bio split or I/O error

Both chunks are regular RAM from the DMA perspective (MEMORY_DEVICE_GENERIC, not P2PDMA). The only requirement is that they not be merged into the same bvec segment, which patch 1/2 enforces by adding the pgmap check to biovec_phys_mergeable().

This patch allows pages from different pgmaps to be added as separate bvec entries in the same bio, eliminating bio splits and I/O failures
when buffers span pgmap boundaries.

I noticed this while doing kernel upgrade from 6.12 to 6.18 for OpenHCL kernel.



There's this another concern flagged from Sashiko code review:
https://sashiko.dev/#/patchset/20260401082329.1602328-1-namjain%40linux.microsoft.com

From my code analysis, this issue would not happening as of now, so this is future proofing the APIs after change 2/2. I would need to add a check like this to fix this:

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index 3462697331890..6f2f30a814560 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -231,6 +231,9 @@ int bio_integrity_add_page(struct bio *bio, struct page *page,
if (bip->bip_vcnt > 0) {
struct bio_vec *bv = &bip->bip_vec[bip->bip_vcnt - 1];

+ if (is_pci_p2pdma_page(bv->bv_page) !=
+ is_pci_p2pdma_page(page))
+ return 0;
if (zone_device_pages_have_same_pgmap(bv->bv_page, page) &&
bvec_try_merge_hw_page(q, bv, page, len, offset)) {
bip->bip_iter.bi_size += len;
diff --git a/block/bio.c b/block/bio.c
index 7715e59e68613..6216a554de68b 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1034,6 +1034,9 @@ int bio_add_page(struct bio *bio, struct page *page,
if (bio->bi_vcnt > 0) {
struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];

+ if (is_pci_p2pdma_page(bv->bv_page) !=
+ is_pci_p2pdma_page(page))
+ return 0;
if (zone_device_pages_have_same_pgmap(bv->bv_page, page) &&
bvec_try_merge_page(bv, page, len, offset)) {
bio->bi_iter.bi_size += len;


Please let me know what you think about this.

Thanks,
Naman