Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory

From: Michael Roth
Date: Mon Nov 28 2022 - 19:08:16 EST


On Tue, Oct 25, 2022 at 11:13:37PM +0800, Chao Peng wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@xxxxxxxxxxxxxxx>
>

<snip>

> +static struct file *restrictedmem_file_create(struct file *memfd)
> +{
> + struct restrictedmem_data *data;
> + struct address_space *mapping;
> + struct inode *inode;
> + struct file *file;
> +
> + data = kzalloc(sizeof(*data), GFP_KERNEL);
> + if (!data)
> + return ERR_PTR(-ENOMEM);
> +
> + data->memfd = memfd;
> + mutex_init(&data->lock);
> + INIT_LIST_HEAD(&data->notifiers);
> +
> + inode = alloc_anon_inode(restrictedmem_mnt->mnt_sb);
> + if (IS_ERR(inode)) {
> + kfree(data);
> + return ERR_CAST(inode);
> + }
> +
> + inode->i_mode |= S_IFREG;
> + inode->i_op = &restrictedmem_iops;
> + inode->i_mapping->private_data = data;
> +
> + file = alloc_file_pseudo(inode, restrictedmem_mnt,
> + "restrictedmem", O_RDWR,
> + &restrictedmem_fops);
> + if (IS_ERR(file)) {
> + iput(inode);
> + kfree(data);
> + return ERR_CAST(file);
> + }
> +
> + file->f_flags |= O_LARGEFILE;
> +
> + mapping = memfd->f_mapping;
> + mapping_set_unevictable(mapping);
> + mapping_set_gfp_mask(mapping,
> + mapping_gfp_mask(mapping) & ~__GFP_MOVABLE);

Is this supposed to prevent migration of pages being used for
restrictedmem/shmem backend?

In my case I've been testing SNP support based on UPM v9, and for
large guests (128GB+), if I force 2M THPs via:

echo always >/sys/kernel/mm/transparent_hugepages/shmem_enabled

it will in some cases trigger the below trace, which suggests that
kcompactd is trying to call migrate_folio() on a PFN that was/is
still allocated for guest private memory (and so has been removed from
directmap as part of shared->private conversation via REG_REGION kvm
ioctl, leading to the crash). This trace seems to occur during early
OVMF boot while the guest is in the middle of pre-accepting on private
memory (no lazy accept in this case).

Is this expected behavior? What else needs to be done to ensure
migrations aren't attempted in this case?

Thanks!

-Mike


# Host logs with debug info for crash during SNP boot

...
[ 904.373632] kvm_restricted_mem_get_pfn: GFN: 0x1caced1, PFN: 0x156b7f, page: ffffea0006b197b0, ref_count: 2
[ 904.373634] kvm_restricted_mem_get_pfn: GFN: 0x1caced2, PFN: 0x156840, page: ffffea0006b09400, ref_count: 2
[ 904.373637] kvm_restricted_mem_get_pfn: GFN: 0x1caced3, PFN: 0x156841, page: ffffea0006b09450, ref_count: 2
[ 904.373639] kvm_restricted_mem_get_pfn: GFN: 0x1caced4, PFN: 0x156842, page: ffffea0006b094a0, ref_count: 2
[ 904.373641] kvm_restricted_mem_get_pfn: GFN: 0x1caced5, PFN: 0x156843, page: ffffea0006b094f0, ref_count: 2
[ 904.373645] kvm_restricted_mem_get_pfn: GFN: 0x1caced6, PFN: 0x156844, page: ffffea0006b09540, ref_count: 2
[ 904.373647] kvm_restricted_mem_get_pfn: GFN: 0x1caced7, PFN: 0x156845, page: ffffea0006b09590, ref_count: 2
[ 904.373649] kvm_restricted_mem_get_pfn: GFN: 0x1caced8, PFN: 0x156846, page: ffffea0006b095e0, ref_count: 2
[ 904.373652] kvm_restricted_mem_get_pfn: GFN: 0x1caced9, PFN: 0x156847, page: ffffea0006b09630, ref_count: 2
[ 904.373654] kvm_restricted_mem_get_pfn: GFN: 0x1caceda, PFN: 0x156848, page: ffffea0006b09680, ref_count: 2
[ 904.373656] kvm_restricted_mem_get_pfn: GFN: 0x1cacedb, PFN: 0x156849, page: ffffea0006b096d0, ref_count: 2
[ 904.373661] kvm_restricted_mem_get_pfn: GFN: 0x1cacedc, PFN: 0x15684a, page: ffffea0006b09720, ref_count: 2
[ 904.373663] kvm_restricted_mem_get_pfn: GFN: 0x1cacedd, PFN: 0x15684b, page: ffffea0006b09770, ref_count: 2

# PFN 0x15684c is allocated for guest private memory, will have been removed from directmap as part of RMP requirements

[ 904.373665] kvm_restricted_mem_get_pfn: GFN: 0x1cacede, PFN: 0x15684c, page: ffffea0006b097c0, ref_count: 2
...

# kcompactd crashes trying to copy PFN 0x15684c to a new folio, crashes trying to access PFN via directmap

[ 904.470135] Migrating restricted page, SRC pfn: 0x15684c, folio_ref_count: 2, folio_order: 0
[ 904.470154] BUG: unable to handle page fault for address: ffff88815684c000
[ 904.470314] kvm_restricted_mem_get_pfn: GFN: 0x1cafe00, PFN: 0x19f6d0, page: ffffea00081d2100, ref_count: 2
[ 904.477828] #PF: supervisor read access in kernel mode
[ 904.477831] #PF: error_code(0x0000) - not-present page
[ 904.477833] PGD 6601067 P4D 6601067 PUD 1569ad063 PMD 1569af063 PTE 800ffffea97b3060
[ 904.508806] Oops: 0000 [#1] SMP NOPTI
[ 904.512892] CPU: 52 PID: 1563 Comm: kcompactd0 Tainted: G E 6.0.0-rc7-hsnp-v7pfdv9d+ #10
[ 904.523473] Hardware name: AMD Corporation ETHANOL_X/ETHANOL_X, BIOS RXM1006B 08/20/2021
[ 904.532499] RIP: 0010:copy_page+0x7/0x10
[ 904.536877] Code: 00 66 90 48 89 f8 48 89 d1 f3 a4 31 c0 c3 cc cc cc cc 48 89 c8 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08
[ 904.557831] RSP: 0018:ffffc900106dfb78 EFLAGS: 00010286
[ 904.563661] RAX: ffff888000000000 RBX: ffffea0006b09810 RCX: 0000000000000200
[ 904.571622] RDX: ffffea0000000000 RSI: ffff88815684c000 RDI: ffff88816bc5d000
[ 904.579581] RBP: ffffc900106dfba0 R08: 0000000000000001 R09: ffffea0006b097c0
[ 904.587541] R10: 0000000000000002 R11: ffffc900106dfb38 R12: ffffea00071add60
[ 904.595502] R13: cccccccccccccccd R14: ffffea0006b09810 R15: ffff888159c1e0f8
[ 904.603462] FS: 0000000000000000(0000) GS:ffff88a04df00000(0000) knlGS:0000000000000000
[ 904.612489] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 904.618897] CR2: ffff88815684c000 CR3: 00000020eae16002 CR4: 0000000000770ee0
[ 904.626855] PKRU: 55555554
[ 904.629870] Call Trace:
[ 904.632594] <TASK>
[ 904.634928] ? folio_copy+0x8c/0xe0
[ 904.638818] migrate_folio+0x5b/0x110
[ 904.642901] move_to_new_folio+0x5b/0x150
[ 904.647371] migrate_pages+0x11bb/0x1830
[ 904.651743] ? move_freelist_tail+0xc0/0xc0
[ 904.656406] ? isolate_freepages_block+0x470/0x470
[ 904.661749] compact_zone+0x681/0xda0
[ 904.665832] kcompactd_do_work+0x1b3/0x2c0
[ 904.670400] kcompactd+0x257/0x330
[ 904.674190] ? prepare_to_wait_event+0x120/0x120
[ 904.679338] ? kcompactd_do_work+0x2c0/0x2c0
[ 904.684098] kthread+0xcf/0xf0
[ 904.687501] ? kthread_complete_and_exit+0x20/0x20
[ 904.692844] ret_from_fork+0x22/0x30
[ 904.696830] </TASK>
[ 904.699262] Modules linked in: nf_conntrack_netlink(E) xfrm_user(E) xfrm_algo(E) xt_addrtype(E) br_netfilter(E) xt_CHECKSUM(E) xt_MASQUERADE(E) xt_conntrack(E) ipt_REJECT(E) nf_reject_ipv4(E) xt_tcpudp(E) ip6table_mangle(E) ip6table_nat(E) iptable_mangle(E) iptable_nat(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) nf_tables(E) nfnetlink(E) ip6table_filter(E) ip6_tables(E) iptable_filter(E) bpfilter(E) intel_rapl_msr(E) intel_rapl_common(E) amd64_edac(E) bridge(E) stp(E) llc(E) kvm_amd(E) overlay(E) nls_iso8859_1(E) kvm(E) crct10dif_pclmul(E) ghash_clmulni_intel(E) aesni_intel(E) crypto_simd(E) cryptd(E) rapl(E) ipmi_si(E) ipmi_devintf(E) wmi_bmof(E) ipmi_msghandler(E) efi_pstore(E) binfmt_misc(E) ast(E) drm_vram_helper(E) joydev(E) drm_ttm_helper(E) ttm(E) drm_kms_helper(E) input_leds(E) i2c_algo_bit(E) fb_sys_fops(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) ccp(E) k10temp(E) mac_hid(E) sch_fq_codel(E) parport_pc(E) ppdev(E) lp(E) parport(E) drm(E) ip_tables(E)
[ 904.699316] x_tables(E) autofs4(E) btrfs(E) blake2b_generic(E) zstd_compress(E) raid10(E) raid456(E) async_raid6_recov(E) async_memcpy(E) async_pq(E) async_xor(E) async_tx(E) xor(E) raid6_pq(E) libcrc32c(E) raid1(E) raid0(E) multipath(E) linear(E) crc32_pclmul(E) hid_generic(E) usbhid(E) hid(E) e1000e(E) i2c_piix4(E) wmi(E)
[ 904.828498] CR2: ffff88815684c000
[ 904.832193] ---[ end trace 0000000000000000 ]---
[ 904.937159] RIP: 0010:copy_page+0x7/0x10
[ 904.941524] Code: 00 66 90 48 89 f8 48 89 d1 f3 a4 31 c0 c3 cc cc cc cc 48 89 c8 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08
[ 904.962478] RSP: 0018:ffffc900106dfb78 EFLAGS: 00010286
[ 904.968305] RAX: ffff888000000000 RBX: ffffea0006b09810 RCX: 0000000000000200
[ 904.976265] RDX: ffffea0000000000 RSI: ffff88815684c000 RDI: ffff88816bc5d000
[ 904.984227] RBP: ffffc900106dfba0 R08: 0000000000000001 R09: ffffea0006b097c0
[ 904.992187] R10: 0000000000000002 R11: ffffc900106dfb38 R12: ffffea00071add60
[ 905.000145] R13: cccccccccccccccd R14: ffffea0006b09810 R15: ffff888159c1e0f8
[ 905.008105] FS: 0000000000000000(0000) GS:ffff88a04df00000(0000) knlGS:0000000000000000
[ 905.017132] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 905.023540] CR2: ffff88815684c000 CR3: 00000020eae16002 CR4: 0000000000770ee0
[ 905.031501] PKRU: 55555554
[ 905.034558] kvm_restricted_mem_get_pfn: GFN: 0x1cafe01, PFN: 0x19f6d1, page: ffffea00081d2150, ref_count: 2
[ 905.045455] kvm_restricted_mem_get_pfn: GFN: 0x1cafe02, PFN: 0x19f6d2, page: ffffea00081d21a0, ref_count: 2
...