Re: [PATCH v11 0/8] KVM: allow mapping non-refcounted pages

From: Alex Bennée
Date: Wed Jul 31 2024 - 07:41:40 EST


Sean Christopherson <seanjc@xxxxxxxxxx> writes:

> On Thu, Feb 29, 2024, David Stevens wrote:
>> From: David Stevens <stevensd@xxxxxxxxxxxx>
>>
>> This patch series adds support for mapping VM_IO and VM_PFNMAP memory
>> that is backed by struct pages that aren't currently being refcounted
>> (e.g. tail pages of non-compound higher order allocations) into the
>> guest.
>>
>> Our use case is virtio-gpu blob resources [1], which directly map host
>> graphics buffers into the guest as "vram" for the virtio-gpu device.
>> This feature currently does not work on systems using the amdgpu driver,
>> as that driver allocates non-compound higher order pages via
>> ttm_pool_alloc_page().
>>
>> First, this series replaces the gfn_to_pfn_memslot() API with a more
>> extensible kvm_follow_pfn() API. The updated API rearranges
>> gfn_to_pfn_memslot()'s args into a struct and where possible packs the
>> bool arguments into a FOLL_ flags argument. The refactoring changes do
>> not change any behavior.
>>
>> From there, this series extends the kvm_follow_pfn() API so that
>> non-refconuted pages can be safely handled. This invloves adding an
>> input parameter to indicate whether the caller can safely use
>> non-refcounted pfns and an output parameter to tell the caller whether
>> or not the returned page is refcounted. This change includes a breaking
>> change, by disallowing non-refcounted pfn mappings by default, as such
>> mappings are unsafe. To allow such systems to continue to function, an
>> opt-in module parameter is added to allow the unsafe behavior.
>>
>> This series only adds support for non-refcounted pages to x86. Other
>> MMUs can likely be updated without too much difficulty, but it is not
>> needed at this point. Updating other parts of KVM (e.g. pfncache) is not
>> straightforward [2].
>
> FYI, on the off chance that someone else is eyeballing this, I am working on
> revamping this series. It's still a ways out, but I'm optimistic that we'll be
> able to address the concerns raised by Christoph and Christian, and maybe even
> get KVM out of the weeds straightaway (PPC looks thorny :-/).

I've applied this series to the latest 6.9.x while attempting to
diagnose some of the virtio-gpu problems it may or may not address.
However launching KVM guests keeps triggering a bunch of BUGs that
eventually leave a hung guest:

12:16:54 [root@draig:~] # dmesg -c
[252080.141629] RAX: ffffffffffffffda RBX: 0000560a64915500 RCX: 00007faa23e81c5b
[252080.141629] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000017
[252080.141630] RBP: 000000000000ae80 R08: 0000000000000000 R09: 0000000000000000
[252080.141630] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[252080.141631] R13: 0000000000000001 R14: 00000000000000b2 R15: 0000000000000002
[252080.141632] </TASK>
[252080.141632] BUG: Bad page state in process CPU 0/KVM pfn:fb1665
[252080.141633] page: refcount:0 mapcount:1 mapping:0000000000000000 index:0x7fa8117c3 pfn:0xfb1665
[252080.141633] flags: 0x17ffffc00a000c(referenced|uptodate|mappedtodisk|swapbacked|node=0|zone=2|lastcpupid=0x1fffff)
[252080.141634] page_type: 0x0()
[252080.141635] raw: 0017ffffc00a000c dead000000000100 dead000000000122 0000000000000000
[252080.141635] raw: 00000007fa8117c3 0000000000000000 0000000000000000 0000000000000000
[252080.141635] page dumped because: nonzero mapcount
[252080.141636] Modules linked in: vhost_net vhost vhost_iotlb tap tun uas usb_storage veth cfg80211 nf_conntrack_netlink xfrm_user xfrm_algo xt_addrtype br_netfilter nft_ma
sq wireguard libchacha20poly1305 chacha_x86_64 poly1305_x86_64 curve25519_x86_64 libcurve25519_generic libchacha ip6_udp_tunnel udp_tunnel rfcomm snd_seq_dummy snd_hrtimer s
nd_seq xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp nft_compat nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetl
ink bridge stp llc qrtr overlay cmac algif_hash algif_skcipher af_alg bnep binfmt_misc squashfs snd_hda_codec_hdmi intel_uncore_frequency snd_ctl_led intel_uncore_frequency_
common ledtrig_audio x86_pkg_temp_thermal intel_powerclamp coretemp snd_sof_pci_intel_tgl snd_sof_intel_hda_common kvm_intel soundwire_intel soundwire_generic_allocation btu
sb snd_sof_intel_hda_mlink sd_mod soundwire_cadence btrtl snd_hda_codec_realtek kvm sg snd_sof_intel_hda btintel snd_sof_pci btbcm snd_hda_codec_generic btmtk
[252080.141656] snd_sof_xtensa_dsp crc32_pclmul bluetooth snd_hda_scodec_component ghash_clmulni_intel snd_sof sha256_ssse3 sha1_ssse3 snd_sof_utils snd_soc_hdac_hda snd_hd
a_ext_core snd_soc_acpi_intel_match snd_soc_acpi snd_soc_core snd_compress soundwire_bus sha3_generic jitterentropy_rng aesni_intel snd_hda_intel snd_intel_dspcfg crypto_sim
d sha512_ssse3 snd_intel_sdw_acpi cryptd sha512_generic uvcvideo snd_hda_codec snd_usb_audio videobuf2_vmalloc uvc ctr videobuf2_memops snd_hda_core snd_usbmidi_lib videobuf
2_v4l2 snd_rawmidi drbg snd_hwdep dell_wmi snd_seq_device nls_ascii ahci ansi_cprng iTCO_wdt processor_thermal_device_pci videodev nls_cp437 snd_pcm intel_pmc_bxt dell_smbio
s libahci processor_thermal_device rapl rtsx_pci_sdmmc iTCO_vendor_support ecdh_generic mmc_core mei_hdcp watchdog libata intel_rapl_msr videobuf2_common rfkill vfat process
or_thermal_wt_hint pl2303 snd_timer dcdbas dell_wmi_ddv dell_wmi_sysman processor_thermal_rfim ucsi_acpi fat intel_cstate usbserial intel_uncore cdc_acm mc battery ecc
[252080.141670] firmware_attributes_class dell_wmi_descriptor wmi_bmof dell_smm_hwmon processor_thermal_rapl pcspkr scsi_mod mei_me intel_lpss_pci snd typec_ucsi igc e1000e
i2c_i801 rtsx_pci intel_rapl_common intel_lpss roles mei soundcore processor_thermal_wt_req i2c_smbus idma64 scsi_common processor_thermal_power_floor typec processor_therm
al_mbox button intel_pmc_core int3403_thermal int340x_thermal_zone intel_vsec pmt_telemetry intel_hid int3400_thermal pmt_class sparse_keymap acpi_tad acpi_pad acpi_thermal_
rel msr parport_pc ppdev lp parport fuse loop efi_pstore configfs efivarfs ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 hid_microsoft joydev ff_memless hid_generic usb
hid hid btrfs blake2b_generic libcrc32c crc32c_generic xor raid6_pq evdev dm_mod i915 i2c_algo_bit drm_buddy ttm drm_display_helper xhci_pci xhci_hcd drm_kms_helper nvme nvm
e_core drm t10_pi usbcore video crc64_rocksoft crc64 crc_t10dif cec crct10dif_generic crct10dif_pclmul crc32c_intel rc_core usb_common crct10dif_common wmi
[252080.141686] pinctrl_alderlake
[252080.141686] CPU: 8 PID: 1819169 Comm: CPU 0/KVM Tainted: G B W 6.9.12-ajb-00008-gfcd4b7efbad0 #17
[252080.141687] Hardware name: Dell Inc. Precision 3660/0PRR48, BIOS 2.8.1 08/14/2023
[252080.141688] Call Trace:
[252080.141688] <TASK>
[252080.141688] dump_stack_lvl+0x60/0x80
[252080.141689] bad_page+0x70/0x100
[252080.141690] free_unref_page_prepare+0x22a/0x370
[252080.141692] free_unref_folios+0xe5/0x340
[252080.141693] ? __mem_cgroup_uncharge_folios+0x7a/0xa0
[252080.141694] folios_put_refs+0x147/0x1e0
[252080.141696] ? __pfx_lru_add_fn+0x10/0x10
[252080.141697] folio_batch_move_lru+0xc8/0x140
[252080.141699] folio_add_lru+0x51/0xa0
[252080.141700] do_wp_page+0x4dd/0xb60
[252080.141701] __handle_mm_fault+0xb2a/0xe30
[252080.141703] handle_mm_fault+0x18c/0x320
[252080.141704] __get_user_pages+0x164/0x6f0
[252080.141705] get_user_pages_unlocked+0xe2/0x370
[252080.141706] hva_to_pfn+0xa0/0x740 [kvm]
[252080.141724] kvm_faultin_pfn+0xf3/0x5f0 [kvm]
[252080.141750] kvm_tdp_page_fault+0x100/0x150 [kvm]
[252080.141774] kvm_mmu_page_fault+0x27e/0x7f0 [kvm]
[252080.141798] ? em_rsm+0xad/0x170 [kvm]
[252080.141823] ? writeback_registers+0x44/0x80 [kvm]
[252080.141848] ? vmx_set_cr0+0xc7/0x1320 [kvm_intel]
[252080.141853] ? x86_emulate_insn+0x484/0xe60 [kvm]
[252080.141877] ? vmx_vmexit+0x6e/0xd0 [kvm_intel]
[252080.141882] ? vmx_vmexit+0x99/0xd0 [kvm_intel]
[252080.141887] vmx_handle_exit+0x129/0x930 [kvm_intel]
[252080.141892] kvm_arch_vcpu_ioctl_run+0x682/0x15b0 [kvm]
[252080.141918] kvm_vcpu_ioctl+0x23d/0x6f0 [kvm]
[252080.141936] ? __seccomp_filter+0x32f/0x500
[252080.141937] ? kvm_io_bus_read+0x42/0xd0 [kvm]
[252080.141956] __x64_sys_ioctl+0x90/0xd0
[252080.141957] do_syscall_64+0x80/0x190
[252080.141958] ? kvm_arch_vcpu_put+0x126/0x160 [kvm]
[252080.141982] ? vcpu_put+0x1e/0x50 [kvm]
[252080.141999] ? kvm_arch_vcpu_ioctl_run+0x757/0x15b0 [kvm]
[252080.142023] ? kvm_vcpu_ioctl+0x29e/0x6f0 [kvm]
[252080.142040] ? __seccomp_filter+0x32f/0x500
[252080.142042] ? kvm_on_user_return+0x60/0x90 [kvm]
[252080.142065] ? fire_user_return_notifiers+0x30/0x60
[252080.142066] ? syscall_exit_to_user_mode+0x73/0x200
[252080.142067] ? do_syscall_64+0x8c/0x190
[252080.142068] ? kvm_on_user_return+0x60/0x90 [kvm]
[252080.142090] ? fire_user_return_notifiers+0x30/0x60
[252080.142091] ? syscall_exit_to_user_mode+0x73/0x200
[252080.142092] ? do_syscall_64+0x8c/0x190
[252080.142093] ? do_syscall_64+0x8c/0x190
[252080.142094] ? do_syscall_64+0x8c/0x190
[252080.142095] ? exc_page_fault+0x72/0x170
[252080.142096] entry_SYSCALL_64_after_hwframe+0x76/0x7e

This backtrace repeats for a large chunk of pfns

--
Alex Bennée
Virtualisation Tech Lead @ Linaro