Re: [PATCH v2 0/3] kvm/mm: Allow GUP to respond to non fatal signals
From: Peter Xu
Date: Wed Aug 10 2022 - 15:38:47 EST
Any further comments? Thanks,
On Wed, Jul 20, 2022 at 08:03:15PM -0400, Peter Xu wrote:
> v2:
> - Added r-b
> - Rewrite the comment in faultin_page() for FOLL_INTERRUPTIBLE [John]
> - Dropped the controversial patch to introduce a flag for
> __gfn_to_pfn_memslot(), instead used a boolean for now [Sean]
> - Rename s/is_sigpending_pfn/KVM_PFN_ERR_SIGPENDING/ [Sean]
> - Change comment in kvm_faultin_pfn() mentioning fatal signals [Sean]
>
> rfc: https://lore.kernel.org/kvm/20220617014147.7299-1-peterx@xxxxxxxxxx
> v1: https://lore.kernel.org/kvm/20220622213656.81546-1-peterx@xxxxxxxxxx
>
> One issue was reported that libvirt won't be able to stop the virtual
> machine using QMP command "stop" during a paused postcopy migration [1].
>
> It won't work because "stop the VM" operation requires the hypervisor to
> kick all the vcpu threads out using SIG_IPI in QEMU (which is translated to
> a SIGUSR1). However since during a paused postcopy, the vcpu threads are
> hang death at handle_userfault() so there're simply not responding to the
> kicks. Further, the "stop" command will further hang the QMP channel.
>
> The mm has facility to process generic signal (FAULT_FLAG_INTERRUPTIBLE),
> however it's only used in the PF handlers only, not in GUP. Unluckily, KVM
> is a heavy GUP user on guest page faults. It means we won't be able to
> interrupt a long page fault for KVM fetching guest pages with what we have
> right now.
>
> I think it's reasonable for GUP to only listen to fatal signals, as most of
> the GUP users are not really ready to handle such case. But actually KVM
> is not such an user, and KVM actually has rich infrastructure to handle
> even generic signals, and properly deliver the signal to the userspace.
> Then the page fault can be retried in the next KVM_RUN.
>
> This patchset added FOLL_INTERRUPTIBLE to enable FAULT_FLAG_INTERRUPTIBLE,
> and let KVM be the first one to use it. KVM and mm/gup can always be able
> to respond to fatal signals, but not non-fatal ones until this patchset.
>
> One thing to mention is that this is not allowing all KVM paths to be able
> to respond to non fatal signals, but only on x86 slow page faults. In the
> future when more code is ready for handling signal interruptions, we can
> explore possibility to have more gup callers using FOLL_INTERRUPTIBLE.
>
> Tests
> =====
>
> I created a postcopy environment, pause the migration by shutting down the
> network to emulate a network failure (so the handle_userfault() will stuck
> for a long time), then I tried three things:
>
> (1) Sending QMP command "stop" to QEMU monitor,
> (2) Hitting Ctrl-C from QEMU cmdline,
> (3) GDB attach to the dest QEMU process.
>
> Before this patchset, all three use case hang. After the patchset, all
> work just like when there's not network failure at all.
>
> Please have a look, thanks.
>
> [1] https://gitlab.com/qemu-project/qemu/-/issues/1052
--
Peter Xu