Re: [PATCH 1/4] KVM: Always flush async #PF workqueue when vCPU is being destroyed
From: Sean Christopherson
Date: Wed Jan 24 2024 - 14:05:14 EST
On Sat, Jan 20, 2024, Xu Yilun wrote:
> On Tue, Jan 09, 2024 at 05:15:30PM -0800, Sean Christopherson wrote:
> > Always flush the per-vCPU async #PF workqueue when a vCPU is clearing its
> > completion queue, e.g. when a VM and all its vCPUs is being destroyed.
> > KVM must ensure that none of its workqueue callbacks is running when the
> > last reference to the KVM _module_ is put. Gifting a reference to the
> > associated VM prevents the workqueue callback from dereferencing freed
> > vCPU/VM memory, but does not prevent the KVM module from being unloaded
> > before the callback completes.
> >
> > Drop the misguided VM refcount gifting, as calling kvm_put_kvm() from
> > async_pf_execute() if kvm_put_kvm() flushes the async #PF workqueue will
> > result in deadlock. async_pf_execute() can't return until kvm_put_kvm()
> > finishes, and kvm_put_kvm() can't return until async_pf_execute() finishes:
> >
> > WARNING: CPU: 8 PID: 251 at virt/kvm/kvm_main.c:1435 kvm_put_kvm+0x2d/0x320 [kvm]
> > Modules linked in: vhost_net vhost vhost_iotlb tap kvm_intel kvm irqbypass
> > CPU: 8 PID: 251 Comm: kworker/8:1 Tainted: G W 6.6.0-rc1-e7af8d17224a-x86/gmem-vm #119
> > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> > Workqueue: events async_pf_execute [kvm]
> > RIP: 0010:kvm_put_kvm+0x2d/0x320 [kvm]
> > Call Trace:
> > <TASK>
> > async_pf_execute+0x198/0x260 [kvm]
> > process_one_work+0x145/0x2d0
> > worker_thread+0x27e/0x3a0
> > kthread+0xba/0xe0
> > ret_from_fork+0x2d/0x50
> > ret_from_fork_asm+0x11/0x20
> > </TASK>
> > ---[ end trace 0000000000000000 ]---
> > INFO: task kworker/8:1:251 blocked for more than 120 seconds.
> > Tainted: G W 6.6.0-rc1-e7af8d17224a-x86/gmem-vm #119
> > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > task:kworker/8:1 state:D stack:0 pid:251 ppid:2 flags:0x00004000
> > Workqueue: events async_pf_execute [kvm]
> > Call Trace:
> > <TASK>
> > __schedule+0x33f/0xa40
> > schedule+0x53/0xc0
> > schedule_timeout+0x12a/0x140
> > __wait_for_common+0x8d/0x1d0
> > __flush_work.isra.0+0x19f/0x2c0
> > kvm_clear_async_pf_completion_queue+0x129/0x190 [kvm]
> > kvm_arch_destroy_vm+0x78/0x1b0 [kvm]
> > kvm_put_kvm+0x1c1/0x320 [kvm]
> > async_pf_execute+0x198/0x260 [kvm]
> > process_one_work+0x145/0x2d0
> > worker_thread+0x27e/0x3a0
> > kthread+0xba/0xe0
> > ret_from_fork+0x2d/0x50
> > ret_from_fork_asm+0x11/0x20
> > </TASK>
> >
> > If kvm_clear_async_pf_completion_queue() actually flushes the workqueue,
> > then there's no need to gift async_pf_execute() a reference because all
> > invocations of async_pf_execute() will be forced to complete before the
> > vCPU and its VM are destroyed/freed. And that in turn fixes the module
> > unloading bug as __fput() won't do module_put() on the last vCPU reference
> > until the vCPU has been freed, e.g. if closing the vCPU file also puts the
>
> I'm not sure why __fput() of vCPU fd should be mentioned here. I assume
> we just need to say that vCPUs are freed before module_put(KVM the module)
> in kvm_destroy_vm(), then the whole logic for module unloading fix is:
>
> 1. All workqueue callbacks complete when kvm_clear_async_pf_completion_queue(vcpu)
> 2. kvm_clear_async_pf_completion_queue(vcpu) must be executed before vCPU free.
> 3. vCPUs must be freed before module_put(KVM the module).
>
> So all workqueue callbacks complete before module_put(KVM the module).
>
>
> __fput() of vCPU fd is not the only trigger of kvm_destroy_vm(), that
> makes me distracted from reason of the fix.
My goal was to call out that (a) the vCPU file descriptor is what ensures kvm.ko
is alive at this point and (b) that __fput() very deliberately ensures module_put()
is called after all module function callbacks/hooks complete, as there was quite
a bit of confusion around who/what can safely do module_put().
> > last reference to the KVM module.
> >
> > Note that kvm_check_async_pf_completion() may also take the work item off
> > the completion queue and so also needs to flush the work queue, as the
> > work will not be seen by kvm_clear_async_pf_completion_queue(). Waiting
> > on the workqueue could theoretically delay a vCPU due to waiting for the
> > work to complete, but that's a very, very small chance, and likely a very
> > small delay. kvm_arch_async_page_present_queued() unconditionally makes a
> > new request, i.e. will effectively delay entering the guest, so the
> > remaining work is really just:
> >
> > trace_kvm_async_pf_completed(addr, cr2_or_gpa);
> >
> > __kvm_vcpu_wake_up(vcpu);
> >
> > mmput(mm);
> >
> > and mmput() can't drop the last reference to the page tables if the vCPU is
> > still alive, i.e. the vCPU won't get stuck tearing down page tables.
> >
> > Add a helper to do the flushing, specifically to deal with "wakeup all"
> > work items, as they aren't actually work items, i.e. are never placed in a
> > workqueue. Trying to flush a bogus workqueue entry rightly makes
> > __flush_work() complain (kudos to whoever added that sanity check).
> >
> > Note, commit 5f6de5cbebee ("KVM: Prevent module exit until all VMs are
> > freed") *tried* to fix the module refcounting issue by having VMs grab a
> > reference to the module, but that only made the bug slightly harder to hit
> > as it gave async_pf_execute() a bit more time to complete before the KVM
> > module could be unloaded.
> >
> > Fixes: af585b921e5d ("KVM: Halt vcpu if page it tries to access is swapped out")
> > Cc: stable@xxxxxxxxxxxxxxx
> > Cc: David Matlack <dmatlack@xxxxxxxxxx>
> > Cc: Xu Yilun <yilun.xu@xxxxxxxxxxxxxxx>
> > Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx>
> > ---
> > virt/kvm/async_pf.c | 37 ++++++++++++++++++++++++++++++++-----
> > 1 file changed, 32 insertions(+), 5 deletions(-)
> >
> > diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
> > index e033c79d528e..876927a558ad 100644
> > --- a/virt/kvm/async_pf.c
> > +++ b/virt/kvm/async_pf.c
> > @@ -87,7 +87,25 @@ static void async_pf_execute(struct work_struct *work)
> > __kvm_vcpu_wake_up(vcpu);
> >
> > mmput(mm);
> > - kvm_put_kvm(vcpu->kvm);
> > +}
> > +
> > +static void kvm_flush_and_free_async_pf_work(struct kvm_async_pf *work)
> > +{
> > + /*
> > + * The async #PF is "done", but KVM must wait for the work item itself,
> > + * i.e. async_pf_execute(), to run to completion. If KVM is a module,
> > + * KVM must ensure *no* code owned by the KVM (the module) can be run
> > + * after the last call to module_put(), i.e. after the last reference
> > + * to the last vCPU's file is put.
>
> Maybe drop the i.e? It is not exactly true, other components like VFIO
> may also be the last one to put KVM reference?
Ah, yeah, agreed. I'll drop that last snippet, it doesn't and much value.