Re: [PATCH gmem FIXUP] kvm: guestmem: do not use a file system

From: Sean Christopherson
Date: Tue Oct 10 2023 - 19:30:30 EST


On Tue, Oct 10, 2023, Al Viro wrote:
> On Mon, Oct 09, 2023 at 05:27:04PM -0700, Sean Christopherson wrote:
>
> > If the last reference is effectively held by guest_memfd, it would be:
> >
> > kvm_gmem_release(), a.k.a. file_operations.release()
> > |
> > -> kvm_put_kvm()
> > |
> > -> kvm_destroy_vm()
> > |
> > -> module_put(kvm_chardev_ops.owner);
>
> ... and now your thread gets preempted and loses CPU; before you get
> it back, some joker calls delete_module(), and page of code containing
> kvm_gmem_release() is unmapped. Even though an address within that
> page is stored as return address in a frame on your thread's stack.
> That thread gets the timeslice again and proceeds to return into
> unmapped page. Oops...

*sigh*

What an absolute snafu. Sorry for the wall of text. Feel free to stop reading
after the "Back to KVM..." part below, it's just gory details on how we screwed
things up in KVM.

But one question before I dive into the KVM mess: other than error paths, how is
module_put(THIS_MODULE) ever safe without being superfluous? I don't see how a
module can put the last reference to itself without either hitting the above
scenario, or without creating deadlock. Something other than the module must put
the last reference, no?

The only exceptions I see are:

1. if module_put() is called via async_run_entry_fn(), as delete_module() invokes
async_synchronize_full() before unmapping the module. But IIUC, the async
framework uses workqueues, not the other way around. I.e. delete_module()
doesn't flush arbitrary workqueues.

2. if module_put() is called via module_put_and_kthread_exit(), which uses
THIS_MODULE but does module_put() from a core kernel helper and never returns
to the module's kthread, i.e. doesn't return to module code.

But then this

$ git grep -E "module_put\(THIS_MODULE" | wc -l
132

make me go "huh"? I've blamed a handful of those calls, and I can't find a single
one that provides any clue as to why the module gets/puts references to itself,
let alone actually justifies the usage.

E.g. drivers/block/loop.c has this gem

/* This is safe: open() is still holding a reference. */
module_put(THIS_MODULE);

in __loop_clr_fd(), which is invoked from a .release() function. So open() quite
clearly doesn't hold a reference, unless the comment is talking about the reference
that was obtained by the core file systems layer and won't be put until after
.release() completes. But then what on earth is the point of doing
module_get(THIS_MODULE) and module_put(THIS_MODULE)?


Back to KVM...

Commit 5f6de5cbebee ("KVM: Prevent module exit until all VMs are freed") *tried*
to fix a bug where KVM-the-module could be unloaded while a KVM workqueue callback
was still in-flight. The callback had a reference to the VM, but not to the VM's
file representation.

After that commit went in, I suggested dropping the use of .owner for VMs and
vCPUs (each of which is represented by an anon inode file) because keeping the VM
alive would pin KVM-the-module until all VMs went away. But I missed the
obvious-in-hindsight issue Al highlighted above.

Fixing that particular wart is relatively easy: revert commit 70375c2d8fa3 ("Revert
"KVM: set owner of cpu and vm file operations""), and give all of the other KVM-owned
file_operations structures the same treatment by setting .owner correctly. Note,
"correctly" isn't THIS_MODULE in most cases, which is why the code existing is a
bit odd. For most file_operations, on x86 and PPC (and MIPS?), the effective owner
is actually a sub-module, e.g. THIS_MODULE will point at kvm.ko, but on x86 the
effective owner is either kvm-intel.ko or kvm-amd.ko (which holds a reference to
kvm.ko).

After staring and fiddling for most of today, I finally discovered that grabbing
a reference to the module on behalf of the work item didn't fix the actual bugs,
plural, it just shuffled the deck chairs on the Titanic. And as above, it set us
up to make even bigger mistakes regarding .owner :-(

The problematic code is effectively kvm_clear_async_pf_completion_queue(). That
helper is called for each vCPU when a VM is being destroyed, i.e. when the last
reference to a VM is put via kvm_put_kvm(). Clearing the queue *should* also
flush all work items, except it doesn't when the work is "done", where "done" just
means the page being faulted in is ready. Using file_operations.owner doesn't
solve anything, e.g. even if async_pf_execute() were gifted a reference to the
VM's file and used the deferred fput(), the same preemption issue exists, it's
just slightly harder to hit.

The original async #PF code appears to have fudged around the lack of flushing by
gifting a VM reference to the async_pf_execute(). Or maybe it was the other way
around and not flushing was a workaround for the deadlock that occurs if
kvm_clear_async_pf_completion_queue() does flush the workqueue. If kvm_put_kvm()
is called from async_pf_execute() and kvm_put_kvm() flushes the async #PF workqueue,
deadlock occurs becase async_pf_execute() can't return until kvm_put_kvm() finishes,
and kvm_put_kvm() can't return until async_pf_execute() finishes.

WARNING: CPU: 8 PID: 251 at virt/kvm/kvm_main.c:1435 kvm_put_kvm+0x2d/0x320 [kvm]
Modules linked in: vhost_net vhost vhost_iotlb tap kvm_intel kvm irqbypass
CPU: 8 PID: 251 Comm: kworker/8:1 Tainted: G W 6.6.0-rc1-e7af8d17224a-x86/gmem-vm #119
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
Workqueue: events async_pf_execute [kvm]
RIP: 0010:kvm_put_kvm+0x2d/0x320 [kvm]
Call Trace:
<TASK>
async_pf_execute+0x198/0x260 [kvm]
process_one_work+0x145/0x2d0
worker_thread+0x27e/0x3a0
kthread+0xba/0xe0
ret_from_fork+0x2d/0x50
ret_from_fork_asm+0x11/0x20
</TASK>
---[ end trace 0000000000000000 ]---
INFO: task kworker/8:1:251 blocked for more than 120 seconds.
Tainted: G W 6.6.0-rc1-e7af8d17224a-x86/gmem-vm #119
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/8:1 state:D stack:0 pid:251 ppid:2 flags:0x00004000
Workqueue: events async_pf_execute [kvm]
Call Trace:
<TASK>
__schedule+0x33f/0xa40
schedule+0x53/0xc0
schedule_timeout+0x12a/0x140
__wait_for_common+0x8d/0x1d0
__flush_work.isra.0+0x19f/0x2c0
kvm_clear_async_pf_completion_queue+0x129/0x190 [kvm]
kvm_arch_destroy_vm+0x78/0x1b0 [kvm]
kvm_put_kvm+0x1c1/0x320 [kvm]
async_pf_execute+0x198/0x260 [kvm]
process_one_work+0x145/0x2d0
worker_thread+0x27e/0x3a0
kthread+0xba/0xe0
ret_from_fork+0x2d/0x50
ret_from_fork_asm+0x11/0x20
</TASK>

If kvm_clear_async_pf_completion_queue() actually flushes the workqueue, then
there's no need to gift async_pf_execute() a reference because all invocations
of async_pf_execute() will be forced to complete before the vCPU and its VM are
destroyed/freed. And that also fixes the module unloading mess because __fput()
won't do module_put() on the last vCPU reference until the vCPU has been freed.

The attached patches are lightly tested, but I think they fix the KVM mess. I
likely won't post a proper series until next week, I'm going to be offline the
next two days.