Re: [PATCHv5 11/30] x86/tdx: Handle in-kernel MMIO

From: Kirill A. Shutemov
Date: Fri Mar 11 2022 - 12:18:56 EST


On Thu, Mar 10, 2022 at 09:53:01AM -0800, Dave Hansen wrote:
> On 3/10/22 08:48, Kirill A. Shutemov wrote:
> I think this is good enough:
>
> - All guest code is expected to be in TD-private memory. Being
> private to the TD, VMMs have no way to access TD-private memory and
> no way to read the instruction to decode and emulate it.

Looks good.

One remark: executing from shared memory (or walking page tables in shared
memory) triggers #PF.

>
> We don't have to rehash what private memory is or how it is implemented.
>
> >> This problem was laid out as having three cases:
> >> 1. virtio
> >> 2. x86-specific drivers
> >> 3. random drivers (everything else)
> >>
> >> #1 could be done with paravirt
> >> #2 is unspecified and unknown
> >> #3 use doesn't as far as I know exist in TDX guests today
> >
> > #2 doesn't matter from performance point of view and there is no
> > convenient place where they can be intercepted as they are scattered
> > across the tree. Patching them doesn't bring any benefit, only pain.
>
> I'd feel a lot better if this was slightly better specified. Even
> booting with a:
>
> printf("rip: %lx\n", regs->rip);
>
> in the #VE handler would give some hard data about these. This still
> feels to me like something that Sean got booting two years ago and
> nobody has really reconsidered.

Here the list I see on boot. It highly depends on QEMU setup. Any form of
device filtering will cut the further.

MMIO: ahci_enable_ahci
MMIO: ahci_freeze
MMIO: ahci_init_controller
MMIO: ahci_port_resume
MMIO: ahci_postreset
MMIO: ahci_reset_controller
MMIO: ahci_save_initial_config
MMIO: ahci_scr_read
MMIO: ahci_scr_write
MMIO: ahci_set_em_messages
MMIO: ahci_start_engine
MMIO: ahci_stop_engine
MMIO: ahci_thaw
MMIO: ioapic_set_affinity
MMIO: ioread16
MMIO: ioread32
MMIO: ioread8
MMIO: iowrite16
MMIO: iowrite32
MMIO: iowrite8
MMIO: mask_ioapic_irq
MMIO: mp_irqdomain_activate
MMIO: mp_irqdomain_deactivate
MMIO: native_io_apic_read
MMIO: __pci_enable_msix_range
MMIO: pci_mmcfg_read
MMIO: pci_msi_mask_irq
MMIO: pci_msi_unmask_irq
MMIO: __pci_write_msi_msg
MMIO: restore_ioapic_entries
MMIO: startup_ioapic_irq
MMIO: update_no_reboot_bit_mem

ioread*/iowrite* comes from virtio.

> > #3 some customers already declared that they will use device passthough
> > (yes, it is not safe). CSP may want to emulate random device, depending on
> > setup. Like, a power button or something.
>
> I'm not sure I'm totally on board with that.
>
> But, let's try to make a coherent changelog out of that mess.
>
> This approach is bad for performance. But, it has (virtually)
> no impact on the size of the kernel image and will work for a
> wide variety of drivers. This allows TDX deployments to use
> arbitrary devices and device drivers, including virtio. TDX
> customers have asked for the capability to use random devices in
> their deployments.
>
> In other words, even if all of the work was done to
> paravirtualize all x86 MMIO users and virtio, this approach
> would still be needed. There is essentially no way to get rid
> of this code.
>
> This approach is functional for all in-kernel MMIO users current
> and future and does so with a minimal amount of code and kernel
> image bloat.
>
> Does that summarize it?

I will integrate it in the commit message.

> But, I'd like to see one of two things:
> 1. Change the BUG()s to WARN()s.
> 2. Make it utterly clear that handle_mmio() is for handling kernel MMIO
> only. Definitely change the naming, possibly add a check for
> user_mode(). In other words, make it even _less_ generic.
>
> None of that should be hard.

Okay, I will downgrade BUG() to WARN() and return false for user_mode()
with warning.

> BTW, the BUG()s made me think about how the gp_try_fixup_and_notify()
> code would work for MMIO. For instance, are there any places where
> fixup might be done for MMIO? If so, an earlier BUG() wouldn't allow
> the fixup to occur.

I can be wrong, but I don't think we do fixups for MMIO.

> Do we *WANT* #VE's to be exposed to the #GP fixup machinery?

We need the fixup at least for MSRs.

--
Kirill A. Shutemov