Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

From: Andi Kleen
Date: Sun Oct 10 2021 - 18:11:26 EST



On 10/9/2021 1:39 PM, Dan Williams wrote:
On Sat, Oct 9, 2021 at 2:53 AM Michael S. Tsirkin <mst@xxxxxxxxxx> wrote:
On Fri, Oct 08, 2021 at 05:37:07PM -0700, Kuppuswamy Sathyanarayanan wrote:
From: Andi Kleen <ak@xxxxxxxxxxxxxxx>

For Confidential VM guests like TDX, the host is untrusted and hence
the devices emulated by the host or any data coming from the host
cannot be trusted. So the drivers that interact with the outside world
have to be hardened by sharing memory with host on need basis
with proper hardening fixes.

For the PCI driver case, to share the memory with the host add
pci_iomap_host_shared() and pci_iomap_host_shared_range() APIs.

Signed-off-by: Andi Kleen <ak@xxxxxxxxxxxxxxx>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@xxxxxxxxxxxxxxx>
So I proposed to make all pci mappings shared, eliminating the need
to patch drivers.

To which Andi replied
One problem with removing the ioremap opt-in is that
it's still possible for drivers to get at devices without going through probe.

To which Greg replied:
https://lore.kernel.org/all/YVXBNJ431YIWwZdQ@xxxxxxxxx/
If there are in-kernel PCI drivers that do not do this, they need to be
fixed today.

Can you guys resolve the differences here?
I agree with you and Greg here. If a driver is accessing hardware
resources outside of the bind lifetime of one of the devices it
supports, and in a way that neither modrobe-policy nor
device-authorization -policy infrastructure can block, that sounds
like a bug report.

The 5.15 tree has something like ~2.4k IO accesses (including MMIO and others) in init functions that also register drivers (thanks Elena for the number)

Some are probably old drivers that could be fixed, but it's quite a few legitimate cases. For example for platform or ISA drivers that's the only way they can be implemented because they often have no other enumeration mechanism. For PCI drivers it's rarer, but also still can happen. One example that comes to mind here is the x86 Intel uncore drivers, which support a mix of MSR, ioremap and PCI config space accesses all from the same driver. This particular example can (and should be) fixed in other ways, but similar things also happen in other drivers, and they're not all broken. Even for the broken ones they're usually for some crufty old devices that has very few users, so it's likely untestable in practice.

My point is just that the ecosystem of devices that Linux supports is messy enough that there are legitimate exceptions from the "First IO only in probe call only" rule.

And we can't just fix them all. Even if we could it would be hard to maintain.

Using a "firewall model" hooking into a few strategic points like we're proposing here is much saner for everyone.

Now we can argue about the details. Right now what we're proposing has some redundancies: it has both a device model filter and low level filter for ioremap (this patch and some others). The low level filter is for catching issues that don't clearly fit into the "enumeration<->probe" model. You could call that redundant, but I would call it defense in depth or better safe than sorry. In theory it would be enough to have the low level opt-in only, but that would have the drawback that is something gets enumerated after all you would have all kind of weird device driver failures and in some cases even killed guests. So I think it makes sense to have


Fix those drivers instead of sprinkling
ioremap_shared in select places and with unclear rules about when a
driver is allowed to do "shared" mappings.

Only add it when the driver has been audited and hardened.

But I agree we need on a documented process for this. I will work on some documentation for a proposal. But essentially I think it should be some variant of what Elena has outlined in her talk at Security Summit.

https://static.sched.com/hosted_files/lssna2021/b6/LSS-HardeningLinuxGuestForCCC.pdf

That is using extra auditing/scrutiny at review time, supported with some static code analysis that points to the interaction points, and code needs to be fuzzed explicitly.

However short term it's only three virtio drivers, so this is not a urgent problem.

Let the new
device-authorization mechanism (with policy in userspace)


Default policy in user space just seems to be a bad idea here. Who should know if a driver is hardened other than the kernel? Maintaining the list somewhere else just doesn't make sense to me.

Also there is the more practical problem that some devices are needed for booting. For example in TDX we can't print something to the console with this mechanism, so you would never get any output before the initrd. Just seems like a nightmare for debugging anything. There really needs to be an authorization mechanism that works reasonably early.

I can see a point of having user space overrides though, but we need to have a sane kernel default that works early.

-Andi