Re: [PATCH] PCI: Allow drivers to claim exclusive access to config regions

From: Dan Williams
Date: Thu Mar 25 2021 - 13:44:33 EST


On Wed, Mar 24, 2021 at 11:55 PM Greg Kroah-Hartman
<gregkh@xxxxxxxxxxxxxxxxxxx> wrote:
>
> On Wed, Mar 24, 2021 at 06:23:54PM -0700, Dan Williams wrote:
> > The PCIE Data Object Exchange (DOE) mailbox is a protocol run over
> > configuration cycles. It assumes one initiator at a time is
> > reading/writing the data registers.
>
> That sounds like a horrible protocol for a multi-processor system.
> Where is it described and who can we go complain to for creating such a
> mess?

It appears it was added to the PCIE specification in March of last
year, I don't attend those meetings. I only learned about it since the
CXL specification adopted it.

>
> > If userspace reads from the response
> > data payload it may steal data that a kernel driver was expecting to
> > read. If userspace writes to the request payload it may corrupt the
> > request a driver was trying to send.
>
> Fun! So you want to keep root in userspace from doing this? I thought
> we already do that today?

The only limitation I found was temporary locking via
pci_cfg_access_lock(), and some limitations on max config offset, not
permanent access disable.

>
> > Introduce pci_{request,release}_config_region() for a driver to exclude
> > the possibility of userspace induced corruption while accessing the DOE
> > mailbox. Likely there are other configuration state assumptions that a
> > driver may want to assert are under its exclusive control, so this
> > capability is not limited to any specific configuration range.
>
> As you do not have a user for these functions, it's hard to see how they
> would be used. We also really can't add new apis with no in-tree users,
> so do you have a patch series that requires this functionality
> somewhere?

Whoops, I buried the lead here. This is in reaction to / support of
Jonathan's efforts to use this mailbox to retrieve a blob of memory
characteristics data from CXL devices called the CDAT [1]. That blob
is used to populate / extend ACPI SRAT/HMAT/SLIT data for CXL attached
memory. So while I was reviewing his patches it occurred to me that
the b0rked nature of this interface relative to pci-sysfs needed to be
addressed. This should wait to go in with his series.

[1]: https://lore.kernel.org/linux-acpi/20210310180306.1588376-2-Jonathan.Cameron@xxxxxxxxxx/

>
> > Since writes are targeted and are already prepared for failure the
> > entire request is failed. The same can not be done for reads as the
> > device completely disappears from lspci output if any configuration
> > register in the request is exclusive. Instead skip the actual
> > configuration cycle on a per-access basis and return all f's as if the
> > read had failed.
>
> returning all ff is a huge hint to many drivers that the device is gone,
> not that it just failed. So what happens to code that thinks that and
> then tears stuff down as if the device has been removed?

This is limited to the pci_user_* family of accessors, kernel drivers
should be unaffected. The protection for kernel drivers colliding is
the same as request_mem_region() coordination.

Userspace drivers will of course be horribly confused, but those
should not be binding to devices that are claimed by a kernel driver
in the first place.

> Trying to protect drivers from userspace here feels odd, what userspace
> tools are trying to access these devices while they are under
> "exclusive" control from the kernel? lspci not running as root should
> not be doing anything crazy, but if you want to run it as root,
> shouldn't you be allowed to access it properly?

The main concern is unwanted userspace reads. An inopportune "hexdump
/sys/bus/pci/devices/$device/config" will end up reading the DOE
payload register and advancing the device state machine surprising the
kernel iterator that might be reading the payload.

If root really wants to read that portion of config space it can also
unload the driver similar to the policy for /dev/mem colliding with
exclusive device-mmio... although that raises the question how would
root know. At least for exclusive /dev/mem /proc/iomem can show who is
claiming that resource.

If userspace needs to submit DOE requests then that should probably be
a proper generic driver to submit requests, not raw pci config access.

> What hardware has this problem that we need to claim exclusive ownership
> over that differs from the old hardware we used to have that would do
> crazy things when reading from from userspace? We had this problem a
> long time ago and lived with it, what changed now?

All I can infer from the comments in drivers/pci/access.c is "bad
things happens on some devices if you allow reads past a certain
offset", but no concern for reads for offsets less than
pdev->cfg_size. I think what's changed is that this is the first time
Linux has had to worry about an awkward polled I/O data transfer
protocol over config cycles.

To make matters worse there appears to be a proliferation of protocols
being layered on top of DOE. In addition to CXL Table Access for CDAT
retrieval [2] I'm aware of CXL Compliance Testing [3], Integrity and
Data Encryption (IDE) [4], and Component Measurement and
Authentication (CMA) [5].

I've not read those, but I worry security_locked_down() may want to
prevent even root userspace mucking with "security" interfaces. So
that *might* be a reason to ensure exclusive kernel access beyond the
basic sanity of the kernel being able to have uninterrupted request /
response sessions with this mailbox

[2]: https://uefi.org/sites/default/files/resources/Coherent%20Device%20Attribute%20Table_1.01.pdf
[3]: https://www.computeexpresslink.org/download-the-specification
[4]: https://members.pcisig.com/wg/PCI-SIG/document/15149
[5]: https://members.pcisig.com/wg/PCI-SIG/document/14236