Re: [PATCH 01/12] pci-p2p: Support peer to peer memory

From: Bjorn Helgaas
Date: Thu Jan 04 2018 - 16:40:42 EST


Run "git log --oneline drivers/pci" and follow the convention. I
think it would make sense to add a new tag like "PCI/P2P", although
"P2P" has historically also been used in the "PCI-to-PCI bridge"
context, so maybe there's something less ambiguous. "P2PDMA"?

When you add new files, I guess we're looking for the new SPDX
copyright stuff?

On Thu, Jan 04, 2018 at 12:01:26PM -0700, Logan Gunthorpe wrote:
> Some PCI devices may have memory mapped in a BAR space that's
> intended for use in Peer-to-Peer transactions. In order to enable
> such transactions the memory must be registered with ZONE_DEVICE pages
> so it can be used by DMA interfaces in existing drivers.
>
> A kernel interface is provided so that other subsystems can find and
> allocate chunks of P2P memory as necessary to facilitate transfers
> between two PCI peers. Depending on hardware, this may reduce the
> bandwidth of the transfer but would significantly reduce pressure
> on system memory. This may be desirable in many cases: for example a
> system could be designed with a small CPU connected to a PCI switch by a
> small number of lanes which would maximize the number of lanes available
> to connect to NVME devices.
>
> The interface requires a user driver to collect a list of client devices
> involved in the transaction with the pci_p2pmem_add_client*() functions
> then call pci_p2pmem_find() to obtain any suitable P2P memory. Once
> this is done the list is bound to the memory and the calling driver is
> free to add and remove clients as necessary. The ACS bits on the
> downstream switch port will be managed for all the registered clients.
>
> The code is designed to only utilize the p2pmem device if all the devices
> involved in a transfer are behind the same PCI switch. This is because
> using P2P transactions through the PCI root complex can have performance
> limitations or, worse, might not work at all. Finding out how well a
> particular RC supports P2P transfers is non-trivial.

It's more than "non-trivial" or "with good performance" (from Kconfig
help), isn't it? AFAIK, there's no standard way at all to discover
whether P2P DMA is supported between root ports or RCs.

> +config PCI_P2P
> + bool "PCI Peer to Peer transfer support"
> + depends on ZONE_DEVICE
> + select GENERIC_ALLOCATOR
> + help
> + EnableÑ drivers to do PCI peer to peer transactions to and from
> + bars that are exposed to other devices in the same domain.

s/bars/BARs/ (and similarly below, except in C code)

Similarly, s/dma/DMA/ and s/pci/PCI/ below.

And probably also s/p2p/peer-to-peer DMA/ in messages.

Maybe clarify this domain bit. Using "domain" suggests the common PCI
segment/domain usage, but I think you really mean something like the
part of the hierarchy where peer-to-peer DMA is guaranteed by the PCI
spec to work, i.e., anything below a single PCI bridge.

> +
> + Many PCIe root complexes do not support P2P transactions and
> + it's hard to tell which support it with good performance, so
> + at this time you will need a PCIe switch.
> +
> + If unsure, say N.

> + * pci_p2pmem_add_resource - add memory for use as p2p memory
> + * @pci: the device to add the memory to
> + * @bar: PCI bar to add
> + * @size: size of the memory to add, may be zero to use the whole bar
> + * @offset: offset into the PCI bar
> + *
> + * The memory will be given ZONE_DEVICE struct pages so that it may
> + * be used with any dma request.
> + */
> +int pci_p2pmem_add_resource(struct pci_dev *pdev, int bar, size_t size,
> + u64 offset)
> +{
> + struct dev_pagemap *pgmap;
> + void *addr;
> + int error;

Seems like there should be

if (!(pci_resource_flags(pdev, bar) & IORESOURCE_MEM))
return -EINVAL;

or similar here?

> + if (WARN_ON(offset >= pci_resource_len(pdev, bar)))
> + return -EINVAL;

Are these WARN_ONs for debugging purposes, or do you think we need
them in production? Granted, hitting it would probably be a kernel
driver bug, but still, not sure if the PCI core needs to coddle the
driver author that much.

> + if (!size)
> + size = pci_resource_len(pdev, bar) - offset;
> +
> + if (WARN_ON(size + offset > pci_resource_len(pdev, bar)))
> + return -EINVAL;
> +
> + if (!pdev->p2p) {
> + error = pci_p2pmem_setup(pdev);
> + if (error)
> + return error;
> + }
> +
> + pgmap = devm_kzalloc(&pdev->dev, sizeof(*pgmap), GFP_KERNEL);
> + if (!pgmap)
> + return -ENOMEM;
> +
> + pgmap->res.start = pci_resource_start(pdev, bar) + offset;
> + pgmap->res.end = pgmap->res.start + size - 1;

I'm guessing Christoph's dev_pagemap revamp repo must change
pgmap->res from a pointer to a structure, but I don't see the actual
link in your cover letter.

I think you should set pgmap->res.flags here, too.

> + pgmap->ref = &pdev->p2p->devmap_ref;
> + pgmap->type = MEMORY_DEVICE_PCI_P2P;
> +
> + addr = devm_memremap_pages(&pdev->dev, pgmap);
> + if (IS_ERR(addr))
> + return PTR_ERR(addr);
> +
> + error = gen_pool_add_virt(pdev->p2p->pool, (uintptr_t)addr,
> + pci_bus_address(pdev, bar) + offset,
> + resource_size(&pgmap->res), dev_to_node(&pdev->dev));
> + if (error)
> + return error;
> +
> + error = devm_add_action_or_reset(&pdev->dev, pci_p2pmem_percpu_kill,
> + &pdev->p2p->devmap_ref);
> + if (error)
> + return error;
> +
> + dev_info(&pdev->dev, "added %zdB of p2p memory\n", size);

Can we add %pR and print pgmap->res itself, too?

> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(pci_p2pmem_add_resource);

> + * If a device is behind a switch, we try to find the upstream bridge
> + * port of the switch. This requires two calls to pci_upstream_bridge:
> + * one for the upstream port on the switch, one on the upstream port
> + * for the next level in the hierarchy. Because of this, devices connected
> + * to the root port will be rejected.
> + */
> +static struct pci_dev *get_upstream_switch_port(struct pci_dev *pdev)
> +{
> + struct pci_dev *up1, *up2;
> +
> + if (!pdev)
> + return NULL;
> +
> + up1 = pci_dev_get(pci_upstream_bridge(pdev));
> + if (!up1)
> + return NULL;
> +
> + up2 = pci_dev_get(pci_upstream_bridge(up1));
> + pci_dev_put(up1);
> +
> + return up2;
> +}
> +
> +static bool __upstream_bridges_match(struct pci_dev *upstream,
> + struct pci_dev *client)
> +{
> + struct pci_dev *dma_up;
> + bool ret = true;
> +
> + dma_up = get_upstream_switch_port(client);
> +
> + if (!dma_up) {
> + dev_dbg(&client->dev, "not a pci device behind a switch\n");

You have a bit of a mix of PCI ("pci device", "bridge") and PCIe
("switch", "switch port") terminology. I haven't read the rest of the
patches yet, so I don't know if you intend to restrict this to
PCIe-only, e.g., so you can use ACS, or if you want to make it
available on conventional PCI as well.

If the latter, I would use the generic PCI terminology, i.e., "bridge"
instead of "switch".

> + * pci_virt_to_bus - return the pci bus address for a given virtual
> + * address obtained with pci_alloc_p2pmem
> + * @pdev: the device the memory was allocated from
> + * @addr: address of the memory that was allocated
> + */
> +pci_bus_addr_t pci_p2pmem_virt_to_bus(struct pci_dev *pdev, void *addr)
> +{
> + if (!addr)
> + return 0;
> + if (!pdev->p2p)
> + return 0;
> +
> + return gen_pool_virt_to_phys(pdev->p2p->pool, (unsigned long)addr);

This doesn't seem right. A physical address is not the same as a PCI
bus address. I expected something like pci_bus_address() or
pcibios_resource_to_bus() here. Am I missing something? If so, a
clarifying comment would be helpful.

> + * pci_p2pmem_publish - publish the p2p memory for use by other devices
> + * with pci_p2pmem_find
> + * @pdev: the device with p2p memory to publish
> + * @publish: set to true to publish the memory, false to unpublish it
> + */
> +void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
> +{
> + if (WARN_ON(publish && !pdev->p2p))
> + return;

Same WARN_ON question -- is this really intended for production?
Doesn't seem like the end user can really do anything with the warning
information.

> diff --git a/include/linux/pci-p2p.h b/include/linux/pci-p2p.h
> new file mode 100644
> index 000000000000..f811c97a5886
> --- /dev/null
> +++ b/include/linux/pci-p2p.h
> @@ -0,0 +1,85 @@
> +#ifndef _LINUX_PCI_P2P_H
> +#define _LINUX_PCI_P2P_H
> +/*
> + * Copyright (c) 2016-2017, Microsemi Corporation
> + * Copyright (c) 2017, Christoph Hellwig.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
> + * more details.
> + */
> +
> +#include <linux/pci.h>
> +
> +struct block_device;
> +struct scatterlist;

I've been noticing that we're accumulating PCI-related files in
include/linux: pci.h, pci-aspm.h pci-ats.h, pci-dma.h, pcieport_if.h,
etc. I'm not sure there's value in all those and am thinking maybe
they should just be folded into pci.h. What do you think?

> +#ifdef CONFIG_PCI_P2P
> +int pci_p2pmem_add_resource(struct pci_dev *pdev, int bar, size_t size,
> + u64 offset);
> +int pci_p2pmem_add_client(struct list_head *head, struct device *dev);
> ...

Bjorn