Re: [PATCH kernel v8 14/31] vfio: powerpc/spapr: powerpc/powernv/ioda2: Rework IOMMU ownership control

From: Alexey Kardashevskiy
Date: Mon Apr 20 2015 - 02:55:44 EST


On 04/20/2015 12:44 PM, David Gibson wrote:
On Fri, Apr 17, 2015 at 08:09:29PM +1000, Alexey Kardashevskiy wrote:
On 04/16/2015 04:07 PM, David Gibson wrote:
On Fri, Apr 10, 2015 at 04:30:56PM +1000, Alexey Kardashevskiy wrote:
At the moment the iommu_table struct has a set_bypass() which enables/
disables DMA bypass on IODA2 PHB. This is exposed to POWERPC IOMMU code
which calls this callback when external IOMMU users such as VFIO are
about to get over a PHB.

The set_bypass() callback is not really an iommu_table function but
IOMMU/PE function. This introduces a iommu_table_group_ops struct and
adds a set_ownership() callback to it which is called when an external
user takes control over the IOMMU.

Do you really need separate ops structures at both the single table
and table group level? The different tables in a group will all
belong to the same basic iommu won't they?


IOMMU tables exist alone in VIO. Also, the platform code uses just a table
(or it is in bypass mode) and does not care about table groups. It looked
more clean for myself to keep them separated. Should I still merge
those?

Ok, that sounds like a reasonable argument for keeping them separate,
at least for now.

This renames set_bypass() to set_ownership() as it is not necessarily
just enabling bypassing, it can be something else/more so let's give it
more generic name. The bool parameter is inverted.

The callback is implemented for IODA2 only. Other platforms (P5IOC2,
IODA1) will use the old iommu_take_ownership/iommu_release_ownership API.

Signed-off-by: Alexey Kardashevskiy <aik@xxxxxxxxx>
---
arch/powerpc/include/asm/iommu.h | 14 +++++++++++++-
arch/powerpc/platforms/powernv/pci-ioda.c | 30 ++++++++++++++++++++++--------
drivers/vfio/vfio_iommu_spapr_tce.c | 25 +++++++++++++++++++++----
3 files changed, 56 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index b9e50d3..d1f8c6c 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -92,7 +92,6 @@ struct iommu_table {
unsigned long it_page_shift;/* table iommu page size */
struct iommu_table_group *it_group;
struct iommu_table_ops *it_ops;
- void (*set_bypass)(struct iommu_table *tbl, bool enable);
};

/* Pure 2^n version of get_order */
@@ -127,11 +126,24 @@ extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,

#define IOMMU_TABLE_GROUP_MAX_TABLES 1

+struct iommu_table_group;
+
+struct iommu_table_group_ops {
+ /*
+ * Switches ownership from the kernel itself to an external
+ * user. While onwership is enabled, the kernel cannot use IOMMU
+ * for itself.
+ */
+ void (*set_ownership)(struct iommu_table_group *table_group,
+ bool enable);

The meaning of "enable" in a function called "set_ownership" is
entirely obscure.

Suggest something better please :) I have nothing better...

Well, given it's "set_ownershuip" you could have "owner" - that would
want to be an enum with OWNER_KERNEL and OWNER_VFIO or something
rather than a bool.


It is iommu_take_ownership() in upstream and it is assumed that the owner is anything but the platform code (for now and probably for ever - VFIO). I am not changing this now, just using same naming approach when adding a callback with a similar name.


Or you could leave it a bool but call it "allow_bypass".

Commented below.





+};
+
struct iommu_table_group {
#ifdef CONFIG_IOMMU_API
struct iommu_group *group;
#endif
struct iommu_table tables[IOMMU_TABLE_GROUP_MAX_TABLES];
+ struct iommu_table_group_ops *ops;
};

#ifdef CONFIG_IOMMU_API
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index a964c50..9687731 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1255,10 +1255,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
__free_pages(tce_mem, get_order(TCE32_TABLE_SIZE * segs));
}

-static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
+static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable)
{
- struct pnv_ioda_pe *pe = container_of(tbl->it_group, struct pnv_ioda_pe,
- table_group);
uint16_t window_id = (pe->pe_number << 1 ) + 1;
int64_t rc;

@@ -1286,7 +1284,8 @@ static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
* host side.
*/
if (pe->pdev)
- set_iommu_table_base(&pe->pdev->dev, tbl);
+ set_iommu_table_base(&pe->pdev->dev,
+ &pe->table_group.tables[0]);
else
pnv_ioda_setup_bus_dma(pe, pe->pbus, false);
}
@@ -1302,13 +1301,27 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
/* TVE #1 is selected by PCI address bit 59 */
pe->tce_bypass_base = 1ull << 59;

- /* Install set_bypass callback for VFIO */
- pe->table_group.tables[0].set_bypass = pnv_pci_ioda2_set_bypass;
-
/* Enable bypass by default */
- pnv_pci_ioda2_set_bypass(&pe->table_group.tables[0], true);
+ pnv_pci_ioda2_set_bypass(pe, true);
}

+static void pnv_ioda2_set_ownership(struct iommu_table_group *table_group,
+ bool enable)
+{
+ struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
+ table_group);
+ if (enable)
+ iommu_take_ownership(table_group);
+ else
+ iommu_release_ownership(table_group);
+
+ pnv_pci_ioda2_set_bypass(pe, !enable);
+}
+
+static struct iommu_table_group_ops pnv_pci_ioda2_ops = {
+ .set_ownership = pnv_ioda2_set_ownership,
+};
+
static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
struct pnv_ioda_pe *pe)
{
@@ -1376,6 +1389,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
}
tbl->it_ops = &pnv_iommu_ops;
iommu_init_table(tbl, phb->hose->node);
+ pe->table_group.ops = &pnv_pci_ioda2_ops;
iommu_register_group(&pe->table_group, phb->hose->global_number,
pe->pe_number);

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 9f38351..d5d8c50 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -535,9 +535,22 @@ static int tce_iommu_attach_group(void *iommu_data,
goto unlock_exit;
}

- ret = iommu_take_ownership(table_group);
- if (!ret)
- container->grp = iommu_group;
+ if (!table_group->ops || !table_group->ops->set_ownership) {
+ ret = iommu_take_ownership(table_group);
+ } else {
+ /*
+ * Disable iommu bypass, otherwise the user can DMA to all of
+ * our physical memory via the bypass window instead of just
+ * the pages that has been explicitly mapped into the iommu
+ */
+ table_group->ops->set_ownership(table_group, true);

And here to disable bypass you call it with enable=true, so it doesn't
even have the same meaning as it used to.


I do not disable bypass per se (even if it what set_ownership(true) does) as
it is IODA business and VFIO has no idea about it. I do take control over
the group. I am not following you here - what used to have the same
meaning?

Well, in set_bypass, the enable parameter was whether bypass was
enabled. Here you're setting enable to true, when you want to
*disable* bypass (in the existing case). If the "enable" parameter
isn't about enabling bypass, it's meaning is even more confusing than
I thought.


Its meaning is "take ownership over the group". In this patch set_ownership(true) means set_bypass(false).

But later (in 25/31) set_ownership(true) becomes unset(windows0) + free(table0) + set_bypass(false) = clear DMA setup for the group (i.e. invalidate both TVTs) so it is not just about bypass (which is TVT#1 but not TVT#0) anymore.



Plus, you should fold the logic to call the callback if necessary into
iommu_take_ownership().


I really want to keep VFIO stuff out of arch/powerpc/kernel/iommu.c as much
as possible as it is for platform DMA/IOMMU, not VFIO (which got SPAPR
driver for that). ops->set_ownership() is one of these things.

What's VFIO specific about this fragment - it's just if you have the
callback, call it, otherwise fall back to the default.

iommu_take_ownership()/iommu_release_ownership() are helpers for old-style
commercially-unsupported P5IOC2/IODA1, and this is kind of a hack while
ops->set_ownership() is an interface for VFIO to do dynamic windows thing.

Can you put their logic into a set_ownership callback for IODA1 then?

And P5IOC2, and pseries. We know these callbacks will call the same iommu_take_ownership() and iommu_release_ownership() and this is not going to change. Too invasive for such a hack imho.


If it makes sense, I could fold the previous patch into this one and move
iommu_take_ownership()/iommu_release_ownership() to vfio_iommu_spapr_tce.c,
should I? Or leave things are they are now.

That sounds like it might make sense.

This is what will go to v9, looks cleaner. Thanks.





+ ret = 0;
+ }
+
+ if (ret)
+ goto unlock_exit;
+
+ container->grp = iommu_group;

unlock_exit:
mutex_unlock(&container->lock);
@@ -572,7 +585,11 @@ static void tce_iommu_detach_group(void *iommu_data,
table_group = iommu_group_get_iommudata(iommu_group);
BUG_ON(!table_group);

- iommu_release_ownership(table_group);
+ /* Kernel owns the device now, we can restore bypass */
+ if (!table_group->ops || !table_group->ops->set_ownership)
+ iommu_release_ownership(table_group);
+ else
+ table_group->ops->set_ownership(table_group, false);

Likewise fold this if into iommu_release_ownership().

unlock_exit:
mutex_unlock(&container->lock);






--
Alexey
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/