[PATCH] iommu/vt-d: debugfs: Fix race with iommu unmap when traversing

From: Jingqi Liu
Date: Sun Sep 03 2023 - 10:41:07 EST


When traversing page table, it may race with iommu unmap.

For the page table page pointed to by a PDPE/PDE, there are three
scenarios in the iommu unmap path.

1) The page has been freed.

If the page has a refcount of zero, it has been freed. The
debugfs should avoid to traverse it.

In the debugfs, the refcount of a page table page is checked
before traversing it. If its refcount is zero, the page will not
be traversed. If the refcount is not zero, increment its refcount
before traversal and decrement its refcount after traversal.

2) The page is not freed and the PDPE/PDE is not cleared.

When the page is unlinked from its parent, it's not freed
immediately. It will be freed as soon as the IOTLB is flushed.

Before the page is freed, the contents in the page and all the
pages under its level are stale. The debugfs should avoid to
traverse these pages.

According the VT-d specification, bit 3 of the PDPE/PTE pointing
to a page directory/table is ignored by the hardware. The driver
sets it to 0 by default. In the unmap path, set bit 3 to 1 to
indicate that the page pointed to by the PDPE/PDE is stale.

Check bit 3 of the PDPE/PDE before traversing the page pointed to
by the PDPE/PDE in the debugfs. Traversal stops if bit 3 is 1.

3) The page is not freed and the PDPE/PDE has been cleared.

Check the preseent bit of the PDPE/PDE in the debugfs. The
traversal stops if the present bit is cleared.

Suggested-by: Kevin Tian <kevin.tian@xxxxxxxxx>
Signed-off-by: Jingqi Liu <Jingqi.liu@xxxxxxxxx>
---
drivers/iommu/intel/debugfs.c | 47 +++++++++++++++++++++++++++++++++--
drivers/iommu/intel/iommu.c | 16 ++++++++++++
2 files changed, 61 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/intel/debugfs.c b/drivers/iommu/intel/debugfs.c
index 1f925285104e..e12f1ed88e65 100644
--- a/drivers/iommu/intel/debugfs.c
+++ b/drivers/iommu/intel/debugfs.c
@@ -333,9 +333,52 @@ static void pgtable_walk_level(struct seq_file *m, struct dma_pte *pde,
path[level] = pde->val;
if (dma_pte_superpage(pde) || level == 1)
dump_page_info(m, start, path);
- else
- pgtable_walk_level(m, phys_to_virt(dma_pte_addr(pde)),
+ else {
+ /*
+ * The entry references a Page-Directory Table
+ * or a Page Table.
+ */
+ struct page *pg;
+ u64 pte_addr;
+
+retry:
+ pte_addr = dma_pte_addr(pde);
+ pg = pfn_to_page(PFN_DOWN(pte_addr));
+ if (!get_page_unless_zero(pg))
+ /*
+ * If this page has a refcount of zero,
+ * it has been freed, or will be freed.
+ */
+ continue;
+
+ /*
+ * Check if the page that pointed to by the PDE is
+ * stale. Bit 3 of the PDE is ignored by hardware.
+ * If the page is stale, bit 3 is already set to 1
+ * in the unmap path.
+ */
+ if (pde->val & BIT_ULL(3)) {
+ put_page(pg);
+ continue;
+ }
+
+ /* Check if the value of this entry is changed. */
+ if (pde->val != path[level]) {
+ put_page(pg);
+
+ if (!dma_pte_present(pde))
+ /* This entry is invalid. Skip it. */
+ continue;
+
+ /* The entry has been updated. */
+ path[level] = pde->val;
+ goto retry;
+ }
+
+ pgtable_walk_level(m, phys_to_virt(pte_addr),
level - 1, start, path);
+ put_page(pg);
+ }
path[level] = 0;
}
}
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index b871a6afd803..6775b6bdfa3d 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -1103,6 +1103,22 @@ static void dma_pte_list_pagetables(struct dmar_domain *domain,
pg = pfn_to_page(dma_pte_addr(pte) >> PAGE_SHIFT);
list_add_tail(&pg->lru, freelist);

+ /*
+ * When the page pointed to by a PDPE/PDE is unlinked from its parent,
+ * it's not freed immediately. It will be freed as soon as the IOTLB
+ * is flushed. Before the page is freed, another process context (e.g.
+ * debugfs) may still traverse it. But the content of the page is no
+ * longer valid.
+ * According the VT-d specification, bit 3 of the PDPE/PTE pointing to
+ * a page directory/table is ignored by the hardware. The driver sets
+ * it to 0 by default. To avoid the unnecessary page walks, set bit 3
+ * to 1 to indicate that the page pointed to by the PDPE/PDE is stale.
+ * Then in another process context, bit 3 of the PDPE/PDE is checked
+ * before traversing the page pointed to by the PDPE/PDE. Traversal
+ * stops if bit 3 is 1.
+ */
+ pte->val |= BIT_ULL(3);
+
if (level == 1)
return;

--
2.21.3