On Mar 26, 2021, at 7:31 PM, Lu Baolu <baolu.lu@xxxxxxxxxxxxxxx> wrote:
Hi Nadav,
On 3/19/21 12:46 AM, Nadav Amit wrote:
So here is my guess:
Intel probably used as a basis for the IOTLB an implementation of
some other (regular) TLB design.
Intel SDM says regarding TLBs (4.10.4.2 “Recommended Invalidation”):
"Software wishing to prevent this uncertainty should not write to
a paging-structure entry in a way that would change, for any linear
address, both the page size and either the page frame, access rights,
or other attributes.”
Now the aforementioned uncertainty is a bit different (multiple
*valid* translations of a single address). Yet, perhaps this is
yet another thing that might happen.
From a brief look on the handling of MMU (not IOMMU) hugepages
in Linux, indeed the PMD is first cleared and flushed before a
new valid PMD is set. This is possible for MMUs since they
allow the software to handle spurious page-faults gracefully.
This is not the case for the IOMMU though (without PRI).
Not sure this explains everything though. If that is the problem,
then during a mapping that changes page-sizes, a TLB flush is
needed, similarly to the one Longpeng did manually.
I have been working with Longpeng on this issue these days. It turned
out that your guess is right. The PMD is first cleared but not flushed
before a new valid one is set. The previous entry might be cached in the
paging structure caches hence leads to disaster.
In __domain_mapping():
2352 /*
2353 * Ensure that old small page tables are
2354 * removed to make room for superpage(s).
2355 * We're adding new large pages, so make sure
2356 * we don't remove their parent tables.
2357 */
2358 dma_pte_free_pagetable(domain, iov_pfn, end_pfn,
2359 largepage_lvl + 1);
I guess adding a cache flush operation after PMD switching should solve
the problem.
I am still not clear about this comment:
"
This is possible for MMUs since they allow the software to handle
spurious page-faults gracefully. This is not the case for the IOMMU
though (without PRI).
"
Can you please shed more light on this?
I was looking at the code in more detail, and apparently my concern
is incorrect.
I was under the assumption that the IOMMU map/unmap can merge/split
(specifically split) huge-pages. For instance, if you map 2MB and
then unmap 4KB out of the 2MB, then you would split the hugepage
and keep the rest of the mappings alive. This is the way MMU is
usually managed. To my defense, I also saw such partial unmappings
in Longpeng’s first scenario.
If this was possible, then you would have a case in which out of 2MB
(for instance), 4KB were unmapped, and you need to split the 2MB
hugepage into 4KB pages. If you try to clear the PMD, flush, and then
set the PMD to point to table with live 4KB PTES, you can have
an interim state in which the PMD is not present. DMAs that arrive
at this stage might fault, and without PRI (and device support)
you do not have a way of restarting the DMA after the hugepage split
is completed.
Anyhow, this concern is apparently not relevant. I guess I was too
naive to assume the IOMMU management is similar to the MMU. I now
see that there is a comment in intel_iommu_unmap() saying:
/* Cope with horrid API which requires us to unmap more than the
size argument if it happens to be a large-page mapping. */
Regards,
Nadav