Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs
From: Joao Martins
Date: Wed Jun 01 2022 - 08:19:23 EST
On 6/1/22 00:10, Jason Gunthorpe wrote:
> On Tue, May 31, 2022 at 10:22:32PM +0100, Robin Murphy wrote:
>> There are only 3 instances where we'll free a table while the domain is
>> live. The first is the one legitimate race condition, where two map requests
>> targeting relatively nearby PTEs both go to fill in an intermediate level of
>> table; whoever loses that race frees the table they allocated, but it was
>> never visible to anyone else so that's definitely fine. The second is if
>> we're mapping a block entry, and find that there's already a table entry
>> there, wherein we assume the table must be empty, clear the entry,
>> invalidate any walk caches, install the block entry, then free the orphaned
>> table; since we're mapping the entire IOVA range covered by that table,
>> there should be no other operations on that IOVA range attempting to walk
>> the table at the same time, so it's fine.
>
> I saw these two in the Intel driver
>
>> The third is effectively the inverse, if we get a block-sized unmap
>> but find a table entry rather than a block at that point (on the
>> assumption that it's de-facto allowed for a single unmap to cover
>> multiple adjacent mappings as long as it does so exactly); similarly
>> we assume that the table must be full, and no other operations
>> should be racing because we're unmapping its whole IOVA range, so we
>> remove the table entry, invalidate, and free as before.
>
> Not sure I noticed this one though
>
> This all it all makes sense though.
>
I saw all three incantations in AMD iommu /I think/
AMD has sort of a replicated PTEs concept representing a power of 2 page size
mapped in 'default' page sizes(4K, 2M, 1G), which we need to check every single
one of them. Which upon writing the RFC I realized it's not really the most
efficient thing to keep considering the IOMMU hardware doesn't guarantee the
dirty bit is on every replicated pte. And maybe it's clobbering the fact we
don't attempt to pick the best page-size for the IOVA mapping (like Intel),
to instead have all powers of 2 page sizes allowed and IOMMU hw tries its
best to cope.
>> Although we don't have debug dumping for io-pgtable-arm, it's good to be
>> thinking about this, since it's made me realise that dirty-tracking sweeps
>> per that proposal might pose a similar kind of concern, so we might still
>> need to harden these corners for the sake of that.
>
> Lets make sure Joao sees this..
>
> It is interesting because we probably don't want the big latency
> spikes that would come from using locking to block map/unmap while
> dirty reading is happening - eg at the iommfd level.
>
Specially when we might be scanning big IOVA ranges.
> From a consistency POV the only case that matters is unmap and unmap
> should already be doing a dedicated dirty read directly prior to the
> unmap (as per that other long thread)
>
/me nods, yes
FWIW, I've already removed the unmap_read_dirty ops.
> So having safe racy reading in the kernel is probably best, and so RCU
> would be good here too.
>
Reading dirties ought to be similar to map/unmap but slightly simpler as
I supposedly don't need to care about the pte changing under the hood (or
so I initially thought). I was wrestling at some point if test-and-clear
was enough or whether I switch back cmpxchg to detect the pte has changed
and only mark dirty based on the old value[*]. The latter would align with
how map/unmap performs the pte updates.
[*] e.g. potentially the new entry hasn't been 'seen' by IOMMU walker or
might be a smaller size then what got dirtied with iopte split or racing
with a new map
>> that somewhere I have some half-finished patches making io-pgtable-arm use
>> the iommu_iotlb_gather freelist, so maybe I'll tackle both concerns at once
>> (perhaps we might even be able to RCU-ify the freelist generically? I'll see
>> how it goes when I get there).
>
> This is all very similar to how the mm's tlb gather stuff works,
> especially on PPC which requires RCU protection of page tables instead
> of the IPI trick.
>
> Currently the rcu_head and the lru overlap in the struct page. To do
> this I'd suggest following what was done for slab - see ca1a46d6f506
> ("Merge tag 'slab-for-5.17'..) and create a 'struct page' sub-class
> like 'struct slab' for use with iommu page tables and put the
> list_head and rcu_head sequentially in the struct iommu_pt_page.
>
> Continue to use put_pages_list() just with the list_head in the new
> struct not the lru.
>
> If we need it for dirty tracking then it makes sense to start
> progressing parts at least..
>
The suggestions here seem to apply to both map/unmap too, not
just read_dirty() alone IIUC
I am not sure yet on dynamic demote/promote of page sizes if it changes this.