Re: [PATCH v2 06/11] iommu/arm-smmu-v3: Introduce arm_smmu_s2_parent_tlb_ invalidation helpers
From: Jason Gunthorpe
Date: Tue Apr 15 2025 - 19:46:34 EST
On Tue, Apr 15, 2025 at 01:10:37PM -0700, Nicolin Chen wrote:
> On Tue, Apr 15, 2025 at 09:50:42AM -0300, Jason Gunthorpe wrote:
> > struct invalidation_op {
> > struct arm_smmu_device *smmu;
> > enum {ATS,S2_VMDIA_IPA,S2_VMID,S1_ASID} invalidation_op;
> > union {
> > u16 vmid;
> > u32 asid;
> > u32 ats_id;
> > };
> > refcount_t users;
> > };
> >
> > Then invalidation would just iterate over this list following each
> > instruction.
> >
> > When things are attached the list is mutated:
> > - Normal S1/S2 attach would reuse an ASID for the same instance or
> > allocate a new list entry, users keeps track of ID sharing
> > - VMID attach would use the VMID of the vSMMU
> > - ATS enabled would add entries for each PCI device instead of the
> > seperate ATS list
>
> Interesting. I can see it generalize all the use cases.
>
> Yet are you expecting a big list combining TLBI and ATC_INV cmds?
It is the idea I had in my head. There isn't really a great reason to
have two lists if one list can handle the required updating and
locking needs.. I imagine the IOTLB entries would be sorted first and
the ATC entries last.
> I think the ATC_INV entries doesn't need a refcount?
Probably in almost all cases.
But see below about needing two domains in the list at once and recall
that today we temporarily put the same domain in the list twice
sometimes. So it may make alot of sense to use the refcount in every
entry to track how many masters are using that entry just to keep the
design simple.
> And finding an SID (to remove the device for example) would take
> long, when there are a lot of entries in the list?
It depends how smart you get, bisection search on a sorted linear list
would scale fine. But I don't think we care much about attach/detach
performance, or have such high numbers of attachments that this is
worth optimizing for.
> Should the ATS list still be separate, or even an xarray?
I haven't gone through it in any details to know.. If the invalidation
can use the structure above for ATS and nothing else needs the ATS
list, then perhaps it doesn't need to exist.
> I will refer to their driver. Yet, I wonder what we will gain from
> RCU here? Race condition? Would you elaborate with some use case?
The invalidation path was optimized to avoid locking, look at the
stuff in arm_smmu_atc_inv_domain() to try to avoid the spinlock
protecting the ATS invalidations read from the devices list.
So, I imagine a similar lock free scheme would be
invalidation:
rcu_read_lock()
list = READ_ONCE(domain->invalidation_ops);
[execute invalidation on list]
rcu_read_unlock()
mutate:
mutex_lock(domain->lock for attachment)
new_list = kcalloc()
copy_and_mutate(domain->invalidation_ops, new_list);
rcu_assign_pointer(domain->invalidation_ops, new_list);
mutex_unlock(domain->lock for attachment)
Then because of RCU you have to deal with some races.
1) HW flushing must be synchronous with the domain attach:
CPU 1 CPU 2
change an IOPTE
release IOPTs
attach a domain
release invalidation_ops
invalidation
acquire READ_ONCE()
acquire IOPTEs
update the STE/CD
Such that the HW is guarenteed to either:
a) see the new value of IOPTE before seeing the STE/CD that could
cause it be fetched
b) is guaranteed to see the invalidation_op for the new STE prior to
the STE being installed.
IIRC the riscv folks determined that this was a simple smp_mb()..
On the detaching side spurious IOTLB invalidation is OK, that will
just cause some performance anomaly. And I think spurious ATC
invalidation is OK too, though maybe need a synchronize_rcu() in
device removal due to friendly hot unplug.. IDK
2) Safe domain replacement
The existing code double adds devices to the invalidations lists for
safety. So it would need a algorithm like this:
prepare:
middle_list = copy_and_mutate_add_master(domain->list, new_master);
final_list = copy_and_mutate_remove_master(middle_list, old_master);
commit:
// Invalidate both new/old master while we mess with the STE/CD
rcu_assign_pointer(domain->list, middle_list);
install_ste()
// Only invalidate new master
rcu_assign_pointer(domain->list, final_list);
kfree_rcu(middle_list);
kfree_rcu(old_list);
As there is an intrinsic time window after the STE is written to
memory but before the STE invalidation sync has been completed in HW
where we have no idea which of the two domains the HW is fetching
from.
3) IOMMU Device removal
Since the RCU is also protecting the smmu instance memory and queues:
CPU 1 CPU 2
invalidation
rcu_read_lock()
domain detach
arm_smmu_release_device()
iommu_device_unregister()
list = READ_ONCE()
.. list[i]->smmu ..
rcu_read_unlock()
synchronize_rcu()
kfree(smmu);
But that's easy and we never hotunplug smmu's anyhow.
> > But the end result is we fully disconnect the domain from the smmu
> > instance and all domain types can be shared across all instances if
> > they support the pagetable layout. The invalidation also becomes
> > somewhat simpler as it just sweeps the list and does what it is
> > told. The special ATS list, counter and locking is removed too.
>
> OK. I'd like to give it another try. Or would you prefer to write
> yourself?
I'd be happy if you can knock it out, or at least determine it is too
hard/bad idea I'm trying to push out the io page table stuff this
cycle
The only thing that gives me pause is the complexity of the list copy
and mutate, but I didn't try to enumerate all the mutations that are
required. Maybe if this is done in a very simple unoptimized way it is
good enough 'mutate add master' 'mutate remove master', allocating a
new list copy for each operation.
Scan the list and calculate the new size. Copy the list discarding
things to delete. Add the new things to the end. Sort.
I'd probably start here, try to write the two mutate functions, check
if those are enough mutate functions, then try to migrate the
invalidation logic over to use the new lists part by part. Building
the new lists can be done first in a series.
>From here a future project would be to optimize the invalidation for
multi-SMMU and multi-device... The current code runs everything
serially, but we could push all the invalidation commands to all the
instances, then wait for the sync's to come back from each instance
allowing the HW invalidation to be in parallel. Then similarly do the
ATC in parallel. It is easy to do if the list is sorted already in
order of required operations. This might make most sense for ATC
invalidation since it is always range based and only needs two command
entries?
Jason