Re: [PATCH] swiotlb: fix the check whether a device has used software IO TLB

From: Catalin Marinas
Date: Fri Sep 15 2023 - 13:10:23 EST


On Fri, Sep 15, 2023 at 11:13:43AM +0200, Petr Tesařík wrote:
> On Thu, 14 Sep 2023 19:28:01 +0100
> Catalin Marinas <catalin.marinas@xxxxxxx> wrote:
> > What do the smp_wmb() barriers in swiotlb_find_slots() and
> > swiotlb_dyn_alloc() order? The latter is even more unclear as it's at
> > the end of the function and the "pairing" comment doesn't help.
>
> By the time swiotlb_find_slots() returns a valid slot index, the new
> value of dev->dma_uses_io_tlb must be visible by all CPUs in
> is_swiotlb_buffer(). The index is used to calculate the bounce buffer
> address returned to device drivers. This address may be passed to
> another CPU and used as an argument to is_swiotlb_buffer().

Ah, I remember now. So the smp_wmb() ensures that dma_uses_io_tlb is
seen by other CPUs before the slot address (presumably passed via other
memory write). It may be worth updating the comment in the code (I'm
sure I'll forget it in a month time). The smp_rmb() before READ_ONCE()
in this patch is also needed for the same reasons (ordering after the
read of the address passed to is_swiotlb_buffer()).

BTW, you may want to use WRITE_ONCE() when setting dma_uses_io_tlb (it
also matches the READ_ONCE() in is_swiotlb_buffer()). Or you can use
smp_store_mb() (but check its semantics first).

> I am not sure that the smp_wmb() in swiotlb_dyn_alloc() is needed, but
> let me explain my thinking. Even if the dev->dma_uses_io_tlb flag is
> set, is_swiotlb_buffer() returns false unless the buffer address is
> found in the RCU list of swiotlb pools, which is walked without taking
> any lock. In some iterations of the patch series, there was a direct
> call to swiotlb_dyn_alloc(), and a smp_wmb() was necessary to make sure
> that the list of swiotlb pools cannot be observed as empty by another
> CPU. You have already explained to me that taking a spin lock in
> add_mem_pool() is not sufficient, because it does not invalidate a
> value that might have been speculatively read by another CPU before
> entering the critical section. OTOH swiotlb_dyn_alloc() is always
> called through schedule_work() in the current version. If that
> implicitly provides necessary barriers, this smp_wmb() can be removed.

Normally you'd need a barrier between pool update and flag update:

list_add_rcu(&pool->node, &mem->pools);
smp_wmb();
WRITE_ONCE(dev->dma_uses_io_tlb, true);

The lock around mem->pools update doesn't help since since it only has
release semantics on the unlock path (which doesn't prevent the
dma_uses_io_tlb write from being reordered before).

On the reader side, you need to make sure that if the dma_uses_io_tlb is
true, the mem->pools access was not speculated and read earlier as
empty, so another dma_rmb():

if (READ_ONCE(dev->dma_uses_io_tlb)) {
dma_rmb();
swiotlb_find_pool(...);
}

That's missing in the code currently (and rcu_read_lock() only has
acquire semantics).

However, what confuses me is that mem->pools is populated asynchronously
via schedule_work(). Not sure how the barriers help since the work can
be scheduled on any CPU at any point after, potentially after
dma_uses_io_tlb has been updated.

On the transient dev->dma_io_tlb_pools updated in swiotlb_find_slots(),
you'd need the barriers as I mentioned above (unless I misunderstood how
this pool searching works; not entirely sure why swiotlb_find_pool()
searches both).

> FTR list_add_rcu() alone is not sufficient. It adds the new element
> using rcu_assign_pointer(), which modifies the forward link with
> smp_store_release(). However, list_for_each_entry_rcu() reads the head
> with list_entry_rcu(), which is a simple READ_ONCE(); there is no
> ACQUIRE semantics.
>
> Actually, I wonder when a newly added RCU list element is guaranteed to
> be visible by all CPUs. Existing documentation deals mainly with
> element removal, explaining that both the old state and the new state
> of an RCU list are valid after addition. Is it assumed that the old
> state of an RCU list after addition is valid indefinitely?

When some write becomes visible to other CPUs, I don't think the
architectures define precisely (more like in vague terms as
"eventually"). That's why we can't talk about barriers in relation to a
single write as the barrier doesn't make a write visible but rather
makes it visible _before_ a subsequent write becomes visible (whenever
that may be).

Anyway, as above, I think the current code is still missing some
barriers between dma_uses_io_tlb update or read and the actual pool
writes. But I haven't figured out how this works with the asynchronous
swiotlb_dyn_alloc().

--
Catalin