Re: [PATCH v2 RESEND 7/7] swiotlb: per-device flag if there are dynamically allocated buffers

From: Petr Tesařík
Date: Tue May 16 2023 - 03:56:22 EST


On Mon, 15 May 2023 17:28:38 +0100
Catalin Marinas <catalin.marinas@xxxxxxx> wrote:

> (some of you replies may have been filtered to various of my mailboxes,
> depending on which lists you cc'ed; replying here)
>
> On Mon, May 15, 2023 at 12:00:54PM +0200, Petr Tesařík wrote:
> > On Mon, 15 May 2023 10:48:47 +0200
> > Petr Tesařík <petr@xxxxxxxxxxx> wrote:
> > > On Sun, 14 May 2023 19:54:27 +0100
> > > Catalin Marinas <catalin.marinas@xxxxxxx> wrote:
> > > > Now, thinking about the list_head access and the flag ordering, since it
> > > > doesn't matter, you might as well not bother with the flag at all and
> > > > rely on list_add() and list_empty() ordering vs the hypothetical 'blah'
> > > > access. Both of these use READ/WRITE_ONCE() for setting
> > > > dma_io_tlb_dyn_slots.next. You only need an smp_wmb() after the
> > > > list_add() and an smp_rmb() before a list_empty() check in
> > ^^^^^^^^^
> > Got it, finally. Well, that's exactly something I don't want to do.
> > For example, on arm64 (seeing your email address), smp_rmb() translates
> > to a "dsb ld" instruction. I would expect that this is more expensive
> > than a "ldar", generated by smp_load_acquire().
>
> It translates to a dmb ishld which is on par with ldar (dsb is indeed a
> lot more expensive but that's not generated here).

You're right, dsb is generated for the non-smp barrier variants. Thanks
for the correction.

> > > > is_swiotlb_buffer(), no dma_iotlb_have_dyn variable.
> > >
> > > Wait, let me check that I understand you right. Do you suggest that I
> > > convert dma_io_tlb_dyn_slots to a lockless list and get rid of the
> > > spinlock?
> > >
> > > I'm sure it can be done for list_add() and list_del(). I'll have
> > > to think about list_move().
> >
> > Hm, even the documentation of llist_empty() says that it is "not
> > guaranteed to be accurate or up to date". If it could be, I'm quite
> > sure the authors would have gladly implemented it as such.
>
> It doesn't but neither does your flag.

Yes, I have already agreed in another sub-thread.

> If you want a guarantee, you'd
> need locks because a llist_empty() on its own can race with other
> llist_add/del_*() that may not yet be visible to a CPU at exactly that
> moment. BTW, the llist implementation cannot delete a random element, so
> not sure this is suitable for your implementation (it can only delete
> the first element or the whole list).
>
> I do think you need to change your invariants and not rely on an
> absolute list_empty() or some flag:
>
> P0:
> list_add(paddr);
> WRITE_ONCE(blah, paddr);
>
> P1:
> paddr = READ_ONCE(blah);
> list_empty();
>
> Your invariant (on P1) should be blah == paddr => !list_empty(). If
> there is another P2 removing paddr from the list, this wouldn't work
> (nor your flag) but the assumption is that a correctly written driver
> that still has a reference to paddr doesn't use it after being removed
> from the list (i.e. it doesn't do a dma_unmap(paddr) and still continue
> to use this paddr for e.g. dma_sync()).

Right. In other words, given any paddr:

a. Either it is on the list, and then the list cannot become empty by
any concurrent code.

a. Or it is not on the list, but then we may skip the search
regardless of any races with other CPUs.

> For such invariant, you'd need ordering between list_add() and the
> write of paddr (smp_wmb() would do). On P1, you need an smp_rmb() before
> list_empty() since the implementation does a READ_ONCE only).

I agree.

> You still need the locks for list modifications and list traversal as I
> don't see how you can use the llist implementation with random element
> removal.

That's right. It might even perform better than a truly non-blocking
list (cf. Valois, Harris, Zhang).

> There is another scenario to take into account on the list_del() side.
> Let's assume that there are other elements on the list, so
> list_empty() == false:
>
> P0:
> list_del(paddr);
> /* the memory gets freed, added to some slab or page free list */
> WRITE_ONCE(slab_free_list, __va(paddr));
>
> P1:
> paddr = __pa(READ_ONCE(slab_free_list));/* re-allocating paddr freed on P0 */
> if (!list_empty()) { /* assuming other elements on the list */
> /* searching the list */
> list_for_each() {
> if (pos->paddr) == __pa(vaddr))
> /* match */
> }
> }
>
> On P0, you want the list update to be visible before the memory is freed
> (and potentially reallocated on P1). An smp_wmb() on P0 would do. For
> P1, we don't care about list_empty() as there can be other elements
> already. But we do want any list elements reading during the search to
> be ordered after the slab_free_list reading. The smp_rmb() you'd add for
> the case above would suffice.

Yes, but to protect against concurrent insertions/deletions, a spinlock
is held while searching the list. The spin lock provides the necessary
memory barriers implicitly.

Again, thank you very much for your time!

Petr T