Re: [GIT PULL] (swiotlb) stable/for-linus-5.15

From: Robin Murphy
Date: Thu Sep 02 2021 - 17:55:47 EST


Hi Linus,

On 2021-09-02 19:54, Linus Torvalds wrote:
On Wed, Sep 1, 2021 at 2:09 PM Konrad Rzeszutek Wilk
<konrad.wilk@xxxxxxxxxx> wrote:

which has a new feature called restricted DMA pools. It allows SWIOTLB to utilize
per-device (or per-platform) allocated memory pools instead of using the global one.

The first big user of this is ARM Confidential Computing where the memory for DMA
operations can be set per platform.

Hmm.

I was going to merge this, but I had already merged the dma-mapping
updates from Christoph Hellwig first.

And it turns out that the conflicts are trivial to do the
straightforward way, that's not a problem.

But I think that straightforward way is actually buggy, because it
makes the is_swiotlb_for_alloc() type allocations then use that new
global pool for allocations.

Just to check, I looked into what linux-next did, in case this had
caused any discussion there. Nope. Linux-next does the obvious
straightforward merge.

So I'm going to hold off merging until I can clarify what the rules
for these restricted DMA pools are, because I get the feeling that
people didn't actually think about the interaction with the new global
pool very much.

Not to defend the state things have got into - frankly I've found it increasingly hard to follow for some time now too - but I believe the unwritten story is of the one true dma-direct trying to be all things to all users, and there are several aspects at play here which are mutually exclusive in practice.

Or if they did, it didn't get communicated to me. Christoph mentioned
the merge conflict, but he implied that I should just merge it the
obvious way. And the more I look at that code, the less I think that's
the right thing to do.

So just to make this very concrete, look at dma_direct_alloc(). The
straightforward merge is to do this:

if (!IS_ENABLED(CONFIG_ARCH_HAS_DMA_SET_UNCACHED) &&
!IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) &&
!IS_ENABLED(CONFIG_DMA_GLOBAL_POOL) &&
!dev_is_dma_coherent(dev) &&
!is_swiotlb_for_alloc(dev))
return arch_dma_alloc(dev, size, dma_handle, gfp, attrs);

if (IS_ENABLED(CONFIG_DMA_GLOBAL_POOL) &&
!dev_is_dma_coherent(dev))
return dma_alloc_from_global_coherent(dev, size, dma_handle);

and that seems to be what linux-next did too.

But that means that a is_swiotlb_for_alloc(dev) allocation can use
that new global coherent pool (see that second case).
>
Maybe that's ok. But I get the feeling that it is not. It sure as heck
isn't obvious. If it's ok to use the global pool, then fine - but I
want to know why (and I want the code to say why).

The global pool is essentially for noMMU, so should not be expected to overlap with SWIOTLB at all, much less systems using restricted DMA.

I think that the second conditional should probably get an all-new added

&& !is_swiotlb_for_alloc(dev)

in it, but that's not what happened in linux-next. And all this code
is messy and illogical and crazy.

Btw, what's the actual logic with the

!dev_is_dma_coherent(dev)

tests anyway? Those are pre-existing, but seem odd and senseless. The
function is called "dma_alloc_from_global_coherent()", but it is *not*
called when "dev_is_dma_coherent()".

Heh, good terminology for this is hard to find. One way to make sense of it is to consider that either the device itself is coherent - i.e. can snoop caches such that we don't have to much in the way of special treatment - or otherwise it's the memory which has to be made coherent, i.e. remapped non-cacheable, such that CPUs and non-snooping devices can still maintain a consistent view of it (note: I'm focusing on caches here for simplicity, but in practice there's other stuff like memory encryption in the mix too). Hence a "coherent pool" is actually a pool of non-cacheably-mapped memory from which to allocate for non-hardware-coherent devices in contexts where we can't remap regular page allocations on-the-fly (also called an "atomic pool" in various places for that reason). For noMMU with caches, that's effectively all the time, which is where the global pool came from - unlike the atomic pool(s) for MMU systems which can just be allocated and remapped at boot, the global pool is often subject to platform-specific restrictions like being at a special physical alias, so starts life in a very different way.

[ And the reason it's called the global coherent pool is from the implementation growing out of per-device coherent pools, which are yet another similar-but-slightly-different thing. For added fun, those may also intersect with restricted DMA, but that all forks off down the dma_alloc_from_dev_coherent() path before we even get to dma-direct. ]

After the initial setup, though, I guess it might be possible with a bit more refactoring to streamline the global pool bits into the dma_alloc_from_pool() path, since by that point it's functionally equivalent to the atomic pool(s), just under different conditions.

Is it just me, or does anybody else also get the feeling that too many
recreational drugs may have been involved in the creation of this code
over the years? People possibly self-medicating to take away the pain?

Anyway - I'm going to punt on this pull request, because I want
clarification that didn't happen despite the conflict in linux-next.

I also have to say, looking at that code, you guys write collectively
some disgusting code.

I think individually each change has been probably fairly small, but
taken together it's just too nasty for words. That code should be
taken out behind a shed and shot in the head, because it's entirely
impossible to read all those nasty conditionals over multiple lines.

So regardless of what the merge resolution should be, I think this
code needs reorganizing. Because those entirely random conditionals
with several different config tests that DO NOT MAKE SENSE TOGETHER
should not exist.

The merge conflict is just a symptom of the disease in this code, not
the problem itself.

That code needs to be split up, and the logic needs to be made
understandable. Preferably *obvious*. Which it isn't now.

I suppose one logical option would be to split out dedicated versions of dma-direct for at least the major differences - noMMU, CONFIG_DMA_DIRECT_REMAP, etc. - but that does run the risk of ending up as a lot of very similar looking code and being a rich source of bugs where things get added/fixed in one place but not the others.

In particular, if the rule is that "is_swiotlb_for_alloc()" means that
you *have* to use swiotlb_alloc() and absolutely nothing else", then
that should be the logic, and it should be at the top, and that should
be all it is. No m ixing with the other tests that are about entirely
different issues.

Sadly it doesn't entirely mean that. A device using restricted DMA *should* usually end up allocating from its own dedicated SWIOTLB pool, but there may be cases where falling back to dma_direct_alloc_from_pool() is still OK, or maybe even necessary.

Because right now, that code is literally set up to use completely
random different allocation choices depending on completely random and
illogical combinations of config rules mixed with random device rules.

Make each conditional be about *one* thing, and one thing only. Not
about three unrelated things that may or may have the same end result
rule.

See what I'm saying?

The solution may be something like

if (is_swiotlb_for_alloc(dev))
goto swiotlb_alloc_only;

at the top of the allocation routines, or whatever. Don't make the
other conditionals - that have nothing to do with this thing - get
mixed up in the fundamental rule.

Also, that's an odd name. "is_swiotlb_for_alloc()"? Those four words
do not make sense together in that order. Shouldn't it be
"use_swiotlb_for_alloc()"?

So make the actual rules individual, and make them make sense. It's ok
to have five (or more) of those kinds of things, but they need to be
individually sensible, and not randomly tied to each other as one big
broken non-sensible thing.

Please no more of this completely crazy "let's mix it all together in
a blender, and write code that makes no sense".

In the immediate short term, please

(a) explain what the rules actually are

I hope some of the background above has helped a bit.

(b) so that I can do a merge that actually does the right thing (ie I
_think_ I should add that extra test above, but I want this thought
about and confirmed or be told why that isn't the case).

and then the rewrite should be a future step. But that rewrite of
those disgusting "rules" absolutely NEEDS to happen, because looking
at that spaghetti of random config options and other random
conditionals, that's simply not acceptable.

FWIW, no argument from me there - I believe it's actually ChromeOS and Android pKVM having the initial use-cases for getting this new SWIOTLB functionality landed, but in terms of Arm CCA we certainly still have some interest in it at the moment, so I might be able to wrangle some time to take on a bit of cleanup from that angle.

Cheers,
Robin.

It's particularly not acceptable since I suspect that it all probably
*works* even if you allocate from the wrong pool, but it then
presumably violates some security rule that it was set up so that
allocations would *not* preferably get mixed up in some global pool.

Pls advise. I'm not asking anybody to do the merge for me - I'm
literally asking for that code and the rules to be clarified.

Linus