Re: [PATCH 00/10] mm, arm64: Reduce ARCH_KMALLOC_MINALIGN below the cache line size

From: Catalin Marinas
Date: Thu Apr 07 2022 - 13:49:14 EST

Next message: Kirill A. Shutemov: "Re: [PATCHv8 00/30] TDX Guest: TDX core support"
Previous message: Sean Christopherson: "Re: [PATCH v2 06/31] KVM: x86: hyper-v: Don't use sparse_set_to_vcpu_mask() in kvm_hv_send_ipi()"
In reply to: Vlastimil Babka: "Re: [PATCH 00/10] mm, arm64: Reduce ARCH_KMALLOC_MINALIGN below the cache line size"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, Apr 07, 2022 at 04:40:15PM +0200, Vlastimil Babka wrote:
> On 4/5/22 15:57, Catalin Marinas wrote:
> > This series is beneficial to arm64 even if it's only reducing the
> > kmalloc() minimum alignment to 64. While it would be nice to reduce this
> > further to 8 (or 16) on SoCs known to be fully DMA coherent, detecting
> > this is via arch_setup_dma_ops() is problematic, especially with late
> > probed devices. I'd leave it for an additional RFC series on top of
> > this (there are ideas like bounce buffering for non-coherent devices if
> > the SoC was deemed coherent).
[...]
> - due to ARCH_KMALLOC_MINALIGN and dma guarantees we should return
> allocations aligned to ARCH_KMALLOC_MINALIGN and the prepended size header
> should also not share their ARCH_KMALLOC_MINALIGN block with another
> (shorter) allocation that has a different lifetime, for the dma coherency
> reasons
> - this is very wasteful especially with the 128 bytes alignment, and seems
> we already violate it in some scenarios anyway [2]. Extending this to all
> objects would be even more wasteful.
>
> So this series would help here, especially if we can get to the 8/16 size.

If we get to 8/16 size, it would only be for platforms that are fully
coherent. Otherwise, with non-coherent DMA, the minimum kmalloc()
alignment would still be the cache line size (typically 64) even if
ARCH_KMALLOC_MINALIGN is 8.

IIUC your point is that if ARCH_KMALLOC_MINALIGN is 8, kmalloc() could
return pointers 8-byte aligned only as long as DMA safety is preserved
(like not sharing the rest of the cache line with anything other
writers).

> But now I also wonder if keeping the name and meaning of "MINALIGN" is in
> fact misleading and unnecessarily constraining us? What this is really about
> is "granularity of exclusive access", no?

Not necessarily. Yes, in lots of cases it is about granularity of access
but there are others where the code does need the pointer returned
aligned to ARCH_DMA_MINALIGN (currently via ARCH_KMALLOC_MINALIGN).
Crypto seems to have such requirement (see the sub-thread with Herbert).
Some (all?) callers ask kmalloc() for the aligned size and there's an
expectation that if the size is a multiple of a power of two, kmalloc()
will return a pointer aligned to that power of two. I think we need to
preserve these semantics which may lead to some more wastage if you add
the header (e.g. a size of 3*64 returns a pointer either aligned to 192
or 256).

> Let's say the dma granularity is 64bytes, and there's a kmalloc(56).

In your example, the size is not a power of two (or multiple of), so I
guess there's no expectation for a 64-byte alignment (it can be 8)
unless DMA is involved. See below.

> If SLOB find a 64-bytes aligned block, uses the first 8 bytes for the
> size header and returns the remaining 56 bytes, then the returned
> pointer is not *aligned* to 64 bytes, but it's still aligned enough
> for cpu accesses (which need only e.g. 8), and non-coherent dma should
> be also safe because nobody will be accessing the 8 bytes header,
> until the user of the object calls kfree() which should happen only
> when it's done with any dma operations. Is my reasoning correct and
> would this be safe?

>From the DMA perspective, it's not safe currently. Let's say we have an
inbound DMA transfer, the DMA API will invalidate the cache line prior
to DMA. In arm64 terms, it means that the cache line is discarded, not
flushed to memory. If the first 8 bytes had not been written back to
RAM, they'd be lost. If we can guarantee that no CPU write happens to
the cache line during the DMA transfer, we can change the DMA mapping
operation to do a clean+invalidate (flush the cacheline to RAM) first. I
guess this could be done with an IS_ENABLED(CONFIG_SLOB) check.

--
Catalin

Next message: Kirill A. Shutemov: "Re: [PATCHv8 00/30] TDX Guest: TDX core support"
Previous message: Sean Christopherson: "Re: [PATCH v2 06/31] KVM: x86: hyper-v: Don't use sparse_set_to_vcpu_mask() in kvm_hv_send_ipi()"
In reply to: Vlastimil Babka: "Re: [PATCH 00/10] mm, arm64: Reduce ARCH_KMALLOC_MINALIGN below the cache line size"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]