[PATCH 0/6] riscv: Reduce ARCH_KMALLOC_MINALIGN to 8

From: Jisheng Zhang
Date: Fri May 26 2023 - 13:11:21 EST

Currently, riscv defines ARCH_DMA_MINALIGN as L1_CACHE_BYTES, I.E
64Bytes, if CONFIG_RISCV_DMA_NONCOHERENT=y. To support unified kernel
Image, usually we have to enable CONFIG_RISCV_DMA_NONCOHERENT, thus
it brings some bad effects to for coherent platforms:

Firstly, it wastes memory, kmalloc-96, kmalloc-32, kmalloc-16 and
kmalloc-8 slab caches don't exist any more, they are replaced with
either kmalloc-128 or kmalloc-64.

Secondly, larger than necessary kmalloc aligned allocations results
in unnecessary cache/TLB pressure.

This issue also exists on arm64 platforms. From last year, Catalin
tried to solve this issue by decoupling ARCH_KMALLOC_MINALIGN from
ARCH_DMA_MINALIGN, limiting kmalloc() minimum alignment to
dma_get_cache_alignment() and replacing ARCH_KMALLOC_MINALIGN usage
in various drivers with ARCH_DMA_MINALIGN etc.

One fact we can make use of for riscv: if the CPU doesn't support
ZICBOM or T-HEAD CMO, we know the platform is coherent. Based on
Catalin's work and above fact, we can easily solve the kmalloc align
issue for riscv: we can override dma_get_cache_alignment(), then let
it return ARCH_DMA_MINALIGN at the beginning and return 1 once we know
the underlying HW neither supports ZICBOM nor supports T-HEAD CMO.

So what about if the CPU supports ZICBOM and T-HEAD CMO, but all the
devices are dma coherent? Well, we use ARCH_DMA_MINALIGN as the
kmalloc minimum alignment, nothing changed in this case. This case
can be improved in the future.

After this patch, a simple test of booting to a small buildroot rootfs
on qemu shows:

kmalloc-96 5041 5041 96 ...
kmalloc-64 9606 9606 64 ...
kmalloc-32 5128 5128 32 ...
kmalloc-16 7682 7682 16 ...
kmalloc-8 10246 10246 8 ...

So we save about 1268KB memory. The saving will be much larger in normal
OS env on real HW platforms.

patch 1,2,3,4 are either clean up or preparation patches.
patch5 allows kmalloc() caches aligned to the smallest value.

After this series:

As for coherent platforms, kmalloc-{8,16,32,96} caches come back on
coherent both RV32 and RV64 platforms, I.E !ZICBOM and !THEAD_CMO.

As for noncoherent RV32 platforms, nothing changed.

As for noncoherent RV64 platforms, I.E either ZICBOM or THEAD_CMO, the
above kmalloc caches also come back if > 4GB memory or users pass
"swiotlb=mmnn,force" to force swiotlb creation if <= 4GB memory. How
much mmnn should be depends on the specific platform, it need to be
tried and tested all possible usage case on the specific hardware. For
example, I can use the minimal I/O TLB slabs on Sipeed M1S Dock.

[1] Link: https://lore.kernel.org/linux-arm-kernel/20230524171904.3967031-1-catalin.marinas@xxxxxxx/

Jisheng Zhang (6):
riscv: errata: thead: only set cbom size & noncoherent during boot
riscv: mm: mark CBO relate initialization funcs as __init
riscv: mm: mark noncoherent_supported as __ro_after_init
riscv: mm: pass noncoherent or not to riscv_noncoherent_supported()
riscv: allow kmalloc() caches aligned to the smallest value
riscv: enable DMA_BOUNCE_UNALIGNED_KMALLOC for !dma_coherent

arch/riscv/Kconfig | 1 +
arch/riscv/errata/thead/errata.c | 22 ++++++++++++++--------
arch/riscv/include/asm/cache.h | 14 ++++++++++++++
arch/riscv/include/asm/cacheflush.h | 4 ++--
arch/riscv/kernel/setup.c | 6 +++++-
arch/riscv/mm/cacheflush.c | 8 ++++----
arch/riscv/mm/dma-noncoherent.c | 16 +++++++++++-----
7 files changed, 51 insertions(+), 20 deletions(-)