Re: block: DMA alignment of IO buffer allocated from slab

From: Matthew Wilcox
Date: Mon Sep 24 2018 - 23:28:44 EST

On Tue, Sep 25, 2018 at 08:16:16AM +0800, Ming Lei wrote:
> On Mon, Sep 24, 2018 at 11:57:53AM -0700, Matthew Wilcox wrote:
> > On Mon, Sep 24, 2018 at 09:19:44AM -0700, Bart Van Assche wrote:
> > You're not supposed to use kmalloc memory for DMA. This is why we have
> > dma_alloc_coherent() and friends. Also, from DMA-API.txt:
> Please take a look at USB drivers, or storage drivers or scsi layer. Lot of
> DMA buffers are allocated via kmalloc.

Then we have lots of broken places. I mean, this isn't new. We used
to have lots of broken places that did DMA to the stack. And then
the stack was changed to be vmalloc'ed and all those places got fixed.
The difference this time is that it's only certain rare configurations
that are broken, and the brokenness is only found by corruption in some
fairly unlikely scenarios.

> Also see the following description in DMA-API-HOWTO.txt:
> If the device supports DMA, the driver sets up a buffer using kmalloc() or
> a similar interface, which returns a virtual address (X). The virtual
> memory system maps X to a physical address (Y) in system RAM. The driver
> can use virtual address X to access the buffer, but the device itself
> cannot because DMA doesn't go through the CPU virtual memory system.

Sure, but that's not addressing the cacheline coherency problem.

Regardless of what the docs did or didn't say, let's try answering
the question: what makes for a more useful system?

A: A kmalloc implementation which always returns an address suitable
for mapping using the DMA interfaces

B: A kmalloc implementation which is more efficient, but requires drivers
to use a different interface for allocating space for the purposes of DMA

I genuinely don't know the answer to this question, and I think there are
various people in this thread who believe A or B quite strongly.

I would also like to ask people who believe in A what should happen in
this situation:

blocks = kmalloc(4, GFP_KERNEL);
sg_init_one(&sg, blocks, 4);
result = ntohl(*blocks);

(this is just one example; there are others). Because if we have to
round all allocations below 64 bytes up to 64 bytes, that's going to be
a memory consumption problem. On my laptop:

kmalloc-96 11527 15792 96 42 1 : slabdata 376 376 0
kmalloc-64 54406 62912 64 64 1 : slabdata 983 983 0
kmalloc-32 80325 84096 32 128 1 : slabdata 657 657 0
kmalloc-16 26844 30208 16 256 1 : slabdata 118 118 0
kmalloc-8 17141 21504 8 512 1 : slabdata 42 42 0

I make that an extra 1799 pages (7MB). Not the end of the world, but
not free either.