Re: Phyr Starter

From: Jason Gunthorpe
Date: Mon Jan 10 2022 - 19:41:33 EST


On Mon, Jan 10, 2022 at 07:34:49PM +0000, Matthew Wilcox wrote:

> Finally, it may be possible to stop using scatterlist to describe the
> input to the DMA-mapping operation. We may be able to get struct
> scatterlist down to just dma_address and dma_length, with chaining
> handled through an enclosing struct.

Can you talk about this some more? IMHO one of the key properties of
the scatterlist is that it can hold huge amounts of pages without
having to do any kind of special allocation due to the chaining.

The same will be true of the phyr idea right?

> I would like to see phyr replace bio_vec everywhere it's currently used.
> I don't have time to do that work now because I'm busy with folios.
> If someone else wants to take that on, I shall cheer from the sidelines.
> What I do intend to do is:

I wonder if we mixed things though..

IMHO there is a lot of optimization to be had by having a
datastructure that is expressly 'the physical pages underlying a
contiguous chunk of va'

If you limit to that scenario then we can be more optimal because
things like byte granular offsets and size in the interior pages don't
need to exist. Every interior chunk is always aligned to its order and
we only need to record the order.

An overall starting offset and total length allow computing the slice
of the first/last entry.

If the physical address is always aligned then we get 12 free bits
from the min 4k alignment and also only need to store order, not an
arbitary byte granular length.

The win is I think we can meaningfully cover most common cases using
only 8 bytes per physical chunk. The 12 bits can be used to encode the
common orders (4k, 2M, 1G, etc) and some smart mechanism to get
another 16 bits to cover 'everything'.

IMHO storage density here is quite important, we end up having to keep
this stuff around for a long time.

I say this here, because I've always though bio_vec/etc are more
general than we actually need, being byte granular at every chunk.

> - Add an interface to gup.c to pin/unpin N phyrs
> - Add a sg_map_phyrs()
> This will take an array of phyrs and allocate an sg for them
> - Whatever else I need to do to make one RDMA driver happy with
> this scheme

I spent alot of time already cleaning all the DMA code in RDMA - it is
now nicely uniform and ready to do this sort of change. I was
expecting to be a bio_vec, but this is fine too.

What is needed is a full scatterlist replacement, including the IOMMU
part.

For the IOMMU I would expect the datastructure to be re-used, we start
with a list of physical pages and then 'dma map' gives us a list of
IOVA physical pages, in another allocation, but exactly the same
datastructure.

This 'dma map' could return a pointer to the first datastructure if
there is no iommu, allocate a single entry list if the whole thing can
be linearly mapped with the iommu, and other baroque cases (like pci
offset/etc) will need to allocate full array. ie good HW runs fast and
is memory efficient.

It would be nice to see a patch sketching showing what this
datastructure could look like.

VFIO would like this structure as well as it currently is a very
inefficient page at a time loop when it iommu maps things.

Jason