Re: [PATCH 07/13] mm: remove the page_shift member from struct hmm_range

From: Christoph Hellwig
Date: Thu Aug 01 2019 - 02:49:28 EST


On Tue, Jul 30, 2019 at 05:50:16PM +0000, Jason Gunthorpe wrote:
> The way ODP seems to work is once in hugetlb mode the dma addresses
> must give huge pages or the page fault will be failed. I think that is
> a terrible design, but this is how the driver is ..
>
> So, from this HMM perspective if the caller asked for huge pages then
> the results have to be all huge pages or a hard failure.

Which isn't how the page_shift member works at moment. It still
allows non-hugetlb mappings even with the member.

> It is not negotiated as an optimization like you are thinking.
>
> [note, I haven't yet checked carefully how this works in ODP, every
> time I look at parts of it the thing seems crazy]

This seems pretty crazy. Especially as hugetlb use in applications
seems to fade in favour of THP, for which this ODP scheme does not seem
to work at all.

> > The best API for mlx4 would of course be to pass a biovec-style
> > variable length structure that hmm_fault could fill out, but that would
> > be a major restructure.
>
> It would work, but the driver has to expand that into a page list
> right awayhow.
>
> We can't even dma map the biovec with today's dma API as it needs the
> ability to remap on a page granularity.

We can do dma_map_page loops over each biovec entry pretty trivially,
and that won't be any worse than the current loop over each page in
the hmm dma helpers. Once I get around the work to have a better
API for iommu mappings for bio_vecs we could coalesce it similar to
how we do it with scatterlist (but without all the mess of a new
structure). That work is going to take a little longer through, as
it needs the amd and intell iommu drivers to be convered to dma-iommu
which isn't making progress as far as I hoped.

Let me know if you want to keep this code for now despite the issues,
or if we'd rather reimplement it once you've made sense of the ODP
code.