Re: [RFC PATCH 1/5] mm: intorduce __GFP_UNMAPPED and unmapped_alloc()

From: Luis Chamberlain
Date: Tue Mar 28 2023 - 13:19:10 EST

Next message: Mark Greer: "Re: [PATCH] MAINTAINERS: remove the linux-nfc@xxxxxxxxxxxx list"
Previous message: Vipin Sharma: "Re: [Patch v4 13/18] KVM: mmu: Add common initialization logic for struct kvm_mmu_memory_cache{}"
In reply to: Michal Hocko: "Re: [RFC PATCH 1/5] mm: intorduce __GFP_UNMAPPED and unmapped_alloc()"
Next in thread: Matthew Wilcox: "Re: [RFC PATCH 1/5] mm: intorduce __GFP_UNMAPPED and unmapped_alloc()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Mike, please Cc linux-modules if you want modules folks' input as well ;)

On Tue, Mar 28, 2023 at 8:11 AM Mike Rapoport <rppt@xxxxxxxxxx> wrote:
> On Tue, Mar 28, 2023 at 09:39:37AM +0200, Michal Hocko wrote:
> > OK, so you want to reduce that direct map fragmentation?
>
> Yes.
>
> > Is that a real problem?
>
> A while ago Intel folks published report [1] that showed better performance
> with large pages in the direct map for majority of benchmarks.
>
> > My impression is that modules are mostly static thing. BPF
> > might be a different thing though. I have a recollection that BPF guys
> > were dealing with direct map fragmention as well.
>
> Modules are indeed static, but module_alloc() used by anything that
> allocates code pages, e.g. kprobes, ftrace and BPF. Besides, Thomas
> mentioned that having code in 2M pages reduces iTLB pressure [2], but
> that's not only about avoiding the splits in the direct map but also about
> using large mappings in the modules address space.

It is easily overlooked why such things create direct fragmentation --
it's because of the special permission stuff done, module_alloc()
targets memory which can be executed somehow.

> BPF guys suggested an allocator for executable memory [3] mainly because
> they've seen performance improvement of 0.6% - 0.9% in their setups [4].

The performance metrics were completely opaque to some synthetic
environment and our goal is to showcase real value with reproducible
performance benchmarks. Since now Song is convinced that modules need
to be a first class citizen in order to generalize a special allocator
we may sooner rather than later real reproducible performance data to
show the benefit of such a special allocator. One of the big
differences with eBPF programs is that modules *can* be rather large
in size. What is the average size of modules? Well let's take a look:

mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./
-name \*.ko| wc -l
9173
mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./
-name \*.ko| xargs stat -c "%s - %n" | awk 'BEGIN {sum=0} {sum+=$1}
END {print sum/NR/1024}'
175.1

175 MiB is pretty large.

Ignoring the top 5 piggies:

mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./
-name \*.ko| xargs stat -c "%s - %n" | sort -n -k 1 -r |head -5
58315248 - ./drivers/gpu/drm/amd/amdgpu/amdgpu.ko
29605592 - ./drivers/gpu/drm/i915/i915.ko
18591256 - ./drivers/gpu/drm/nouveau/nouveau.ko
16867048 - ./fs/xfs/xfs.ko
14209440 - ./fs/btrfs/btrfs.ko

mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./
-name \*.ko| xargs stat -c "%s - %n" | sort -n -k 1 -r | tail
-$((9173-5)) | awk 'BEGIN {sum=0} {sum+=$1} END {print sum/NR/1024}'
160.54

Ignoring the top 10 largest modules:

mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./
-name \*.ko| xargs stat -c "%s - %n" | sort -n -k 1 -r | tail
-$((9173-10)) | awk 'BEGIN {sum=0} {sum+=$1} END {print sum/NR/1024}'
154.656

Ignoring the top 20 piggies:
mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./
-name \*.ko| xargs stat -c "%s - %n" | sort -n -k 1 -r | tail
-$((9173-20)) | awk 'BEGIN {sum=0} {sum+=$1} END {print sum/NR/1024}'
146.384

Ignoring the top 100 bloated modules:
mcgrof@bigtwin mirror/code/mcgrof/linux-next (git::master)$ find ./
-name \*.ko| xargs stat -c "%s - %n" | sort -n -k 1 -r | tail
-$((9173-100)) | awk 'BEGIN {sum=0} {sum+=$1} END {print sum/NR/1024}'
117.97

Ignoring the top 1000 bloated modules:
mcgrof@bigtwin mirror/code/mcgrof/linux-next (git::master)$ find ./
-name \*.ko| xargs stat -c "%s - %n" | sort -n -k 1 -r | tail
-$((9173-1000)) | awk 'BEGIN {sum=0} {sum+=$1} END {print
sum/NR/1024}'
64.4686

Ignoring the top 2000 bloated modules:
mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./
-name \*.ko| xargs stat -c "%s - %n" | sort -n -k 1 -r | tail
-$((9173-2000)) | awk 'BEGIN {sum=0} {sum+=$1} END {print
sum/NR/1024}'
48.7869

Ignoring top 3000
mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./
-name \*.ko| xargs stat -c "%s - %n" | sort -n -k 1 -r | tail
-$((9173-3000)) | awk 'BEGIN {sum=0} {sum+=$1} END {print
sum/NR/1024}'
39.6037

Ignoring top 4000
mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./
-name \*.ko| xargs stat -c "%s - %n" | sort -n -k 1 -r | tail
-$((9173-4000)) | awk 'BEGIN {sum=0} {sum+=$1} END {print
sum/NR/1024}'
33.106

Ignoring top 5000
mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./
-name \*.ko| xargs stat -c "%s - %n" | sort -n -k 1 -r | tail
-$((9173-5000)) | awk 'BEGIN {sum=0} {sum+=$1} END {print
sum/NR/1024}'
28.0925

But at least on the driver front we know we won't always have loaded
*all* GPU drivers, but *one*. So the math needs to consider what
module sizes are for modules actually used.

Let's see what the average module size is on on a big system:

mcgrof@bigtwin ~ $ lsmod | grep -v Module| awk '{print $1}' | xargs
sudo modinfo --field filename | xargs stat -c "%s - %n" | awk 'BEGIN
{sum=0} {sum+=$1} END {print sum/NR/1024}'
313.432

On a small desktop:
mcgrof@desktop ~ $ lsmod | grep -v Module| awk '{print $1}' | xargs
sudo modinfo --field filename | xargs stat -c "%s - %n" | awk 'BEGIN
{sum=0} {sum+=$1} END {print sum/NR/1024}'
292.786

For each finit_module we also do a vmalloc twice actually, one for the
kernel_read_*() which lets us read the file from userspace, and then a
second set of allocations where we copy data over, each step either
using vzalloc() or module_alloc() (vmalloc() with special
permissions), this is more easily reflected and visible now on
linux-next with Song's new module_memory_alloc().

However -- for the context of *this* effort, we're only talking about
the executable sections, the areas we'd use module_alloc() for.

On a big system:

mcgrof@bigtwin ~ $ lsmod | grep -v Module| awk '{print $1}' | xargs
sudo modinfo --field filename | xargs size --radix=10 | awk '{print
$1}'| grep -v text| awk 'BEGIN {sum=0} {sum+=$1} END {print
sum/NR/1024}'
88.7964

On a small desktop:
mcgrof@desktop ~ $ lsmod | grep -v Module| awk '{print $1}' | xargs
sudo modinfo --field filename | xargs size --radix=10 | awk '{print
$1}'| grep -v text| awk 'BEGIN {sum=0} {sum+=$1} END {print
sum/NR/1024}'
92.1473

Regardless, the amount of memory allocated is pretty significant, to
the point a 400 CPU system easily run out of vmap space (yes the
kernel does some stupid stuff, which we're correcting over time):

https://lkml.kernel.org/r/20221013180518.217405-1-david@xxxxxxxxxx

To help with this we have some recent efforts to help with this
pressure on vmap space:

https://lkml.kernel.org/r/20230311051712.4095040-1-mcgrof@xxxxxxxxxx

> > > If we were to use unmapped_pages_alloc() in modules_alloc(), we would have
> > > to implement the part of vmalloc() that reserves the virtual addresses and
> > > maps the allocated memory there in module_alloc().
> >
> > Another option would be to provide an allocator for the backing pages to
> > vmalloc. But I do agree that a gfp flag is a less laborous way to
> > achieve the same. So the primary question really is whether we really
> > need vmalloc support for unmapped memory.
>
> I'm not sure I follow here. module_alloc() is essentially an alias to
> vmalloc(), so to reduce direct map fragmentation caused by code allocations
> the most sensible way IMO is to support unmapped memory in vmalloc().

The big win also would be to use huge pages from a performance point
of view, specially reducing iTLB pressure, however that is *slightly*
orthogonal to your goal of reducing direct map fragmentation. However,
we should not ignore that strategy as it is a special use case which
*could* be leveraged in other ways too. And so I'd prefer to see that
over just a flag to hide those allocations. It would not only reduce
that fragmentation, but reduce iTLB pressure which I think *is* what
experimentation revealed to show more gains.

> I also think vmalloc with unmmapped pages can provide backing pages for
> execmem_alloc() Song proposed.

Only if you consider huge pages. I don't see that here.

Luis

Next message: Mark Greer: "Re: [PATCH] MAINTAINERS: remove the linux-nfc@xxxxxxxxxxxx list"
Previous message: Vipin Sharma: "Re: [Patch v4 13/18] KVM: mmu: Add common initialization logic for struct kvm_mmu_memory_cache{}"
In reply to: Michal Hocko: "Re: [RFC PATCH 1/5] mm: intorduce __GFP_UNMAPPED and unmapped_alloc()"
Next in thread: Matthew Wilcox: "Re: [RFC PATCH 1/5] mm: intorduce __GFP_UNMAPPED and unmapped_alloc()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]