Re: [RFC 00/12] module: avoid userspace pressure on unwanted allocations

From: Luis Chamberlain
Date: Fri Mar 24 2023 - 13:55:54 EST


On Fri, Mar 24, 2023 at 10:27:14AM +0100, David Hildenbrand wrote:
> On 21.03.23 20:32, David Hildenbrand wrote:
> > On 20.03.23 22:27, Luis Chamberlain wrote:
> > > On Mon, Mar 20, 2023 at 02:23:36PM -0700, Luis Chamberlain wrote:
> > > > On Mon, Mar 20, 2023 at 10:15:23PM +0100, David Hildenbrand wrote:
> > > > > Not able to reproduce with 20230319-module-alloc-opts so far (2 tries).
> > > >
> > > > Oh wow, so to clarify, it boots OK?
> > > >
> > >
> > > Now that we know that tree works, I'm curious also now if you can
> > > confirm just re-ordering the patches still works (it should)
> > >
> > > https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230319-module-alloc-opts-adjust
> > >
> >
> > So far no vmap errors booting the debug/kasan kernel (2 tries).

<-- snip -->

> > I think we primarily only care about systemd-udev-settle.service.
> >
> > That is fastest without the rcu patch (~6s), compared to with the rcu
> > patch (~6.5s) and with stock (~7.5s -- 8s).
> >
> > Looks like dracut-initqueue also might be a bit faster with your changes, but
> > maybe it's mostly noise (would have to do more runs).
> >
> > So maybe drop that rcu patch? But of course, there could be other scenarios where it's
> > helpful ...

Yes I confirm the RCU patch does not help at all now also using
stress-ng.

> Are there any other things you would like me to measure/test? I'll have to
> hand back that test machine soonish.

Yes please test the below. Perhaps its not the final form we want, but
it *does* fix OOM'ing when thrashing with stress-ng now with the module
option and even with 100 threads brings down max memory consumption by
259 MiB. The reason is that we also vmalloc during each finit_read_file()
for each module as well way before we even do layout_and_allocate(), and
so obviously if we fix the module path but not this path this will eventually
catch up with us as. I'm not at all happy with the current approach given
ideally we'd bump the counter when the user is done with the file, but we
don't yet have any tracking of that for users, they just vfree the memory
itself. And so this is just trying to catch heavy immediate abuse on the
caller to fend off abuse of vmalloc uses in a lightway manner.

There's gotta be a better way to do this, but its just an idea I have so far.
If we *want* to keep tabs until the user is done, we have to just modify
most users of these APIs and intrudce our own free. I don't think we're
in a rush to fix this so maybe that's the better approach.

And so I've managed to reproduce the issues you found now with my new stress-ng
module stressor as well.

https://github.com/ColinIanKing/stress-ng.git

Even though you have 400 CPUs with stress-ng we can likely reproduce it
with (use a module not loaded on your system):

./stress-ng --module 100 --module-name xfs

Without the patch below using 400 threads still OOMs easily due to the
kread issue. Max threads allowed are 8192.