Re: [PATCH 1/2] fs/kernel_read_file: add support for duplicate detection

From: Linus Torvalds
Date: Thu May 25 2023 - 15:22:01 EST

On Thu, May 25, 2023 at 11:08 AM Luis Chamberlain <mcgrof@xxxxxxxxxx> wrote:
> Certainly on the track where I wish we could go. Now this goes tested.
> On 255 cores:
> Before:
> vagrant@kmod ~ $ sudo systemd-analyze
> Startup finished in 41.653s (kernel) + 44.305s (userspace) = 1min 25.958s
> reached after 44.178s in userspace.
> root@kmod ~ # grep "Virtual mem wasted bytes" /sys/kernel/debug/modules/stats
> Virtual mem wasted bytes 1949006968
> ; 1949006968/1024/1024/1024
> ~1.81515418738126754761
> So ~1.8 GiB... of vmalloc space wasted during boot.
> After:
> systemd-analyze
> Startup finished in 24.438s (kernel) + 41.278s (userspace) = 1min 5.717s
> reached after 41.154s in userspace.
> root@kmod ~ # grep "Virtual mem wasted bytes" /sys/kernel/debug/modules/stats
> Virtual mem wasted bytes 354413398
> So still 337.99 MiB of vmalloc space wasted during boot due to
> duplicates.

Ok. I think this will count as 'good enough for mitigation purposes'

> The reason is the exclusive_deny_write_access() must be
> kept during the life of the module otherwise as soon as it is done
> others can still race to load

Yes. The exclusion only applies while the file is actively being read.

> So with two other hunks added (2nd and 4th), this now matches parity with
> my patch, not suggesting this is right,

Yeah, we can't do that, because user space may quite validly want to
write the file afterwards.

Or, in fact, unload the module and re-load it.

So the "exclusion" really needs to be purely temporary.

That said, I considered moving the exclusion to module/main.c itself,
rather than the reading part. That wouild get rid of the hacky "id ==
READING_MODULE", and put the exclusion in the place that actually
wants it.

And that would allow us to at least extend that temporary exlusion a
bit - we could keep it until the module has actually been loaded and

So it would probably improve on those numbers a bit more, but you'd
still have the fundamental race where *serial* duplicates end up
always wasting CPU effort and temporary vmalloc space.