Re: [PATCH 1/3] selftests/mm: virtual_address_range: Fix error when CommitLimit < 1GiB

From: Thomas Weißschuh
Date: Thu Jan 09 2025 - 02:48:01 EST


On Wed, Jan 08, 2025 at 05:46:37PM +0100, David Hildenbrand wrote:
> On 08.01.25 17:13, Thomas Weißschuh wrote:
> > On Wed, Jan 08, 2025 at 02:36:57PM +0100, David Hildenbrand wrote:
> > > On 08.01.25 09:05, Thomas Weißschuh wrote:
> > > > On Wed, Jan 08, 2025 at 11:46:19AM +0530, Dev Jain wrote:
> > > > >
> > > > > On 07/01/25 8:44 pm, Thomas Weißschuh wrote:
> > > > > > If not enough physical memory is available the kernel may fail mmap();
> > > > > > see __vm_enough_memory() and vm_commit_limit().
> > > > > > In that case the logic in validate_complete_va_space() does not make
> > > > > > sense and will even incorrectly fail.
> > > > > > Instead skip the test if no mmap() succeeded.
> > > > > >
> > > > > > Fixes: 010409649885 ("selftests/mm: confirm VA exhaustion without reliance on correctness of mmap()")
> > > > > > Cc: stable@xxxxxxxxxxxxxxx
> > >
> > > CC stable on tests is ... odd.
> >
> > I thought it was fairly common, but it isn't.
> > Will drop it.
>
> As it's not really a "kernel BUG", it's rather uncommon.

I also used it on patch 2, which is now reproducibly broken on x86
mainline since my commit mentioned in that patch.
But I'll drop it there, too.

> > > Note that with MAP_NORESRVE, most setups we care about will allow mapping as
> > > much as you want, but on access OOM will fire.
> >
> > Thanks for the hint.
> >
> > > So one could require that /proc/sys/vm/overcommit_memory is setup properly
> > > and use MAP_NORESRVE.
> >
> > Isn't the check for lchunks == 0 essentially exactly this?
>
> I assume paired with MAP_NORESERVE?

Yes.

> Maybe, but it could be better to have something that says "if
> overcommit_memory is not setup properly I will SKIP this test", but
> otherwise I expect this to work and will FAIL if it doesn't".

Ok, I'll validate the sysctl value.

> Or would you expect to run into lchunks == 0 even if overcommit_memory is
> setup properly and MAP_NORESERVE is used? (very very low memory that we
> cannot even create all the VMAs?)

No.

> > > Reading from anonymous memory will populate the shared zeropage. To mitigate
> > > OOM from "too many page tables", one could simply unmap the pieces as they
> > > are verified (or MAP_FIXED over them, to free page tables).
> >
> > The code has to figure out if a verified region was created by mmap(),
> > otherwise an munmap() could crash the process.
> > As the entries from /proc/self/maps may have been merged and (I assume)
>
> Yes, and partial unmap (in chunk granularity?) would split them again.
>
> > the ordering of mappings is not guaranteed, some bespoke logic to establish
> > the link will be needed.
>
> My thinking was that you simply process one /proc/self/maps entry in some
> chunks. After processing a chunk, you munmap() it.
>
> So you would process + munmap in chunks.

That is clear. The issue would be to figure which chunks are valid to
unmap. If something critical like the executable file is unmapped,
the process crashes. But see below.

> > Is it fine to rely on CONFIG_ANON_VMA_NAME?
> > That would make it much easier to implement.
>
> Can you elaborate how you would do it?

First set the VMA name after mmap():

for (i = 0; i < NR_CHUNKS_LOW; i++) {
ptr[i] = mmap(NULL, MAP_CHUNK_SIZE, PROT_READ | PROT_WRITE,
MAP_NORESERVE | MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

if (ptr[i] == MAP_FAILED) {
if (validate_lower_address_hint())
ksft_exit_fail_msg("mmap unexpectedly succeeded with hint\n");
break;
}

validate_addr(ptr[i], 0);
if (prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, ptr[i], MAP_CHUNK_SIZE, "virtual_address_range"))
ksft_exit_fail_msg("prctl(PR_SET_VMA_ANON_NAME) failed: %s\n", strerror(errno));
}

During validation:

hop = 0;
while (start_addr + hop < end_addr) {
if (write(fd, (void *)(start_addr + hop), 1) != 1)
return 1;
lseek(fd, 0, SEEK_SET);

if (!strncmp(line + path_offset, "[anon:virtual_address_range]", 28))
munmap((char *)(start_addr + hop), MAP_CHUNK_SIZE);

hop += MAP_CHUNK_SIZE;

}

It is done for each chunk, as all chunks may have been merged into a
single VMA and a per-VMA unmap would not happen before OOM.

> > Using MAP_NORESERVE and eager munmap()s, the testcase works nicely even
> > in very low physical memory conditions.
>
> Cool.