Re: [PATCH v14 2/7] mm: add VM_DROPPABLE for designating always lazily freeable mappings
From: Ingo Molnar
Date: Tue Jan 03 2023 - 13:16:03 EST
* Jason A. Donenfeld <Jason@xxxxxxxxx> wrote:
> On Tue, Jan 03, 2023 at 11:50:43AM +0100, Ingo Molnar wrote:
> >
> > * Jason A. Donenfeld <Jason@xxxxxxxxx> wrote:
> >
> > > The vDSO getrandom() implementation works with a buffer allocated with a
> > > new system call that has certain requirements:
> > >
> > > - It shouldn't be written to core dumps.
> > > * Easy: VM_DONTDUMP.
> > > - It should be zeroed on fork.
> > > * Easy: VM_WIPEONFORK.
> > >
> > > - It shouldn't be written to swap.
> > > * Uh-oh: mlock is rlimited.
> > > * Uh-oh: mlock isn't inherited by forks.
> > >
> > > - It shouldn't reserve actual memory, but it also shouldn't crash when
> > > page faulting in memory if none is available
> > > * Uh-oh: MAP_NORESERVE respects vm.overcommit_memory=2.
> > > * Uh-oh: VM_NORESERVE means segfaults.
> > >
> > > It turns out that the vDSO getrandom() function has three really nice
> > > characteristics that we can exploit to solve this problem:
> > >
> > > 1) Due to being wiped during fork(), the vDSO code is already robust to
> > > having the contents of the pages it reads zeroed out midway through
> > > the function's execution.
> > >
> > > 2) In the absolute worst case of whatever contingency we're coding for,
> > > we have the option to fallback to the getrandom() syscall, and
> > > everything is fine.
> > >
> > > 3) The buffers the function uses are only ever useful for a maximum of
> > > 60 seconds -- a sort of cache, rather than a long term allocation.
> > >
> > > These characteristics mean that we can introduce VM_DROPPABLE, which
> > > has the following semantics:
> > >
> > > a) It never is written out to swap.
> > > b) Under memory pressure, mm can just drop the pages (so that they're
> > > zero when read back again).
> > > c) If there's not enough memory to service a page fault, it's not fatal,
> > > and no signal is sent. Instead, writes are simply lost.
> > > d) It is inherited by fork.
> > > e) It doesn't count against the mlock budget, since nothing is locked.
> > >
> > > This is fairly simple to implement, with the one snag that we have to
> > > use 64-bit VM_* flags, but this shouldn't be a problem, since the only
> > > consumers will probably be 64-bit anyway.
> > >
> > > This way, allocations used by vDSO getrandom() can use:
> > >
> > > VM_DROPPABLE | VM_DONTDUMP | VM_WIPEONFORK | VM_NORESERVE
> > >
> > > And there will be no problem with OOMing, crashing on overcommitment,
> > > using memory when not in use, not wiping on fork(), coredumps, or
> > > writing out to swap.
> > >
> > > At the moment, rather than skipping writes on OOM, the fault handler
> > > just returns to userspace, and the instruction is retried. This isn't
> > > terrible, but it's not quite what is intended. The actual instruction
> > > skipping has to be implemented arch-by-arch, but so does this whole
> > > vDSO series, so that's fine. The following commit addresses it for x86.
> >
> > Yeah, so VM_DROPPABLE adds a whole lot of complexity, corner cases, per
> > arch low level work and rarely tested functionality (seriously, whose
> > desktop system touches swap these days?), just so we can add a few pages of
> > per thread vDSO data of a quirky type that in 99.999% of cases won't ever
> > be 'dropped' from under the functionality that is using it and will thus
> > bitrot fast?
>
> It sounds like you've misunderstood the issue.
>
> Firstly, the arch work is +19 lines (in the vdso branch of random.git).
For a single architecture: x86.
And it's only 19 lines because x86 already happens to have a bunch of
complexity implemented, such as a safe instruction decoder that allows the
skipping of an instruction - which relies on thousands of lines of
complexity.
On an architecture where this isn't present, it would have to be
implemented to support the instruction-skipping aspect of VM_DROPPABLE.
Even on x86, it's not common today for the software-decoder to be used in
unprivileged code - primary use was debugging & instrumentation code. So
your patches bring this piece of complexity to a much larger scope of
untrusted user-space functionality.
> That's very small and basic. Don't misrepresent it just to make a point.
I'm not misrepresenting anything.
> Secondly, and more importantly, swapping this data is *not* permissible.
I did not suggest to swap it: my suggestion is to just pin these vDSO data
pages. The per thread memory overhead is infinitesimal on the vast majority
of the target systems, and the complexity trade-off you are proposing is
poorly reasoned IMO.
Anyway:
> Don't misrepresent it just to make a point.
...
> That seems like a ridiculous rhetorical leap.
...
> Did you actually read the commit message?
Frankly, I don't appreciate your condescending discussion style that
borders on the toxic, and to save time I'm nacking this technical approach
until both the patch-set and your reaction to constructive review feedback
improves:
NAcked-by: Ingo Molnar <mingo@xxxxxxxxxx>
I think my core point that it would be much simpler to simply pin those
pages and not introduce rarely-excercised 'discardable memory' semantics in
Linux is a fair one - so it's straightforward to lift this NAK.
I'll re-evaluate the NACK on every new iteration of this patchset I see.
Thanks,
Ingo