Re: [PATCH 1/1] RFC: add pidfd_send_signal flag to reclaim mm while killing a process

From: Suren Baghdasaryan
Date: Tue Nov 24 2020 - 00:45:41 EST


On Wed, Nov 18, 2020 at 4:13 PM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote:
>
> On Wed, Nov 18, 2020 at 11:55 AM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote:
> >
> > On Wed, Nov 18, 2020 at 11:51 AM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote:
> > >
> > > On Wed, Nov 18, 2020 at 11:32 AM Michal Hocko <mhocko@xxxxxxxx> wrote:
> > > >
> > > > On Wed 18-11-20 11:22:21, Suren Baghdasaryan wrote:
> > > > > On Wed, Nov 18, 2020 at 11:10 AM Michal Hocko <mhocko@xxxxxxxx> wrote:
> > > > > >
> > > > > > On Fri 13-11-20 18:16:32, Andrew Morton wrote:
> > > > > > [...]
> > > > > > > It's all sounding a bit painful (but not *too* painful). But to
> > > > > > > reiterate, I do think that adding the ability for a process to shoot
> > > > > > > down a large amount of another process's memory is a lot more generally
> > > > > > > useful than tying it to SIGKILL, agree?
>
> I was looking into how to work around the limitation of MAX_RW_COUNT
> and the conceptual issue there is the "struct iovec" which has its
> iov_len as size_t that lacks capacity for expressing ranges like
> "entire process memory". I would like to check your reaction to the
> following idea which can be implemented without painful surgeries to
> the import_iovec and its friends.
>
> process_madvise(pidfd, iovec = [ { range_start_addr, 0 }, {
> range_end_addr, 0 } ], vlen = 2, behavior=MADV_xxx, flags =
> PMADV_FLAG_RANGE)
>
> So, to represent a range we pass a new PMADV_FLAG_RANGE flag and
> construct a 2-element vector to express range start and range end
> using iovec.iov_base members. iov_len member of the iovec elements is
> ignored in this mode. I know it sounds hacky but I think it's the
> simplest way if we want the ability to express an arbitrarily large
> range.
> Another option is to do what Andrew described as "madvise((void *)0,
> (void *)-1, MADV_PAGEOUT)" which means this mode works only with the
> entire mm of the process.
> WDYT?
>

To follow up on this discussion, I posted a patchset to implement
process_madvise(MADV_DONTNEED) supporting the entire mm range at
https://lkml.org/lkml/2020/11/24/21.

> > > > > >
> > > > > > I am not sure TBH. Is there any reasonable usecase where uncoordinated
> > > > > > memory tear down is OK and a target process which is able to see the
> > > > > > unmapped memory?
> > > > >
> > > > > I think uncoordinated memory tear down is a special case which makes
> > > > > sense only when the target process is being killed (and we can enforce
> > > > > that by allowing MADV_DONTNEED to be used only if the target process
> > > > > has pending SIGKILL).
> > > >
> > > > That would be safe but then I am wondering whether it makes sense to
> > > > implement as a madvise call. It is quite strange to expect somebody call
> > > > a syscall on a killed process. But this is more a detail. I am not a
> > > > great fan of a more generic MADV_DONTNEED on a remote process. This is
> > > > just too dangerous IMHO.
> > >
> > > Agree 100%
> >
> > I assumed here that by "a more generic MADV_DONTNEED on a remote
> > process" you meant "process_madvise(MADV_DONTNEED) applied to a
> > process that is not being killed". Re-reading your comment I realized
> > that you might have meant "process_madvice() with generic support to
> > large memory areas". I hope I understood you correctly.
> >
> > >
> > > >
> > > > > However, the ability to apply other flavors of
> > > > > process_madvise() to large memory areas spanning multiple VMAs can be
> > > > > useful in more cases.
> > > >
> > > > Yes I do agree with that. The error reporting would be more tricky but
> > > > I am not really sure that the exact reporting is really necessary for
> > > > advice like interface.
> > >
> > > Andrew's suggestion for this special mode to change return semantics
> > > to the usual "0 or error code" seems to me like the most reasonable
> > > way to deal with the return value limitation.
> > >
> > > >
> > > > > For example in Android we will use
> > > > > process_madvise(MADV_PAGEOUT) to "shrink" an inactive background
> > > > > process.
> > > >
> > > > That makes sense to me.
> > > > --
> > > > Michal Hocko
> > > > SUSE Labs