Re: [PATCH 1/1] RFC: add pidfd_send_signal flag to reclaim mm while killing a process

From: Suren Baghdasaryan
Date: Wed Nov 18 2020 - 19:14:27 EST


On Wed, Nov 18, 2020 at 11:55 AM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote:
>
> On Wed, Nov 18, 2020 at 11:51 AM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote:
> >
> > On Wed, Nov 18, 2020 at 11:32 AM Michal Hocko <mhocko@xxxxxxxx> wrote:
> > >
> > > On Wed 18-11-20 11:22:21, Suren Baghdasaryan wrote:
> > > > On Wed, Nov 18, 2020 at 11:10 AM Michal Hocko <mhocko@xxxxxxxx> wrote:
> > > > >
> > > > > On Fri 13-11-20 18:16:32, Andrew Morton wrote:
> > > > > [...]
> > > > > > It's all sounding a bit painful (but not *too* painful). But to
> > > > > > reiterate, I do think that adding the ability for a process to shoot
> > > > > > down a large amount of another process's memory is a lot more generally
> > > > > > useful than tying it to SIGKILL, agree?

I was looking into how to work around the limitation of MAX_RW_COUNT
and the conceptual issue there is the "struct iovec" which has its
iov_len as size_t that lacks capacity for expressing ranges like
"entire process memory". I would like to check your reaction to the
following idea which can be implemented without painful surgeries to
the import_iovec and its friends.

process_madvise(pidfd, iovec = [ { range_start_addr, 0 }, {
range_end_addr, 0 } ], vlen = 2, behavior=MADV_xxx, flags =
PMADV_FLAG_RANGE)

So, to represent a range we pass a new PMADV_FLAG_RANGE flag and
construct a 2-element vector to express range start and range end
using iovec.iov_base members. iov_len member of the iovec elements is
ignored in this mode. I know it sounds hacky but I think it's the
simplest way if we want the ability to express an arbitrarily large
range.
Another option is to do what Andrew described as "madvise((void *)0,
(void *)-1, MADV_PAGEOUT)" which means this mode works only with the
entire mm of the process.
WDYT?

> > > > >
> > > > > I am not sure TBH. Is there any reasonable usecase where uncoordinated
> > > > > memory tear down is OK and a target process which is able to see the
> > > > > unmapped memory?
> > > >
> > > > I think uncoordinated memory tear down is a special case which makes
> > > > sense only when the target process is being killed (and we can enforce
> > > > that by allowing MADV_DONTNEED to be used only if the target process
> > > > has pending SIGKILL).
> > >
> > > That would be safe but then I am wondering whether it makes sense to
> > > implement as a madvise call. It is quite strange to expect somebody call
> > > a syscall on a killed process. But this is more a detail. I am not a
> > > great fan of a more generic MADV_DONTNEED on a remote process. This is
> > > just too dangerous IMHO.
> >
> > Agree 100%
>
> I assumed here that by "a more generic MADV_DONTNEED on a remote
> process" you meant "process_madvise(MADV_DONTNEED) applied to a
> process that is not being killed". Re-reading your comment I realized
> that you might have meant "process_madvice() with generic support to
> large memory areas". I hope I understood you correctly.
>
> >
> > >
> > > > However, the ability to apply other flavors of
> > > > process_madvise() to large memory areas spanning multiple VMAs can be
> > > > useful in more cases.
> > >
> > > Yes I do agree with that. The error reporting would be more tricky but
> > > I am not really sure that the exact reporting is really necessary for
> > > advice like interface.
> >
> > Andrew's suggestion for this special mode to change return semantics
> > to the usual "0 or error code" seems to me like the most reasonable
> > way to deal with the return value limitation.
> >
> > >
> > > > For example in Android we will use
> > > > process_madvise(MADV_PAGEOUT) to "shrink" an inactive background
> > > > process.
> > >
> > > That makes sense to me.
> > > --
> > > Michal Hocko
> > > SUSE Labs