Re: [PATCH 1/2] mm/madvise: allow process_madvise operations on entire memory range

From: Suren Baghdasaryan
Date: Tue Dec 08 2020 - 02:24:44 EST


On Mon, Nov 30, 2020 at 11:01 AM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote:
>
> On Wed, Nov 25, 2020 at 3:43 PM Minchan Kim <minchan@xxxxxxxxxx> wrote:
> >
> > On Wed, Nov 25, 2020 at 03:23:40PM -0800, Suren Baghdasaryan wrote:
> > > On Wed, Nov 25, 2020 at 3:13 PM Minchan Kim <minchan@xxxxxxxxxx> wrote:
> > > >
> > > > On Mon, Nov 23, 2020 at 09:39:42PM -0800, Suren Baghdasaryan wrote:
> > > > > process_madvise requires a vector of address ranges to be provided for
> > > > > its operations. When an advice should be applied to the entire process,
> > > > > the caller process has to obtain the list of VMAs of the target process
> > > > > by reading the /proc/pid/maps or some other way. The cost of this
> > > > > operation grows linearly with increasing number of VMAs in the target
> > > > > process. Even constructing the input vector can be non-trivial when
> > > > > target process has several thousands of VMAs and the syscall is being
> > > > > issued during high memory pressure period when new allocations for such
> > > > > a vector would only worsen the situation.
> > > > > In the case when advice is being applied to the entire memory space of
> > > > > the target process, this creates an extra overhead.
> > > > > Add PMADV_FLAG_RANGE flag for process_madvise enabling the caller to
> > > > > advise a memory range of the target process. For now, to keep it simple,
> > > > > only the entire process memory range is supported, vec and vlen inputs
> > > > > in this mode are ignored and can be NULL and 0.
> > > > > Instead of returning the number of bytes that advice was successfully
> > > > > applied to, the syscall in this mode returns 0 on success. This is due
> > > > > to the fact that the number of bytes would not be useful for the caller
> > > > > that does not know the amount of memory the call is supposed to affect.
> > > > > Besides, the ssize_t return type can be too small to hold the number of
> > > > > bytes affected when the operation is applied to a large memory range.
> > > >
> > > > Can we just use one element in iovec to indicate entire address rather
> > > > than using up the reserved flags?
> > > >
> > > > struct iovec {
> > > > .iov_base = NULL,
> > > > .iov_len = (~(size_t)0),
> > > > };
> > > >
> > > > Furthermore, it would be applied for other syscalls where have support
> > > > iovec if we agree on it.
> > > >
> > >
> > > The flag also changes the return value semantics. If we follow your
> > > suggestion we should also agree that in this mode the return value
> > > will be 0 on success and negative otherwise instead of the number of
> > > bytes madvise was applied to.
> >
> > Well, return value will depends on the each API. If the operation is
> > desruptive, it should return the right size affected by the API but
> > would be okay with 0 or error, otherwise.
>
> I'm fine with dropping the flag, I just thought with the flag it would
> be more explicit that this is a special mode operating on ranges. This
> way the patch also becomes simpler.
> Andrew, Michal, Christian, what do you think about such API? Should I
> change the API this way / keep the flag / change it in some other way?


Friendly ping to get some feedback on the proposed API please.