Re: [PATCH] mm: madvise: return correct bytes advised with process_madvise

From: Nadav Amit
Date: Wed Mar 09 2022 - 13:50:31 EST




> On Mar 9, 2022, at 8:47 AM, Minchan Kim <minchan@xxxxxxxxxx> wrote:
>
> On Wed, Mar 09, 2022 at 10:57:59AM +0530, Charan Teja Kalla wrote:
>> The process_madvise() system call returns error even after processing
>> some VMA's passed in the 'struct iovec' vector list which leaves the
>> user confused to know where to restart the advise next. It is also
>> against this syscall man page[1] documentation where it mentions that
>> "return value may be less than the total number of requested bytes, if
>> an error occurred after some iovec elements were already processed.".
>>
>> Consider a user passed 10 VMA's in the 'struct iovec' vector list of
>> which 9 are processed but one. Then it just returns the error caused on
>> that failed VMA despite the first 9 VMA's processed, leaving the user
>> confused about on which VMA it is failed. Returning the number of bytes
>> processed here can help the user to know which VMA it is failed on and
>> thus can retry/skip the advise on that VMA.
>>
>> [1]https://man7.org/linux/man-pages/man2/process_madvise.2.html.
>>
>> Fixes: ecb8ac8b1f14("mm/madvise: introduce process_madvise() syscall: an external memory hinting API"
>> Signed-off-by: Charan Teja Kalla <quic_charante@xxxxxxxxxxx>
>> ---
>> mm/madvise.c | 12 +++++++++---
>> 1 file changed, 9 insertions(+), 3 deletions(-)
>>
>> diff --git a/mm/madvise.c b/mm/madvise.c
>> index 38d0f51..d3b49b3 100644
>> --- a/mm/madvise.c
>> +++ b/mm/madvise.c
>> @@ -1426,15 +1426,21 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
>>
>> while (iov_iter_count(&iter)) {
>> iovec = iov_iter_iovec(&iter);
>> + /*
>> + * Even when [start, end) passed to do_madvise covers
>> + * some unmapped addresses, it continues processing with
>> + * returning ENOMEM at the end. Thus consider the range
>> + * as processed when do_madvise() returns ENOMEM.
>> + * This makes process_madvise() never returns ENOMEM.
>> + */
>
> Looks like that this patch has two things. first, returns processed
> bytes instead of error in case of error. Second, keep working on
> rest vmas on -ENOMEM due to unmapped hole.
>
> First thing totally makes sense to me(that's exactly I wanted to
> do but somehow missed) so it should go stable tree. However,
> second stuff might be arguble so it would be great if you split
> the patch.

I fully understand and relate to the basic motivation of this
patch.

The ENOMEM that this patch checks for, IIUC, is the ENOMEM that is
returned on unmapped holes. Such ENOMEM does not appear, according to
the man page, to be a valid reason to return ENOMEM to userspace.
Presumably process_madvise() is expected to skip unmapped holes
and not to fail because of them.

Having said that, I do not think that the check that the patch does
is clean or clearly documented.

In addition, this patch (and some work on process_madvise()) raise
in my mind a couple of questions:

1. There are other errors that process_madvise might encounter
and can be propagated back to userspace, but are not
documented. For instance if can_madv_lru_vma() fails on
MADV_COLD, userspace will get EINVAL. EINVAL is not documented
as a valid error-code for such case in either madvise() and
process_madvise() man pages.

2. If an advice fails due to other reason, specifically
split_huge_page(), there might be partial success within a
VMA. I guess for now it is fine to ignore such failures
since there is no guarantee on the OS behavior following
most advices (excluding MADV_DONTNEED).

Regards,
Nadav