Re: [PATCH v8 3/4] mm/madvise: introduce process_madvise() syscall: an external memory hinting API

From: Jens Axboe
Date: Fri Aug 28 2020 - 15:27:07 EST


On 8/28/20 12:24 PM, Jens Axboe wrote:
> On 8/28/20 11:40 AM, Arnd Bergmann wrote:
>> On Mon, Jun 22, 2020 at 9:29 PM Minchan Kim <minchan@xxxxxxxxxx> wrote:
>>> So finally, the API is as follows,
>>>
>>> ssize_t process_madvise(int pidfd, const struct iovec *iovec,
>>> unsigned long vlen, int advice, unsigned int flags);
>>
>> I had not followed the discussion earlier and only now came across
>> the syscall in linux-next, sorry for stirring things up this late.
>>
>>> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
>>> index 94bf4958d114..8f959d90338a 100644
>>> --- a/arch/x86/entry/syscalls/syscall_64.tbl
>>> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
>>> @@ -364,6 +364,7 @@
>>> 440 common watch_mount sys_watch_mount
>>> 441 common watch_sb sys_watch_sb
>>> 442 common fsinfo sys_fsinfo
>>> +443 64 process_madvise sys_process_madvise
>>>
>>> #
>>> # x32-specific system call numbers start at 512 to avoid cache impact
>>> @@ -407,3 +408,4 @@
>>> 545 x32 execveat compat_sys_execveat
>>> 546 x32 preadv2 compat_sys_preadv64v2
>>> 547 x32 pwritev2 compat_sys_pwritev64v2
>>> +548 x32 process_madvise compat_sys_process_madvise
>>
>> I think we should not add any new x32-specific syscalls. Instead I think
>> the compat_sys_process_madvise/sys_process_madvise can be
>> merged into one.
>>
>>> + mm = mm_access(task, PTRACE_MODE_ATTACH_FSCREDS);
>>> + if (IS_ERR_OR_NULL(mm)) {
>>> + ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
>>> + goto release_task;
>>> + }
>>
>> Minor point: Having to use IS_ERR_OR_NULL() tends to be fragile,
>> and I would try to avoid that. Can mm_access() be changed to
>> itself return PTR_ERR(-ESRCH) instead of NULL to improve its
>> calling conventions? I see there are only three other callers.
>>
>>
>>> + ret = import_iovec(READ, vec, vlen, ARRAY_SIZE(iovstack), &iov, &iter);
>>> + if (ret >= 0) {
>>> + ret = do_process_madvise(pidfd, &iter, behavior, flags);
>>> + kfree(iov);
>>> + }
>>> + return ret;
>>> +}
>>> +
>>> +#ifdef CONFIG_COMPAT
>> ...
>>> +
>>> + ret = compat_import_iovec(READ, vec, vlen, ARRAY_SIZE(iovstack),
>>> + &iov, &iter);
>>> + if (ret >= 0) {
>>> + ret = do_process_madvise(pidfd, &iter, behavior, flags);
>>> + kfree(iov);
>>> + }
>>
>> Every syscall that passes an iovec seems to do this. If we make import_iovec()
>> handle both cases directly, this syscall and a number of others can
>> be simplified, and you avoid the x32 entry point I mentioned above
>>
>> Something like (untested)
>>
>> index dad8d0cfaaf7..0de4ddff24c1 100644
>> --- a/lib/iov_iter.c
>> +++ b/lib/iov_iter.c
>> @@ -1683,8 +1683,13 @@ ssize_t import_iovec(int type, const struct
>> iovec __user * uvector,
>> {
>> ssize_t n;
>> struct iovec *p;
>> - n = rw_copy_check_uvector(type, uvector, nr_segs, fast_segs,
>> - *iov, &p);
>> +
>> + if (in_compat_syscall())
>> + n = compat_rw_copy_check_uvector(type, uvector, nr_segs,
>> + fast_segs, *iov, &p);
>> + else
>> + n = rw_copy_check_uvector(type, uvector, nr_segs,
>> + fast_segs, *iov, &p);
>> if (n < 0) {
>> if (p != *iov)
>> kfree(p);
>
> Doesn't work for the async case, where you want to be holding on to the
> allocated iovec. But in general I think it's a good helper for the sync
> case, which is by far the majority.

Nevermind, I'm an idiot for reading this totally wrong.

--
Jens Axboe