Re: [External] Re: [RFC] mm: add new syscall pidfd_set_mempolicy()

From: Michal Hocko
Date: Wed Oct 12 2022 - 05:03:27 EST


On Wed 12-10-22 15:55:44, Zhongkun He wrote:
> Hi michal, thanks for your reply and suggestiones.
>
> > Please add some explanation why the cpuset interface is not usable for
> > that usecase.
> OK.
>
> > > To solve the issue, this patch introduces a new syscall
> > > pidfd_set_mempolicy(2). it sets the NUMA memory policy of the thread
> > > specified in pidfd.
> > >
> > > In current process context there is no locking because only the process
> > > accesses its own memory policy, so task_work is used in
> > > pidfd_set_mempolicy() to update the mempolicy of the process specified
> > > in pidfd, avoid using locks and race conditions.
> >
> > Why cannot you alter kernel_set_mempolicy (and do_set_mempolicy) to
> > accept a task rather than operate on current?
>
> I have tried it before this patch, but I found a problem.The allocation and
> update of mempolicy are in the current context, so it is not protected by
> any lock.But when the mempolicy is modified by other processes, the race
> condition appears.
> Say something like the following
>
> pidfd_set_mempolicy target task stack
> alloc_pages
> mpol = get_task_policy;
> task_lock(task);
> old = task->mempolicy;
> task->mempolicy = new;
> task_unlock(task);
> mpol_put(old);
> page = __alloc_pages(mpol);
> There is a situation that when the old mempolicy is released, the target
> task is still using the policy.It would be better if there are suggestions
> on this case.

Yes, this will require some refactoring and one potential way is to make
mpol ref counting unconditional. The conditional ref. counting has
already caused issues in the past and the code is rather hard to follow
anyway. I am not really sure this optimization is worth it.

Another option would be to block the pidfd side of things on completion
which would wake it up from the task_work context but I would rather
explore the ref counting approach first and only if this is proven to be
too expensive to go with hacks like this.
--
Michal Hocko
SUSE Labs