Re: [RFC] proc: Add a new isolated /proc/pid/mempolicy type.

From: Michal Hocko
Date: Fri Sep 30 2022 - 04:54:58 EST


On Wed 28-09-22 11:09:47, Abel Wu wrote:
> On 9/27/22 9:58 PM, Michal Hocko wrote:
> > On Tue 27-09-22 21:07:02, Abel Wu wrote:
> > > On 9/27/22 6:49 PM, Michal Hocko wrote:
> > > > On Tue 27-09-22 11:20:54, Abel Wu wrote:
> > > > [...]
> > > > > > > Btw.in order to add per-thread-group mempolicy, is it possible to add
> > > > > > > mempolicy in mm_struct?
> > > > > >
> > > > > > I dunno. This would make the mempolicy interface even more confusing.
> > > > > > Per mm behavior makes a lot of sense but we already do have per-thread
> > > > > > semantic so I would stick to it rather than introducing a new semantic.
> > > > > >
> > > > > > Why is this really important?
> > > > >
> > > > > We want soft control on memory footprint of background jobs by applying
> > > > > NUMA preferences when necessary, so the impact on different NUMA nodes
> > > > > can be managed to some extent. These NUMA preferences are given by the
> > > > > control panel, and it might not be suitable to overwrite the tasks with
> > > > > specific memory policies already (or vice versa).
> > > >
> > > > Maybe the answer is somehow implicit but I do not really see any
> > > > argument for the per thread-group semantic here. In other words why a
> > > > new interface has to cover more than the local [sg]et_mempolicy?
> > > > I can see convenience as one potential argument. Also if there is a
> > > > requirement to change the policy in atomic way then this would require a
> > > > single syscall.
> > >
> > > Convenience is not our major concern. A well-tuned workload can have
> > > specific memory policies for different tasks/vmas in one process, and
> > > this can be achieved by set_mempolicy()/mbind() respectively. While
> > > other workloads are not, they don't care where the memory residents,
> > > so the impact they brought on the co-located workloads might vary in
> > > different NUMA nodes.
> > >
> > > The control panel, which has a full knowledge of workload profiling,
> > > may want to interfere the behavior of the non-mempolicied processes
> > > by giving them NUMA preferences, to better serve the co-located jobs.
> > >
> > > So in this scenario, a process's memory policy can be assigned by two
> > > objects dynamically:
> > >
> > > a) the process itself, through set_mempolicy()/mbind()
> > > b) the control panel, but API is not available right now
> > >
> > > Considering the two policies should not fight each other, it sounds
> > > reasonable to introduce a new syscall to assign memory policy to a
> > > process through struct mm_struct.
> >
> > So you want to allow restoring the original local policy if the external
> > one is disabled?
>
> Pretty much, but the internal policies are expected to have precedence
> over the external ones, since they are set for some reason to meet their
> specific requirements. The external ones are used only when there is no
> internal policy active.

What does this mean in practice exactly? Will pidfd_set_mempolicy fail
if there is a local policy in place? If not, how does the monitoring
know the effect of its call?

TBH I do not think this is a good idea at all. It seems like a very
confusing semantic to me. The external monitoring tool should be careful
to not go against implicit memory policies and query the state before
altering it. Or if this is required to be done atomicaly then add a flag
to the pidfd call.

> > Anyway, pidfd_$FOO behavior should be semantically very similar to the
> > original $FOO. Moving from per-task to per-mm is a major shift in the
> > semantic. I can imagine to have a dedicated flag for the syscall to
> > enforce the policy to the full thread group. But having a different
> > semantic is both tricky and also constrained because per-thread binding
> > is then impossible.
>
> Agreed. What about a syscall only apply to per-mm? There are precedents
> like process_madvice(2).

Differnt mm operations have different scope. And some of them have
changed their scope over time (e.g. oom_score_adj). If you really need a
per-mm functionality then use a flag for pidfd syscal.

--
Michal Hocko
SUSE Labs