Re: [External] Re: [PATCH] mm: hugetlb: support get/set_policy for hugetlb_vm_ops
From: 黄杰
Date: Sun Oct 16 2022 - 23:35:32 EST
Mike Kravetz <mike.kravetz@xxxxxxxxxx> 于2022年10月15日周六 00:56写道:
>
> On 10/12/22 12:45, Hugh Dickins wrote:
> > On Wed, 12 Oct 2022, Albert Huang wrote:
> >
> > > From: "huangjie.albert" <huangjie.albert@xxxxxxxxxxxxx>
> > >
> > > implement these two functions so that we can set the mempolicy to
> > > the inode of the hugetlb file. This ensures that the mempolicy of
> > > all processes sharing this huge page file is consistent.
> > >
> > > In some scenarios where huge pages are shared:
> > > if we need to limit the memory usage of vm within node0, so I set qemu's
> > > mempilciy bind to node0, but if there is a process (such as virtiofsd)
> > > shared memory with the vm, in this case. If the page fault is triggered
> > > by virtiofsd, the allocated memory may go to node1 which depends on
> > > virtiofsd.
> > >
> > > Signed-off-by: huangjie.albert <huangjie.albert@xxxxxxxxxxxxx>
>
> Thanks for the patch Albert, and thank you Hugh for the comments!
>
> > Aha! Congratulations for noticing, after all this time. hugetlbfs
> > contains various little pieces of code that pretend to be supporting
> > shared NUMA mempolicy, but in fact there was nothing connecting it up.
>
> I actually had to look this up to verify it was not supported. However, the
> documentation is fairly clear.
> From admin-guide/mm/numa_memory_policy.rst.
>
> "As of 2.6.22, only shared memory segments, created by shmget() or
> mmap(MAP_ANONYMOUS|MAP_SHARED), support shared policy. When shared
> policy support was added to Linux, the associated data structures were
> added to hugetlbfs shmem segments. At the time, hugetlbfs did not
> support allocation at fault time--a.k.a lazy allocation--so hugetlbfs
> shmem segments were never "hooked up" to the shared policy support.
> Although hugetlbfs segments now support lazy allocation, their support
> for shared policy has not been completed."
>
> It is somewhat embarrassing that this has been known for so long and
> nothing has changed.
>
> > It will be for Mike to decide, but personally I oppose adding
> > shared NUMA mempolicy support to hugetlbfs, after eighteen years.
> >
> > The thing is, it will change the behaviour of NUMA on hugetlbfs:
> > in ways that would have been sensible way back then, yes; but surely
> > those who have invested in NUMA and hugetlbfs have developed other
> > ways of administering it successfully, without shared NUMA mempolicy.
> >
> > At the least, I would expect some tests to break (I could easily be
> > wrong), and there's a chance that some app or tool would break too.
> >
> > I have carried the reverse of Albert's patch for a long time, stripping
> > out the pretence of shared NUMA mempolicy support from hugetlbfs: I
> > wanted that, so that I could work on modifying the tmpfs implementation,
> > without having to worry about other users.
> >
> > Mike, if you would prefer to see my patch stripping out the pretence,
> > let us know: it has never been a priority to send in, but I can update
> > it to 6.1-rc1 if you'd like to see it. (Once upon a time, it removed
> > all need for struct hugetlbfs_inode_info, but nowadays that's still
> > required for the memfd seals.)
> >
> > Whether Albert's patch is complete and correct, I haven't begun to think
> > about: I am not saying it isn't, but shared NUMA mempolicy adds another
> > dimension of complexity, and need for support, that I think hugetlbfs
> > would be better off continuing to survive without.
>
> To be honest, I have not looked into the complexities of shared NUMA
> mempolicy and exactly what is required for it's support. With my limited
> knowledge, it appears that this patch adds some type of support for shared
> policy, but it may not provide all support mentioned in the documentation.
>
> At the very least, this patch should also update documentation to state
> what type of support is provided.
>
> Albert, can you look into what would be required for full support? I can take
> a look as well but have some other higher priority tasks to work first.
>
Lucky to do this job, let me think about it.
> TBH, I like Hugh's idea of removing the 'pretence of shared policy support'.
> We are currently wasting memory carrying around extra unused fields in
> hugetlbfs_inode_info. :(
> --
> Mike Kravetz