On 11/8/24 18:31, Paolo Bonzini wrote:
On 11/7/24 16:10, Matthew Wilcox wrote:
On Thu, Nov 07, 2024 at 02:24:20PM +0530, Shivank Garg wrote:
The folio allocation path from guest_memfd typically looks like this...
kvm_gmem_get_folio
filemap_grab_folio
__filemap_get_folio
filemap_alloc_folio
__folio_alloc_node_noprof
-> goes to the buddy allocator
Hence, I am trying to have a version of filemap_alloc_folio() that takes an mpol.
It only takes that path if cpuset_do_page_mem_spread() is true. Is the
real problem that you're trying to solve that cpusets are being used
incorrectly?
If it's false it's not very different, it goes to alloc_pages_noprof().
Then it respects the process's policy, but the policy is not
customizable without mucking with state that is global to the process.
Taking a step back: the problem is that a VM can be configured to have
multiple guest-side NUMA nodes, each of which will pick memory from the
right NUMA node in the host. Without a per-file operation it's not
possible to do this on guest_memfd. The discussion was whether to use
ioctl() or a new system call. The discussion ended with the idea of
posting a *proposal* asking for *comments* as to whether the system call
would be useful in general beyond KVM.
Commenting on the system call itself I am not sure I like the
file_operations entry, though I understand that it's the simplest way to
implement this in an RFC series. It's a bit surprising that fbind() is
a total no-op for everything except KVM's guest_memfd.
Maybe whatever you pass to fbind() could be stored in the struct file *,
and used as the default when creating VMAs; as if every mmap() was
followed by an mbind(), except that it also does the right thing with
MAP_POPULATE for example. Or maybe that's a horrible idea?
mbind() manpage has this:
The specified policy will be ignored for any MAP_SHARED
mappings in the specified memory range. Rather the pages will be allocated
according to the memory policy of the thread that caused the page to be
allocated. Again, this may not be the thread that called mbind().
So that seems like we're not very keen on having one user of a file set a
policy that would affect other users of the file?
Now the next paragraph of the manpage says that shmem is different, and
guest_memfd is more like shmem than a regular file.
My conclusion from that is that fbind() might be too broad and we don't want
this for actual filesystem-backed files? And if it's limited to guest_memfd,
it shouldn't be an fbind()?