[RFC] man/move_phys_pages: migrate pages based on physical address

From: Gregory Price
Date: Tue Sep 19 2023 - 19:10:48 EST


Draft of the move_phys_pages syscall proposed in RFC:

https://lore.kernel.org/all/20230907075453.350554-1-gregory.price@xxxxxxxxxxxx/

Signed-off-by: Gregory Price <gregory.price@xxxxxxxxxxxx>
---
man2/move_phys_pages.2 | 180 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 180 insertions(+)
create mode 100644 man2/move_phys_pages.2

diff --git a/man2/move_phys_pages.2 b/man2/move_phys_pages.2
new file mode 100644
index 000000000..4f4b68915
--- /dev/null
+++ b/man2/move_phys_pages.2
@@ -0,0 +1,180 @@
+.\" SPDX-License-Identifier: Linux-man-pages-copyleft-2-para
+.\"
+.\" This manpage is Copyright (C) 2006 Silicon Graphics, Inc.
+.\" Christoph Lameter
+.\" This manpage is Copyright (C) 2023 MemVerge, Inc.
+.\" Gregory Price
+.\"
+.\"
+.TH move_phys_pages 2 (date) "Linux man-pages (unreleased)"
+.SH NAME
+move_phys_pages \- move individual physically-addressed pages to another node
+.SH LIBRARY
+NUMA (Non-Uniform Memory Access) policy library
+.RI ( libnuma ", " \-lnuma )
+.SH SYNOPSIS
+.nf
+.B #include <numaif.h>
+.PP
+.BI "long move_phys_pages(unsigned long " count ", \
+uint64_t *" pages [. count ],
+.BI " const int " nodes [. count "], int " status [. count "], \
+int " flags );
+.fi
+.SH DESCRIPTION
+.BR move_phys_pages ()
+moves the specified
+.I physical pages
+to the memory nodes specified by
+.IR nodes .
+The result of the move is reflected in
+.IR status .
+The
+.I flags
+indicate constraints on the pages to be moved.
+.PP
+This interface requires
+.RB ( CAP_SYS_ADMIN ) .
+.PP
+.I count
+is the number of pages to move.
+It defines the size of the three arrays
+.IR pages ,
+.IR nodes ,
+and
+.IR status .
+.PP
+.I pages
+is an array of physical addresses to the pages that should be moved.
+These are addresses that should be aligned to page boundaries.
+.PP
+.I nodes
+is an array of integers that specify the desired location for each page.
+Each element in the array is a node number.
+.I nodes
+can also be NULL, in which case
+.BR move_phys_pages ()
+does not move any pages but instead will return the node
+where each page currently resides, in the
+.I status
+array.
+Obtaining the status of each page may be necessary to determine
+pages that need to be moved.
+.PP
+.I status
+is an array of integers that return the status of each page.
+The array contains valid values only if
+.BR move_phys_pages ()
+did not return an error.
+Preinitialization of the array to a value
+which cannot represent a real numa node or valid error of status array
+could help to identify pages that have been migrated if a partial
+failure occurs.
+.PP
+.I flags
+specify what types of pages to move.
+.B MPOL_MF_MOVE
+means that only pages that are in exclusive use by a process
+are to be moved.
+.B MPOL_MF_MOVE_ALL
+means that pages shared between multiple processes can also be moved.
+.SS Page states in the status array
+The following values can be returned in each element of the
+.I status
+array.
+.TP
+.B 0..MAX_NUMNODES
+Identifies the node on which the page resides.
+.TP
+.B \-EACCES
+The target node for the page is not in the insectional set of allowed
+nodes defined by all tasks mapping the address. At least one task
+mapping the address does not allow memory the target node.
+.TP
+.B \-EBUSY
+The page is currently busy and cannot be moved.
+Try again later.
+This occurs if a page is undergoing I/O or another kernel subsystem
+is holding a reference to the page.
+.TP
+.B \-EFAULT
+This is a zero page, the memory area is not mapped by the process,
+or the memory is not migratable.
+.TP
+.B \-EIO
+Unable to write back a page.
+The page has to be written back
+in order to move it since the page is dirty and the filesystem
+does not provide a migration function that would allow the move
+of dirty pages.
+.TP
+.B \-EINVAL
+A dirty page cannot be moved.
+The filesystem does not
+provide a migration function and has no ability to write back pages.
+
+.TP
+.B \-ENOENT
+The physical page is not online or the page is not present in any VMA.
+.TP
+.B \-ENOMEM
+Unable to allocate memory on target node.
+.SH RETURN VALUE
+On success
+.BR move_phys_pages ()
+returns zero.
+.\" FIXME . Is the following quite true: does the wrapper in numactl
+.\" do the right thing?
+On error, it returns \-1, and sets
+.I errno
+to indicate the error.
+If positive value is returned, it is the number of
+nonmigrated pages.
+.SH ERRORS
+.TP
+.B Positive value
+The number of nonmigrated pages if they were the result of nonfatal
+reasons.
+.TP
+.B EFAULT
+Parameter array could not be accessed.
+.TP
+.B EINVAL
+The flag value was not 0 (Linux 6.6), or an attempt was made to
+migrate pages of a kernel thread.
+.TP
+.B ENODEV
+One of the target nodes is not online.
+.TP
+.B EPERM
+The caller specified has insufficient privileges
+.RB ( CAP_SYS_ADMIN ).
+.SH STANDARDS
+Linux.
+.SH HISTORY
+Linux X.Y.Z
+.SH NOTES
+For information on library support, see
+.BR numa (7).
+.PP
+Use of this function may result in pages whose location
+(node) violates the memory policy established for the
+specified addresses (See
+.BR mbind (2))
+and/or the specified process (See
+.BR set_mempolicy (2)).
+That is, memory policy does not constrain the destination
+nodes used by
+.BR move_phys_pages ().
+.PP
+The
+.I <numaif.h>
+header is not included with glibc, but requires installing
+.I libnuma\-devel
+or a similar package.
+.SH SEE ALSO
+.BR mbind (2),
+.BR numa (3),
+.BR numa_maps (5),
+.BR cpuset (7),
+.BR numa (7),
+.BR migratepages (8),
+.BR numastat (8)
--
2.34.1