Re: [RFC PATCH] Add /proc/<pid>/numa_vamaps for numa node information

From: Prakash Sangappa
Date: Fri Sep 14 2018 - 14:08:22 EST




On 9/14/18 5:49 AM, Jann Horn wrote:
On Fri, Sep 14, 2018 at 8:21 AM Michal Hocko <mhocko@xxxxxxxxxx> wrote:
On Fri 14-09-18 03:33:28, Jann Horn wrote:
On Wed, Sep 12, 2018 at 10:43 PM prakash.sangappa
<prakash.sangappa@xxxxxxxxxx> wrote:
On 05/09/2018 04:31 PM, Dave Hansen wrote:
On 05/07/2018 06:16 PM, prakash.sangappa wrote:
It will be /proc/<pid>/numa_vamaps. Yes, the behavior will be
different with respect to seeking. Output will still be text and
the format will be same.

I want to get feedback on this approach.
I think it would be really great if you can write down a list of the
things you actually want to accomplish. Dare I say: you need a
requirements list.

The numa_vamaps approach continues down the path of an ever-growing list
of highly-specialized /proc/<pid> files. I don't think that is
sustainable, even if it has been our trajectory for many years.

Pagemap wasn't exactly a shining example of us getting new ABIs right,
but it sounds like something along those is what we need.
Just sent out a V2 patch. This patch simplifies the file content. It
only provides VA range to numa node id information.

The requirement is basically observability for performance analysis.

- Need to be able to determine VA range to numa node id information.
Which also gives an idea of which range has memory allocated.

- The proc file /proc/<pid>/numa_vamaps is in text so it is easy to
directly view.

The V2 patch supports seeking to a particular process VA from where
the application could read the VA to numa node id information.

Also added the 'PTRACE_MODE_READ_REALCREDS' check when opening the
file /proc file as was indicated by Michal Hacko
procfs files should use PTRACE_MODE_*_FSCREDS, not PTRACE_MODE_*_REALCREDS.
Out of my curiosity, what is the semantic difference? At least
kernel_move_pages uses PTRACE_MODE_READ_REALCREDS. Is this a bug?
No, that's fine. REALCREDS basically means "look at the caller's real
UID for the access check", while FSCREDS means "look at the caller's
filesystem UID". The ptrace access check has historically been using
the real UID, which is sorta weird, but normally works fine. Given
that this is documented, I didn't see any reason to change it for most
things that do ptrace access checks, even if the EUID would IMO be
more appropriate. But things that capture caller credentials at points
like open() really shouldn't look at the real UID; instead, they
should use the filesystem UID (which in practice is basically the same
as the EUID).

So in short, it depends on the interface you're coming through: Direct
syscalls use REALCREDS, things that go through the VFS layer use
FSCREDS.

So in this case can the REALCREDS check be done in the read() system call
when reading the /proc file instead of the open call?