Re: [PATCH 26/34] mm: implement new mprotect_key() system call

From: Michael Kerrisk (man-pages)
Date: Wed Dec 09 2015 - 06:09:10 EST


Hi Dave,

On 7 December 2015 at 17:44, Dave Hansen <dave@xxxxxxxx> wrote:
> On 12/04/2015 10:50 PM, Michael Kerrisk (man-pages) wrote:
>> On 12/04/2015 02:15 AM, Dave Hansen wrote:
>>> From: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>
>>>
>>> mprotect_key() is just like mprotect, except it also takes a
>>> protection key as an argument. On systems that do not support
>>> protection keys, it still works, but requires that key=0.
>>> Otherwise it does exactly what mprotect does.
>>
>> Is there a man page for this API?
>
> Yep.

Thanks!

> Patch to man-pages source is attached.

Better as inline, for review purposes.

> I actually broke it up in
> to a few separate pages.

Seems the right approach to me.

> I was planning on submitting these after the
> patches themselves go upstream.

Not a good idea. Reading and creating man pages has helped
me (and others) find a heap of design and implementation
bugs in APIs. Best that that happens before things hit
upstream.

Would you be willing to revise your man page (and possibly
your kernel patches) in the light of my comments below?
It would be better to do this sooner than later, since
I suspect I'll have a few more API comments as I review
future drafts of the page.

> commit ebb12643876810931ed23992f92b7c77c2c36883
> Author: Dave Hansen <dave.hansen@xxxxxxxxx>
> Date: Mon Dec 7 08:42:57 2015 -0800
>
> pkeys
>
> diff --git a/man2/mprotect.2 b/man2/mprotect.2
> index ae305f6..a3c1e62 100644
> --- a/man2/mprotect.2
> +++ b/man2/mprotect.2
> @@ -38,16 +38,19 @@
> .\"
> .TH MPROTECT 2 2015-07-23 "Linux" "Linux Programmer's Manual"
> .SH NAME
> -mprotect \- set protection on a region of memory
> +mprotect, mprotect_key \- set protection on a region of memory

Elsewhere in your patch series (in a mail with subject
"mm: implement new mprotect_key() system call") I see:

+SYSCALL_DEFINE4(pkey_mprotect, unsigned long, start, size_t, len,
+ unsigned long, prot, int, pkey)
+{
+ if (!arch_validate_pkey(pkey))
+ return -EINVAL;
+
+ return do_mprotect_pkey(start, len, prot, pkey);
+}

And lower down in this patch series, I see "mprotect_pkey"!

What is the name of this system call supposed to be?

For what it's worth, I think "mprotect_pkey()" is the best
name (and secretly, you seem to as well, since we have at
the bottom of it all the internal function "do_mprotect_pkey()".
It signifies that this is a modified version of the base
functionality provided my mprotect(), and "pkey" is
consistent with the remainder of the APIs.

But, whatever name you do choose, please fix it in all
of your commit messages, otherwise reading the git
history gets very confusing.

> .SH SYNOPSIS
> .nf
> .B #include <sys/mman.h>
> .sp
> .BI "int mprotect(void *" addr ", size_t " len ", int " prot );
> +.BI "int mprotect_key(void *" addr ", size_t " len ", int " prot , " int " key);
> .fi
> .SH DESCRIPTION
> .BR mprotect ()
> -changes protection for the calling process's memory page(s)
> +and
> +.BR mprotect_key ()
> +change protection for the calling process's memory page(s)
> containing any part of the address range in the
> interval [\fIaddr\fP,\ \fIaddr\fP+\fIlen\fP\-1].
> .I addr
> @@ -74,10 +77,17 @@ The memory can be modified.
> .TP
> .B PROT_EXEC
> The memory can be executed.
> +.PP
> +.I key
> +is the protection or storage key to assign to the memory.

Why "protection or storage key" here? This phrasing seems a
little ambiguous to me, given that we also have a 'prot'
argument. I think it would be clearer just to say
"protection key". But maybe I'm missing something.

> +A key must be allocated with pkey_alloc () before it is

Please format syscall cross references as

.BR pkey_alloc (2)

> +passed to pkey_mprotect ().
> .SH RETURN VALUE
> On success,
> .BR mprotect ()
> -returns zero.
> +and
> +.BR mprotect_key ()
> +return zero.
> On error, \-1 is returned, and
> .I errno
> is set appropriately.

Are there no errors specific to mprotect_key()? Is there
an error if pkey is invalid? I see now that there is. That
EINVAL error needs documenting.

> diff --git a/man2/pkey_alloc.2 b/man2/pkey_alloc.2
> new file mode 100644
> index 0000000..980ce3e
> --- /dev/null
> +++ b/man2/pkey_alloc.2
> @@ -0,0 +1,72 @@
> +.\" Copyright (C) 2007 Michael Kerrisk <mtk.manpages@xxxxxxxxx>
> +.\" and Copyright (C) 1995 Michael Shields <shields@xxxxxxxxxx>.

Michaels have many talents, but documenting kernel APIs
20 years ahead of their creation is not one of them, I believe.
Better replace this with the actual copyright holder and author
name.

> +.\" %%%LICENSE_START(VERBATIM)
> +.\" Permission is granted to make and distribute verbatim copies of this
> +.\" manual provided the copyright notice and this permission notice are
> +.\" preserved on all copies.
> +.\"
> +.\" Permission is granted to copy and distribute modified versions of this
> +.\" manual under the conditions for verbatim copying, provided that the
> +.\" entire resulting derived work is distributed under the terms of a
> +.\" permission notice identical to this one.
> +.\"
> +.\" Since the Linux kernel and libraries are constantly changing, this
> +.\" manual page may be incorrect or out-of-date. The author(s) assume no
> +.\" responsibility for errors or omissions, or for damages resulting from
> +.\" the use of the information contained herein. The author(s) may not
> +.\" have taken the same level of care in the production of this manual,
> +.\" which is licensed free of charge, as they might when working
> +.\" professionally.
> +.\"
> +.\" Formatted or processed versions of this manual, if unaccompanied by
> +.\" the source, must acknowledge the copyright and author of this work.
> +.\" %%%LICENSE_END
> +.\"
> +.\" Modified 2015-12-04 by Dave Hansen <dave@xxxxxxxx>

This info should be in the copyright notice above.

> +.\"
> +.\"
> +.TH PKEY_ALLOC 2 2015-12-04 "Linux" "Linux Programmer's Manual"
> +.SH NAME
> +pkey_alloc, pkey_free \- allocate or free a protection key
> +.SH SYNOPSIS
> +.nf
> +.B #include <sys/mman.h>
> +.sp
> +.BI "int pkey_alloc(unsigned long" flags ", unsigned long " init_val);

If I understand correctly, 'init_val' is a mask of access rights
as per pkey_set(). If so, let's name the argument in this man page
"access_rights" also (or perhaps "init_access_rights" if you must,
but I think the shorter name is better. This helps the reader
to understand that we're talking about the same thing. It would be
good also to make the same change in the kernel code.

> +.BI "int pkey_free(int " pkey);
> +.fi
> +.SH DESCRIPTION
> +.BR pkey_alloc ()
> +and
> +.BR pkey_free ()
> +allow or disallow the calling process's to use the given

s/process's/process/
But should this actually be "thread"?

> +protection key for all protection-key-related operations.
> +
> +.PP
> +.I flags
> +is may contain zero or more disable operation:

s/is may/may/

> +.B PKEY_DISABLE_ACCESS
> +and/or
> +.B PKEY_DISABLE_WRITE

For the above two, please format as
.TP
.B PKEY_DISABLE_ACCESS
<Explanation of this flag>
.TP
.B PKEY_DISABLE_WRITE
<Explanation of this flag>

> +.SH RETURN VALUE
> +On success,
> +.BR pkey_alloc ()
> +and
> +.BR pkey_free ()
> +return zero.

The description of the success return for pkey_alloc() can't
be right. Doesn't it return a protection key?

> +On error, \-1 is returned, and
> +.I errno
> +is set appropriately.
> +.SH ERRORS
> +.TP
> +.B EINVAL
> +An invalid protection key, flag, or init_val was specified.

Better to write that last line as:

[[
.IR pkey ,
.IR flags ,
or
.I init_val [Or: access_rights]
is invalid.
]]

> +.TP
> +.B ENOSPC
> +All protection keys available for the current process have
> +been allocated.

So it seems to me that this page needs a discussion of the
limit that is involved here.

> +.SH SEE ALSO
> +.BR mprotect_pkey (2),
> +.BR pkey_get (2),
> +.BR pkey_set (2),

Remove trailing comma.

> diff --git a/man2/pkey_get.2 b/man2/pkey_get.2
> new file mode 100644
> index 0000000..4cfdea9
> --- /dev/null
> +++ b/man2/pkey_get.2
> @@ -0,0 +1,76 @@
> +.\" Copyright (C) 2007 Michael Kerrisk <mtk.manpages@xxxxxxxxx>
> +.\" and Copyright (C) 1995 Michael Shields <shields@xxxxxxxxxx>.

Again, the copyright notice needs fixing.

> +.\" %%%LICENSE_START(VERBATIM)
> +.\" Permission is granted to make and distribute verbatim copies of this
> +.\" manual provided the copyright notice and this permission notice are
> +.\" preserved on all copies.
> +.\"
> +.\" Permission is granted to copy and distribute modified versions of this
> +.\" manual under the conditions for verbatim copying, provided that the
> +.\" entire resulting derived work is distributed under the terms of a
> +.\" permission notice identical to this one.
> +.\"
> +.\" Since the Linux kernel and libraries are constantly changing, this
> +.\" manual page may be incorrect or out-of-date. The author(s) assume no
> +.\" responsibility for errors or omissions, or for damages resulting from
> +.\" the use of the information contained herein. The author(s) may not
> +.\" have taken the same level of care in the production of this manual,
> +.\" which is licensed free of charge, as they might when working
> +.\" professionally.
> +.\"
> +.\" Formatted or processed versions of this manual, if unaccompanied by
> +.\" the source, must acknowledge the copyright and author of this work.
> +.\" %%%LICENSE_END
> +.\"
> +.\" Modified 2015-12-04 by Dave Hansen <dave@xxxxxxxx>
> +.\"
> +.\"
> +.TH PKEY_GET 2 2015-12-04 "Linux" "Linux Programmer's Manual"
> +.SH NAME
> +pkey_get, pkey_set \- manage protection key access permissions
> +.SH SYNOPSIS
> +.nf
> +.B #include <sys/mman.h>
> +.sp
> +.BI "int pkey_get(int " pkey);
> +.BI "int pkey_set(int " pkey ", unsigned long " access_rights);
> +.fi
> +.SH DESCRIPTION
> +.BR pkey_get ()
> +and
> +.BR pkey_set ()
> +query or set the current set of rights for the calling
> +task for the given protection key.

Change "task" to "thread".

> +When rights for a key are disabled, any future access
> +to any memory region with that key set will generate
> +a SIGSEGV. The rights are local to the calling thread and
> +do not affect any other threads.

I think the last sentence could be simpler ("Access rights are
private to each thread."), or even removed, since you already
say above that these operations are per task (should be "per thread").

> +.PP
> +Upon entering any signal handler, the process is given a
> +default set of protection key rights which are separate from
> +the main thread's. Any calls to pkey_set () in a signal

s/signal/signal hander/

Format the reference to this system call as:

.BR pkey_set ()

> +will not persist upon a return to the calling process.

So, the preceding paragraph leaves me confused. And I'm wondering
if that confusion reflects some weirdness in the API design. But
I can't tell until I understand it better. These are my problems:

* You throw "process" and "thread" together in the explanation.
Is this simply a mistake? If it is not, the distinction
you are trying to draw by using the two different terms
is not made clear in the text.

* Your text ("separate from the main thread's") makes it
sound as though a signal handler is somehow invoked in a
different thread, which makes no sense. I suspect you
want to say something like this:

[[
When a signal handler is invoked, the thread is temporarily
given a default set of protection key rights. The thread's
protection key rights are restored when the signal handler
returns.
]]

Is that close to the truth?

* Change "a return to the calling process" to "when the
signal handler returns". Signal handlers are not "called"
by the program.

* There needs to be some explanation in this page of *why*
this special behavior occurs when signal handlers are
invoked.

And I have a question (and the answer probably should
be documented in the manual page). What happens when
one signal handler interrupts the execution of another?
Do pkey_set() calls in the first handler persist into the
second handler? I presume not, but it would be good to
be a little more explicit about this.

> +.PP
> +.I access_rights
> +is may contain zero or more disable operation:

s/is may/may/

> +.B PKEY_DISABLE_ACCESS
> +and/or
> +.B PKEY_DISABLE_WRITE

For the above two, please format as

[[
.TP
.B PKEY_DISABLE_ACCESS
<Explanation of this flag>
.TP
.B PKEY_DISABLE_WRITE
<Explanation of this flag>
]]

In various commit messages you use two alternative names:
PKEY_DENY_ACCESS and PKEY_DENY_WRITE. I assume bit rot here
as the the API has evolved. But please fix all of those
commit messages, so that the git history is more sensible.

> +.SH RETURN VALUE
> +On success,
> +.BR pkey_get ()
> +and
> +.BR pkey_set ()
> +return zero.

The success return value of pkey_get() is not correct.
Doesn't it return an access rights mask?

> +On error, \-1 is returned, and
> +.I errno
> +is set appropriately.
> +.SH ERRORS
> +.TP
> +.B EINVAL
> +An invalid protection key or access_rights was specified.

[[
.I access_rights
or
.I pkey
is invalid.
]]

> +.SH SEE ALSO
> +.BR mprotect_pkey (2),
> +.BR pkey_alloc (2),
> +.BR pkey_free (2),

Remove trailing comma.

So at the end of reading these pages, and delving
a little through the commit messages, I still don't
feel convinced that I understand what these APIs are
about. There's several things that I think still need
to be added to these pages:

* A general overview of why this functionality is useful.
* A note on which architectures support/will support
this functionality.
* Explanation of what a protection domain is.
* Explanation of how a process (thread?) changes its
protection domain.
* Explanation of the relationship between page permission
bits (PROT_READ/PROT_WRITE/PROTE_EXEC) and
PKEY_DISABLE_ACCESS and PKEY_DISABLE_WRITE.
It's still not clear to me. Do the PKEY_* bits
override the PROT_* bits. Or, something else?

Thanks,

Michael


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/