Re: Compat 32-bit syscall entry from 64-bit task!?

From: Linus Torvalds
Date: Wed Jan 18 2012 - 15:26:43 EST


Added Peter to the cc, since this is now about some x86-specific
things. Ingo was already cc'd earlier.

On Wed, Jan 18, 2012 at 11:31 AM, Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> Using the high bits of 'eflags' might work. Hopefully nobody tests
> that. IOW, something like the attached might work. It just sets bit#32
> in eflags if the system call is a compat call.

So that description was bogus, it was what my original patch did, but
not the one I actually sent out (Peter - you can find it on lkml,
although the description below is probably sufficient for you to
understand what it does, or the obvious nature of the attached patch
for strace).

The one I sent out *unconditionally* sets one bit in the high bits of
the returned value of the eflags register from ptrace(), very much on
purpose. That way you can unambiguously see whether it's an old kernel
(bits clear) or a new kernel that supports the feature. On a new
kernel, bit #32 of eflags will be set for a native 64-bit system call,
and bit #33 will be set for a compat system call.

And some testing says that it works. In particular, I have a patch to
strace-4.6 that is able to correctly decode my mixed-case binary that
uses both the compat system call and the native system calls from
64-bit long mode. Also, it looks like gdb ignores the high bits of
eflags, since it "knows" that eflags is just a 32-bit register even in
64-bit mode, so the fact that we set some random bits in there doesn't
end up being noisy for at least one debugger.

HOWEVER. I'm not going to guarantee that this is the right approach.
It seems to work, and it clearly gives people real information, but
whether this is the best way to do things or not is open.

The reason I picked 'eflags' was that it

(a) was easy from an implementation standpoint, since we already have
to handle reading of eflags specially in ptrace (we have to fake out
the resume bit)

(b) it "kind of" makes sense to make high bits be "system flags",
with low bits being "cpu flags", so it fits at least *some* kind of
conceptual model.

(c) the other sane places to put it (high bits of CS and/or ORIG_AX)
were being used and compared as 64-bit values at least by strace.
Whether eflags works for all users, I have no idea, but generally you
would never compare eflags for one particular value - you might check
individual bits in eflags, but hopefully setting a few new bits should
not be something that any legacy user would ever really notice.

So there are reasons to think that my patch is sane, but...

Here's the strace patch, so people can look. I didn't even test it on
an old kernel, but the fallback case to the old behavior looks
trivial.

Comments?

Linus
syscall.c | 21 +++++++++++++++++++--
1 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/syscall.c b/syscall.c
index e66ac0a95582..edd9cb804318 100644
--- a/syscall.c
+++ b/syscall.c
@@ -901,14 +901,31 @@ get_scno(struct tcb *tcp)
long val;
int pid = tcp->pid;

+ /* Check the high bits of eflags for processor mode */
+ if (upeek(tcp, 8*EFLAGS, &val) < 0)
+ return -1;
+ val >>= 32;
/* Check CS register value. On x86-64 linux it is:
* 0x33 for long mode (64 bit)
* 0x23 for compatibility mode (32 bit)
* It takes only one ptrace and thus doesn't need
* to be cached.
*/
- if (upeek(tcp, 8*CS, &val) < 0)
- return -1;
+ switch (val & 3) {
+ case 0:
+ /* Legacu case: check CS */
+ if (upeek(tcp, 8*CS, &val) < 0)
+ return -1;
+ break;
+ case 1:
+ /* "Long mode" value */
+ val = 0x33;
+ break;
+ case 2:
+ /* Compatibility mode */
+ val = 0x23;
+ break;
+ }
switch (val) {
case 0x23: currpers = 1; break;
case 0x33: currpers = 0; break;