I think this idea is very attractive in several ways. I've been thinking
about something like this for a long time as part of an implementation for a
security tool like MEC described earlier.
The device you're talking about could go in /proc/<n>, for example. The
tracing process could read a line like this from it for each system call
write 2, "this is data", 12
(or perhaps a packed version for efficiency). The tracer could then react by
writing a line like
skip-fail EIO
or
skip-ok 12
to make the call "fail" or "succeed" (returning 12) without even really being
seen by the kernel. Alternatively, the tracer could write
write 4, "this is data", 12
to alter the arguments of the call (useful for sandboxing, etc). The tracer
then writes
cont
or
step
to actually proceed with the (possibly altered) call, stopping again after the
call if 'step' was specified. In that case, it would then read
fail EIO
or
ok 12
after the kernel has processed the call, to see the result of the call, and
then specify the result passed to the tracee by writing
fail EIO
or
ok 12
back to the device.
You would also want to be able to step on memory before returning (e.g., to
alter the result of gettimeofday). It would be nice if the protocol would
provide commands like
set 1 "some data"
set 2 "other data"
but this might be problematic. Just using /proc/<n>/mem is also a
possibility.
Similar commands would be provided for intercepting and handling signals.
The beauty of an approach like this is that it could be relatively generic.
Right now, in order to do this sort of thing with ptrace, you have to know
endless ugly details about exactly how to parse the stack, etc. This device,
though, could have a generic protocol.
It's also fairly powerful. I think you could (portably) implement tools like
strace or MEC's trace-and-replay debugger or the security tool we're
discussing using just this interface. And by eliminating ptrace, you
eliminate its problems (altered parent/child semantics).
This device could also be reentrant, able to be opened by more than one
tracer. Then each call would possibly be screened/altered by several
processes, in a nested fashion. This would be difficult or impossible with
ptrace.
As far as performance, though, it's not clear at all that this would be an
improvement over ptrace. In the SP case, you still end up suspending the
calling process, waking up the tracer, which is presumably doing a select on
the device, and then switching back to the tracee. I'm not very knowledgeable
about MP, but I don't see why this would be a win in the MP case either.
Another problem you run into is dealing with system call arguments. Suppose
the call is a write. Are the contents of the buffer to be written also to be
passed through this syscall device, so that the tracer can examine them? That
would be nice, but how to you do it efficiently?
A naive way to do it would be to simply copy the arguments out of user space.
But that's really expensive if you're doing 10MB writes. You also have to
know exactly what data each system call will possibly read or write from user
memory, which MEC has previously pointed out is a rat's nest for ioctl, etc.
Another way you could do it would be to fake the copy by having reads on the
syscall device transparently pull data out of memory, depending on the
semantics of the particular system call, values of other arguments, etc. This
still has the ioctl problem, though, and if you don't copy the data, you have
to worry about locking it down (making sure it doesn't get overwritten) if
shared memory is involved.
I'm not trying to discourage you. I think some variation of this is a real
winner, and I was planning to try my hand at it myself.
--Mike
-- Any sufficiently adverse technology is indistinguishable from Microsoft.- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/