Re: [PATCH 1/1] tracing, bpf: Implement function bpf_probe_write

From: Alexei Starovoitov
Date: Fri Jul 15 2016 - 22:30:48 EST


On Fri, Jul 15, 2016 at 07:16:01PM -0700, Sargun Dhillon wrote:
>
>
> On Thu, 14 Jul 2016, Alexei Starovoitov wrote:
>
> >On Wed, Jul 13, 2016 at 01:31:57PM -0700, Sargun Dhillon wrote:
> >>
> >>
> >>On Wed, 13 Jul 2016, Alexei Starovoitov wrote:
> >>
> >>>On Wed, Jul 13, 2016 at 03:36:11AM -0700, Sargun Dhillon wrote:
> >>>>Provides BPF programs, attached to kprobes a safe way to write to
> >>>>memory referenced by probes. This is done by making probe_kernel_write
> >>>>accessible to bpf functions via the bpf_probe_write helper.
> >>>
> >>>not quite :)
> >>>
> >>>>Signed-off-by: Sargun Dhillon <sargun@xxxxxxxxx>
> >>>>---
> >>>> include/uapi/linux/bpf.h | 3 +++
> >>>> kernel/trace/bpf_trace.c | 20 ++++++++++++++++++++
> >>>> samples/bpf/bpf_helpers.h | 2 ++
> >>>> 3 files changed, 25 insertions(+)
> >>>>
> >>>>diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> >>>>index 406459b..355b565 100644
> >>>>--- a/include/uapi/linux/bpf.h
> >>>>+++ b/include/uapi/linux/bpf.h
> >>>>@@ -313,6 +313,9 @@ enum bpf_func_id {
> >>>> */
> >>>> BPF_FUNC_skb_get_tunnel_opt,
> >>>> BPF_FUNC_skb_set_tunnel_opt,
> >>>>+
> >>>>+ BPF_FUNC_probe_write, /* int bpf_probe_write(void *dst, void *src,
> >>>>int size) */
> >>>>+
> >>>
> >>>the patch is against some old kernel.
> >>>Please always make the patch against net-next tree and cc netdev list.
> >>>
> >>Sorry, I did this against Linus's tree, not net-next. Will fix.
> >>
> >>>>+static u64 bpf_probe_write(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
> >>>>+{
> >>>>+ void *dst = (void *) (long) r1;
> >>>>+ void *unsafe_ptr = (void *) (long) r2;
> >>>>+ int size = (int) r3;
> >>>>+
> >>>>+ return probe_kernel_write(dst, unsafe_ptr, size);
> >>>>+}
> >>>
> >>>the patch is whitepsace mangled. Please see Documentation/networking/netdev-FAQ.txt
> >>Also will fix.
> >>
> >>>
> >>>the main issue though that we cannot simply allow bpf to do probe_write,
> >>>since it may crash the kernel.
> >>>What might be ok is to allow writing into memory of current
> >>>user space process only. This way bpf prog will keep kernel safety guarantees,
> >>>yet it will be able to modify user process memory when necessary.
> >>>Since bpf+tracing is root only, it doesn't pose security risk.
> >>>
> >>>
> >>
> >>Doesn't probe_write prevent you from writing to protected memory and
> >>generate an EFAULT? Or are you worried about the situation where a bpf
> >>program writes to some other chunk of kernel memory, or writes bad data
> >>to said kernel memory?
> >>
> >>I guess when I meant "safe" -- it's safer than allowing arbitrary memcpy.
> >>I don't see a good way to ensure safety otherwise as we don't know
> >>which registers point to memory that it's reasonable for probes to
> >>manipulate. It's not like skb_store_bytes where we can check the pointer
> >>going in is the same pointer that's referenced, and with a super
> >>restricted datatype.
> >
> >exactly. probe_write can write anywhere in the kernel and that
> >will cause crashes. If we allow that bpf becomes no different than
> >kernel module.
> >
> >>Perhaps, it would be a good idea to describe an example where I used this:
> >>#include <uapi/linux/ptrace.h>
> >>#include <net/sock.h>
> >>#include <bcc/proto.h>
> >>
> >>
> >>int trace_inet_stream_connect(struct pt_regs *ctx)
> >>{
> >> if (!PT_REGS_PARM2(ctx)) {
> >> return 0;
> >> }
> >> struct sockaddr uaddr = {};
> >> struct sockaddr_in *addr_in;
> >> bpf_probe_read(&uaddr, sizeof(struct sockaddr), (void *)PT_REGS_PARM2(ctx));
> >> if (uaddr.sa_family == AF_INET) {
> >> // Simple cast causes LLVM weirdness
> >> addr_in = &uaddr;
> >> char fmt[] = "Connecting on port: %d\n";
> >> bpf_trace_printk(fmt, sizeof(fmt), ntohs(addr_in->sin_port));
> >> if (ntohs(addr_in->sin_port) == 80) {
> >> addr_in->sin_port = htons(443);
> >> bpf_probe_write((void *)PT_REGS_PARM2(ctx), &uaddr, sizeof(uaddr));
> >> }
> >> }
> >> return 0;
> >>};
> >>
> >>There are two reasons I want to do this:
> >>1) Debugging - sometimes, it makes sense to divert a program's syscalls in
> >>order to allow for better debugging
> >>2) Network Functions - I wrote a load balancer which intercepts
> >>inet_stream_connect & tcp_set_state. We can manipulate the destination
> >>address as neccessary at connect time. This also has the nice side effect
> >>that getpeername() returns the real IP that a server is connected to, and
> >>the performance is far better than doing "network load balancing"
> >>
> >>(I realize this is a total hack, better approaches would be appreciated)
> >
> >nice. interesting idea.
> >Have you considered ld_preload hack to do port rewrite?
> >
> We've thought about it. It wont really work for us, because we're doing this
> to manipulate 3rd party runtimes, many of which are written in languages
> that don't play nice with LD_PRELOAD. Go is the primary problem child in
> this case. We also looked at using SECCOMP + ptrace, but again, not all
> runtimes play nice with ptrace.

interesting!
I was about to suggest to hack write support into seccomp, since few
folks were interested in it as well. Why seccomp won't work?
You cannot have a centralized daemon that launches all the processes?

> >>If we allowed manipulation of the current task's user memory by exposing
> >>copy_to_user, that could also work if I attach the probe to sys_connect,
> >>I could overwrite the address there before it gets copied into
> >>kernel space, but that could lead to its own weirdness.
> >
> >we cannot simply call copy_to_user from the bpf either,
> >but yeah, something semantically equivalent to copy_to_user should
> >solve your port rewriting case, right?
> >Could you explain little bit more on 'syscall divert' ideas?
> >
> >
> If we had a "safe" copy_to_user which checked if BPF programs were running
> in the user context, that would work right? I mean, you could still make
> user programs crash, but that's better than making the kernel fall over. We
> would need both copy_from_user, and copy_to_user. If you look at the example
> program, it first checks what the user is connecting to -- so it'd have to
> check the address the user is passing to the syscall.

yep. imo 'safe' copy_to_user is no different in user impact from seccomp.
Both can make user process inoperable if there is a bug in bpf program,
but both are always safe from kernel point of view.
there is no need in copy_from_user. bpf_probe_read can read user memory just
fine or you actually want to make even safer version of probe_read that
can only read current task memory instead of any memory?
Makes sense to me.

> Syscall Divert:
> The idea here is to try to "lie" to the program about the environment. A
> specific example is where I want to bypass certain calls like prctl during
> debugging, to allow certain instructions to execute. I don't always have
> access to the source of the containerizer I'm running, and it's nice to turn
> them off during debugging.

makes sense too.
so seccomp with write support is also not option here, because you actually
want to divert seccomp itself?