Re: [PATCH 1/1] selftests/bpf: add cls_redirect classifier

From: Alexei Starovoitov
Date: Sun Apr 26 2020 - 13:33:32 EST


On Fri, Apr 24, 2020 at 07:55:55PM +0100, Lorenz Bauer wrote:
> cls_redirect is a TC clsact based replacement for the glb-redirect iptables
> module available at [1]. It enables what GitHub calls "second chance"
> flows [2], similarly proposed by the Beamer paper [3]. In contrast to
> glb-redirect, it also supports migrating UDP flows as long as connected
> sockets are used. cls_redirect is in production at Cloudflare, as part of
> our own L4 load balancer.

This is awesome. Thank you for contributing!
Applied.

> There are two major distinctions from glb-redirect: first, cls_redirect
> relies on receiving encapsulated packets directly from a router. This is
> because we don't have access to the neighbour tables from BPF, yet. See

Let's make it available then :)

> The code base started it's life on v4.19, so there are most likely still
> hold overs from old workarounds. In no particular order:
>
> - The function buf_off is required to defeat a clang optimization
> that leads to the verifier rejecting the program due to pointer
> arithmetic in the wrong order.
>
> - The function pkt_parse_ipv6 is force inlined, because it would
> otherwise be rejected due to returning a pointer to stack memory.
>
> - The functions fill_tuple and classify_tcp contain kludges, because
> we've run out of function arguments.
>
> - The logic in general is rather nested, due to verifier restrictions.
> I think this is either because the verifier loses track of constants
> on the stack, or because it can't track enum like variables.

Have you tried undoing these workarounds to check the latest verifier?
If they're still needed we certainly have to improve the verifier.

> +#include "test_cls_redirect.skel.h"

Do you use skeleton internally as well or was it just for selftests? ;)

> +_Static_assert(
> + sizeof(flow_ports_t) !=
> + offsetofend(struct bpf_sock_tuple, ipv4.dport) -
> + offsetof(struct bpf_sock_tuple, ipv4.sport) - 1,
> + "flow_ports_t must match sport and dport in struct bpf_sock_tuple");

Nice. I didn't realize clang supports it. Of course it should.

> +/* Linux packet pointers are either aligned to NET_IP_ALIGN (aka 2 bytes),
> + * or not aligned if the arch supports efficient unaligned access.
> + *
> + * Since the verifier ensures that eBPF packet accesses follow these rules,
> + * we can tell LLVM to emit code as if we always had a larger alignment.
> + * It will yell at us if we end up on a platform where this is not valid.
> + */
> +typedef uint8_t *net_ptr __attribute__((align_value(8)));

Wow. I didn't know about this attribute.
I wonder whether it can help Daniel's memcpy hack.

> +
> +typedef struct buf {
> + struct __sk_buff *skb;
> + net_ptr head;
> + /* NB: tail musn't have alignment other than 1, otherwise
> + * LLVM will go and eliminate code, e.g. when checking packet lengths.
> + */
> + uint8_t *const tail;
> +} buf_t;
> +
> +static size_t buf_off(const buf_t *buf)
> +{
> + /* Clang seems to optimize constructs like
> + * a - b + c
> + * if c is known:
> + * r? = c
> + * r? -= b
> + * r? += a
> + *
> + * This is a problem if a and b are packet pointers,
> + * since the verifier allows subtracting two pointers to
> + * get a scalar, but not a scalar and a pointer.
> + *
> + * Use inline asm to break this optimization.
> + */
> + size_t off = (size_t)buf->head;
> + asm("%0 -= %1" : "+r"(off) : "r"(buf->skb->data));
> + return off;
> +}

We need to look into this one.
This part is not gated by allow_ptr_leaks.
if (dst_reg == off_reg) {
/* scalar -= pointer. Creates an unknown scalar */
verbose(env, "R%d tried to subtract pointer from scalar\n",
dst);
return -EACCES;
}
Hopefully not too hard to fix.

> +
> +static bool pkt_skip_ipv6_extension_headers(buf_t *pkt,
> + const struct ipv6hdr *ipv6,
> + uint8_t *upper_proto,
> + bool *is_fragment)
> +{
> + /* We understand five extension headers.
> + * https://tools.ietf.org/html/rfc8200#section-4.1 states that all
> + * headers should occur once, except Destination Options, which may
> + * occur twice. Hence we give up after 6 headers.
> + */
> + struct {
> + uint8_t next;
> + uint8_t len;
> + } exthdr = {
> + .next = ipv6->nexthdr,
> + };
> + *is_fragment = false;
> +
> +#pragma clang loop unroll(full)
> + for (int i = 0; i < 6; i++) {

unroll is still needed even with bounded loop support in the verifier?


> +/* This function has to be inlined, because the verifier otherwise rejects it
> + * due to returning a pointer to the stack. This is technically correct, since
> + * scratch is allocated on the stack. However, this usage should be safe since
> + * it's the callers stack after all.
> + */

Interesting. The verifier can recognize that ptr to stack can be safe in some cases.
When I added that check I didn't think that the code can be as tricky as this :)

> +static verdict_t process_ipv4(buf_t *pkt, metrics_t *metrics)
> +{
> + switch (ipv4->protocol) {
> + case IPPROTO_ICMP:
> + return process_icmpv4(pkt, metrics);
> +
> + case IPPROTO_TCP:
> + return process_tcp(pkt, ipv4, sizeof(*ipv4), metrics);
> +
> + case IPPROTO_UDP:
> + return process_udp(pkt, ipv4, sizeof(*ipv4), metrics);
> +
> + default:
> + metrics->errors_total_unknown_l4_proto++;
> + return INVALID;
> + }
> +}
> +
> +static verdict_t process_ipv6(buf_t *pkt, metrics_t *metrics)
> + if (is_fragment) {
> + metrics->errors_total_fragmented_ip++;
> + return INVALID;
> + }
> +
> + switch (l4_proto) {
> + case IPPROTO_ICMPV6:
> + return process_icmpv6(pkt, metrics);
> +
> + case IPPROTO_TCP:
> + return process_tcp(pkt, ipv6, sizeof(*ipv6), metrics);
> +
> + case IPPROTO_UDP:
> + return process_udp(pkt, ipv6, sizeof(*ipv6), metrics);
> +
> + default:
> + metrics->errors_total_unknown_l4_proto++;
> + return INVALID;
> + }
> +}
> +
> +SEC("classifier/cls_redirect")
> +int cls_redirect(struct __sk_buff *skb)
> +{
...
> + verdict_t verdict;
> + switch (encap->gue.proto_ctype) {
> + case IPPROTO_IPIP:
> + verdict = process_ipv4(&pkt, metrics);
> + break;
> +
> + case IPPROTO_IPV6:
> + verdict = process_ipv6(&pkt, metrics);
> + break;

The code flow looks pretty clean.
I didn't find the copy-paste of ipv[46] -> tcp/udp
you were mentioning.
So that old issue is now gone?