Re: [PATCH net-next v2 2/6] tunnels: PMTU discovery support for directly bridged IP packets

From: Naresh Kamboju
Date: Wed Aug 05 2020 - 15:48:37 EST


On Tue, 4 Aug 2020 at 11:24, Stefano Brivio <sbrivio@xxxxxxxxxx> wrote:
>
> It's currently possible to bridge Ethernet tunnels carrying IP
> packets directly to external interfaces without assigning them
> addresses and routes on the bridged network itself: this is the case
> for UDP tunnels bridged with a standard bridge or by Open vSwitch.
>
> PMTU discovery is currently broken with those configurations, because
> the encapsulation effectively decreases the MTU of the link, and
> while we are able to account for this using PMTU discovery on the
> lower layer, we don't have a way to relay ICMP or ICMPv6 messages
> needed by the sender, because we don't have valid routes to it.
>
> On the other hand, as a tunnel endpoint, we can't fragment packets
> as a general approach: this is for instance clearly forbidden for
> VXLAN by RFC 7348, section 4.3:
>
> VTEPs MUST NOT fragment VXLAN packets. Intermediate routers may
> fragment encapsulated VXLAN packets due to the larger frame size.
> The destination VTEP MAY silently discard such VXLAN fragments.
>
> The same paragraph recommends that the MTU over the physical network
> accomodates for encapsulations, but this isn't a practical option for
> complex topologies, especially for typical Open vSwitch use cases.
>
> Further, it states that:
>
> Other techniques like Path MTU discovery (see [RFC1191] and
> [RFC1981]) MAY be used to address this requirement as well.
>
> Now, PMTU discovery already works for routed interfaces, we get
> route exceptions created by the encapsulation device as they receive
> ICMP Fragmentation Needed and ICMPv6 Packet Too Big messages, and
> we already rebuild those messages with the appropriate MTU and route
> them back to the sender.
>
> Add the missing bits for bridged cases:
>
> - checks in skb_tunnel_check_pmtu() to understand if it's appropriate
> to trigger a reply according to RFC 1122 section 3.2.2 for ICMP and
> RFC 4443 section 2.4 for ICMPv6. This function is already called by
> UDP tunnels
>
> - a new function generating those ICMP or ICMPv6 replies. We can't
> reuse icmp_send() and icmp6_send() as we don't see the sender as a
> valid destination. This doesn't need to be generic, as we don't
> cover any other type of ICMP errors given that we only provide an
> encapsulation function to the sender
>
> While at it, make the MTU check in skb_tunnel_check_pmtu() accurate:
> we might receive GSO buffers here, and the passed headroom already
> includes the inner MAC length, so we don't have to account for it
> a second time (that would imply three MAC headers on the wire, but
> there are just two).
>
> This issue became visible while bridging IPv6 packets with 4500 bytes
> of payload over GENEVE using IPv4 with a PMTU of 4000. Given the 50
> bytes of encapsulation headroom, we would advertise MTU as 3950, and
> we would reject fragmented IPv6 datagrams of 3958 bytes size on the
> wire. We're exclusively dealing with network MTU here, though, so we
> could get Ethernet frames up to 3964 octets in that case.
>
> v2:
> - moved skb_tunnel_check_pmtu() to ip_tunnel_core.c (David Ahern)
> - split IPv4/IPv6 functions (David Ahern)
>
> Signed-off-by: Stefano Brivio <sbrivio@xxxxxxxxxx>
> ---
> drivers/net/bareudp.c | 5 +-
> drivers/net/geneve.c | 5 +-
> drivers/net/vxlan.c | 4 +-
> include/net/dst.h | 10 --
> include/net/ip_tunnels.h | 2 +
> net/ipv4/ip_tunnel_core.c | 244 ++++++++++++++++++++++++++++++++++++++
> 6 files changed, 254 insertions(+), 16 deletions(-)
>
> diff --git a/drivers/net/bareudp.c b/drivers/net/bareudp.c
> index 3b6664c7e73c..841910f1db65 100644
> --- a/drivers/net/bareudp.c
> +++ b/drivers/net/bareudp.c
> @@ -308,7 +308,7 @@ static int bareudp_xmit_skb(struct sk_buff *skb, struct net_device *dev,
> return PTR_ERR(rt);
>
> skb_tunnel_check_pmtu(skb, &rt->dst,
> - BAREUDP_IPV4_HLEN + info->options_len);
> + BAREUDP_IPV4_HLEN + info->options_len, false);
>
> sport = udp_flow_src_port(bareudp->net, skb,
> bareudp->sport_min, USHRT_MAX,
> @@ -369,7 +369,8 @@ static int bareudp6_xmit_skb(struct sk_buff *skb, struct net_device *dev,
> if (IS_ERR(dst))
> return PTR_ERR(dst);
>
> - skb_tunnel_check_pmtu(skb, dst, BAREUDP_IPV6_HLEN + info->options_len);
> + skb_tunnel_check_pmtu(skb, dst, BAREUDP_IPV6_HLEN + info->options_len,
> + false);
>
> sport = udp_flow_src_port(bareudp->net, skb,
> bareudp->sport_min, USHRT_MAX,
> diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
> index 017c13acc911..de86b6d82132 100644
> --- a/drivers/net/geneve.c
> +++ b/drivers/net/geneve.c
> @@ -894,7 +894,7 @@ static int geneve_xmit_skb(struct sk_buff *skb, struct net_device *dev,
> return PTR_ERR(rt);
>
> skb_tunnel_check_pmtu(skb, &rt->dst,
> - GENEVE_IPV4_HLEN + info->options_len);
> + GENEVE_IPV4_HLEN + info->options_len, false);
>
> sport = udp_flow_src_port(geneve->net, skb, 1, USHRT_MAX, true);
> if (geneve->cfg.collect_md) {
> @@ -955,7 +955,8 @@ static int geneve6_xmit_skb(struct sk_buff *skb, struct net_device *dev,
> if (IS_ERR(dst))
> return PTR_ERR(dst);
>
> - skb_tunnel_check_pmtu(skb, dst, GENEVE_IPV6_HLEN + info->options_len);
> + skb_tunnel_check_pmtu(skb, dst, GENEVE_IPV6_HLEN + info->options_len,
> + false);
>
> sport = udp_flow_src_port(geneve->net, skb, 1, USHRT_MAX, true);
> if (geneve->cfg.collect_md) {
> diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
> index 77658425db8a..1432544da507 100644
> --- a/drivers/net/vxlan.c
> +++ b/drivers/net/vxlan.c
> @@ -2720,7 +2720,7 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
> }
>
> ndst = &rt->dst;
> - skb_tunnel_check_pmtu(skb, ndst, VXLAN_HEADROOM);
> + skb_tunnel_check_pmtu(skb, ndst, VXLAN_HEADROOM, false);
>
> tos = ip_tunnel_ecn_encap(RT_TOS(tos), old_iph, skb);
> ttl = ttl ? : ip4_dst_hoplimit(&rt->dst);
> @@ -2760,7 +2760,7 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
> goto out_unlock;
> }
>
> - skb_tunnel_check_pmtu(skb, ndst, VXLAN6_HEADROOM);
> + skb_tunnel_check_pmtu(skb, ndst, VXLAN6_HEADROOM, false);
>
> tos = ip_tunnel_ecn_encap(RT_TOS(tos), old_iph, skb);
> ttl = ttl ? : ip6_dst_hoplimit(ndst);
> diff --git a/include/net/dst.h b/include/net/dst.h
> index 852d8fb36ab7..6ae2e625050d 100644
> --- a/include/net/dst.h
> +++ b/include/net/dst.h
> @@ -535,14 +535,4 @@ static inline void skb_dst_update_pmtu_no_confirm(struct sk_buff *skb, u32 mtu)
> dst->ops->update_pmtu(dst, NULL, skb, mtu, false);
> }
>
> -static inline void skb_tunnel_check_pmtu(struct sk_buff *skb,
> - struct dst_entry *encap_dst,
> - int headroom)
> -{
> - u32 encap_mtu = dst_mtu(encap_dst);
> -
> - if (skb->len > encap_mtu - headroom)
> - skb_dst_update_pmtu_no_confirm(skb, encap_mtu - headroom);
> -}
> -
> #endif /* _NET_DST_H */
> diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h
> index 36025dea7612..02ccd32542d0 100644
> --- a/include/net/ip_tunnels.h
> +++ b/include/net/ip_tunnels.h
> @@ -420,6 +420,8 @@ void iptunnel_xmit(struct sock *sk, struct rtable *rt, struct sk_buff *skb,
> u8 tos, u8 ttl, __be16 df, bool xnet);
> struct metadata_dst *iptunnel_metadata_reply(struct metadata_dst *md,
> gfp_t flags);
> +int skb_tunnel_check_pmtu(struct sk_buff *skb, struct dst_entry *encap_dst,
> + int headroom, bool reply);
>
> int iptunnel_handle_offloads(struct sk_buff *skb, int gso_type_mask);
>
> diff --git a/net/ipv4/ip_tunnel_core.c b/net/ipv4/ip_tunnel_core.c
> index f8b419e2475c..9ddee2a0c66d 100644
> --- a/net/ipv4/ip_tunnel_core.c
> +++ b/net/ipv4/ip_tunnel_core.c
> @@ -184,6 +184,250 @@ int iptunnel_handle_offloads(struct sk_buff *skb,
> }
> EXPORT_SYMBOL_GPL(iptunnel_handle_offloads);
>
> +/**
> + * iptunnel_pmtud_build_icmp() - Build ICMP error message for PMTUD
> + * @skb: Original packet with L2 header
> + * @mtu: MTU value for ICMP error
> + *
> + * Return: length on success, negative error code if message couldn't be built.
> + */
> +static int iptunnel_pmtud_build_icmp(struct sk_buff *skb, int mtu)
> +{
> + const struct iphdr *iph = ip_hdr(skb);
> + struct icmphdr *icmph;
> + struct iphdr *niph;
> + struct ethhdr eh;
> + int len, err;
> +
> + if (!pskb_may_pull(skb, ETH_HLEN + sizeof(struct iphdr)))
> + return -EINVAL;
> +
> + skb_copy_bits(skb, skb_mac_offset(skb), &eh, ETH_HLEN);
> + pskb_pull(skb, ETH_HLEN);
> + skb_reset_network_header(skb);
> +
> + err = pskb_trim(skb, 576 - sizeof(*niph) - sizeof(*icmph));
> + if (err)
> + return err;
> +
> + len = skb->len + sizeof(*icmph);
> + err = skb_cow(skb, sizeof(*niph) + sizeof(*icmph) + ETH_HLEN);
> + if (err)
> + return err;
> +
> + icmph = skb_push(skb, sizeof(*icmph));
> + *icmph = (struct icmphdr) {
> + .type = ICMP_DEST_UNREACH,
> + .code = ICMP_FRAG_NEEDED,
> + .checksum = 0,
> + .un.frag.__unused = 0,
> + .un.frag.mtu = ntohs(mtu),
> + };
> + icmph->checksum = ip_compute_csum(icmph, len);
> + skb_reset_transport_header(skb);
> +
> + niph = skb_push(skb, sizeof(*niph));
> + *niph = (struct iphdr) {
> + .ihl = sizeof(*niph) / 4u,
> + .version = 4,
> + .tos = 0,
> + .tot_len = htons(len + sizeof(*niph)),
> + .id = 0,
> + .frag_off = htons(IP_DF),
> + .ttl = iph->ttl,
> + .protocol = IPPROTO_ICMP,
> + .saddr = iph->daddr,
> + .daddr = iph->saddr,
> + };
> + ip_send_check(niph);
> + skb_reset_network_header(skb);
> +
> + skb->ip_summed = CHECKSUM_NONE;
> +
> + eth_header(skb, skb->dev, htons(eh.h_proto), eh.h_source, eh.h_dest, 0);
> + skb_reset_mac_header(skb);
> +
> + return skb->len;
> +}
> +
> +/**
> + * iptunnel_pmtud_check_icmp() - Trigger ICMP reply if needed and allowed
> + * @skb: Buffer being sent by encapsulation, L2 headers expected
> + * @mtu: Network MTU for path
> + *
> + * Return: 0 for no ICMP reply, length if built, negative value on error.
> + */
> +static int iptunnel_pmtud_check_icmp(struct sk_buff *skb, int mtu)
> +{
> + const struct icmphdr *icmph = icmp_hdr(skb);
> + const struct iphdr *iph = ip_hdr(skb);
> +
> + if (mtu <= 576 || iph->frag_off != htons(IP_DF))
> + return 0;
> +
> + if (ipv4_is_lbcast(iph->daddr) || ipv4_is_multicast(iph->daddr) ||
> + ipv4_is_zeronet(iph->saddr) || ipv4_is_loopback(iph->saddr) ||
> + ipv4_is_lbcast(iph->saddr) || ipv4_is_multicast(iph->saddr))
> + return 0;
> +
> + if (iph->protocol == IPPROTO_ICMP && icmp_is_err(icmph->type))
> + return 0;
> +
> + return iptunnel_pmtud_build_icmp(skb, mtu);
> +}
> +
> +#if IS_ENABLED(CONFIG_IPV6)
> +/**
> + * iptunnel_pmtud_build_icmpv6() - Build ICMPv6 error message for PMTUD
> + * @skb: Original packet with L2 header
> + * @mtu: MTU value for ICMPv6 error
> + *
> + * Return: length on success, negative error code if message couldn't be built.
> + */
> +static int iptunnel_pmtud_build_icmpv6(struct sk_buff *skb, int mtu)
> +{
> + const struct ipv6hdr *ip6h = ipv6_hdr(skb);
> + struct icmp6hdr *icmp6h;
> + struct ipv6hdr *nip6h;
> + struct ethhdr eh;
> + int len, err;
> + __wsum csum;
> +
> + if (!pskb_may_pull(skb, ETH_HLEN + sizeof(struct ipv6hdr)))
> + return -EINVAL;
> +
> + skb_copy_bits(skb, skb_mac_offset(skb), &eh, ETH_HLEN);
> + pskb_pull(skb, ETH_HLEN);
> + skb_reset_network_header(skb);
> +
> + err = pskb_trim(skb, IPV6_MIN_MTU - sizeof(*nip6h) - sizeof(*icmp6h));
> + if (err)
> + return err;
> +
> + len = skb->len + sizeof(*icmp6h);
> + err = skb_cow(skb, sizeof(*nip6h) + sizeof(*icmp6h) + ETH_HLEN);
> + if (err)
> + return err;
> +
> + icmp6h = skb_push(skb, sizeof(*icmp6h));
> + *icmp6h = (struct icmp6hdr) {
> + .icmp6_type = ICMPV6_PKT_TOOBIG,
> + .icmp6_code = 0,
> + .icmp6_cksum = 0,
> + .icmp6_mtu = htonl(mtu),
> + };
> + skb_reset_transport_header(skb);
> +
> + nip6h = skb_push(skb, sizeof(*nip6h));
> + *nip6h = (struct ipv6hdr) {
> + .priority = 0,
> + .version = 6,
> + .flow_lbl = { 0 },
> + .payload_len = htons(len),
> + .nexthdr = IPPROTO_ICMPV6,
> + .hop_limit = ip6h->hop_limit,
> + .saddr = ip6h->daddr,
> + .daddr = ip6h->saddr,
> + };
> + skb_reset_network_header(skb);
> +
> + csum = csum_partial(icmp6h, len, 0);
> + icmp6h->icmp6_cksum = csum_ipv6_magic(&nip6h->saddr, &nip6h->daddr, len,
> + IPPROTO_ICMPV6, csum);

Linux next build breaks for riscv architecture defconfig build.

git_repo: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
target_arch: riscv
toolchain: gcc-9
git_short_log: d15fe4ec0435 (\Add linux-next specific files for 20200805\)
git_describe: next-20200805

Reported-by: Naresh Kamboju <naresh.kamboju@xxxxxxxxxx>

make -sk KBUILD_BUILD_USER=TuxBuild -C/linux ARCH=riscv
CROSS_COMPILE=riscv64-linux-gnu- HOSTCC=gcc CC="sccache
riscv64-linux-gnu-gcc" O=build defconfig
#
#
# make -sk KBUILD_BUILD_USER=TuxBuild -C/linux -j16 ARCH=riscv
CROSS_COMPILE=riscv64-linux-gnu- HOSTCC=gcc CC="sccache
riscv64-linux-gnu-gcc" O=build
#
../net/ipv4/ip_tunnel_core.c: In function ‘iptunnel_pmtud_build_icmpv6’:
../net/ipv4/ip_tunnel_core.c:335:24: error: implicit declaration of
function ‘csum_ipv6_magic’; did you mean ‘csum_tcpudp_magic’?
[-Werror=implicit-function-declaration]
335 | icmp6h->icmp6_cksum = csum_ipv6_magic(&nip6h->saddr,
&nip6h->daddr, len,
| ^~~~~~~~~~~~~~~
| csum_tcpudp_magic
cc1: some warnings being treated as errors
make[3]: *** [../scripts/Makefile.build:283: net/ipv4/ip_tunnel_core.o] Error 1

Full build log link,
https://gitlab.com/Linaro/lkft/kernel-runs/-/jobs/671651045


--
Linaro LKFT
https://lkft.linaro.org