Re: [PATCH] netfilter: nf_conntrack: Add conntrack helper for ESP/IPsec

From: Florian Westphal
Date: Mon Apr 26 2021 - 07:54:19 EST


Cole Dishington <Cole.Dishington@xxxxxxxxxxxxxxxxxxx> wrote:
> @@ -90,6 +90,8 @@ enum ctattr_l4proto {
> CTA_PROTO_ICMPV6_ID,
> CTA_PROTO_ICMPV6_TYPE,
> CTA_PROTO_ICMPV6_CODE,
> + CTA_PROTO_SRC_ESP_ID,
> + CTA_PROTO_DST_ESP_ID,
> __CTA_PROTO_MAX
> };


> diff --git a/net/netfilter/nf_conntrack_proto_esp.c b/net/netfilter/nf_conntrack_proto_esp.c
> new file mode 100644
> index 000000000000..f17ce8a9439f
> --- /dev/null
> +++ b/net/netfilter/nf_conntrack_proto_esp.c
> @@ -0,0 +1,736 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * <:copyright-gpl
> + * Copyright 2008 Broadcom Corp. All Rights Reserved.
> + * Copyright (C) 2021 Allied Telesis Labs NZ
> + *
> + * This program is free software; you can distribute it and/or modify it
> + * under the terms of the GNU General Public License (Version 2) as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
> + * for more details.
> + *
> + * You should have received a copy of the GNU General Public License along
> + * with this program.
> + * :>
> + */
> +/******************************************************************************
> + * Filename: nf_conntrack_proto_esp.c
> + * Author: Pavan Kumar
> + * Creation Date: 05/27/04
> + *

Can you remove this changelog? The history isn't relevant for upstream.
You can add credits to the commit message if you like.

> + struct rhash_head lnode;
> + struct rhash_head rnode;
> + struct rhlist_head incmpl_rlist;
> +
> + u16 esp_id;
> +
> + u32 l_spi;
> + u32 r_spi;
> +
> + u16 l3num;

Minor nit: you can save a few bytes by placing the two u16 next to each
other.

> + union nf_inet_addr l_ip;
> + union nf_inet_addr r_ip;
> +
> + u32 alloc_time_jiffies;
> + struct net *net;
> +};
> +
> +struct _esp_hkey {
> + u16 l3num;

Nit: l3num can be u8.

> +static inline void esp_ip_addr_set_any(int af, union nf_inet_addr *a)
> +{
> + if (af == AF_INET6)
> + ipv6_addr_set(&a->in6, 0, 0, 0, 0);

Alternative is a->in6 = IN6ADDR_ANY_INIT , up to you.

You could also remove the if (af ... conditional and just zero
everything.

Also, with very few exceptions, we try to avoid 'inline' keyword in .c
files.

> +static inline void esp_ip_addr_copy(int af, union nf_inet_addr *dst,
> + const union nf_inet_addr *src)
> +{
> + if (af == AF_INET6)
> + ipv6_addr_prefix_copy(&dst->in6, &src->in6, 128);

Alternative is to dst->in6 = src->in6.

> +static inline void calculate_key(const u32 net_hmix, const u32 spi,
> + const u16 l3num,

l3num can be u8.

> +int nf_conntrack_esp_init(void)
> +{
> + int i;
> + int ret = 0;
> +
> + spin_lock_bh(&esp_table_lock);

This lock isn't needed. There is no way this function
can be executed concurrently.

> + /* Check if esphdr already associated with a pre-existing connection:
> + * if no, create a new connection, missing the r_spi;
> + * if yes, check if we have seen the source IP:
> + * if no, fill in r_spi in the pre-existing connection.
> + */
> + spin_lock_bh(&esp_table_lock);

Can you remove this lock?

It would be very unfortunate if we lose rhashtable ability of parallel
insert & lockless lookups.

> + esp_entry = search_esp_entry_by_spi(net, spi, tuple->src.l3num,
> + &tuple->src.u3, &tuple->dst.u3);
> + if (!esp_entry) {
> + struct _esp_hkey key = {};
> + union nf_inet_addr any;
> + u32 net_hmix = net_hash_mix(net);
> + int err;
> +
> + esp_entry = alloc_esp_entry(net);
> + if (!esp_entry) {
> + pr_debug("All esp connection slots in use\n");
> + spin_unlock_bh(&esp_table_lock);
> + return false;
> + }
> + esp_entry->l_spi = spi;
> + esp_entry->l3num = tuple->src.l3num;
> + esp_ip_addr_copy(esp_entry->l3num, &esp_entry->l_ip, &tuple->src.u3);
> + esp_ip_addr_copy(esp_entry->l3num, &esp_entry->r_ip, &tuple->dst.u3);
> +
> + /* Add entries to the hash tables */
> +
> + err = rhashtable_insert_fast(&ltable, &esp_entry->lnode, ltable_params);
> + if (err) {

... without lock, this can fail with -EEXIST.

You could remove the esp_table_lock and change the above
rhashtable_insert_fast() to something like:

esp_entry_old = rhashtable_lookup_get_insert_fast(&ltable, &esp_entry->lnode ltable_params);
if (esp_entry_old) {
if (IS_ERR(esp_entry_old)) {
esp_table_free_entry_by_esp_id(net, esp_entry->esp_id);
return false;
}

esp_table_free_entry_by_esp_id(net, esp_entry->esp_id);
/* insertion raced, use existing entry */
esp_entry = esp_entry_old;
}
/* esp_entry_old == NULL -- insertion successful */

This should allow removal of the esp_table_lock spinlock.

> +#ifdef CONFIG_NF_CONNTRACK_PROCFS
> +/* print private data for conntrack */
> +static void esp_print_conntrack(struct seq_file *s, struct nf_conn *ct)
> +{
> + seq_printf(s, "l_spi=%x, r_spi=%x ", ct->proto.esp.l_spi, ct->proto.esp.r_spi);

Thanks, this looks good.

> + nf_conntrack_event_cache(IPCT_ASSURED, ct);
> +
> + /* Retrieve SPIs of original and reply from esp_entry.
> + * Both directions should contain the same esp_entry,
> + * so just check the first one.
> + */
> + tuple = nf_ct_tuple(ct, IP_CT_DIR_ORIGINAL);
> + esp_id = tuple->src.u.esp.id;
> + if (esp_id >= TEMP_SPI_START && esp_id <= TEMP_SPI_MAX) {
> + spin_lock_bh(&esp_table_lock);
> + esp_entry = esp_table[esp_id - TEMP_SPI_START];

This esp_table[] has to be removed.

1. It breaks netns isolation
2. It forces contention on a single spinlock.

As far as I understand, this table also serves as a resource limiter to
avoid eating up too much resources.

So, how about adding a espid bitmap to struct nf_conntrack_net?

Something like this:

diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -63,6 +63,9 @@ struct nf_conntrack_net {
struct delayed_work ecache_dwork;
struct netns_ct *ct_net;
#endif
+#ifdef CONFIG_NF_CT_PROTO_ESP
+ DECLARE_BITMAP(esp_id_map, 1024);
+#endif
};

#include <linux/types.h>
diff --git a/net/netfilter/nf_conntrack_proto_esp.c b/net/netfilter/nf_conntrack_proto_esp.c
index f17ce8a9439f..ce4d5864c480 100644
--- a/net/netfilter/nf_conntrack_proto_esp.c
+++ b/net/netfilter/nf_conntrack_proto_esp.c
@@ -341,24 +340,28 @@ static void esp_table_free_entry_by_esp_id(struct net *net, u16 esp_id)
*/
struct _esp_table *alloc_esp_entry(struct net *net)
{
- int idx;
+ struct nf_conntrack_net *cnet = net_generic(net, nf_conntrack_net_id);
struct _esp_table *esp_entry = NULL;
+ int idx;

- /* Find the first unused slot */
- for (idx = 0; idx < ESP_MAX_CONNECTIONS; idx++) {
- if (esp_table[idx])
- continue;
+again:
+ idx = find_first_zero_bit(cnet->esp_id_map, 1024);
+ if (idx >= 1024)
+ return NULL;

- esp_table[idx] = kmalloc(sizeof(*esp_entry), GFP_ATOMIC);
- if (!esp_table[idx])
- return NULL;
- memset(esp_table[idx], 0, sizeof(struct _esp_table));
- esp_table[idx]->esp_id = idx + TEMP_SPI_START;
- esp_table[idx]->alloc_time_jiffies = nfct_time_stamp;
- esp_table[idx]->net = net;
- esp_entry = esp_table[idx];
- break;
+ if (test_and_set_bit(cnet->esp_id_map, idx))
+ goto again; /* raced */
+
+ esp_entry = kmalloc(sizeof(*esp_entry), GFP_ATOMIC);
+ if (!esp_entry) {
+ clear_bit(cnet->esp_id_map, idx);
+ return NULL;
}
+
+ esp_entry->esp_id = idx + TEMP_SPI_START;
+ esp_entry->alloc_time_jiffies = nfct_time_stamp;
+ esp_entry->net = net;
+
return esp_entry;
}


I have a few more concerns:

AFAICS there is no guarantee that an allocated esp table entry is backed
by a conntrack entry.

So, there must be a way to reap all allocated esp_entry structs
when a network namespace goes down.

Perhaps you could add a pernet (nf_conntrack_net) spinlock+list head
that appends each allocated entry to that list.

Then, on conntrack removal, in addition to removal from the rhash
tables, add a list_del().

On network namespace destruction, walk this list and remove all
remaining entries (those that are still around after removal of all
the conntrack objects).

Does that make sense to you?

> +static int esp_tuple_to_nlattr(struct sk_buff *skb,
> + const struct nf_conntrack_tuple *t)
> +{
> + if (nla_put_be16(skb, CTA_PROTO_SRC_ESP_ID, t->src.u.esp.id) ||
> + nla_put_be16(skb, CTA_PROTO_DST_ESP_ID, t->dst.u.esp.id))
> + goto nla_put_failure;

This exposes the 16 bit kernel-generated IDs, right?
Should this dump the real on-wire SPIs instead?

Or is there are reason why the internal IDs need exposure?