Re: Regression: kernel 4.14 an later very slow with many ipsec tunnels

From: Wolfgang Walter
Date: Fri Sep 14 2018 - 07:49:17 EST


Am Freitag, 14. September 2018, 07:54:37 schrieb Florian Westphal:
> Steffen Klassert <steffen.klassert@xxxxxxxxxxx> wrote:
> > On Thu, Sep 13, 2018 at 11:03:25PM +0200, Florian Westphal wrote:
> > > David Miller <davem@xxxxxxxxxxxxx> wrote:
> > > > From: Florian Westphal <fw@xxxxxxxxx>
> > > > Date: Thu, 13 Sep 2018 18:38:48 +0200
> > > >
> > > > > Wolfgang Walter <linux@xxxxxxx> wrote:
> > > > >> What I can say is that it depends mainly on number of policy rules
> > > > >> and SA.
> > > > >
> > > > > Thats already a good hint, I guess we're hitting long hash chains in
> > > > > xfrm_policy_lookup_bytype().
> > > >
> > > > I don't really see how recent changes can influence that.
> > >
> > > I don't think there is a recent change that did this.
> > >
> > > Walter says < 4.14 is ok, so this is likely related to flow cache
> > > removal.
> > >
> > > F.e. it looks like all prefixed policies end up in a linked list
> > > (net->xfrm.policy_inexact) and are not even in a hash table.
> > >
> > > I am staring at b58555f1767c9f4e330fcf168e4e753d2d9196e0
> > > but can't figure out how to configure that away from the
> > > 'no hashing for prefixed policies' default or why we even have
> > > policy_inexact in first place :/
> >
> > The hash threshold can be configured like this:
> >
> > ip x p set hthresh4 0 0
> >
> > This sets the hash threshold to local /0 and remote /0 netmasks.
> > With this configuration, all policies should go to the hashtable.
>
> Yes, but won't they all be hashed to same bucket?
>
> [ jhash(addr & 0, addr & 0) ] ?
>
> > Default hash thresholds are local /32 and remote /32 netmasks, so
> > all prefixed policies go to the inexact list.
>
> Yes.
>
> Wolfgang, before having to work on getting perf into your router image
> can you perhaps share a bit of info about the policies you're using?
>
> How many are there? Are they prefixed or not ("10.1.2.1")?

All rules are tunnel rules. That is they are rules like (in strongswan
notation)

conn A-to-B
left=111.111.111.111
leftsubnet=10.148.32.0/24
leftsigkey=....
right=111.111.111.222
rightsubnet=10.148.13.224/29
rightsigkey=....
esp=aes128ctr-sha1-ecp256-esn!
ike=aes128ctr-sha1-ecp256!
mobike=no
type=tunnel
....

(... other options not important here).


leftsubnet and rightsubnet may have any prefix from /30 to /16 here (we do not
yet use ipv6 but will do so next year).

We have about 3000 of them.

strongswan install IN, FWD and OUT rules for that in the kernel security
policy database with automated generated priorities (and SAs are generated
when strongswan actually establish a tunnel).

Also some of the rules overlap in range, that means ordering is important.

With IKEv2 this may happens automatically for SAs even if you avoid it in your
rule set as IKEv2 allows narrowing.

In policies you most often get this if you want to excempt a certain network
or host. We have a about 70 of them at the moment.

We do not use other possible selectors beside src-addr-range and dst-addr-
range (you could additionally select by protocol (icmp, udp, tcp), src- and
dst-port-range). So theoretically you could have a ruleset where there is a
rule with exempts all connection to dst port 22 for several network or applies
different encryption options and so on.

A rule determins what has to be done with the packet (sending or receiving)
from an ipsec-point of view: allow it without ipsec-transformation, block it
completely, or require certain ipsec transformation (use this or that
ecnryption scheme, use header compression, use transport or tunnel mode, ...)

So for any packet the kernel sends it has to look up if there are SAs which
matches and from these chose that with the highest priority (which is that one
with the lowest priority field). If there is none he has to lookup if there is
a matching policy, again choosing the one with the highest priority (and then
let the IKE-daemon actually establish a SA). For tunnel-mode he actually has
to do it twice, I think, as the tunnel-paket again passes ipsec.

For every packet it receives and which ist not an ipsec paket he has to do a
lookup in the policy database to see if it should have been (or if it is
allowed or blocked). If no rule is found it is allowed without encryption. We
have 29.000 allow rules. I did deactivate them for the tests with 4.14 and
4.18 as these makes things horrible. They are automatically generated from our
declarativ network description and we actually don't need them as they do not
overlap with the remote networks tunneled via ipsec. They did not impose any
burden for 4.9 and earlier.

We sometimes need them (say if 10.10.0.0/16 is remote but 10.10.1.0 which is
local).

So this is basically the multidimensional packet classifiction problem: from a
set of m-dimensional blocks find that one with the highest priority which
contains a certain point.

The dimension here are src-addr-range, dst-addr-range, protocol, src-port-
range, dst-port-range.

If your rule is itself a point you may hash it (and you can only do this if it
is sure that there is no other non-point rule with higher prio matching this
point rule as there is no such rule that a more specific rule beats a less
specific rule (this would be ill defined)).

Here an example how strongswan allows you to use all of the above selectors
for your rules. For example you could write for leftsubnet:

leftsubnet=10.0.0.1[tcp/http],10.0.0.2[6/80]
leftsubnet=fec1::1[udp],10.0.0.0/16[/53].
leftsubnet=fec1::1[udp/%any],10.0.0.0/16[%any/53]
leftsubnet=fec1::1[udp/%any],10.0.0.0/16[%any/1024-32000]

So ipsec with large policy-database without xfrm flow cache is comparable with
a large netfilter ruleset (with only one chain) without conntrack.

Regards,
--
Wolfgang Walter
Studentenwerk München
Anstalt des öffentlichen Rechts