Re: [RFC 2/4] net/fib: Provide fib_balance_budget sysctl

From: Dmitry Safonov
Date: Thu Apr 04 2019 - 14:31:57 EST


On 4/1/19 7:09 PM, Alexander Duyck wrote:
> On Tue, 2019-03-26 at 15:30 +0000, Dmitry Safonov wrote:
>> Unfortunately, MAX_WORK at this moment is broken: during
>> trie_rebalance() climbing upwards resize() is being called for each
>> tnode. Which makes the limit useless the bigger trie gets: at each level
>> there are 10 attempts to balance tnode and childs, resulting in
>> O(10^n + 10^{n-1} + ... 10^{n-k}) complexity to balance trie, where k - is
>> the level where we changed the trie originally (adding/removing
>> alias/route).
>>
>> Which results in the following price of removing one route under big
>> trie (similar for some other single routes that results in reallocating
>> tnodes by hitting the threshold limits for halve/inflate resize):
>
> The info below doesn't really tell us the price. It might be useful if
> you could just time the removal of that one route and provide that
> information.

Yes, well, that was an attempt to point that MAX_WORK in the current
state doesn't limit much complexity to balance trie.

So, most of the time (up to a couple of seconds) is spent on
synchronise_rcu() (which was addressed by 4/4 and a patch from David
Ahern in -next).

But I thought fixing max_work is also worth attention, regardless the
fix for synchronising rcu. And playing with the sysctl limit that is
provided by the patch, I've found out that even with 4/4 applied one can
tune this sysctl knob the way that results in dividing the work between
consecutive route manipulations. So that on a 4-core switch the time
spent to add a route in maximum was more than a second and adjusting
work limit can bring it down to 300msec in max. Though, the total time
spent to add the whole route table was the same.

I will provide more details for v2.

>
>> Before:
>> Basic info: size of leaf: 40 bytes, size of tnode: 40 bytes.
>> Main:
>> Aver depth: 1.99
>> Max depth: 2
>> Leaves: 77825
>> Prefixes: 77825
>> Internal nodes: 77
>> 9: 1 10: 76
>> Pointers: 78336
>> Null ptrs: 435
>> Total size: 7912 kB
>> Local:
>> [omitted]
>>
>> After:
>> Basic info: size of leaf: 40 bytes, size of tnode: 40 bytes.
>> Main:
>> Aver depth: 3.00
>> Max depth: 3
>> Leaves: 77824
>> Prefixes: 77824
>> Internal nodes: 20491
>> 1: 2048 2: 18432 6: 1 11: 10
>> Pointers: 98368
>> Null ptrs: 54
>> Total size: 8865 kB
>> Local:
>> [omitted]
>>
>
> Is this before/after the patch, or just before/after the removal of the
> one route? I assume you are are just talking about the one route since
> the number of prefixes has dropped by 1.

Yes, just for removing one route. Should have mentioned that explicitly,
sorry.

>
>> Provide a sysctl to control amount of pending balancing work.
>> (by default unlimited as it was)
>>
>> Cc: linux-doc@xxxxxxxxxxxxxxx
>> Cc: Jonathan Corbet <corbet@xxxxxxx>
>> Fixes: ff181ed8768f ("fib_trie: Push assignment of child to parent down
>> into inflate/halve")
>> Signed-off-by: Dmitry Safonov <dima@xxxxxxxxxx>
>> ---
>> Documentation/networking/ip-sysctl.txt | 6 +++
>> include/net/ip.h | 1 +
>> net/ipv4/fib_trie.c | 60 +++++++++++++++-----------
>> net/ipv4/sysctl_net_ipv4.c | 7 +++
>> 4 files changed, 49 insertions(+), 25 deletions(-)
>>
>> diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
>> index acdfb5d2bcaa..fb71dacff4dd 100644
>> --- a/Documentation/networking/ip-sysctl.txt
>> +++ b/Documentation/networking/ip-sysctl.txt
>> @@ -81,6 +81,12 @@ fib_multipath_hash_policy - INTEGER
>> 0 - Layer 3
>> 1 - Layer 4
>>
>> +fib_balance_budget - UNSIGNED INTEGER
>> + Limits the number of resize attempts during balancing fib trie
>> + on adding/removing new routes.
>> + Possible values:
>> + Default: UINT_MAX (0xFFFFFFFF)
>> +
>
> So I am not really a fan of the use of UNSIGNED_INTEGER here. I would
> much rather see this be a signed integer and the test be for the value
> going negative instead of having to test in so many places if the value
> is 0.

Ok, sounds good to me - will change the type for v2.

[..]
>> @@ -874,22 +877,25 @@ static struct key_vector *resize(struct trie *t, struct key_vector *tn)
>> break;
>> }
>>
>> - max_work--;
>> + (*budget)--;
>
> One thing we might explore here would be to look at pulling out the
> check of budget at the start of the loop and instead combine the
> decrement with the check for < 0 here so we have something like:
> if (--*budget < 0)
> break;
>
> That way what should happen is that we will achieve maximum depth in
> any resize and make at least one pass optimizing from the bottom up
> instead of the top down.

Yeah, sounds reasonable - will do this way and retest.

>
>> + inflated = true;
>> tn = get_child(tp, cindex);
>> }
>>
>> /* update parent in case inflate failed */
>> tp = node_parent(tn);
>>
>> - /* Return if at least one inflate is run */
>> - if (max_work != MAX_WORK)
>> + /* Return if at least one inflate is run:
>> + * microoptimization to not recalculate thresholds
>> + */
>> + if (inflated)
>> return tp;
>
> An alternative to having to add another variable would be to look at
> switching the loop to a combination of an if and a do/while like so:
> if (should_inflate(tp, tn)) {
> do {
> ...
> } while (should_inflate(tp, tn));
>
> /* update parent in case inflate failed */
> return node_parent(tn);
> }

That's neat! Will drop the variable.

>
>
>> /* Halve as long as the number of empty children in this
>> * node is above threshold.
>> */
>> - while (should_halve(tp, tn) && max_work) {
>> - tp = halve(t, tn);
>> + while (should_halve(tp, tn) && *budget) {
>> + tp = halve(t, tn, budget);
>> if (!tp) {
>> #ifdef CONFIG_IP_FIB_TRIE_STATS
>> this_cpu_inc(stats->resize_node_skipped);
>> @@ -897,7 +903,7 @@ static struct key_vector *resize(struct trie *t, struct key_vector *tn)
>> break;
>> }
>>
>> - max_work--;
>> + (*budget)--;
>
> Same here. It might make sense to pull the budget check out of the
> start of the loop and move it here so that it becomes a depth first
> optimization instead of being a shallow one.

Makes sense, will rework.

>
>> tn = get_child(tp, cindex);
>> }
>>
>> @@ -1005,8 +1011,10 @@ static struct fib_alias *fib_find_alias(struct hlist_head *fah, u8 slen,
>>
>> static void trie_rebalance(struct trie *t, struct key_vector *tn)
>> {
>> - while (!IS_TRIE(tn))
>> - tn = resize(t, tn);
>> + unsigned int budget = fib_balance_budget;
>> +
>> + while (budget && !IS_TRIE(tn))
>> + tn = resize(t, tn, &budget);
>
> I would not have the budget check here. At a minimum we should be
> replacing trie nodes with leaves in the case that we are down to one
> leaf node in the trie.

Yeah, I see.

>
>> }
>>
>> static int fib_insert_node(struct trie *t, struct key_vector *tp,
>> @@ -1784,6 +1792,7 @@ struct fib_table *fib_trie_unmerge(struct fib_table *oldtb)
>> void fib_table_flush_external(struct fib_table *tb)
>> {
>> struct trie *t = (struct trie *)tb->tb_data;
>> + unsigned int budget = fib_balance_budget;
>> struct key_vector *pn = t->kv;
>> unsigned long cindex = 1;
>> struct hlist_node *tmp;
>> @@ -1806,7 +1815,7 @@ void fib_table_flush_external(struct fib_table *tb)
>> update_suffix(pn);
>>
>> /* resize completed node */
>> - pn = resize(t, pn);
>> + pn = resize(t, pn, &budget);
>> cindex = get_index(pkey, pn);
>>
>> continue;
>
> So we would probably want a completely different budget here, or at
> least we would want the budget to be multiplied by the leaves we are
> removing. The problem is we aren't pulling single addresses, but are
> literally splitting the tree into two separate tries. The same budget
> isn't going to work for both cases.

Which will multiply the work (and the time spent).
Hmm, well, I guess we can multiply here and relax rtnl_mutex pressure if
needed by introducing fib_lock. Does that sounds good?

>
>> @@ -1853,6 +1862,7 @@ void fib_table_flush_external(struct fib_table *tb)
>> int fib_table_flush(struct net *net, struct fib_table *tb, bool flush_all)
>> {
>> struct trie *t = (struct trie *)tb->tb_data;
>> + unsigned int budget = fib_balance_budget;
>> struct key_vector *pn = t->kv;
>> unsigned long cindex = 1;
>> struct hlist_node *tmp;
>> @@ -1876,7 +1886,7 @@ int fib_table_flush(struct net *net, struct fib_table *tb, bool flush_all)
>> update_suffix(pn);
>>
>> /* resize completed node */
>> - pn = resize(t, pn);
>> + pn = resize(t, pn, &budget);
>> cindex = get_index(pkey, pn);
>>
>> continue;
>
> Similar issues here, though I don't know what the typical amount of
> change is we expect to see from a fib_table_flush call versus something
> like the fib_table_flush_external call.

Yes. Well, I was also curious if it's possible to avoid resize() call
under (flush_all == true)?
So, that exiting a namespace wouldn't result in rebalancing a possibly
huge routing trie.

Thanks for your notes and review,
Dmitry