Re: [PATCH net-next 0/5] rtnetlink: add IFA_IF_NETNSID for RTM_GETADDR

From: Christian Brauner
Date: Fri Aug 31 2018 - 21:34:57 EST


On Thu, Aug 30, 2018 at 04:45:45PM +0200, Christian Brauner wrote:
> On Thu, Aug 30, 2018 at 11:49:31AM +0300, Kirill Tkhai wrote:
> > On 29.08.2018 21:13, Christian Brauner wrote:
> > > Hi Kirill,
> > >
> > > Thanks for the question!
> > >
> > > On Wed, Aug 29, 2018 at 11:30:37AM +0300, Kirill Tkhai wrote:
> > >> Hi, Christian,
> > >>
> > >> On 29.08.2018 02:18, Christian Brauner wrote:
> > >>> From: Christian Brauner <christian@xxxxxxxxxx>
> > >>>
> > >>> Hey,
> > >>>
> > >>> A while back we introduced and enabled IFLA_IF_NETNSID in
> > >>> RTM_{DEL,GET,NEW}LINK requests (cf. [1], [2], [3], [4], [5]). This has led
> > >>> to signficant performance increases since it allows userspace to avoid
> > >>> taking the hit of a setns(netns_fd, CLONE_NEWNET), then getting the
> > >>> interfaces from the netns associated with the netns_fd. Especially when a
> > >>> lot of network namespaces are in use, using setns() becomes increasingly
> > >>> problematic when performance matters.
> > >>
> > >> could you please give a real example, when setns()+socket(AF_NETLINK) cause
> > >> problems with the performance? You should do this only once on application
> > >> startup, and then you have created netlink sockets in any net namespaces you
> > >> need. What is the problem here?
> > >
> > > So we have a daemon (LXD) that is often running thousands of containers.
> > > When users issue a lxc list request against the daemon it returns a list
> > > of all containers including all of the interfaces and addresses for each
> > > container. To retrieve those addresses we currently rely on setns() +
> > > getifaddrs() for each of those containers. That has horrible
> > > performance.
> >
> > Could you please provide some numbers showing that setns()
> > introduces signify performance decrease in the application?
>
> Sure, might take a few days++ though since I'm traveling.

Hey Kirill,

As promised here's some code [1] that compares performance. I basically
did a setns() to the network namespace and called getifaddrs() and
compared this to the scenario where I use the newly introduced property.
I did this 1 million times and calculated the mean getifaddrs()
retrieval time based on that.
My patch cuts the time in half.

brauner@wittgenstein:~/netns_getifaddrs$ ./getifaddrs_perf 0 1178
Mean time in microseconds (netnsid): 81
Mean time in microseconds (setns): 162

Christian

I'm only appending the main file since the netsns_getifaddrs() code I
used is pretty long:

[1]:

#define _GNU_SOURCE
#define __STDC_FORMAT_MACROS
#include <fcntl.h>
#include <inttypes.h>
#include <linux/types.h>
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/stat.h>
#include <sys/time.h>
#include <sys/types.h>
#include <unistd.h>

#include "netns_getifaddrs.h"
#include "print_getifaddrs.h"

#define ITERATIONS 1000000
#define SEC_TO_MICROSEC(x) (1000000 * (x))

int main(int argc, char *argv[])
{
int i, ret;
__s32 netns_id;
pid_t netns_pid;
char path[1024];
intmax_t times[ITERATIONS];
struct timeval t1, t2;
intmax_t time_in_mcs;
int fret = EXIT_FAILURE;
intmax_t sum = 0;
int host_netns_fd = -1, netns_fd = -1;

struct ifaddrs *ifaddrs = NULL;

if (argc != 3)
goto on_error;

netns_id = atoi(argv[1]);
netns_pid = atoi(argv[2]);
printf("%d\n", netns_id);
printf("%d\n", netns_pid);

for (i = 0; i < ITERATIONS; i++) {
ret = gettimeofday(&t1, NULL);
if (ret < 0)
goto on_error;

ret = netns_getifaddrs(&ifaddrs, netns_id);
freeifaddrs(ifaddrs);
if (ret < 0)
goto on_error;

ret = gettimeofday(&t2, NULL);
if (ret < 0)
goto on_error;

time_in_mcs = (SEC_TO_MICROSEC(t2.tv_sec) + t2.tv_usec) -
(SEC_TO_MICROSEC(t1.tv_sec) + t1.tv_usec);
times[i] = time_in_mcs;
}

for (i = 0; i < ITERATIONS; i++)
sum += times[i];

printf("Mean time in microseconds (netnsid): %ju\n",
sum / ITERATIONS);

ret = snprintf(path, sizeof(path), "/proc/%d/ns/net", netns_pid);
if (ret < 0 || (size_t)ret >= sizeof(path))
goto on_error;

netns_fd = open(path, O_RDONLY | O_CLOEXEC);
if (netns_fd < 0)
goto on_error;

host_netns_fd = open("/proc/self/ns/net", O_RDONLY | O_CLOEXEC);
if (host_netns_fd < 0)
goto on_error;

memset(times, 0, sizeof(times));
for (i = 0; i < ITERATIONS; i++) {
ret = gettimeofday(&t1, NULL);
if (ret < 0)
goto on_error;

ret = setns(netns_fd, CLONE_NEWNET);
if (ret < 0)
goto on_error;

ret = getifaddrs(&ifaddrs);
freeifaddrs(ifaddrs);
if (ret < 0)
goto on_error;

ret = gettimeofday(&t2, NULL);
if (ret < 0)
goto on_error;

ret = setns(host_netns_fd, CLONE_NEWNET);
if (ret < 0)
goto on_error;

time_in_mcs = (SEC_TO_MICROSEC(t2.tv_sec) + t2.tv_usec) -
(SEC_TO_MICROSEC(t1.tv_sec) + t1.tv_usec);
times[i] = time_in_mcs;
}

for (i = 0; i < ITERATIONS; i++)
sum += times[i];

printf("Mean time in microseconds (setns): %ju\n",
sum / ITERATIONS);

fret = EXIT_SUCCESS;

on_error:
if (netns_fd >= 0)
close(netns_fd);

if (host_netns_fd >= 0)
close(host_netns_fd);

exit(fret);
}

>
> >
> > I worry about all this just because of netlink interface is
> > already overloaded, while this patch makes it even less modular.
>
> Introducing the IFA_IF_NETNSID property will not make the netlink
> interface less modular. It is a clean, RTM_*ADDR-request specific
> property using network namespace identifiers which we discussed in prior
> patches are the way to go forward.
>
> You can already get interfaces via GETLINK from another network
> namespaces than the one you reside in (Which we enabled just a few
> months back.) but you can't do the same for GETADDR. Those two are
> almost always used together. When you want to get the links you usually
> also want to get the addresses associated with it right after.
> In a prior discussion we agreed that network namespace identifiers are
> the way to go forward but that any other propery, i.e. PIDs and fds
> should never be ported into other parts of the codebase and that is
> indeed something I agree with.
>
> > In case of one day we finally reach rtnl unscalability trap,
> > every common interface like this may be a decisive nail in
> > a coffin lid of possibility to overwrite everything.
> >
> > > The problem with what you're proposing is that the daemon would need to
> > > cache a socket file descriptor for each container which is something
> > > that we unfortunately cannot do since we can't excessively cache file
> > > descriptors because we can easily hit the open file limit. We also
> > > refrain from caching file descriptors for a long time for security
> > > reasons.
> > >
> > > For the case where users just request a list of the interfaces we
> > > can already use RTM_GETLINK + IFLA_IF_NETNS which has way better
> > > performance. But we can't do the same with RTM_GETADDR requests which
> > > was an oversight on my part when I wrote the original patchset for the
> > > RTM_*LINK requests. This just rectifies this and aligns RTM_GETLINK +
> > > RTM_GETADDR.
> > > Based on this patchset I have written a userspace POC that is basically
> > > a netns namespace aware getifaddr() or - as I like to call it -
> > > netns_getifaddr().
> > >
> > >>
> > >>> Usually, RTML_GETLINK requests are followed by RTM_GETADDR requests (cf.
> > >>> getifaddrs() style functions and friends). But currently, RTM_GETADDR
> > >>> requests do not support a similar property like IFLA_IF_NETNSID for
> > >>> RTM_*LINK requests.
> > >>> This is problematic since userspace can retrieve interfaces from another
> > >>> network namespace by sending a IFLA_IF_NETNSID property along but
> > >>> RTM_GETLINK request but is still forced to use the legacy setns() style of
> > >>> retrieving interfaces in RTM_GETADDR requests.
> > >>>
> > >>> The goal of this series is to make it possible to perform RTM_GETADDR
> > >>> requests on different network namespaces. To this end a new IFA_IF_NETNSID
> > >>> property for RTM_*ADDR requests is introduced. It can be used to send a
> > >>> network namespace identifier along in RTM_*ADDR requests. The network
> > >>> namespace identifier will be used to retrieve the target network namespace
> > >>> in which the request is supposed to be fulfilled. This aligns the behavior
> > >>> of RTM_*ADDR requests with the behavior of RTM_*LINK requests.
> > >>>
> > >>> Security:
> > >>> - The caller must have assigned a valid network namespace identifier for
> > >>> the target network namespace.
> > >>> - The caller must have CAP_NET_ADMIN in the owning user namespace of the
> > >>> target network namespace.
> > >>>
> > >>> Thanks!
> > >>> Christian
> > >>>
> > >>> [1]: commit 7973bfd8758d ("rtnetlink: remove check for IFLA_IF_NETNSID")
> > >>> [2]: commit 5bb8ed075428 ("rtnetlink: enable IFLA_IF_NETNSID for RTM_NEWLINK")
> > >>> [3]: commit b61ad68a9fe8 ("rtnetlink: enable IFLA_IF_NETNSID for RTM_DELLINK")
> > >>> [4]: commit c310bfcb6e1b ("rtnetlink: enable IFLA_IF_NETNSID for RTM_SETLINK")
> > >>> [5]: commit 7c4f63ba8243 ("rtnetlink: enable IFLA_IF_NETNSID in do_setlink()")
> > >>>
> > >>> Christian Brauner (5):
> > >>> rtnetlink: add rtnl_get_net_ns_capable()
> > >>> if_addr: add IFA_IF_NETNSID
> > >>> ipv4: enable IFA_IF_NETNSID for RTM_GETADDR
> > >>> ipv6: enable IFA_IF_NETNSID for RTM_GETADDR
> > >>> rtnetlink: move type calculation out of loop
> > >>>
> > >>> include/net/rtnetlink.h | 1 +
> > >>> include/uapi/linux/if_addr.h | 1 +
> > >>> net/core/rtnetlink.c | 15 +++++---
> > >>> net/ipv4/devinet.c | 38 +++++++++++++++-----
> > >>> net/ipv6/addrconf.c | 70 ++++++++++++++++++++++++++++--------
> > >>> 5 files changed, 97 insertions(+), 28 deletions(-)
> > >>>