Re: [PATCH V4 3/8] namespaces: expose ns instance serial numbers in proc

From: Andy Lutomirski
Date: Mon Aug 25 2014 - 12:13:58 EST

On Mon, Aug 25, 2014 at 8:43 AM, Nicolas Dichtel
<nicolas.dichtel@xxxxxxxxx> wrote:
> Le 25/08/2014 16:04, Andy Lutomirski a Ãcrit :
>> On Aug 25, 2014 6:30 AM, "Nicolas Dichtel" <nicolas.dichtel@xxxxxxxxx>
>> wrote:
>>>> CRIU wants to save the complete state of a namespace and then restore
>>>> it. For that to work, any information exposed to things in the
>>>> namespace *cannot* be globally unique or unique per boot, since CRIU
>>>> needs to arrange for that information to match whatever it was when
>>>> CRIU saved it.
>>> How are ifindex of network devices managed? These ifindexes are unique
>>> per boot,
>>> thus can change depending on the order in which netdev are created.
>>> These ifindexes are unique per boot and exposed to userspace ...
>> This does not appear to be true.
>> $ sudo unshare --net
>> # ip link add veth0 type veth peer name veth1
>> # ip link
>> 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group
>> default
>> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>> 2: veth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode
>> DEFAULT group default qlen 1000
>> link/ether 06:0d:59:c7:a6:a8 brd ff:ff:ff:ff:ff:ff
>> 3: veth0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode
>> DEFAULT group default qlen 1000
>> link/ether b2:5c:8b:f2:12:28 brd ff:ff:ff:ff:ff:ff
>> # logout
>> $ ip link
>> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
>> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>> 3: em1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast
>> state DOWN qlen 1000
> I've probably misunderstood what you're trying to say. ifindexes are unique
> per
> boot and per netns.

I think we both misunderstood each other. The ifindexes are unique
*per netns*, which means that, if you're unprivileged in a netns,
global information doesn't leak to you. I think this is good.

>> Let me try again, with emphasis in the right place.
>> I think that *code running in a namespace* has no business even
>> knowing a unique identity of *that namespace* from the perspective of
>> the host.
>> In your example, if there's a veth device between netns A and netns B,
>> then code *in netns A* has no business knowing the identity of its
>> veth peer if its peer (B) is a sibling or ancestor. It also IMO has
>> no business knowing the identity of its own netns (A) other than as
>> "my netns".
> I do not agree (see the example below).
>> If A and B are siblings, then their parent needs to know where that
>> veth device goes, but I think this is already the case to a sufficient
>> extent today.
> I'm not aware of a hierarchy between netns. A daemon should be able to
> got the full network configuration, even if it's started when this
> configuration
> is already applied, ie even if it doesn't know what happen before it starts.

I don't know exactly which namespaces have an explicit hierarchy, but
there is certainly a hierarchy of *user* namespaces, and network
namespaces live in user namespaces, so they at least have somewhat of
a hierarchy.

>> I feel like this discussion is falling into a common trap of new API
>> discussions. Can one of you who wants this API please articulate,
>> with a reasonably precise example, what it is that you want to do, why
>> you can't easily do it already, and how this API helps? I currently
>> understand how the API creates problems, but I don't understand how it
>> solves any problems, and I will NAK it (and I suspect that Eric will,
>> too, which is pretty much fatal) unless that changes.
> What I'm trying to solve is to have full info in netlink messages sent by
> the
> kernel, thus beeing able to identify a peer netns (and this is close from
> what
> audit guys are trying to have). Theorically, messages sent by the kernel can
> be
> reused as is to have the same configuration. This is not the case with
> x-netns
> devices. Here is an example, with ip tunnels:
> $ ip netns add 1
> $ ip link add ipip1 type ipip remote local dev eth0
> $ ip -d link ls ipip1
> 8: ipip1@eth0: <POINTOPOINT,NOARP> mtu 1480 qdisc noop state DOWN mode
> DEFAULT group default
> link/ipip peer promiscuity 0
> ipip remote local dev eth0 ttl inherit pmtudisc
> $ ip link set ipip1 netns 1
> $ ip netns exec 1 ip -d link ls ipip1
> 8: ipip1@tunl0: <POINTOPOINT,NOARP,M-DOWN> mtu 1480 qdisc noop state DOWN
> mode DEFAULT group default
> link/ipip peer promiscuity 0
> ipip remote local dev tunl0 ttl inherit pmtudisc
> Now informations got with 'ip link' are wrong and incomplete:
> - the link dev is now tunl0 instead of eth0, because we only got an ifindex
> from the kernel without any netns informations.
> - the encapsulation addresses are not part of this netns but the user
> doesn't
> known that (still because netns info is missing). These IPv4 addresses
> may
> exist into this netns.
> - it's not possible to create the same netdevice with these infos.

Aha. That's a genuine problem.

Perhaps we need a concept of which netnses should be able to see each other.

I think I would be okay with a somewhat different outcome from your example:

$ ip netns exec 1 ip -d link ls ipip1
8: ipip1@[unknown device in another namespace]:
<POINTOPOINT,NOARP,M-DOWN> mtu 1480 qdisc noop state DOWN

I think this outcome is mandatory if netns 1 lives in a subsidiary
user namespace.

Certainly, if you do the 'ip link' in the original namespace, I agree
that this should work.

For most namespace types, this all works transparently, since
everything has an real identity all the way up the hierarchy. Network
namespaces are different.

I don't think that exposing serial numbers in /proc is a good
solution, both for the reasons already described and because I don't
think that iproute2 should need to muck around with /proc to function
correctly. Eric, any clever ideas here? Do we need fancier netlink
messages for this?

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at