Re: [PATCH] net: Provide sysctl to tune local port range to IANA specification

From: Eric Dumazet
Date: Wed Jul 24 2024 - 06:00:10 EST


On Wed, Jul 24, 2024 at 8:04 AM <jiang.kun2@xxxxxxxxxx> wrote:
>
> From: Fan Yu <fan.yu9@xxxxxxxxxx>
>
> The Importance of Following IANA Standards
> ========================================
> IANA specifies User ports as 1024-49151, and it just so happens
> that my application uses port 33060 (reserved for MySQL Database Extended),
> which conflicts with the Linux default dynamic port range (32768-60999)[1].
>
> In fact, IANA assigns numbers in port range from 32768 to 49151,
> which is uniformly accepted by the industry. To do this,
> it is necessary for the kernel to follow the IANA specification.
>
> Drawbacks of existing implementations
> ========================================
> In past discussions, follow the IANA specification by modifying the
> system defaults has been discouraged, which would greatly affect
> existing users[2].
>
> Theoretically, this can be done by tuning net.ipv4.local_port_range,
> but there are inconveniences such as:
> (1) For cloud-native scenarios, each container is expected to follow
> the IANA specification uniformly, so it is necessary to do sysctl
> configuration in each container individually, which increases the user's
> resource management costs.
> (2) For new applications, since sysctl(net.ipv4.local_port_range) is
> isolated across namespaces, the container cannot inherit the host's value,
> so after startup, it remains at the kernel default value of 32768-60999,
> which reduces the ease of use of the system.
>
> Solution
> ========================================
> In order to maintain compatibility, we provide a sysctl interface in
> host namespace, which makes it easy to tune local port range to
> IANA specification.
>
> When ip_local_port_range_use_iana=1, the local port range of all network
> namespaces is tuned to IANA specification (49152-60999), and IANA
> specification is also used for newly created network namespaces. Therefore,
> each container does not need to do sysctl settings separately, which
> improves the convenience of configuration.
> When ip_local_port_range_use_iana=0, the local port range of all network
> namespaces are tuned to the original kernel defaults (32768-60999).
> For example:
> # cat /proc/sys/net/ipv4/ip_local_port_range
> 32768 60999
> # echo 1 > /proc/sys/net/ipv4/ip_local_port_range_use_iana
> # cat /proc/sys/net/ipv4/ip_local_port_range
> 49152 60999
>
> # unshare -n
> # cat /proc/sys/net/ipv4/ip_local_port_range
> 49152 60999
>
> Notes
> ========================================
> The lower value(49152), consistent with IANA dynamic port lower limit.
> The upper limit value(60999), which differs from the IANA dynamic upper
> limit due to the fact that Linux will use 61000-65535 as masquarading/NAT,
> but this does not conflict with the IANA specification[3].
>
> Note that following the above specification reduces the number of ephemeral
> ports by half, increasing the risk of port exhaustion[2].
>
> [1]:https://www.iana.org/assignments/service-names-port-numbers/service-names-port-numbers.txt
> [2]:https://lore.kernel.org/all/bf42f6fd-cd06-02d6-d7b6-233a0602c437@xxxxxxxxx/
> [3]:https://lore.kernel.org/all/20070512210830.514c7709@xxxxxxxxxxxxxxxxx/
>
> Co-developed-by: Kun Jiang <jiang.kun2@xxxxxxxxxx>
> Signed-off-by: Fan Yu <fan.yu9@xxxxxxxxxx>
> Signed-off-by: Kun Jiang <jiang.kun2@xxxxxxxxxx>
> Reviewed-by: xu xin <xu.xin16@xxxxxxxxxx>
> Reviewed-by: Yunkai Zhang <zhang.yunkai@xxxxxxxxxx>
> Reviewed-by: Qiang Tu <tu.qiang35@xxxxxxxxxx>
> Reviewed-by: Peilin He<he.peilin@xxxxxxxxxx>
> Cc: Yang Yang <yang.yang29@xxxxxxxxxx>
> ---
> Documentation/networking/ip-sysctl.rst | 13 +++++++++++++
> net/ipv4/af_inet.c | 7 ++++++-
> net/ipv4/sysctl_net_ipv4.c | 31 +++++++++++++++++++++++++++++++
> 3 files changed, 50 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
> index bd50df6a5a42..27f4928c2a1d 100644
> --- a/Documentation/networking/ip-sysctl.rst
> +++ b/Documentation/networking/ip-sysctl.rst
> @@ -1320,6 +1320,19 @@ ip_local_port_range - 2 INTEGERS
> Must be greater than or equal to ip_unprivileged_port_start.
> The default values are 32768 and 60999 respectively.
>
> +ip_local_port_range_use_iana - BOOLEAN
> + Tune ip_local_port_range to IANA specification easily.
> + When ip_local_port_range_use_iana=1, the local port range of
> + all network namespaces is tuned to IANA specification (49152-60999),
> + and IANA specification is also used for newly created network namespaces.
> + Therefore, each container does not need to do sysctl settings separately,
> + which improves the convenience of configuration.
> + When ip_local_port_range_use_iana=0, the local port range of
> + all network namespaces are tuned to the original kernel
> + defaults (32768-60999).
> +

IANA means : Internet Assigned Numbers Authority

It is very possible a future RFC changes the actual ranges.

I would have used rfc 6335, because when a new rfc comes in 2030, we
will have to add a new sysctl, right ?

> + Default: 0
> +
> ip_local_reserved_ports - list of comma separated ranges
> Specify the ports which are reserved for known third-party
> applications. These ports will not be used by automatic port
> diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> index b24d74616637..42b6bc58dc45 100644
> --- a/net/ipv4/af_inet.c
> +++ b/net/ipv4/af_inet.c
> @@ -123,6 +123,8 @@
>
> #include <trace/events/sock.h>
>
> +extern u8 sysctl_ip_local_port_range_use_iana;
> +
> /* The inetsw table contains everything that inet_create needs to
> * build a new socket.
> */
> @@ -1802,7 +1804,10 @@ static __net_init int inet_init_net(struct net *net)
> /*
> * Set defaults for local port range
> */
> - net->ipv4.ip_local_ports.range = 60999u << 16 | 32768u;
> + if (sysctl_ip_local_port_range_use_iana)
> + net->ipv4.ip_local_ports.range = 60999u << 16 | 49152u;
> + else
> + net->ipv4.ip_local_ports.range = 60999u << 16 | 32768u;
>
> seqlock_init(&net->ipv4.ping_group_range.lock);
> /*
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index 162a0a3b6ba5..a38447889072 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -45,6 +45,8 @@ static unsigned int tcp_child_ehash_entries_max = 16 * 1024 * 1024;
> static unsigned int udp_child_hash_entries_max = UDP_HTABLE_SIZE_MAX;
> static int tcp_plb_max_rounds = 31;
> static int tcp_plb_max_cong_thresh = 256;
> +u8 sysctl_ip_local_port_range_use_iana;
> +EXPORT_SYMBOL(sysctl_ip_local_port_range_use_iana);
>
> /* obsolete */
> static int sysctl_tcp_low_latency __read_mostly;
> @@ -95,6 +97,26 @@ static int ipv4_local_port_range(struct ctl_table *table, int write,
> return ret;
> }
>
> +static int ipv4_local_port_range_use_iana(struct ctl_table *table, int write,
> + void *buffer, size_t *lenp, loff_t *ppos)
> +{
> + struct net *net;
> + int ret;
> +
> + ret = proc_dou8vec_minmax(table, write, buffer, lenp, ppos);
> +
> + if (write && ret == 0) {
> + for_each_net(net) {

This is quite buggy.

for_each_net() can only be used with care, otherwise list can be
corrupted, netns can disappear under you.