netfilter: iptables-restore: setsockopt(3, SOL_IP,IPT_SO_SET_REPLACE, "security...", ...) return -EAGAIN

From: Dexuan Cui
Date: Wed May 12 2021 - 18:26:58 EST


Hi,
I'm debugging an iptables-restore failure, which happens about 5% of the
time when I keep stopping and starting the Linux VM. The VM has only 1
CPU, and kernel version is 4.15.0-1098-azure, but I suspect the issue may
also exist in the mainline Linux kernel.

When the failure happens, it's always caused by line 27 of the rule file:

1 # Generated by iptables-save v1.6.0 on Fri Apr 23 09:22:59 2021
2 *raw
3 :PREROUTING ACCEPT [0:0]
4 :OUTPUT ACCEPT [0:0]
5 -A PREROUTING ! -s 168.63.129.16/32 -p tcp -j NOTRACK
6 -A OUTPUT ! -d 168.63.129.16/32 -p tcp -j NOTRACK
7 COMMIT
8 # Completed on Fri Apr 23 09:22:59 2021
9 # Generated by iptables-save v1.6.0 on Fri Apr 23 09:22:59 2021
10 *filter
11 :INPUT ACCEPT [2407:79190058]
12 :FORWARD ACCEPT [0:0]
13 :OUTPUT ACCEPT [1648:2190051]
14 -A OUTPUT -d 169.254.169.254/32 -m owner --uid-owner 33 -j DROP
15 COMMIT
16 # Completed on Fri Apr 23 09:22:59 2021
17 # Generated by iptables-save v1.6.0 on Fri Apr 23 09:22:59 2021
18 *security
19 :INPUT ACCEPT [2345:79155398]
20 :FORWARD ACCEPT [0:0]
21 :OUTPUT ACCEPT [1504:2129015]
22 -A OUTPUT -d 168.63.129.16/32 -p tcp -m owner --uid-owner 0 -j ACCEPT
23 -A OUTPUT -d 168.63.129.16/32 -p tcp -m conntrack --ctstate INVALID,NEW -j DROP
24 -A OUTPUT -d 168.63.129.16/32 -p tcp -m owner --uid-owner 0 -j ACCEPT
25 -A OUTPUT -d 168.63.129.16/32 -p tcp -m conntrack --ctstate INVALID,NEW -j DROP
26 -A OUTPUT -d 168.63.129.16/32 -p tcp -m conntrack --ctstate INVALID,NEW -j DROP
27 COMMIT

The related part of the strace log is:

1 socket(PF_INET, SOCK_RAW, IPPROTO_RAW) = 3
2 getsockopt(3, SOL_IP, IPT_SO_GET_INFO, "security\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., [84]) = 0
3 getsockopt(3, SOL_IP, IPT_SO_GET_ENTRIES, "security\0\357B\16Z\177\0\0Pg\355\0\0\0\0\0Pg\355\0\0\0\0\0"..., [880]) = 0
4 setsockopt(3, SOL_IP, IPT_SO_SET_REPLACE, "security\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 2200) = -1 EAGAIN (Resource temporarily unavailable)
5 close(3) = 0
6 write(2, "iptables-restore: line 27 failed"..., 33) = 33

The -EAGAIN error comes from line 1240 of xt_replace_table():

do_ipt_set_ctl
do_replace
__do_replace
xt_replace_table

1216 xt_replace_table(struct xt_table *table,
1217 unsigned int num_counters,
1218 struct xt_table_info *newinfo,
1219 int *error)
1220 {
1221 struct xt_table_info *private;
1222 unsigned int cpu;
1223 int ret;
1224
1225 ret = xt_jumpstack_alloc(newinfo);
1226 if (ret < 0) {
1227 *error = ret;
1228 return NULL;
1229 }
1230
1231 /* Do the substitution. */
1232 local_bh_disable();
1233 private = table->private;
1234
1235 /* Check inside lock: is the old number correct? */
1236 if (num_counters != private->number) {
1237 pr_debug("num_counters != table->private->number (%u/%u)\n",
1238 num_counters, private->number);
1239 local_bh_enable();
1240 *error = -EAGAIN;
1241 return NULL;
1242 }

When the function returns -EAGAIN, the 'num_counters' is 5 while
'private->number' is 6.

If I re-run the iptables-restore program upon the failure, the program
will succeed.

I checked the function xt_replace_table() in the recent mainline kernel and it
looks like the function is the same.

It looks like there is a race condition between iptables-restore calls
getsockopt() to get the number of table entries and iptables call
setsockopt() to replace the entries? Looks like some other program is
concurrently calling getsockopt()/setsockopt() -- but it looks like this is
not the case according to the messages I print via trace_printk() around
do_replace() in do_ipt_set_ctl(): when the -EAGAIN error happens, there is
no other program calling do_replace(); the table entry number was changed
to 5 by another program 'iptables' about 1.3 milliseconds ago, and then
this program 'iptables-restore' calls setsockopt() and the kernel sees
'num_counters' being 5 and the 'private->number' being 6 (how can this
happen??); the next setsockopt() call for the same 'security' table
happens in about 1 minute with both the numbers being 6.

Can you please shed some light on the issue? Thanks!

BTW, iptables does have a retry mechanism for getsockopt():
2f93205b375e ("Retry ruleset dump when kernel returns EAGAIN.")
(https://git.netfilter.org/iptables/commit/libiptc?id=2f93205b375e&context=10&ignorews=0&dt=0)

But it looks like this is enough? e.g. here getsockopt() returns 0, but
setsockopt() returns -EAGAIN.

Thanks,
Dexuan