Fwd: AF-XDP program in multi-process/multi-threaded configuration IO_PAGEFAULT
From: Bagas Sanjaya
Date: Mon Aug 14 2023 - 20:40:47 EST
Hi,
I notice a bug report on Bugzilla [1]. Quoting from it:
> Hello,
>
> I am currently doing research on AF_XDP and I encountered a bug that is present in multi-process and multi-threaded configurations of AF_XDP programs. I believe there is a race condition that causes an IO_PAGEFAULT and the entire OS to crash when it is encountered. This bug can be reproduced using Suricata release 7.0.0-rc1, or another program where multiple user space processes each with an AF_XDP socket are created.
>
> I have attached some sample code that has should be able to reproduce the bug. This code creates n processes where n is the number of RX queues specified by the user. In my experience the higher the number of processes/RX queues used, the higher the likelihood of triggering the crash.
>
> To change the number of RX queues, use Ethtool to set the number of combined RX queues, this may vary depending on network card:
> sudo ethtool -L <interface> combined <number of RX queues>
>
> Compile the code using make and run the code as such:
> sudo -E ./xdp_main.o <interface> <number of child processes> consec
>
> To get the crash to show up, lots of traffic needs to be sent to the network interface. In our experimental setup, a machine using Pktgen is sending traffic to the machine running the AF_XDP code at max line rate. Using Pktgen, vary the IP/MAC addresses of each packet to make sure the packets are somewhat evenly distributed across each RX queue. This may help with reproducing the bug. Also be sure the interface is set to promiscuous mode.
>
> While sending traffic at max line rate, send a SIGINT to the AF_XDP program receiving the traffic to terminate the program. Sometimes an IO_PAGEFAULT will occur. This is more common than not. Also attached are some screen shots of the terminal and of the output our server gives.
>
> The bug occurs because each process has the same STDIN file descriptor and as a result each child process gets the same SIGINT signal at the same time causing them all to terminate at once. During this, I believe a race condition is reached where the AF_XDP program is still receiving packets and is trying to write them to a UMEM that no longer exists. The order of operations to cause this would be:
> 1. XDP program looks up AF_XDP socket in XSKS_MAP
> 2. User space program deletes UMEM and/or AF_XDP socket
> 3. XDP program tries to write packet to UMEM
>
> This can also be reproduced with Suricata as stated earlier with a similar traffic load as described for my personal program.
>
> If more clarification is needed, please reach out to me. I would also like to know if this is an intended design or the cause of this bug. I look forward to hearing from you!
See Bugzilla for the full thread and attached reproducer code.
Thanks.
[1]: https://bugzilla.kernel.org/show_bug.cgi?id=217712
--
An old man doll... just what I always wanted! - Clara