Re: [RFC PATCH 0/5] net: low latency Ethernet device polling

From: Stephen Hemminger
Date: Mon Mar 04 2013 - 12:19:35 EST

Next message: Paul Menzel: "Re: [PATCH] md: dm-verity: Fix to avoid a deadlock in dm-bufio"
Previous message: Roger Pau Monné: "Re: [Xen-devel] [PATCH RFC 01/12] xen-blkback: don't store dev_bus_addr"
In reply to: Eliezer Tamir: "Re: [RFC PATCH 0/5] net: low latency Ethernet device polling"
Next in thread: Ben Hutchings: "Re: [RFC PATCH 0/5] net: low latency Ethernet device polling"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Wed, 27 Feb 2013 09:55:49 -0800
Eliezer Tamir <eliezer.tamir@xxxxxxxxxxxxxxxxxx> wrote:

> This patchset adds the ability for the socket layer code to poll directly
> on an Ethernet device's RX queue. This eliminates the cost of the interrupt
> and context switch and with proper tuning allows us to get very close
> to the HW latency.
>
> This is a follow up to Jesse Brandeburg's Kernel Plumbers talk from last year
> http://www.linuxplumbersconf.org/2012/wp-content/uploads/2012/09/2012-lpc-Low-Latency-Sockets-slides-brandeburg.pdf
>
> Patch 1 adds ndo_ll_poll and the IP code to use it.
> Patch 2 is an example of how TCP can use ndo_ll_poll.
> Patch 3 shows how this method would be implemented for the ixgbe driver.
> Patch 4 adds statistics to the ixgbe driver for ndo_ll_poll events.
> (Optional) Patch 5 is a handy kprobes module to measure detailed latency
> numbers.
>
> this patchset is also available in the following git branch
> git://github.com/jbrandeb/lls.git rfc
>
> Performance numbers:
> Kernel Config C3/6 rx-usecs TCP UDP
> 3.8rc6 typical off adaptive 37k 40k
> 3.8rc6 typical off 0* 50k 56k
> 3.8rc6 optimized off 0* 61k 67k
> 3.8rc6 optimized on adaptive 26k 29k
> patched typical off adaptive 70k 78k
> patched optimized off adaptive 79k 88k
> patched optimized off 100 84k 92k
> patched optimized on adaptive 83k 91k
> *rx-usecs=0 is usually not useful in a production environment.
>
> Notice that the patched kernel gives good results even with no tweaking.
> Performance for the default configuration is up by almost 100%,
> tuning will get you another 14%. Comparing best-case performance
> patched vs. unpatched, we are up 36%.
>
> Test setup details:
> Machines: each with two Intel Xeon 2680 CPUs and X520 (82599) optical NICs
> Tests: Netperf tcp_rr and udp_rr, 1 byte (round trips per second)
> Kernel: unmodified 3.8rc6 and patched 3.8rc6
> Config: typical is derived from RH6.2, optimized is a stripped down config
> Interrupt coalescing (ethtool rx-usecs) settings: 0=off, 1=adaptive, 100 us
> C3/6 states were turned on and off through BIOS.
> When C states were on the performance governor was used.
>
> Design:
> Pointers to a napi_struct were added both to struct sk_buff and struct sk.
> These are used to track which NAPI we need to poll for a specific socket.
> (more about this in the open issues section)
> The device driver marks every incoming skb.
> This info is propagated to the sk when an skb is added to the socket queue.
> When the socket code does not find any more data on the socket queue,
> it now may call ndo_ll_poll which will crank the device's rx queue and feed
> incoming packets to the stack directly from the context of the socket.
> A sysctl value (net.ipv4.ip_low_latency_poll) controls how many cycles we
> busy-wait before giving up. (setting to 0 globally disables busy-polling)
>
> Locking:
> Since what needs to be locked between a device's NAPI poll and ndo_ll_poll,
> is highly device / configuration dependent, we do this inside the
> Ethernet driver. For example, when packets for high priority connections
> are sent to separate rx queues, you might not need locking at all.
> For ixgbe we only lock the RX queue.
> ndo_ll_poll does not touch the interrupt state or the TX queues.
> (earlier versions of this patchset did touch them,
> but this design is simpler and works better.)
> Ndo_ll_poll is called with local BHs disabled.
>
> If a queue is actively polled by a socket (on another CPU) napi poll
> will not service it, but will wait until the queue can be locked
> and cleaned before doing a napi_complete().
> If a socket can't lock the queue because another CPU has it,
> either from NAPI or from another socket polling on it,
> the socket code can busy wait on the socket's skb queue.
> Ndo_ll_poll does not have preferential treatment for the data from the
> calling socket vs. data from others, so if another CPU is polling,
> you will see your data on this socket's queue when it arrives.
>
> Open issues:
> 1. Find a way to avoid the need to change the sk and skb structs.
> One big disadvantage of how we do this right now is that when a device is
> removed, it's hard to prevent it from getting polled by a socket
> which holds a stale reference.
>
> 2. How do we decide which sockets are eligible to do busy polling?
> Do we add a socket option to control this?
> How do we provide sane defaults while allowing flexibility and performance?
>
> 3. Andi Kleen and HPA pointed out that using get_cycles() is not portable.
>
> 4. How and where do we call ndo_ll_poll from the socket code?
> One good place seems to be wherever the kernel puts the process to sleep,
> waiting for more data, but this makes doing something intelligent about
> poll (the system call) hard. From the perspective of how ndo_ll_poll
> itself is implemented this does not seem to matter.
>
> 5. I would like to hear suggestions on naming conventions and where
> to put the code that for now I have put in include/net/ll_poll.h
>
> How to test:
> 1. The patchset should apply cleanly to either net or Linux 3.8
> (don't forget to configure INET_LL_RX_POLL and INET_LL_TCP_POLL).
>
> 2. The ethtool -c setting for rx-usecs should be on the order of 100.
>
> 3. Sysctl value net.ipv4.ip_low_latency_poll controls how long
> (in cycles) to busy-wait for more data, You are encouraged to play
> with this and see what works for you. (setting it to 0 would
> globally disable the new mechanism altogether.)
>
> 4. benchmark thread and IRQ should be bound to separate cores.
> Both cores should be on the same CPU NUMA node as the NIC.
> When the app and the IRQ run on the same CPU you get a ~5% penalty.
> If interrupt coalescing is set to a low value this penalty
> can be very large.
>
> 5. If you suspect that your machine is not configured properly,
> use numademo to make sure that the CPU to memory BW is OK.
> numademo 128m memcpy local copy numbers should be more than
> 8GB/s on a properly configured machine.
>
> Credit:
> Jesse Brandeburg, Arun Chekhov Ilango, Alexander Duyck, Eric Geisler,
> Jason Neighbors, Yadong Li, Mike Polehn, Anil Vasudevan, Don Wood
> Special thanks for finding bugs in earlier versions:
> Willem de Bruijn and Andi Kleen

This is not a criticism of this patch, but it seems that gradually, we have gotten
worse and worse at making network devices generic. There are more and more features
that require special case code in every device driver. This is fine for Intel devices but makes
the feature less generally useable and more error prone.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Paul Menzel: "Re: [PATCH] md: dm-verity: Fix to avoid a deadlock in dm-bufio"
Previous message: Roger Pau Monné: "Re: [Xen-devel] [PATCH RFC 01/12] xen-blkback: don't store dev_bus_addr"
In reply to: Eliezer Tamir: "Re: [RFC PATCH 0/5] net: low latency Ethernet device polling"
Next in thread: Ben Hutchings: "Re: [RFC PATCH 0/5] net: low latency Ethernet device polling"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]