Re: Why the number of /proc/interrupts doesn't change when nic isunder heavy workload?

From: Eric Dumazet
Date: Sun Jan 15 2012 - 17:09:30 EST


Le dimanche 15 janvier 2012 Ã 15:53 -0500, Yuehai Xu a Ãcrit :
> Hi All,
>
> My nic of server is Intel Corporation 80003ES2LAN Gigabit Ethernet
> Controller, the driver is e1000e, and my Linux version is 3.1.4. I
> have a Memcached server running on this 8 core box, the weird thing is
> that when my server is under heavy workload, the number of
> /proc/interrupts doesn't change at all. Below are some details:
> =======
> cat /proc/interrupts | grep eth0
> 68: 330887 330861 331432 330544 330346 330227
> 330830 330575 PCI-MSI-edge eth0
> =======
> cat /proc/irq/68/smp_affinity
> ff
>
> I know when network is under heavy load, NAPI will disable nic
> interrupt and poll ring buffer in nic. My question is, when is nic
> interrupt enabled again? It seems that it will never be enabled if the
> heavy workload doesn't stop, simply because the number showed by
> /proc/interrupts doesn't change at all. In my case, one of core is
> saturated by ksoftirqd, because lots of softirqs are pending to that
> core. I just want to distribute these softirqs to other cores. Even
> RPS is enabled, that core is still occupied by ksoftirq, nearly 100%.
>
> I dive into the codes and find these statements:
> __napi_schedule ==>
> local_irq_save(flags);
> ____napi_schedule(&__get_cpu_var(softnet_data), n);
> local_irq_restore(flags);
>
> here "local_irq_save" actually invokes "cli" which disable interrupt
> for the local core, is this the one that used in NAPI to disable nic
> interrupt? Personally I don't think it is because it just disables
> local cpu.
>
> I also find "enable_irq/disable_irq/e1000_irq_enable/e1000_irq_disable"
> under drivers/net/e1000e, are these used in NAPI to disable nic
> interrupt, but I fail to get any clue that they are used in the code
> path of NAPI?

This is done in the device driver itself, not in generic NAPI code.

When NAPI poll() get less packets than the budget, it re-enables chip
interrupts.


>
> My current situation is that, almost 60% of time of other 7 cores are
> idle, while only one core which is occupied by ksoftirq is 100% busy.
>

You could post some info, like "cat /proc/net/softnet_stat"

If you use RPS on a very high workload, on a mono queue NIC, best is to
stick for example cpu0 for the packet dispatching, and other cpus for
IP/UDP handling.

echo 01 >/proc/irq/68/smp_affinity
echo fe >/sys/class/net/eth0/queues/rx-0/rps_cpus

Please keep in mind that if your memcache uses a single UDP socket, you
probably hit a lot of contention on the socket spinlock and various
counters. So maybe it would be better to _reduce_ number of cpus
handling network load to reduce false sharing.

echo 0e >/sys/class/net/eth0/queues/rx-0/rps_cpus

Really, if you have a single UDP queue, best would be to not use RPS and
only have :

echo 01 >/proc/irq/68/smp_affinity

Then you could post the result of "perf top -C 0" so that we can spot
obvious problems on the hot path for this particular cpu.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/