[patch V2 00/20] timer: Refactor the timer wheel

From: Thomas Gleixner
Date: Fri Jun 17 2016 - 09:28:35 EST


This is the second version of the timer wheel rework series. The first series
can be found here:

http://lkml.kernel.org/r/20160613070440.950649741@xxxxxxxxxxxxx

The series is also available in git:

git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git WIP.timers

Changes vs. V1:

- Addressed the review comments of V1

- Fixed the fallout in tty/metag (noticed by Arjan)
- Renamed the hlist helper (noticed by Paolo/George)
- Used the proper mask in get_timer_base() (noticed by Richard)
- Fixed the inverse state check in internal_add_timer() (noticed by Richard)
- Simplified the macro maze, removed wrapper (noticed by George)
- Reordered data retrieval in run_timer() (noticed by George)

- Removed cascading completely

We have a hard cutoff of expiry times at the capacity of the last wheel
level now. Timers which insist on timeouts longer than that, i.e. ~6days,
will expire at the cutoff, i.e. ~6 days. From our data gathering the
largest timeouts are 5 days (networking contrack), which are well in the
capacity.

To achieve this capacity with HZ=1000 without increasing the storage size
by another level, we reduced the granularity of the first wheel level from
1ms to 4ms. According to our data, there is no user which relies on that
1ms granularity and 99% of those timers are canceled before expiry.

As a side effect there is the benefit of better batching in the first level
which helps networking to avoid rearming timers in the hotpath.

We gathered more data about performance and batching. Compared to mainline the
following changes have been observed:

- The bad outliers in mainline when the timer wheel needs to be forwarded
after a long idle sleep are completely gone.

- The total cpu time used for timer softirq processing is significantly
reduced. Depending on the HZ setting and workload this ranges from factor
2 to 6.

- The average invocation period of the timer softirq on an idle system
increases significantly. Depending on the HZ settings and workload this
ranges from factor 1.5 to 5. That means that the residency in deep
c-states should be improved. Have not yet have time to verify this with
the power tools.

Thanks,

tglx

---
arch/x86/kernel/apic/x2apic_uv_x.c | 4
arch/x86/kernel/cpu/mcheck/mce.c | 4
block/genhd.c | 5
drivers/cpufreq/powernv-cpufreq.c | 5
drivers/mmc/host/jz4740_mmc.c | 2
drivers/net/ethernet/tile/tilepro.c | 4
drivers/power/bq27xxx_battery.c | 5
drivers/tty/metag_da.c | 4
drivers/tty/mips_ejtag_fdc.c | 4
drivers/usb/host/ohci-hcd.c | 1
drivers/usb/host/xhci.c | 2
include/linux/list.h | 10
include/linux/timer.h | 30
kernel/time/tick-internal.h | 1
kernel/time/tick-sched.c | 46 -
kernel/time/timer.c | 1099 +++++++++++++++++++++---------------
lib/random32.c | 1
net/ipv4/inet_connection_sock.c | 7
net/ipv4/inet_timewait_sock.c | 5
19 files changed, 725 insertions(+), 514 deletions(-)