[patch 00/20] timer: Refactor the timer wheel

From: Thomas Gleixner
Date: Mon Jun 13 2016 - 04:47:21 EST


The current timer wheel has some drawbacks:

1) Cascading

Cascading can be an unbound operation and is completely pointless in most
cases because the vast majority of the timer wheel timers are canceled or
rearmed before expiration.

2) No fast lookup of the next expiring timer

In NOHZ scenarios the first timer soft interrupt after a long NOHZ period
must fast forward the base time to current jiffies. As we have no way to
find the next expiring timer fast, the code loops and increments the base
time by one and checks for expired timers in each step. I've observed loops
lasting 1 ms!

There are some other issues caused by the above, but they are minor compare to
those.

After a thorough analysis of real world data gathered on laptops,
workstations, webservers and other machines (thanks Chris!) I came to the
conclusion that the current 'classic' timer wheel implementation can be
modified to address the above issues.

The vast majority of timer wheel timers is canceled or rearmed before
expiry. Most of them are timeouts for networking and other I/O tasks. The
nature of timeouts is to catch the exception from normal operation (TCP ack
timed out, disk does not respond, etc.). For these kind of timeouts the
accuracy is not really a concern. In case the timeout fires, performance is
down the drain already.

The few timers which actually expire can be split into two categories:

1) Short expiry times which expect halfways accurate expiry

2) Long term expiry times are inaccurate today already due to the batching
which is done for NOHZ.

So for long term expiry timers we can avoid the cascading property and just
leave them in the less granular outer wheels until expiry or cancelation.

Contrary to the classic wheel the granularities of the next wheel is not the
capacity of the first wheel. The granularities of the wheels are in the
currently chosen setting 8 times the granularity of the previous wheel. So for
HZ=250 we end up with the following granularity levels:

Level Offset Granularity Range
0 0 4 ms 0 ms - 252 ms
1 64 32 ms 256 ms - 2044 ms (256ms - ~2s)
2 128 256 ms 2048 ms - 16380 ms (~2s - ~16s)
3 192 2048 ms (~2s) 16384 ms - 131068 ms (~16s - ~2m)
4 256 16384 ms (~16s) 131072 ms - 1048572 ms (~2m - ~17m)
5 320 131072 ms (~2m) 1048576 ms - 8388604 ms (~17m - ~2h)

That's a worst case inaccuracy of 12.5% for the timers which are queued at the
beginning of a level.

So the new wheel concept addresses the old issues:

1) Cascading is avoided (except for extreme long time timers)

2) By keeping the timers in the bucket until expiry/cancelation we can track
the buckets which have timers enqueued in a bucket bitmap and therefor can
lookup the next expiring timer fast and time bound.

A further benefit of the concept is, that the slack calculation which is done
on every timer start is not longer necessary because the granularity levels
provide natural batching already.

Our extensive testing with various loads did not show any performance
degradation vs. the current wheel implementation.

This series has also preparatory patches for changing the NOHZ timer handling
from the current push to a pull model. Currently we decide at timer enqueue
time on which cpu we queue the timer. This is exceptionally silly because
there is no way to predict at enqueue time which cpu will be idle when the
timer expires. Given the fact that most timers are canceled or rearmed before
expiry this is even more silly. We trade a expensive decision and cross cpu
access for a very doubtful benefit.

The solution to this is to store the migrateable timers in a seperate storage
space on the local cpu and when the cpu goes idle tell the others about its
next expiring timer and let the other cpu pull the timer in case of expiry. We
have a proof of concept implementation for this, but it's not yet ready for
posting.

Thanks,

tglx
---
arch/x86/kernel/apic/x2apic_uv_x.c | 4
arch/x86/kernel/cpu/mcheck/mce.c | 4
block/genhd.c | 5
drivers/cpufreq/powernv-cpufreq.c | 5
drivers/mmc/host/jz4740_mmc.c | 2
drivers/net/ethernet/tile/tilepro.c | 4
drivers/power/bq27xxx_battery.c | 5
drivers/tty/metag_da.c | 4
drivers/tty/mips_ejtag_fdc.c | 4
drivers/usb/host/ohci-hcd.c | 1
drivers/usb/host/xhci.c | 2
include/linux/list.h | 10
include/linux/timer.h | 30
include/trace/events/timer.h | 11
kernel/time/tick-internal.h | 1
kernel/time/tick-sched.c | 46 -
kernel/time/timer.c | 1090 +++++++++++++++++++++---------------
lib/random32.c | 1
net/ipv4/inet_connection_sock.c | 7
net/ipv4/inet_timewait_sock.c | 5
20 files changed, 731 insertions(+), 510 deletions(-)