[patch 0/7] timers: Footprint diet and NOHZ overhead mitigation

From: Thomas Gleixner
Date: Tue May 26 2015 - 18:52:10 EST

Next message: Mike Kravetz: "Re: [PATCH v2 1/2] mm/hugetlb: compute/return the number of regions added by region_add()"
Previous message: Thomas Gleixner: "[patch 4/7] timer: Replace timer base by a cpu index"
Next in thread: Thomas Gleixner: "[patch 2/7] timer: Remove FIFO guarantee"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

I still have a couple of patches against the timer code in my review
list, but the more I look at them the more horrible I find them.

All these patches are related to the NOHZ stuff and try to mitigate
shortcomings with even more bandaids. And of course these bandaids
cost cycles and are a burden for timer heavy use cases like
networking.

Sadly enough the NOHZ crowd is happy to throw more and more crap at
the kernel and none of these people is even thinking about whether
this stuff could be solved in a different way.

After Eric mentioned in one of the recent discussions that the
timer_migration sysctl is not a lot of gain, I tried to mitigate at
least that issue. That caused me to look deeper and I came up with the
following patch series which:

- Clean up the sloppy catchup timer jiffies stuff

- Replace the hash bucket lists by hlists, which shrinks the wheel
footprint by 50%

- Replaces the timer base pointer in struct timer_list by a cpu
index, which shrinks struct timer_list by 4 - 8 bytes depending on
alignement and architecture.

- Caches nohz_active and timer_migration in the timer per cpu bases,
so we can avoid going through loops and hoops to figure that out.

This series applies against

git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git timer/core

With a pretty naive timer test module and a sched other cpu hog on an
isolated core I verified that I did not wreckage anything. The
following table shows the resulting CPU time of the hog for the
various scenarios.

nohz=on nohz=on nohz=off
timer_migration=on timer_migration=off

Unpatched: 46.57% 46.87% 46.64%

Patched: 47.76% 48.20% 48.73%

Though, I do not trust my numbers, so I really hope that Eric or any
other power timer wheel user can throw a bunch of tests at it.

Now some more rant about the whole NOHZ issues. The timer wheel is
fundamentally unfriendly for NOHZ because:

- it's impossible to keep reliably track of the next expiring timer

- finding the next expiring timer is in the worst case O(N) and
involves touching a gazillion of cache lines

The other disaster area is the nohz timer migration mechanism. I think
it's fundamentally broken.

- We decide at timer enqueue time, on which CPU we queue the timer,
based on cpu idle heuristics which even fails to recognize that a
cpu is really busy with interrupt and softirq processing (reported
by Eric).

- When moving a timer to some "non-idle" CPU we speculate about the
system situation in the future (at timer expiry time). This kinda
works for limited scenarios (NOHZ_FULL and constraint power saving
on mobile devices). But it is pretty much nonsensical for
everything else. For network heavy workloads it can be even
outright wrong and dangerous as Eric explained.

So we really need to put a full stop at all this NOHZ tinkering and
think about proper solutions. I'm not going to merge any NOHZ related
features^Whacks before the existing problems are not solved. In
hindsight I should have done that earlier, but it's never too late.

So we need to address two issues:

1) Keeping track of the first expiring timer in the wheel.

Don't even think about rbtrees or such, it just wont work, but I'm
willing to accept prove of the contrary.

One of the ideas I had a long time ago is to have a wheel
implementation, which does never cascade and therefor provides
different levels of granularity per wheel level.

LVL0 1 jiffy granularity
LVL1 8 jiffies granularity
LVL1 64 jiffies granularity
....

This requires more levels than the classic timer wheel, so its not
feasible from a memory consumption POV.

But we can have a compromise and put all 'soon' expiring timers
into these fancy new wheels and stick all 'far out' timers into the
last level of the wheel and cascade them occasionally.

That should work because:

- Most timers are short term expiry (< 1h)
- Most timers are canceled before expiry

So we need a sensible upper bound of levels and get the following
properties:

- Natural batching (no more slack magic). This might also help
networking to avoid rearming of timers.

- Long out timers are queued in the slowest wheel and
ocassionally with the granularity of the slowest wheel
cascaded

- Tracking the next expiring timer can be done with a bitmap,
so the 'get next expiring timer' becomes O(1) without
touching anything else than the bitmap, if we accept that
the upper limit of what we can retrieve O(1) is the
granularity of the last level, i.e. we treat timers which
need recascading simple as normal inhabitants of the last
wheel level.

- The expiry code (run_timers) gets more complicated as it
needs to walk through the different levels more often, but
with the bitmap that shouldnt be a too big issue.

- Seperate storage for non-deferrable and deferrable timers.

I spent some time in coding that up. It barely boots and has
certainly a few bugs left and right, but I will send it out
nevertheless as a reply to this mail to get the discussion going.

2) The timer migration problem

I think we need to address this from a different angle.

Joonwoo tried to solve the deferrable timer issue for non pinned
timers by introducing a global timer wheel for those. That sucks
and adds obviously more overhead into the fast pathes. But the idea
itself that the non idle cpus consume events from the idle ones is
not the worst. A global wheel might work for the few deferrable
timers, but it wont work for anything else.

So instead of deciding at enqueue time, we should come up with
different wheels for the different types of timers. That way we
could keep the locality for the networking scenario, and at the
same time have a way to delegate all non bound timers of a cpu to
some other cpu at idle time or pull them from the other cpu when
requested.

I haven't thought that through fully yet, but it's something we
should at least investigate thoroughly. I won't do that for the next
10 days as I'm about to leave for vacation and conferencing.

Thanks,

tglx
---
include/linux/hrtimer.h | 4
include/linux/sched.h | 6
include/linux/sched/sysctl.h | 12 -
include/linux/timer.h | 56 ++++----
include/trace/events/timer.h | 13 --
kernel/rcu/tree_plugin.h | 2
kernel/sched/core.c | 9 -
kernel/sysctl.c | 18 +-
kernel/time/hrtimer.c | 38 ++++--
kernel/time/tick-internal.h | 14 ++
kernel/time/tick-sched.c | 25 ++-
kernel/time/timer.c | 271 ++++++++++++++++++++-----------------------
kernel/time/timer_list.c | 2
kernel/time/timer_stats.c | 10 -
14 files changed, 244 insertions(+), 236 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Mike Kravetz: "Re: [PATCH v2 1/2] mm/hugetlb: compute/return the number of regions added by region_add()"
Previous message: Thomas Gleixner: "[patch 4/7] timer: Replace timer base by a cpu index"
Next in thread: Thomas Gleixner: "[patch 2/7] timer: Remove FIFO guarantee"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]