Re: [PATCH v2] tile: avoid using clocksource_cyc2ns with absolute cycle count

From: Chris Metcalf
Date: Wed Nov 16 2016 - 17:49:40 EST


On 11/16/2016 2:59 PM, John Stultz wrote:
On Wed, Nov 16, 2016 at 11:35 AM, Chris Metcalf <cmetcalf@xxxxxxxxxxxx> wrote:
For large values of "mult" and long uptimes, the intermediate
result of "cycles * mult" can overflow 64 bits. For example,
the tile platform calls clocksource_cyc2ns with a 1.2 GHz clock;
we have mult = 853, and after 208.5 days, we overflow 64 bits.

Since clocksource_cyc2ns() is intended to be used for relative
cycle counts, not absolute cycle counts, performance is more
importance than accepting a wider range of cycle values.
So, just use mult_frac() directly in tile's sched_clock().

Signed-off-by: Chris Metcalf <cmetcalf@xxxxxxxxxxxx>
---
Blackfin should make a similar change in their sched_clock().

arch/tile/kernel/time.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/tile/kernel/time.c b/arch/tile/kernel/time.c
index 178989e6d3e3..ea960d660917 100644
--- a/arch/tile/kernel/time.c
+++ b/arch/tile/kernel/time.c
@@ -218,8 +218,8 @@ void do_timer_interrupt(struct pt_regs *regs, int fault_num)
*/
unsigned long long sched_clock(void)
{
- return clocksource_cyc2ns(get_cycles(),
- sched_clock_mult, SCHED_CLOCK_SHIFT);
+ return mult_frac(get_cycles(),
+ sched_clock_mult, 1ULL << SCHED_CLOCK_SHIFT);
}
So... looking closer at mult_frac(), its a really slow implementation,
doing 2 divs and a mod and a mult. Hopefully the compiler can sort out
the divs are power of two, and optimize it out, but I'm still
hesitant.

sched_clock() is normally a very hot-path call, so this might have a
real performance impact, especially compared to what its replacing.

In your earlier patch, you mentioned this was similar to 4cecf6d401a0
("sched, x86: Avoid unnecessary overflow in
sched_clock"). It might be better to actually try to use similar logic
there, to make sure the performance impact is minimal.

This was the first thing I looked at when I saw the mult_frac()
implementation. The modulus operations are indeed converted to
bitmasks and the divides to shifts. We do have to do two multiplies
instead of one, but that's basically the worst of the cost.

Change 4cecf6d401a0 results in essentially identical code for x86 as
this proposed change does for tile. In fact a follow-on change by
Salman introduced mult_frac() and switched to using it, so it was
identical at that point.

PeterZ (cc'ed) then improved it to use __int128 math via
mul_u64_u32_shr(), but that doesn't help tile; we only do one multiply
instead of two, but the multiply is handled by an out-of-line call to
__multi3, and the sched_clock() function ends up about 2.5x slower as
a result.

Thanks for thinking about this!

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com