Re: [patch V2 00/20] timer: Refactor the timer wheel

From: Eric Dumazet
Date: Fri Jun 17 2016 - 10:25:29 EST


On Fri, Jun 17, 2016 at 6:57 AM, Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
> On Fri, 17 Jun 2016, Eric Dumazet wrote:
>> >
>> > To achieve this capacity with HZ=1000 without increasing the storage size
>> > by another level, we reduced the granularity of the first wheel level from
>> > 1ms to 4ms. According to our data, there is no user which relies on that
>> > 1ms granularity and 99% of those timers are canceled before expiry.
>> >
>>
>> Ah... This might be a problem for people using small TCP RTO timers in
>> datacenters (order of 5 ms)
>> (and small delay ack timers as well, in the order of 4 ms)
>>
>> TCP/pacing uses high resolution timer in sch_fq.c so no problem there.
>>
>> If we arm a timer for 5 ms, what are the exact consequences ?
>
> The worst case expiry time is 8ms on HZ=1000 as it is on HZ=250
>
>> I fear we might trigger lot more of spurious retransmits.
>>
>> Or maybe I should read the patch series. I'll take some time today.
>
> Maybe just throw it at such a workload and see what happens :)

Well, when a network congestion happens in a cluster, and hundred of
millions of RTO timers fire,
adding fuel to the fire, it is a nightmare already ;)

To avoid increasing probability of such events we would need to have
at least 4 ms difference between the RTO timer and delack timer.

Meaning we have to increase both of them and increase P99 latencies of
RPC workloads.

Maybe a switch to hrtimer would be less risky.
But I do not know yet if it is doable without big performance penalty.