Re: [PATCH] hrtimer: Update softirq_expires_next correctly after __hrtimer_get_next_event()

From: Lorenzo Colitti
Date: Thu Apr 22 2021 - 10:20:27 EST


On Thu, Apr 22, 2021 at 9:08 AM Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
> Just for comparison. In a VM I'm experimenting with right now the
> reprogramming time is ~500ns which is still a lot of cycles, but
> compared to 5us faster by an order of magnitude. And on the sane
> machine bare metal its way faster and therefore less noticeable.

FWIW, on this hardware, frtrace says that arming the arm64 architected
timer takes 0.7us. Definitely better than 2-3us, but still not free.
This is not a high-end desktop or server, but it's also not super
slow, low-power hardware.

> * The transmit should only be run if no skb data has been sent for a
> * certain duration.
>
> which is useless word salad.

You're the one who wrote that comment - see b1a31a5f5f27. You'll
forgive me for being amused. :-)

Thanks for the history/analysis/suggestions. I think it's a fact that
this is a regression in performance: this particular code has
performed well for a couple of years now. The fact that the good
performance only existed due to a correctness bug in the hrtimer code
definitely does make it harder to argue that the regression should be
reverted.

That said: if you have a fix for the double reprogram, then that fix
should probably be applied? 0.5us is not free, and even if hrtimers
aren't designed for frequent updates, touching the hardware twice as
often does seem like a bad idea, since, as you point out, there's a
*lot* of hardware that is slow.

Separately, we're also going to look at making ncm better. (In defense
of the original author, in 2014 I don't think anyone would even have
dreamed of USB being fast enough for this to be a problem.) The first
thing we're going to try to do is set the timer once per NTB instead
of once per packet (so, 10x less). My initial attempt to do that
causes the link to die after a while and I need to figure out why
before I can send a patch up. I'm suspicious of the threading, which
uses non-atomic variables (timer_force_tx, ncm->skb_tx_data) to
synchronize control flow between the timer and the transmit function,
which can presumably run on different CPUs. That seems wrong since
either core could observe stale variables. But perhaps there are
memory barriers that I'm not aware of.

The idea of getting rid of the timer by doing aggregation based on
transmit queue lengths seems like a much larger effort, but probably
one that is required to actually improve performance substantially
beyond what it is now.