Re: [PATCH v4 04/11] x86/bhi: Make clear_bhb_loop() effective on newer CPUs

From: Pawan Gupta

Date: Wed Dec 03 2025 - 20:40:33 EST

On Tue, Nov 25, 2025 at 11:34:07AM +0000, david laight wrote:
> On Mon, 24 Nov 2025 11:31:26 -0800
> Pawan Gupta <pawan.kumar.gupta@xxxxxxxxxxxxxxx> wrote:
>
> > On Sat, Nov 22, 2025 at 11:05:58AM +0000, david laight wrote:
> ...
> > > For subtle reasons one of the mitigations that slows kernel entry caused
> > > a doubling of the execution time of a largely single-threaded task that
> > > spends almost all its time in userspace!
> > > (I thought I'd disabled it at compile time - but the config option
> > > changed underneath me...)
> >
> > That is surprising. If its okay, could you please share more details about
> > this application? Or any other way I can reproduce this?
>
> The 'trigger' program is a multi-threaded program that wakes up every 10ms
> to process RTP and TDM audio data.
> So we have a low RT priority process with one thread per cpu.
> Since they are RT they usually get scheduled on the same cpu as last lime.
> I think this simple program will have the desired effect:
> A main process that does:
> syscall(SYS_clock_gettime, CLOCK_MONOTONIC, &start_time);
> start_time += 1sec;
> for (n = 1; n < num_cpu; n++)
> pthread_create(thread_code, start_time);
> thread_code(start_time);
> with:
> thread_code(ts)
> {
> for (;;) {
> ts += 10ms;
> syscall(SYS_clock_nanosleep, CLOCK_MONOTONIC, TIMER_ABSTIME, &ts, NULL);
> do_work();
> }
>
> So all the threads wake up at exactly the same time every 10ms.
> (You need to use syscall(), don't look at what glibc does.)
>
> On my system the program wasn't doing anything, so do_work() was empty.
> What matters is whether all the threads end up running at the same time.
> I managed that using pthread_broadcast(), but the clock code above
> ought to be worse (and I've since changed the daemon to work that way
> to avoid all this issues with pthread_broadcast() being sequential
> and threads not running because the target cpu is running an ISR or
> just looping in kernel).
>
> The process that gets 'hit' is anything cpu bound.
> Even a shell loop (eg while :; do ;: done) but with a counter will do.
>
> Without the 'trigger' program, it will (mostly) sit on one cpu and the
> clock frequency of that cpu will increase to (say) 3GHz while the other
> all run at 800Mhz.
> But the 'trigger' program runs threads on all the cpu at the same time.
> So the 'hit' program is pre-empted and is later rescheduled on a
> different cpu - running at 800MHz.
> The cpu speed increases, but 10ms later it gets bounced again.

Sorry I haven't tried creating this test yet.

> The real issue is that the cpu speed is associated with the cpu, not
> the process running on it.

So if the 'hit' program gets scheduled to a CPU that is running at 3GHz
then we don't expect a dramatic performance drop? Setting scaling_governor
to "performance" would be an interesting test.