Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED

From: Ingo Molnar
Date: Tue Sep 12 2023 - 03:38:56 EST



* Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:

> On Mon, Sep 11, 2023 at 02:16:18PM -0700, Linus Torvalds wrote:
> > On Mon, 11 Sept 2023 at 13:50, Linus Torvalds
> > <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> > >
> > > Except we've actually been *adding* to this whole mess, rather than
> > > removing it. So we have actively *expanded* on that preemption choice
> > > with PREEMPT_DYNAMIC.
> >
> > Actually, that config option makes no sense.
> >
> > It makes the sched_cond() behavior conditional with a static call.
> >
> > But all the *real* overhead is still there and unconditional (ie all
> > the preempt count updates and the "did it go down to zero and we need
> > to check" code).
> >
> > That just seems stupid. It seems to have all the overhead of a
> > preemptible kernel, just not doing the preemption.
> >
> > So I must be mis-reading this, or just missing something important.
> >
> > The real cost seems to be
> >
> > PREEMPT_BUILD -> PREEMPTION -> PREEMPT_COUNT
> >
> > and PREEMPT vs PREEMPT_DYNAMIC makes no difference to that, since both
> > will end up with that, and thus both cases will have all the spinlock
> > preempt count stuff.
> >
> > There must be some non-preempt_count cost that people worry about.
> >
> > Or maybe I'm just mis-reading the Kconfig stuff entirely. That's
> > possible, because this seems *so* pointless to me.
> >
> > Somebody please hit me with a clue-bat to the noggin.
>
> Well, I was about to reply to your previous email explaining this, but
> this one time I did read more email..
>
> Yes, PREEMPT_DYNAMIC has all the preempt count twiddling and only nops
> out the schedule()/cond_resched() calls where appropriate.
>
> This work was done by a distro (SuSE) and if they're willing to ship this
> I'm thinking the overheads are acceptable to them.
>
> For a significant number of workloads the real overhead is the extra
> preepmtions themselves more than the counting -- but yes, the counting is
> measurable, but probably in the noise compared to other some of the other
> horrible things we have done the past years.
>
> Anyway, if distros are fine shipping with PREEMPT_DYNAMIC, then yes,
> deleting the other options are definitely an option.

Yes, so my understanding is that distros generally worry more about
macro-overhead, for example material changes to a random subset of key
benchmarks that specific enterprise customers care about, and distros are
not nearly as sensitive about micro-overhead that preempt_count()
maintenance causes.

PREEMPT_DYNAMIC is basically a reflection of that: the desire to have only
a single kernel image, but a boot-time toggle to differentiate between
desktop and server loads and have CONFIG_PREEMPT (desktop) but also
PREEMPT_VOLUNTARY behavior (server).

There's also the view that PREEMPT kernels are a bit more QA-friendly,
because atomic code sequences are much better defined & enforced via kernel
warnings. Without preempt_count we only have irqs-off warnings, that are
only a small fraction of all critical sections in the kernel.

Ideally we'd be able to patch out most of the preempt_count maintenance
overhead too - OTOH these days it's little more than noise on most CPUs,
considering the kind of horrible security-workaround overhead we have on
almost all x86 CPU types ... :-/

Thanks,

Ingo