Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED

From: Ankur Arora
Date: Sun Sep 10 2023 - 06:01:52 EST



Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> writes:

> On Sat, 9 Sept 2023 at 20:49, Ankur Arora <ankur.a.arora@xxxxxxxxxx> wrote:
>>
>> I think we can keep these checks, but with this fixed definition of
>> resched_allowed(). This might be better:
>>
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -2260,7 +2260,8 @@ static inline void disallow_resched(void)
>>
>> static __always_inline bool resched_allowed(void)
>> {
>> - return unlikely(test_tsk_thread_flag(current, TIF_RESCHED_ALLOW));
>> + return unlikely(!preempt_count() &&
>> + test_tsk_thread_flag(current, TIF_RESCHED_ALLOW));
>> }
>
> I'm not convinced (at all) that the preempt count is the right thing.
>
> It works for interrupts, yes, because interrupts will increment the
> preempt count even on non-preempt kernels (since the preempt count is
> also the interrupt context level).
>
> But what about any synchronous trap handling?
>
> In other words, just something like a page fault? A page fault doesn't
> increment the preemption count (and in fact many page faults _will_
> obviously re-schedule as part of waiting for IO).
>
> A page fault can *itself* say "feel free to preempt me", and that's one thing.
>
> But a page fault can also *interupt* something that said "feel free to
> preempt me", and that's a completely *different* thing.
>
> So I feel like the "tsk_thread_flag" was sadly completely the wrong
> place to add this bit to, and the wrong place to test it in. What we
> really want is "current kernel entry context".

So, what we want allow_resched() to say is: feel free to reschedule
if in a reschedulable context.

The problem with doing that with an allow_resched tsk_thread_flag is
that the flag is really only valid while it is executing in the context
it was set.
And, trying to validate the flag by checking the preempt_count() makes
it pretty fragile, given that now we are tying it with the specifics of
whether the handling of arbitrary interrupts bumps up the
preempt_count() or not.

> So the right thing to do would basically be to put it in the stack
> frame at kernel entry - whether that kernel entry was a system call
> (which is doing some big copy that should be preemptible without us
> having to add "cond_resched()" in places), or is a page fault (which
> will also do things like big page clearings for hugepages)

Seems to me that associating an allow_resched flag with the stack also
has similar issue. Couldn't the context level change while we are on the
same stack?

I guess the problem is that allow_resched()/disallow_resched() really
need to demarcate a section of code having some property, but instead
set up state that has much wider scope.

Maybe code that allows resched can be in a new .section ".text.resched"
or whatever, and we could use something like this as a check:

int resched_allowed(regs) {
return !preempt_count() && in_resched_function(regs->rip);
}

(allow_resched()/disallow_resched() shouldn't be needed except for
debug checks.)

We still need the !preempt_count() check, but now both the conditions
in the test express two orthogonal ideas:

- !preempt_count(): preemption is safe in the current context
- in_resched_function(regs->rip): okay to reschedule here

So in this example, it should allow scheduling inside both the
clear_pages_reschedulable() calls:

-> page_fault()
clear_page_reschedulable();
-> page_fault()
clear_page_reschedulable();

Here though, rescheduling could happen only in the first call to
clear_page_reschedulable():

-> page_fault()
clear_page_reschedulable();
-> hardirq()
-> page_fault()
clear_page_reschedulable();

Does that make any sense, or I'm just talking through my hat?

--
ankur