Re: [PATCH v2 10/13] fork: Store task pointer in unpopulated stack ptes

From: Thomas Gleixner

Date: Sat Jun 27 2026 - 19:11:31 EST

On Sat, Jun 27 2026 at 00:46, Thomas Gleixner wrote:
> On Fri, Apr 24 2026 at 12:14, David Stevens wrote:
>> Store the task pointer in the ptes of the unpopulated pages of dynamic
>> stacks, to allow the vm_struct pointer to be retrieved without relying
>> on any locks or current.
>
> You fail to explain why you can't use current. Changelogs have to
> describe the WHY and not the WHAT.

I obviously know why you can't use it. But the absence of a proper
explanation and my disgust for the implementation bothered me enough to
look deeper into it.

Let's look at the only problematic case:

schedule()
....
switch_to(prev, next)
switch_to_asm(prev, next)
1) switch(RSP)
__switch_to(prev, next)
2) this_cpu_write(current_task, next);

There is obviously a hole between #1 and #2 where 'current_task' is not
giving the right answer. You work around that with this PTE storage
magic which is admittedly smart, but completely overengineered and not
necessary at all.

Why?

If you look at the above condensed context switch logic related to this
problem thoroughly, you'll notice that there are three sources of
information:

- prev: the task being scheduled out
- next: the task being scheduled in
- RSP: the stack pointer

Between #1 and #2 it cannot be determined whether RSP belongs to 'prev'
or 'next' because 'next' is not exposed to the fault handler. But if it
would be exposed it would allow to answer the question where RSP belongs
to, no?

So the obvious _and_ simple solution _is_ to expose 'next':

schedule()
....
switch_to(prev, next)
1) raw_cpu_write(next_task, next);
switch_to_asm(prev, next)
2) switch(RSP)
__switch_to(prev, next)
3) raw_cpu_write(current_task, next);

With that the stack fault handler logic becomes:

curr = raw_cpu_read(current_task);
addr = fred_event_data(regs);

if (within_task_stack(addr, curr))
return handle_stack_fault(regs, curr);

next = raw_cpu_read(next_task);
if (curr != next && within_task_stack(addr, next)
return handle_stack_fault(regs, next);

return 0;

Which is correct at any point in time and that pattern works on _all_
supported architectures because it's all CPU local. All you need is one
extra store. When done right that's ending up in the same cache line
which is anyway dirtied by the context switch (i.e. current_task), so
you won't even be able to measure the overhead.

Thanks,

tglx

---
Everything should be made as simple as possible, but not simpler. - Einstein