Re: [PATCH v4 2/4] lazy tlb: allow lazy tlb mm refcounting to be configurable

From: Nicholas Piggin
Date: Mon Jun 14 2021 - 00:48:06 EST


Excerpts from Nicholas Piggin's message of June 14, 2021 2:14 pm:
> Excerpts from Andy Lutomirski's message of June 14, 2021 1:52 pm:
>> On 6/13/21 5:45 PM, Nicholas Piggin wrote:
>>> Excerpts from Andy Lutomirski's message of June 9, 2021 2:20 am:
>>>> On 6/4/21 6:42 PM, Nicholas Piggin wrote:
>>>>> Add CONFIG_MMU_TLB_REFCOUNT which enables refcounting of the lazy tlb mm
>>>>> when it is context switched. This can be disabled by architectures that
>>>>> don't require this refcounting if they clean up lazy tlb mms when the
>>>>> last refcount is dropped. Currently this is always enabled, which is
>>>>> what existing code does, so the patch is effectively a no-op.
>>>>>
>>>>> Rename rq->prev_mm to rq->prev_lazy_mm, because that's what it is.
>>>>
>>>> I am in favor of this approach, but I would be a lot more comfortable
>>>> with the resulting code if task->active_mm were at least better
>>>> documented and possibly even guarded by ifdefs.
>>>
>>> active_mm is fairly well documented in Documentation/active_mm.rst IMO.
>>> I don't think anything has changed in 20 years, I don't know what more
>>> is needed, but if you can add to documentation that would be nice. Maybe
>>> moving a bit of that into .c and .h files?
>>>
>>
>> Quoting from that file:
>>
>> - however, we obviously need to keep track of which address space we
>> "stole" for such an anonymous user. For that, we have "tsk->active_mm",
>> which shows what the currently active address space is.
>>
>> This isn't even true right now on x86.
>
> From the perspective of core code, it is. x86 might do something crazy
> with it, but it has to make it appear this way to non-arch code that
> uses active_mm.
>
> Is x86's scheme documented?
>
>> With your patch applied:
>>
>> To support all that, the "struct mm_struct" now has two counters: a
>> "mm_users" counter that is how many "real address space users" there are,
>> and a "mm_count" counter that is the number of "lazy" users (ie anonymous
>> users) plus one if there are any real users.
>>
>> isn't even true any more.
>
> Well yeah but the active_mm concept hasn't changed. The refcounting
> change is hopefully reasonably documented?
>
>>
>>
>>>> x86 bare metal currently does not need the core lazy mm refcounting, and
>>>> x86 bare metal *also* does not need ->active_mm. Under the x86 scheme,
>>>> if lazy mm refcounting were configured out, ->active_mm could become a
>>>> dangling pointer, and this makes me extremely uncomfortable.
>>>>
>>>> So I tend to think that, depending on config, the core code should
>>>> either keep ->active_mm [1] alive or get rid of it entirely.
>>>
>>> I don't actually know what you mean.
>>>
>>> core code needs the concept of an "active_mm". This is the mm that your
>>> kernel threads are using, even in the unmerged CONFIG_LAZY_TLB=n patch,
>>> active_mm still points to init_mm for kernel threads.
>>
>> Core code does *not* need this concept. First, it's wrong on x86 since
>> at least 4.15. Any core code that actually assumes that ->active_mm is
>> "active" for any sensible definition of the word active is wrong.
>> Fortunately there is no such code.
>>
>> I looked through all active_mm references in core code. We have:
>>
>> kernel/sched/core.c: it's all refcounting, although it's a bit tangled
>> with membarrier.
>>
>> kernel/kthread.c: same. refcounting and membarrier stuff.
>>
>> kernel/exit.c: exit_mm() a BUG_ON().
>>
>> kernel/fork.c: initialization code and a warning.
>>
>> kernel/cpu.c: cpu offline stuff. wouldn't be needed if active_mm went away.
>>
>> fs/exec.c: nothing of interest
>
> I might not have been clear. Core code doesn't need active_mm if
> active_mm somehow goes away. I'm saying active_mm can't go away because
> it's needed to support (most) archs that do lazy tlb mm switching.
>
> The part I don't understand is when you say it can just go away. How?
>
>> I didn't go through drivers, but I maintain my point. active_mm is
>> there for refcounting. So please don't just make it even more confusing
>> -- do your performance improvement, but improve the code at the same
>> time: get rid of active_mm, at least on architectures that opt out of
>> the refcounting.
>
> powerpc opts out of the refcounting and can not "get rid of active_mm".
> Not even in theory.

That is to say, it does do a type of reference management that requires
active_mm so you can argue it has not entirely opted out of refcounting.
But we're not just doing refcounting for the sake of refcounting! That
would make no sense.

active_mm is required because that's the mm that we have switched to
(from core code's perspective), and it is integral to know when to
switch to a different mm. See how active_mm is a fundamental concept
in core code? It's part of the contract between core code and the
arch mm context management calls. reference counting follows from there
but it's not the _reason_ for this code.

Pretend the reference problem does not exit (whether by refcounting or
shootdown or garbage collection or whatever). We still can't remove
active_mm! We need it to know how to call into arch functions like
switch_mm.

I don't know if you just forgot that critical requirement in your above
list, or you actually are entirely using x86's mental model for this
code which is doing something entirely different that does not need it
at all. If that is the case I really don't mind some cleanup or wrapper
functions for x86 do entirely do its own thing, but if that's the case
you can't criticize core code's use of active_mm due to the current
state of x86. It's x86 that needs documentation and cleaning up.

Thanks,
Nick