Re: [PATCH 4/8] membarrier: Make the post-switch-mm barrier explicit

From: Nicholas Piggin
Date: Fri Jun 18 2021 - 22:53:45 EST


Excerpts from Andy Lutomirski's message of June 18, 2021 9:49 am:
> On 6/16/21 11:51 PM, Nicholas Piggin wrote:
>> Excerpts from Andy Lutomirski's message of June 17, 2021 3:32 pm:
>>> On Wed, Jun 16, 2021, at 7:57 PM, Andy Lutomirski wrote:
>>>>
>>>>
>>>> On Wed, Jun 16, 2021, at 6:37 PM, Nicholas Piggin wrote:
>>>>> Excerpts from Andy Lutomirski's message of June 17, 2021 4:41 am:
>>>>>> On 6/16/21 12:35 AM, Peter Zijlstra wrote:
>>>>>>> On Wed, Jun 16, 2021 at 02:19:49PM +1000, Nicholas Piggin wrote:
>>>>>>>> Excerpts from Andy Lutomirski's message of June 16, 2021 1:21 pm:
>>>>>>>>> membarrier() needs a barrier after any CPU changes mm. There is currently
>>>>>>>>> a comment explaining why this barrier probably exists in all cases. This
>>>>>>>>> is very fragile -- any change to the relevant parts of the scheduler
>>>>>>>>> might get rid of these barriers, and it's not really clear to me that
>>>>>>>>> the barrier actually exists in all necessary cases.
>>>>>>>>
>>>>>>>> The comments and barriers in the mmdrop() hunks? I don't see what is
>>>>>>>> fragile or maybe-buggy about this. The barrier definitely exists.
>>>>>>>>
>>>>>>>> And any change can change anything, that doesn't make it fragile. My
>>>>>>>> lazy tlb refcounting change avoids the mmdrop in some cases, but it
>>>>>>>> replaces it with smp_mb for example.
>>>>>>>
>>>>>>> I'm with Nick again, on this. You're adding extra barriers for no
>>>>>>> discernible reason, that's not generally encouraged, seeing how extra
>>>>>>> barriers is extra slow.
>>>>>>>
>>>>>>> Both mmdrop() itself, as well as the callsite have comments saying how
>>>>>>> membarrier relies on the implied barrier, what's fragile about that?
>>>>>>>
>>>>>>
>>>>>> My real motivation is that mmgrab() and mmdrop() don't actually need to
>>>>>> be full barriers. The current implementation has them being full
>>>>>> barriers, and the current implementation is quite slow. So let's try
>>>>>> that commit message again:
>>>>>>
>>>>>> membarrier() needs a barrier after any CPU changes mm. There is currently
>>>>>> a comment explaining why this barrier probably exists in all cases. The
>>>>>> logic is based on ensuring that the barrier exists on every control flow
>>>>>> path through the scheduler. It also relies on mmgrab() and mmdrop() being
>>>>>> full barriers.
>>>>>>
>>>>>> mmgrab() and mmdrop() would be better if they were not full barriers. As a
>>>>>> trivial optimization, mmgrab() could use a relaxed atomic and mmdrop()
>>>>>> could use a release on architectures that have these operations.
>>>>>
>>>>> I'm not against the idea, I've looked at something similar before (not
>>>>> for mmdrop but a different primitive). Also my lazy tlb shootdown series
>>>>> could possibly take advantage of this, I might cherry pick it and test
>>>>> performance :)
>>>>>
>>>>> I don't think it belongs in this series though. Should go together with
>>>>> something that takes advantage of it.
>>>>
>>>> I’m going to see if I can get hazard pointers into shape quickly.
>>>
>>> Here it is. Not even boot tested!
>>>
>>> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=sched/lazymm&id=ecc3992c36cb88087df9c537e2326efb51c95e31
>>>
>>> Nick, I think you can accomplish much the same thing as your patch by:
>>>
>>> #define for_each_possible_lazymm_cpu while (false)
>>
>> I'm not sure what you mean? For powerpc, other CPUs can be using the mm
>> as lazy at this point. I must be missing something.
>
> What I mean is: if you want to shoot down lazies instead of doing the
> hazard pointer trick to track them, you could do:
>
> #define for_each_possible_lazymm_cpu while (false)
>
> which would promise to the core code that you don't have any lazies left
> by the time exit_mmap() is done. You might need a new hook in
> exit_mmap() depending on exactly how you implement the lazy shootdown.

Oh for configuring it away entirely. I'll have to see how it falls out,
I suspect we'd want to just no-op that entire function and avoid the 2
atomics if we are taking care of our lazy mms with shootdowns.

The more important thing would be the context switch fast path, but even
there, there's really no reason why the two approaches couldn't be made
to both work with some careful helper functions or structuring of the
code.

Thanks,
Nick