Re: [PATCH 4/8] membarrier: Make the post-switch-mm barrier explicit

From: Andy Lutomirski
Date: Wed Jun 16 2021 - 14:41:26 EST


On 6/16/21 12:35 AM, Peter Zijlstra wrote:
> On Wed, Jun 16, 2021 at 02:19:49PM +1000, Nicholas Piggin wrote:
>> Excerpts from Andy Lutomirski's message of June 16, 2021 1:21 pm:
>>> membarrier() needs a barrier after any CPU changes mm. There is currently
>>> a comment explaining why this barrier probably exists in all cases. This
>>> is very fragile -- any change to the relevant parts of the scheduler
>>> might get rid of these barriers, and it's not really clear to me that
>>> the barrier actually exists in all necessary cases.
>>
>> The comments and barriers in the mmdrop() hunks? I don't see what is
>> fragile or maybe-buggy about this. The barrier definitely exists.
>>
>> And any change can change anything, that doesn't make it fragile. My
>> lazy tlb refcounting change avoids the mmdrop in some cases, but it
>> replaces it with smp_mb for example.
>
> I'm with Nick again, on this. You're adding extra barriers for no
> discernible reason, that's not generally encouraged, seeing how extra
> barriers is extra slow.
>
> Both mmdrop() itself, as well as the callsite have comments saying how
> membarrier relies on the implied barrier, what's fragile about that?
>

My real motivation is that mmgrab() and mmdrop() don't actually need to
be full barriers. The current implementation has them being full
barriers, and the current implementation is quite slow. So let's try
that commit message again:

membarrier() needs a barrier after any CPU changes mm. There is currently
a comment explaining why this barrier probably exists in all cases. The
logic is based on ensuring that the barrier exists on every control flow
path through the scheduler. It also relies on mmgrab() and mmdrop() being
full barriers.

mmgrab() and mmdrop() would be better if they were not full barriers. As a
trivial optimization, mmgrab() could use a relaxed atomic and mmdrop()
could use a release on architectures that have these operations. Larger
optimizations are also in the works. Doing any of these optimizations
while preserving an unnecessary barrier will complicate the code and
penalize non-membarrier-using tasks.

Simplify the logic by adding an explicit barrier, and allow architectures
to override it as an optimization if they want to.

One of the deleted comments in this patch said "It is therefore
possible to schedule between user->kernel->user threads without
passing through switch_mm()". It is possible to do this without, say,
writing to CR3 on x86, but the core scheduler indeed calls
switch_mm_irqs_off() to tell the arch code to go back from lazy mode
to no-lazy mode.