Re: [PATCH 3/3] membarrier: Propagate SYNC_CORE and RSEQ actions more carefully

From: Mathieu Desnoyers
Date: Tue Dec 01 2020 - 13:31:12 EST


----- On Dec 1, 2020, at 1:12 PM, Andy Lutomirski luto@xxxxxxxxxx wrote:

> On Tue, Dec 1, 2020 at 6:28 AM Mathieu Desnoyers
> <mathieu.desnoyers@xxxxxxxxxxxx> wrote:
>>
>> ----- On Dec 1, 2020, at 5:16 AM, Peter Zijlstra peterz@xxxxxxxxxxxxx wrote:
>>
>> > On Mon, Nov 30, 2020 at 09:50:35AM -0800, Andy Lutomirski wrote:
>> >> membarrier() carefully propagates SYNC_CORE and RSEQ actions to all
>> >> other CPUs, but there are two issues.
>> >>
>> >> - membarrier() does not sync_core() or rseq_preempt() the calling
>> >> CPU. Aside from the logic being mind-bending, this also means
>> >> that it may not be safe to modify user code through an alias,
>> >> call membarrier(), and then jump to a different executable alias
>> >> of the same code.
>> >
>> > I always understood this to be on purpose. The calling CPU can fix up
>> > itself just fine. The pain point is fixing up the other CPUs, and that's
>> > where membarrier() helps.
>>
>> Indeed, as documented in the man page:
>>
>> MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE (since Linux 4.16)
>> In addition to providing the memory ordering guarantees de‐
>> scribed in MEMBARRIER_CMD_PRIVATE_EXPEDITED, upon return from
>> system call the calling thread has a guarantee that all its run‐
>> ning thread siblings have executed a core serializing instruc‐
>> tion. This guarantee is provided only for threads in the same
>> process as the calling thread.
>>
>> membarrier sync core guarantees a core serializing instruction on the siblings,
>> not on the caller thread. This has been done on purpose given that the caller
>> thread can always issue its core serializing instruction from user-space on
>> its own.
>>
>> >
>> > That said, I don't mind including self, these aren't fast calls by any
>> > means.
>>
>> I don't mind including self either, but this would require documentation
>> updates, including man pages, to state that starting from kernel Y this
>> is the guaranteed behavior. It's then tricky for user-space to query what
>> the behavior is unless we introduce a new membarrier command for it. So this
>> could introduce issues if software written for the newer kernels runs on older
>> kernels.
>
> For rseq at least, if we do this now we don't have this issue -- I
> don't think any released kernel has the rseq mode.

But for rseq, there is no core-sync. And considering that it is invalid
to issue a system call within an rseq critical section (including membarrier),
I don't see what we gain by doing a rseq barrier on self ?

The only case where it really changes the semantic is for core-sync I think.
And in this case, it would be adding an additional core-sync on self. I
am OK with doing that considering that it will simplify use of the system
call. I'm just wondering how we should document this change in the man page.

>
>>
>> >
>> >> - membarrier() does not explicitly sync_core() remote CPUs either;
>> >> instead, it relies on the assumption that an IPI will result in a
>> >> core sync. On x86, I think this may be true in practice, but
>> >> it's not architecturally reliable. In particular, the SDM and
>> >> APM do not appear to guarantee that interrupt delivery is
>> >> serializing.
>> >
>> > Right, I don't think we rely on that, we do rely on interrupt delivery
>> > providing order though -- as per the previous email.
>> >
>> >> On a preemptible kernel, IPI return can schedule,
>> >> thereby switching to another task in the same mm that was
>> >> sleeping in a syscall. The new task could then SYSRET back to
>> >> usermode without ever executing IRET.
>> >
>> > This; I think we all overlooked this scenario.
>>
>> Indeed, this is an issue which needs to be fixed.
>>
>> >
>> >> This patch simplifies the code to treat the calling CPU just like
>> >> all other CPUs, and explicitly sync_core() on all target CPUs. This
>> >> eliminates the need for the smp_mb() at the end of the function
>> >> except in the special case of a targeted remote membarrier(). This
>> >> patch updates that code and the comments accordingly.
>>
>> I am not confident that removing the smp_mb at the end of membarrier is
>> an appropriate change, nor that it simplifies the model.
>
> Ah, but I didn't remove it. I carefully made sure that every possible
> path through the function does an smp_mb() or stronger after all the
> cpu_rq reads. ipi_func(), on_each_cpu(), and the explicit smp_mb()
> cover the three cases.
>
> That being said, if you prefer, I can make the change to skip the
> calling CPU, in which case I'll leave the smp_mb() at the end alone.

For the memory barrier commands, I prefer skipping self and leaving the
smp_mb at the very beginning/end of the system call. Those are the key
before/after points we are synchronizing against, and those are simple
to document.

For rseq, I would not mind calling the ipi callback on self, but I don't
see how useful it would be considering that issuing system calls within
rseq c.s. is invalid.

For core-sync, I agree that doing a core-sync on self would be an actual
simplification for users, but we should think about how we plan to update
the man page to convey that behavior change to end users.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com