Re: [PATCH 3/3] membarrier: Propagate SYNC_CORE and RSEQ actions more carefully

From: Peter Zijlstra
Date: Tue Dec 01 2020 - 05:17:40 EST


On Mon, Nov 30, 2020 at 09:50:35AM -0800, Andy Lutomirski wrote:
> membarrier() carefully propagates SYNC_CORE and RSEQ actions to all
> other CPUs, but there are two issues.
>
> - membarrier() does not sync_core() or rseq_preempt() the calling
> CPU. Aside from the logic being mind-bending, this also means
> that it may not be safe to modify user code through an alias,
> call membarrier(), and then jump to a different executable alias
> of the same code.

I always understood this to be on purpose. The calling CPU can fix up
itself just fine. The pain point is fixing up the other CPUs, and that's
where membarrier() helps.

That said, I don't mind including self, these aren't fast calls by any
means.

> - membarrier() does not explicitly sync_core() remote CPUs either;
> instead, it relies on the assumption that an IPI will result in a
> core sync. On x86, I think this may be true in practice, but
> it's not architecturally reliable. In particular, the SDM and
> APM do not appear to guarantee that interrupt delivery is
> serializing.

Right, I don't think we rely on that, we do rely on interrupt delivery
providing order though -- as per the previous email.

> On a preemptible kernel, IPI return can schedule,
> thereby switching to another task in the same mm that was
> sleeping in a syscall. The new task could then SYSRET back to
> usermode without ever executing IRET.

This; I think we all overlooked this scenario.

> This patch simplifies the code to treat the calling CPU just like
> all other CPUs, and explicitly sync_core() on all target CPUs. This
> eliminates the need for the smp_mb() at the end of the function
> except in the special case of a targeted remote membarrier(). This
> patch updates that code and the comments accordingly.
>
> Signed-off-by: Andy Lutomirski <luto@xxxxxxxxxx>

> @@ -228,25 +258,33 @@ static int membarrier_private_expedited(int flags, int cpu_id)
> rcu_read_unlock();
> }
>
> - preempt_disable();
> - if (cpu_id >= 0)
> - smp_call_function_single(cpu_id, ipi_func, NULL, 1);
> - else
> - smp_call_function_many(tmpmask, ipi_func, NULL, 1);
> - preempt_enable();
> + if (cpu_id >= 0) {
> + int cpu = get_cpu();
> +
> + if (cpu_id == cpu) {
> + ipi_func(NULL);
> + } else {
> + smp_call_function_single(cpu_id, ipi_func, NULL, 1);
> + /*
> + * This is analogous to the smp_mb() at the beginning
> + * of the function -- exit from a system call is not a
> + * barrier. We only need this if we're targeting a
> + * specific remote CPU, though -- otherwise ipi_func()
> + * would serves the same purpose.
> + */
> + smp_mb();

smp_call_function_single(.wait=1) already orders against completion of
the IPI. Do we really need more?

> + }
> +
> + put_cpu();
> + } else {
> + on_each_cpu_mask(tmpmask, ipi_func, NULL, true);
> + }
>
> out:
> if (cpu_id < 0)
> free_cpumask_var(tmpmask);
> cpus_read_unlock();
>
> - /*
> - * Memory barrier on the caller thread _after_ we finished
> - * waiting for the last IPI. Matches memory barriers around
> - * rq->curr modification in scheduler.
> - */
> - smp_mb(); /* exit from system call is not a mb */
> -
> return 0;
> }
>
> --
> 2.28.0
>