Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections

From: Mathieu Desnoyers
Date: Tue May 26 2015 - 17:04:09 EST


----- Original Message -----
> On May 25, 2015 11:54 AM, "Andy Lutomirski" <luto@xxxxxxxxxxxxxx> wrote:
[...]
> I might be guilty of being too x86-centric here. On x86, as long as
> the lock and unlock primitives are sufficiently atomic, everything
> should be okay. On other architectures, though, a primitive that
> gives lock, unlock, and abort of a per-cpu lock without checking that
> you're still on the right cpu at unlock time may not be sufficient.
> If the primitive is implemented purely with loads and stores, then
> even if you take the lock, migrate, finish your work, and unlock
> without anyone else contending from the lock (and hence don't abort),
> the next thread to take the same lock will end up unsynchronized
> unless there's appropriate memory ordering. For example, if taking
> the lock were an acquire and unlocking were a release, we'd be fine.

If we look at x86-TSO, where stores reordered after loads is pretty
much the only thing we need to care about, we have:

CPU 0 CPU 1

load, test lock[1]
store lock[1]
loads/stores to data[1]
[preempted, migrate to cpu 0]
[run other thread]
load, test lock[1] (loop)
loads/stores to data[1] (A)
store 0 to lock[1] (B)
load, test lock[1] (C)
store lock[1]
loads/stores to data[1] (D)

What we want to ensure is that load/stores to data[1] from CPU 0 and
1 don't get interleaved.

On CPU 0 doing unlock, loads and stores to data[1] (A) cannot be reordered
against the store to lock[1] (B) due to TSO (all we care about is stores
reordered after loads).

On CPU 1 grabbing the 2nd lock, we care about loads/store from/to
data[1] (C) being reordered before load, test of lock[1] (C). Again,
thanks to TSO, loads/stores from/to data[1] (D) cannot be reordered
wrt (C).

So on x86-TSO we should be OK, but as you say, we need to be careful
to have the right barriers in place on other architectures.

>
> Your RFC design certainly works (in principle -- I haven't looked at
> the code in detail), but I can't shake the feeling that it's overkill

If we can find a simpler way to achieve the same result, I'm all ears :)

> and that it could be improved to avoid unnecessary aborts every time
> the lock holder is scheduled out.

The restart is only performed if preemption occurs when the thread is
trying to grab the lock. Once the thread has the lock (it's now the
lock holder), it will not be aborted even if it is migrated.

>
> This isn't a problem in your RFC design, but if we wanted to come up
> with tighter primitives, we'd have to be quite careful to document
> exactly what memory ordering guarantees they come with.

Indeed.

>
> It may be that all architectures for which you care about the
> performance boost already have efficient acquire and release
> operations. Certainly x86 does, and I don't know how fast the new ARM
> instructions are, but I imagine they're pretty good.

We may even need memory barriers around lock and unlock on ARM. :/ ARM
smp_store_release() and smp_load_acquire() currently implements those with
smp_mb().

>
> It's too bad that not all architectures have a single-instruction
> unlocked compare-and-exchange.

Based on my benchmarks, it's not clear that single-instruction
unlocked CAS is actually faster than doing the same with many
instructions.

Thanks,

Mathieu

>
> --Andy
>

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/