Re: Regression in SYS_membarrier expedited

From: Rich Felker
Date: Tue Feb 19 2019 - 11:36:22 EST


On Tue, Feb 19, 2019 at 11:02:41AM -0500, Mathieu Desnoyers wrote:
> ----- On Feb 18, 2019, at 4:55 PM, Rich Felker dalias@xxxxxxxx wrote:
>
> > On Mon, Feb 18, 2019 at 10:22:32AM -0500, Mathieu Desnoyers wrote:
> >> ----- On Feb 17, 2019, at 5:08 PM, Rich Felker dalias@xxxxxxxx wrote:
> >>
> >> > On Sun, Feb 17, 2019 at 04:52:35PM -0500, Rich Felker wrote:
> >> >> On Sun, Feb 17, 2019 at 04:34:45PM -0500, Mathieu Desnoyers wrote:
> >> >> > ----- On Feb 17, 2019, at 1:48 PM, Rich Felker dalias@xxxxxxxx wrote:
> >> >> >
> >> >> > > commit a961e40917fb14614d368d8bc9782ca4d6a8cd11 made it so that the
> >> >> > > MEMBARRIER_CMD_PRIVATE_EXPEDITED command cannot be used without first
> >> >> > > registering intent to use it. However, registration is an expensive
> >> >> > > operation since commit 3ccfebedd8cf54e291c809c838d8ad5cc00f5688, which
> >> >> > > added synchronize_sched() to it; this means it's no longer possible to
> >> >> > > lazily register intent at first use, and it's unreasonably expensive
> >> >> > > to preemptively register intent for possibly extremely-short-lived
> >> >> > > processes that will never use it. (My usage case is in libc (musl),
> >> >> > > where I can't know if the process will be short- or long-lived;
> >> >> > > unnecessary and potentially expensive syscalls can't be made
> >> >> > > preemptively, only lazily at first use.)
> >> >> > >
> >> >> > > Can we restore the functionality of MEMBARRIER_CMD_PRIVATE_EXPEDITED
> >> >> > > to work even without registration? The motivation of requiring
> >> >> > > registration seems to be:
> >> >> > >
> >> >> > > "Registering at this time removes the need to interrupt each and
> >> >> > > every thread in that process at the first expedited
> >> >> > > sys_membarrier() system call."
> >> >> > >
> >> >> > > but interrupting every thread in the process is exactly what I expect,
> >> >> > > and is not a problem. What does seem like a big problem is waiting for
> >> >> > > synchronize_sched() to synchronize with an unboundedly large number of
> >> >> > > cores (vs only a few threads in the process), especially in the
> >> >> > > presence of full_nohz, where it seems like latency would be at least a
> >> >> > > few ms and possibly unbounded.
> >> >> > >
> >> >> > > Short of a working SYS_membarrier that doesn't require expensive
> >> >> > > pre-registration, I'm stuck just implementing it in userspace with
> >> >> > > signals...
> >> >> >
> >> >> > Hi Rich,
> >> >> >
> >> >> > Let me try to understand the scenario first.
> >> >> >
> >> >> > musl libc support for using membarrier private expedited
> >> >> > would require to first register membarrier private expedited for
> >> >> > the process at musl library init (typically after exec). At that stage, the
> >> >> > process is still single-threaded, right ? So there is no reason
> >> >> > to issue a synchronize_sched() (or now synchronize_rcu() in newer
> >> >> > kernels):
> >> >> >
> >> >> > membarrier_register_private_expedited()
> >> >> >
> >> >> > if (!(atomic_read(&mm->mm_users) == 1 && get_nr_threads(p) == 1)) {
> >> >> > /*
> >> >> > * Ensure all future scheduler executions will observe the
> >> >> > * new thread flag state for this process.
> >> >> > */
> >> >> > synchronize_rcu();
> >> >> > }
> >> >> >
> >> >> > So considering that pre-registration carefully done before the process
> >> >> > becomes multi-threaded just costs a system call (and not a synchronize_sched()),
> >> >> > does it make the pre-registration approach more acceptable ?
> >> >>
> >> >> It does get rid of the extreme cost, but I don't think it would be
> >> >> well-received by users who don't like random unnecessary syscalls at
> >> >> init time (each adding a few us of startup time cost). If it's so
> >> >> cheap, why isn't it just the default at kernel-side process creation?
> >> >> Why is there any requirement of registration to begin with? Reading
> >> >> the code, it looks like all it does is set a flag, and all this flag
> >> >> is used for is erroring-out if it's not set.
> >> >
> >> > On further thought, pre-registration could be done at first
> >> > pthread_create rather than process entry, which would probably be
> >> > acceptable. But the question remains why it's needed at all, and
> >> > neither of these approaches is available to code that doesn't have the
> >> > privilege of being part of libc. For example, library code that might
> >> > be loaded via dlopen can't safely use SYS_membarrier without
> >> > introducing unbounded latency before the first use.
> >>
> >> For membarrier private expedited, the need for pre-registration is currently
> >> there because of powerpc not wanting to slow down switch_mm() for processes
> >> not needing that command.
> >>
> >> That's the only reason I see for it. If we would have accepted to add
> >> a smp_mb() to the powerpc switch_mm() scheduler path, we could have done
> >> so without registration for the private expedited membarrier command.
> >
> > I don't understand why the barrier is needed at all; the ipi ping
> > should suffice to execute a barrier instruction on all cores on which
> > a thread of the process is running, and if any other core subsequently
> > picks up a thread of the process to run, it must necessarily perform a
> > barrier just to synchronize with whatever core the thread was
> > previously running on (not to mention synchronizing the handoff
> > itself). Is this just to optimize out ipi pinging cores that threads
> > of the process are not currently running on, but were last running on
> > and could become running on again without migration?
>
> See this comment in context_switch():
>
> * If mm is non-NULL, we pass through switch_mm(). If mm is
> * NULL, we will pass through mmdrop() in finish_task_switch().
> * Both of these contain the full memory barrier required by
> * membarrier after storing to rq->curr, before returning to
> * user-space.
>
> So the full memory barrier we are discussing here in switch_mm() orders
> the store to rq->curr before following memory accesses (including those
> performed by user-space).
>
> This pairs with the first smp_mb() in membarrier_private_expedited(), which
> orders memory accesses before the membarrier system call before the following
> loads of each cpu_rq(cpu)->curr.
>
> This guarantees that if we happen to skip the IPI for a given CPU that
> is just about to schedule in a thread belonging to our process (say, just before
> storing rq->curr in the scheduler), the memory accesses performed by that thread
> after it starts running are ordered after the memory accesses performed prior
> to membarrier.
>
> As we are not grabbing the runqueue locks for each CPU in membarrier because
> it would be too costly, we need to ensure the proper barriers are there within
> the scheduler. We cannot just rely on the memory barrier present at the very
> start of the scheduler code to order with respect to memory accesses happening
> in the newly scheduled-in thread _after_ scheduler execution. This is why we
> need to ensure the proper barriers are present both before and after the
> scheduler stores to rq->curr.

OK, this all makes sense.

> >> commit a961e40917fb hints at the sync_core private expedited membarrier
> >> commands (which was being actively designed at that time) which may
> >> require pre-registration. However, things did not turn out that way: we
> >> ended up adding the required core serializing barriers unconditionally
> >> into each architecture.
> >>
> >> Considering that sync_core private expedited membarrier ended up not needing
> >> pre-registration, I think this pre-registration optimization may have been
> >> somewhat counter-productive, since I doubt the overhead of smp_mb() in a
> >> switch_mm() for powerpc is that high, but I don't have the hardware handy
> >> to provide numbers. So we end up slowing down everyone by requiring a
> >> registration system call after exec. :-(
> >
> > I'm surprised it's even possible to do switch_mm correctly with no
> > barrier...
>
> There is a barrier at the beginning of the scheduler code, which is typically
> enough for all use-cases that rely on the runqueue lock to synchronize with the
> scheduler. Membarrier does not use rq lock for performance considerations, so
> we end up having to make sure the proper barriers are in place both before/after
> store to rq->curr. Those pair with the 2 barriers at the end/beginning of the
> membarrier system call (e.g. in membarrier_private_expedited())

This too.

> >> One possible way out of this would be to make MEMBARRIER_CMD_PRIVATE_EXPEDITED
> >> and MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE work fine without
> >> pre-registration
> >> in future kernels. Therefore, the application could try using them without
> >> registration. If it works, all is fine, else it would treat the error how it
> >> sees fit, either through explicit registration and trying again, or returning
> >> the error to the caller.
> >
> > This would be nice. I think the register-at-first-pthread_create works
> > fine for my case now, but as noted it doesn't work when you can't
> > control libc internals.
>
> It seems to be a transient problem then: once all libc start supporting this,
> there will be fewer and fewer "early adopter" libraries that need to register
> while multithreaded. Therefore, I wonder if it's really worthwhile to introduce
> a change at this point, considering that it will take a while before newer
> kernels gets rolled out, and people will have to support legacy behavior anyway.

It's slightly more complicated than that. At present, the code I have
queued for push to musl only performs the syscall in pthread_create if
the code using membarrier is linked (it's a weak reference, or rather
as we implement those, a weak alias for a dummy function), and the
only code using it is in the dynamic linker, so static-linked programs
never register. This would be fixed for applications if we provide a
public membarrier() function declared in some header, and applications
use it rather than doing syscall(SYS_membarrier, ...) themselves,
which I'd kinda like to do -- it would also let us offer a fallback on
old kernels via the signaling mechanism musl already has as a
fallback. But I don't think we have any consensus with glibc about
whether or where to declare such a function, and we usually try to get
that before adding new syscall wrappers. I can work on this.

> >> The only change I see we would require to make this work is to turn
> >> arch/powerpc/include/asm/membarrier.h membarrier_arch_switch_mm() into
> >> an unconditional smp_mb().
> >
> > I finally found this just now; before I was mistakenly grepping for
> > MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY rather than
> > MEMBARRIER_STATE_PRIVATE_EXPEDITED, and couldn't find anywhere it was
> > actually used.
> >
> >> Thoughts ?
> >
> > I'm all for making it work without pre-registration. I don't care
> > whether this is via heavier barriers in ppc or via pinging more cores,
> > or some other mechanism, but maybe others have opinions on this.
>
> If from a libc perspective the current pre-registration scheme is acceptable,
> I don't see a strong point for introducing a behavior change to an already
> exposed system call (not requiring pre-registration). It may introduce user
> confusion, and would require documenting per-kernel-version-ranges system call
> error behavior, which seems to add an unwanted amount of complexity for
> users of the system call.
>
> >
> > One alternative might be auto-registration, ipi pinging all cores the
> > process has threads on (even not currently running) for the first
> > membarrier, so that they see the membarrier_state updated without
> > having to synchronize with the rcu/scheduler, then doing the same as
> > now after the first call. I'm not sure if this creates opportunities
> > for abuse; probably no worse than the process could do just by
> > actually running lots of threads.
>
> We try really hard not to IPI a core that happens to be running
> a realtime thread at high priority. I think this scheme would not
> meet this requirement.

That's a very reasonable requirement.

Rich