Re: [RFC PATCH v9 for 4.15 01/14] Restartable sequences system call

From: Paul E. McKenney
Date: Fri Oct 13 2017 - 16:54:35 EST


On Fri, Oct 13, 2017 at 11:30:29AM -0700, Linus Torvalds wrote:
> On Fri, Oct 13, 2017 at 2:35 AM, Ben Maurer <bmaurer@xxxxxx> wrote:
> >
> > I'm really excited to hear that you're open to this patch set and totally understand the desire for some more numbers.
>
> So the patch-set actually looks very reasonable today. I looked
> through it (ok, I wasn't cc'd on the ppc-only patches so I didn't look
> at those, but I don't think they are likely objectionable either), and
> everything looked fine from a patch standpoint.
>
> But it's not _just_ numbers for real loads I'm looking for, it's
> actually an _existence proof_ for a real load too. I'd like to know
> that the suggested interface _really_ works in practice too for all
> the expected users.
>
> In particular, it's easy to make test-cases to show basic
> functionality, but that does not necessarily show that the interface
> then works in "real life".
>
> For example, if this is supposed to work for a malloc library, it's
> important that people show that yes, this can really work in a
> *LIBRARY*.
>
> That sounds so obvious and stupid that you might go "What do you
> mean?", but for things to work for libraries, they have to work
> together with *other* users, and with *independent* users.
>
> For example, say that you're some runtime that wants to use the percpu
> thing for percpu counters - because you want to avoid cache ping-pong,
> and you want to avoid per-thread allocation overhead (or per-thread
> scaling for just summing up the counters) when you have potentially
> tens of thousands of threads.
>
> Now, how does this runtime work *together* with
>
> - CPU hotplug adding new cpu's while you are running (and after you
> allocated your percpu areas)
>
> - libraries and system admins that limit - or extend - you to a
> certain set of CPUs
>
> - another library (like the malloc library) that wants to use the
> same interface for its percpu allocation queues.
>
> maybe all of this "just works", but I really want to see an existence
> proof. Not just a "dedicated use of the interface for one benchmark".
>
> So yes, I want to see numbers, but I really want to see something much
> more fundamental. I want to feel like there is a good reason to
> believe that the interface really is sufficient and that it really
> does work, even when a single thread may have multiple *different*
> uses for this. Statistics, memory allocation queues, RCU, per-cpu
> locking, yadda yadda. All these things may want to use this, but they
> want to use it *together*, and without you having to write special
> code where every user needs to know about every other user statically.
>
> Can you load two different *dynamic* libraries that each independently
> uses this thing for their own use, without having to be built together
> for each other?
>
> >> A "increment percpu value" simply isn't relevant.
> >
> > While I understand it seems trivial, my experience has been that this type of operation can actually be important in many server workloads.
>
> Oh, I'm not saying that it's not relevant to have high-performance
> statistics gathering using percpu data structures. Of _course_ that is
> important, we do that very much in the kernel itself.
>
> But a benchmark that does nothing else really isn't relevant. If the
> *only* thing somebody uses this for is statistics, it's simply not
> good enough.
>
>
> >> Because without real-world uses, it's not obvious that there won't be
> >> somebody who goes "oh, this isn't quite enough for us, the semantics
> >> are subtly incompatible with our real-world use case".
> >
> > Is your concern mainly this question (is this patchset a good way to
> > bring per-cpu algorithms to userspace)? I'm hoping that given the
> > variety of ways that per-cpu data structures are used in the kernel
> > the concerns around this patch set are mainly around what approach we
> > should take rather than if per-cpu algorithms are a good idea at all.
> > If this is your main concern perhaps our focus should be around
> > demonstrating that a number of useful per-cpu algorithms can be
> > implemented using restartable sequences.
>
> The important thing for me is that it should demonstrate that you can
> have users co-exists, and that the interface is sufficient for that.
>
> So I do want to see "just numbers" in the sense that I would want to
> see that people have actually written code that takes advantage of the
> percpu nature to do real things (like an allocator). But more than
> that, I want to see *use*.
>
> > Ultimately I'm worried there's a chicken and egg problem here.
>
> This patch-set has been around for *years* in some form. It's improved
> over the years, but the basic approaches are not new.
>
> Honestly, if people still don't have any actual user-level code that
> really _uses_ this, I'm not interested in merging it.
>
> There's no chicken-and-egg here. Anybody who wants to push this
> patch-set needs to write the user level code to validate that the
> patch-set makes sense. That's not chicken-and-egg, that's just
> "without the user-space code, the kernel code has never been tested,
> validated or used".
>
> And if nobody can be bothered to write the user-level code and test
> this patch-series, then obviously it's not important enough for the
> kernel to merge it.

My guess is that it will take some time, probably measured in months,
to carry out this level of integration and testing to. But agreed, it
is needed -- as you know, I recently removed some code from RCU that
was requested but then never used. Not fun. Even worse would be a
case where the requested code was half-used in a inefficient way, but
precluding improvements. And we actually had that problem some years
back with the userspace-accessible ring-buffer code.

So if it would help the people doing the testing, I would be happy to
maintain a out-of-tree repository for this series. That way, if the
testing showed that kernel-code changes were required, these changes
could be easily made without worrying about backwards compatibility
(you don't get the backwards-compatibility guarantee until the code
hits mainline). This repository would be similar in some ways to the
-rt tree, however, given the small size of this patchset, I cannot
justify a separate git tree. My thought is to provide (for example)
v4.14-rc4-rseq tags within my -rcu tree.

If there are problems with this approach, or if someone has a better idea,
please let me know.

Thanx, Paul