Re: [RFC PATCH v8 1/9] Restartable sequences system call

From: Ben Maurer
Date: Thu Aug 25 2016 - 13:59:08 EST


On 8/25/16, 10:08 AM, "Mathieu Desnoyers" <mathieu.desnoyers@xxxxxxxxxxxx> wrote:
> The most appealing application we have so far is Dave Watson's Facebook
> services using jemalloc as a memory allocator. It would be nice if he
> could re-run those benchmarks with my rseq implementation. The trade-offs
> here are about speed and memory usage:

One piece of context Iâd like to provide about our investigation into using rseq for malloc â I think that itâs really important to keep in mind that weâve optimized the software that we write to survive well in a world where thereâs a high memory cost for jemalloc for each thread and jemalloc is unable to have allocation caches as large as we would like. Weâre not going to have real world benchmarks that show a magical improvement with rseq because over time weâve baked the constraints of our environment into the design of our programs and optimized for the current set of APIs the kernel provides. I do think rseq provides a benefit even for applications optimized for todayâs malloc implementations. But the real benefit is the new types of application designed that rseq enables and the ability for rseq to provide predictable performance for low-level APIs with much less investment from users. Iâll illustrate the costs that rseq would let us avoid with two examples of design choices weâve made:

1) Because jemalloc uses a per-thread cache, threads that are sleeping have a high memory cost. For example, if you have a thread-pool with 100 threads but only 10 are used most of the time the other 90 threads will still have a dormant cache consuming memory. In order to combat this we have an abstraction called MemoryIdler (https://github.com/facebook/folly/blob/master/folly/detail/MemoryIdler.h) which is essentially a wrapper around futex that signals jemalloc to release its caches when the thread is idle. From what I can tell this is a practice that isnât widely adopted even though it can save a substantial amount of memory â rseq makes this a moot point since caches can be per-cpu and the memory allocator does not need to worry about an idle thread hogging the cache.
2) The per-thread nature of malloc implementations has generally led people to avoid thread-per-request designs. Things like MemoryIdler can help you if a thread is going to be idle for seconds before it is used again, but if your thread makes a 100 ms RPC to another server clearing the cache is far too expensive to justify. But you still have a bunch of memory sitting around unused for 100ms. Multiply that by many concurrent requests and you are consuming a lot of memory. This has forced people to handle multiple requests in a single thread â this leads to problems of its own like a contested lock in one request impacting many other requests on the same thread.

rseq opens up a whole world of algorithms to userspace â algorithms that are O(num CPUs) and where one can have an extremely fast fastpath at the cost of a slower slow path. Many of these algorithms are in use in the kernel today â per-cpu allocators, RCU, light-weight reader writer locks, etc. Even in cases where these APIs can be implemented today, a rseq implementation is often superior in terms of predictability and usability (eg per-thread counters consume more memory and are more expensive to read than per-cpu counters).

Isnât the large number of uses of rseq-like algorithms in the kernel a pretty substantial sign that there would be demand for similar algorithms by user-space systems programmers?

-b