Re: [PATCH 0/3] mm,vdso: preallocate new vmas

From: Andy Lutomirski
Date: Wed Oct 23 2013 - 17:43:27 EST


On Wed, Oct 23, 2013 at 3:13 AM, Michel Lespinasse <walken@xxxxxxxxxx> wrote:
> On Tue, Oct 22, 2013 at 10:54 AM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>> On 10/22/2013 08:48 AM, walken@xxxxxxxxxx wrote:
>>> Generally the problems I see with mmap_sem are related to long latency
>>> operations. Specifically, the mmap_sem write side is currently held
>>> during the entire munmap operation, which iterates over user pages to
>>> free them, and can take hundreds of milliseconds for large VMAs.
>>
>> This is the leading cause of my "egads, something that should have been
>> fast got delayed for several ms" detector firing.
>
> Yes, I'm seeing such issues relatively frequently as well.
>
>> I've been wondering:
>>
>> Could we replace mmap_sem with some kind of efficient range lock? The
>> operations would be:
>>
>> - mm_lock_all_write (drop-in replacement for down_write(&...->mmap_sem))
>> - mm_lock_all_read (same for down_read)
>> - mm_lock_write_range(mm, start, end)
>> - mm_lock_read_range(mm, start_end)
>>
>> and corresponding unlock functions (that maybe take a cookie that the
>> lock functions return or that take a pointer to some small on-stack data
>> structure).
>
> That seems doable, however I believe we can get rid of the latencies
> in the first place which seems to be a better direction. As I briefly
> mentioned, I would like to tackle the munmap problem sometime; Jan
> Kara also has a project to remove places where blocking FS functions
> are called with mmap_sem held (he's doing it for lock ordering
> purposes, so that FS can call in to MM functions that take mmap_sem,
> but there are latency benefits as well if we can avoid blocking in FS
> with mmap_sem held).

There will still be scalability issues if there are enough threads,
but maybe this isn't so bad. (My workload may also have priority
inversion problems -- there's a thread that runs on its own core and
needs the mmap_sem read lock and a thread that runs on a highly
contended core that needs the write lock.)

>
>> The easiest way to implement this that I can think of is a doubly-linked
>> list or even just an array, which should be fine for a handful of
>> threads. Beyond that, I don't really know. Creating a whole trie for
>> these things would be expensive, and fine-grained locking on rbtree-like
>> things isn't so easy.
>
> Jan also had an implementation of range locks using interval trees. To
> take a range lock, you'd add the range you want to the interval tree,
> count the conflicting range lock requests that were there before you,
> and (if nonzero) block until that count goes to 0. When releasing the
> range lock, you look for any conflicting requests in the interval tree
> and decrement their conflict count, waking them up if the count goes
> to 0.

Yuck. Now we're taking a per-mm lock on the rbtree, doing some
cacheline-bouncing rbtree operations, and dropping the lock to
serialize access to something that probably only has a small handful
of accessors at a time. I bet that an O(num locks) array or linked
list will end up being faster in practice.

I think the idea solution would be to shove these things into the page
tables somehow, but that seems impossibly complicated.

--Andy

>
> But as I said earlier, I would prefer if we could avoid holding
> mmap_sem during long-latency operations rather than working around
> this issue with range locks.
>
> --
> Michel "Walken" Lespinasse
> A program is never fully debugged until the last user dies.



--
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/