Re: expand_stack fix [was Re: 2.4.9aa3]

From: Andrea Arcangeli (andrea@suse.de)
Date: Sat Sep 08 2001 - 11:04:16 EST


On Fri, Sep 07, 2001 at 11:47:09AM -0700, Linus Torvalds wrote:
>
> On Mon, 3 Sep 2001, Andrea Arcangeli wrote:
> >
> > Linus please include the attached patch to the next kernel, expand_stack
> > is totally broken at the moment, we cannot mess with the mm vma layout
> > if we don't hold the mmap_sem in write mode.
>
> I disagree with the diagnosis..
>
> expand_stack() has _never_ messed with the vma layout, and never should.
> As such, from a vma list integrity standpoint it is fine.
>
> Do we mess with the contents? Yes. But I'd much rather see a much more
> minimal approach to the problem, on the order of:
> - make sure we only accept GROWSDOWN for anonymous areas (which don't
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> care about the offset)
> - make the vm_start update atomic (possibly by just getting the pagetable
> spinlock).

We just take the pagetable spinlock there, the race is all about the
pgoff.

In short you agree that the current locking is broken but you propose to
limit the usability of GROWSDOWN and GROWSUP solely to the anonymous
vmas instead of fixing the pgoff race with proper locking as I did.

My fix for the race doesn't drop the usability of GROWSDOWN that could
otherwise break userspace programs. I guess at least uml uses growsdown
vma file backed. Jeff?

Assuming it's acceptable to break GROWSDOWN on file backed vmas as you
suggested, the new locking rules with your approch would be:

1) pgoff is garbage as usual for any anon vma so don't even try to
        mess with it in expand_stack
2) with only the read semaphore acquired the vm_start/vm_end of the
        vmas can expand under us but the vma tree not
3) with only the read semaphore acquired to get a consistent
        vm_start/vm_end of all vmas in the tree (like we need to in
        expand_stack) also the page_table_lock of the mm must be
        acquired
4) with the write lock vm_start/vm_end are consistent always as
        well as tree so no change here

> > I considered implementing a read->write semaphore upgrade primitive but
> > it cannot be reliable
>
> There is no such thing. Never has been. It's a fundamentally impossible
> operation. We may, at some point, decide to have a "read_for_write()" and

Yes, of course with "it cannot be reliable" I meant it cannot work out
always, it would be an optimization for the common case:

        spin_lock(&rwsem->lock)
        if (only one reader)
                gain write privilegies
                err = success
        else
                err = faliure
        spin_unlock(&rwsem->lock)

        return err

it would still require the fail path in case of the faliure (multiple
readers potentially all trying to upgrade the lock) so I ignored the
optimization (expand_stack isn't a very fast path anyways).

> However, having a finer-granularity spinlock _inside_ the semaphore (see
> above suggestion) is a perfectly valid approach.

if you are 100% sure it's acceptable to break the kernel API for the
GROWSDOWN file backed vmas (which I don't think it's the case) I can
switch to your suggested fix (otherwise I prefer to keep upgrading the
semaphore in expand_stack rather to have to spinlocking at every nopage
private/shared page fault).

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Sat Sep 15 2001 - 21:00:14 EST