Re: [PATCH] ARC: Improve cmpxchng syscall implementation

From: Alexey Brodkin
Date: Wed Apr 04 2018 - 04:56:28 EST


Hi Vineet, Peter,

On Wed, 2018-03-21 at 14:54 +0300, Alexey Brodkin wrote:
> Hi Vineet,
>
> On Mon, 2018-03-19 at 11:29 -0700, Vineet Gupta wrote:
> > On 03/19/2018 04:00 AM, Alexey Brodkin wrote:
> > > arc_usr_cmpxchg syscall is supposed to be used on platforms
> > > that lack support of Load-Locked/Store-Conditional instructions
> > > in hardware. And in that case we mimic missing hardware features
> > > with help of kernel's sycall that "atomically" checks current
> > > value in memory and then if it matches caller expectation new
> > > value is written to that same location.
> > >
> >
> > ...
> > ...
> >
> > >
> > > 2. What's worse if we're dealing with data from not yet allocated
> > > page (think of pre-copy-on-write state) we'll successfully
> > > read data but on write we'll silently return to user-space
> > > with correct result
> >
> > This is technically incorrect, even for reading, you need a page, which could be
> > common zero page in certain cases.
>
> Ok I'll reword it like.
>
> >
> > (which we really read just before). That leads
> > > to very strange problems in user-space app further down the line
> > > because new value was never written to the destination.
> > >
> > > 3. Regardless of what went wrong we'll return from syscall
> > > and user-space application will continue to execute.
> > > Even if user's pointer was completely bogus.
> >
> > Again we are exaggerating (from technical correctness POV) - if user pointer was
> > bogs, the read would not have worked in first place etc. So lets tone down the
> > rhetoric.
>
> Ok here I may rephrase it like that:
> ------------------------------->8-----------------------------
> 3. Regardless of what went wrong we'll return from syscall
> and user-space application will continue to execute.
> ------------------------------->8-----------------------------
>
> >
> > > In case of hardware LL/SC that app would have been killed
> > > by the kernel.
> > >
> > > With that change we attempt to imrove on all 3 items above:
> > >
> > > 1. We still disable preemption around read-and-write of
> > > user's data but if we happen to fail with either of them
> > > we're enabling preemption and try to force page fault so
> > > that we have a correct mapping in the TLB. Then re-try
> > > again in "atomic" context.
> > >
> > > 2. If real page fault fails or even access_ok() returns false
> > > we send SIGSEGV to the user-space process so if something goes
> > > seriously wrong we'll know about it much earlier.
> > >
> >
> >
> > >
> > > /*
> > > * This is only for old cores lacking LLOCK/SCOND, which by defintion
> > > @@ -60,23 +62,48 @@ SYSCALL_DEFINE3(arc_usr_cmpxchg, int *, uaddr, int, expected, int, new)
> > > /* Z indicates to userspace if operation succeded */
> > > regs->status32 &= ~STATUS_Z_MASK;
> > >
> > > - if (!access_ok(VERIFY_WRITE, uaddr, sizeof(int)))
> > > - return -EFAULT;
> > > + ret = access_ok(VERIFY_WRITE, uaddr, sizeof(*uaddr));
> > > + if (!ret)
> > > + goto fail;
> > >
> > > +again:
> > > preempt_disable();
> > >
> > > - if (__get_user(uval, uaddr))
> > > - goto done;
> > > -
> > > - if (uval == expected) {
> > > - if (!__put_user(new, uaddr))
> > > + ret = __get_user(val, uaddr);
> > > + if (ret == -EFAULT) {
> >
> >
> > Lets see if this warrants adding complexity ! This implies that TLB entry with
> > Read permissions didn't exist for reading the var and page fault handler could not
> > wire up even a zero page due to preempt_disable, meaning it was something not
> > touched by userspace already - sort of uninitialized variable in user code.
>
> Ok I completely missed the fact that fast path TLB miss handler is being
> executed even if we have preemption disabled. So given the mapping exist
> we do not need to retry with enabled preemption.
>
> Still maybe I'm a bit paranoid here but IMHO it's good to be ready for a corner-case
> when the pointer is completely bogus and there's no mapping for him.
> I understand that today we only expect this syscall to be used from libc's
> internals but as long as syscall exists nobody stops anybody from using it
> directly without libc. So maybe instead of doing get_user_pages_fast() just
> send a SIGSEGV to the process? At least user will realize there's some problem
> at earlier stage.
>
> > Otherwise it is extremely unlikely to start with a TLB entry with Read
> > permissions, followed by syscall Trap only to find the entry missing, unless a
> > global TLB flush came from other cores, right in the middle. But this syscall is
> > not guaranteed to work with SMP anyways, so lets ignore any SMP misdoings here.
>
> Well but that's exactly the situation I was debugging: we start from data from read-only
> page and on attempt to write back modified value COW machinery gets involved.
>
> That was on UP platform.
>
> > Now in case it was *an* uninitialized var, do we have to guarantee any well
> > defined semantics for the kernel emulation of cmpxchg ? IMO it should be fine to
> > return 0 or -EFAULT etc. Infact -EFAULT is better as it will force a retry loop on
> > user side, given the typical cmpxchg usage pattern.
>
> The problem is libc only expects to get a value read from memory.
> And in theory expected value might be -14 which is basically -EFAULT.
> I'm not talking about 0 at all because in some cases that's exactly what
> user-space expects.
>
> So if we read unexpected value then we'll just return it without even attempting
> to write.
>
> If we read expected data but fail to write then we'll send a SIGSEGV and
> return whatever... let it be -EFAULT - anyways the app will be killed on exit from
> this syscall.

Any comments on my comments above?

-Alexey