Re: [PATCH 3/3] btrfs: Avoid live-lock in search_ioctl() on hardware with sub-page faults
From: Andreas Gruenbacher
Date: Sat Nov 27 2021 - 07:43:30 EST
On Sat, Nov 27, 2021 at 4:52 AM Andreas Gruenbacher <agruenba@xxxxxxxxxx> wrote:
> On Sat, Nov 27, 2021 at 12:06 AM Catalin Marinas <catalin.marinas@xxxxxxx> wrote:
> > On Fri, Nov 26, 2021 at 11:29:45PM +0100, Andreas Gruenbacher wrote:
> > > On Thu, Nov 25, 2021 at 11:42 PM Catalin Marinas <catalin.marinas@xxxxxxx> wrote:
> > > > As per Linus' reply, we can work around this by doing
> > > > a sub-page fault_in_writable(point_of_failure, align) where 'align'
> > > > should cover the copy_to_user() impreciseness.
> > > >
> > > > (of course, fault_in_writable() takes the full size argument but behind
> > > > the scene it probes the 'align' prefix at sub-page fault granularity)
> > >
> > > That doesn't make sense; we don't want fault_in_writable() to fail or
> > > succeed depending on the alignment of the address range passed to it.
> >
> > If we know that the arch copy_to_user() has an error of say maximum 16
> > bytes (or 15 rather on arm64), we can instead get fault_in_writeable()
> > to probe the first 16 bytes rather than 1.
>
> That isn't going to help one bit: [raw_]copy_to_user() is allowed to
> copy as little or as much as it wants as long as it follows the rules
> documented in include/linux/uaccess.h:
>
> [] If copying succeeds, the return value must be 0. If some data cannot be
> [] fetched, it is permitted to copy less than had been fetched; the only
> [] hard requirement is that not storing anything at all (i.e. returning size)
> [] should happen only when nothing could be copied. In other words, you don't
> [] have to squeeze as much as possible - it is allowed, but not necessary.
>
> When fault_in_writeable() tells us that an address range is accessible
> in principle, that doesn't mean that copy_to_user() will allow us to
> access it in arbitrary chunks. It's also not the case that
> fault_in_writeable(addr, size) is always followed by
> copy_to_user(addr, ..., size) for the exact same address range, not
> even in this case.
>
> These alignment restrictions have nothing to do with page or sub-page faults.
>
> I'm also fairly sure that passing in an unaligned buffer will send
> search_ioctl into an endless loop on architectures with copy_to_user()
> alignment restrictions; there don't seem to be any buffer alignment
> checks.
Let me retract that ...
The description in include/linux/uaccess.h leaves out permissible
reasons for fetching/storing less than requested. Thinking about it, if
the address range passed to one of the copy functions includes an
address that faults, it kind of makes sense to allow the copy function
to stop short instead of copying every last byte right up to the address
that fails.
If that's the only reason, then it would be great to have that included
in the description. And then we can indeed deal with the alignment
effects in fault_in_writeable().
> > > copy_to_user_nofault_unaligned() should be further optimized, maybe as
> > > mm/maccess.c:copy_from_kernel_nofault() and/or per architecture
> > > depending on the actual alignment rules; I'm not sure.
> > [...]
> > > --- a/fs/btrfs/ioctl.c
> > > +++ b/fs/btrfs/ioctl.c
> > > @@ -2051,13 +2051,30 @@ static noinline int key_in_sk(struct btrfs_key *key,
> > > return 1;
> > > }
> > >
> > > +size_t copy_to_user_nofault_unaligned(void __user *to, void *from, size_t size)
> > > +{
> > > + size_t rest = copy_to_user_nofault(to, from, size);
> > > +
> > > + if (rest) {
> > > + size_t n;
> > > +
> > > + for (n = size - rest; n < size; n++) {
> > > + if (copy_to_user_nofault(to + n, from + n, 1))
> > > + break;
> > > + }
> > > + rest = size - n;
> > > + }
> > > + return rest;
> >
> > That's what I was trying to avoid. That's basically a fall-back to byte
> > at a time copy (we do this in copy_mount_options(); at some point we
> > even had a copy_from_user_exact() IIRC).
>
> We could try 8/4/2 byte chunks if both buffers are 8/4/2-byte aligned.
> It's just not clear that it's worth it.
>
> > Linus' idea (if I got it correctly) was instead to slightly extend the
> > probing in fault_in_writeable() for the beginning of the buffer from 1
> > byte to some per-arch range.
> >
> > I attempted the above here and works ok:
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/log/?h=devel/btrfs-live-lock-fix
> >
> > but too late to post it this evening, I'll do it in the next day or so
> > as an alternative to this series.
I've taken a quick look. Under the assumption that alignment effects
are tied to page / sub-page faults, I think we can really solve this
generically as Willy has proposed. Maybe as shown below; no need for
arch-specific code.
Thanks,
Andreas
diff --git a/mm/gup.c b/mm/gup.c
index 2c51e9748a6a..a9b3d916b625 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1658,6 +1658,8 @@ static long __get_user_pages_locked(struct mm_struct *mm, unsigned long start,
}
#endif /* !CONFIG_MMU */
+#define SUBPAGE_FAULT_SIZE 16
+
/**
* fault_in_writeable - fault in userspace address range for writing
* @uaddr: start of address range
@@ -1673,8 +1675,19 @@ size_t fault_in_writeable(char __user *uaddr, size_t size)
if (unlikely(size == 0))
return 0;
if (!PAGE_ALIGNED(uaddr)) {
+ if (SUBPAGE_FAULT_SIZE &&
+ !IS_ALIGNED((unsigned long)uaddr, SUBPAGE_FAULT_SIZE)) {
+ end = PTR_ALIGN(uaddr, SUBPAGE_FAULT_SIZE);
+ if (end - uaddr < size) {
+ if (unlikely(__put_user(0, uaddr) != 0))
+ return size;
+ uaddr = end;
+ if (unlikely(!end))
+ goto out;
+ }
+ }
if (unlikely(__put_user(0, uaddr) != 0))
- return size;
+ goto out;
uaddr = (char __user *)PAGE_ALIGN((unsigned long)uaddr);
}
end = (char __user *)PAGE_ALIGN((unsigned long)start + size);