Re: [PATCH 3/3] btrfs: Avoid live-lock in search_ioctl() on hardware with sub-page faults

From: Catalin Marinas
Date: Sat Nov 27 2021 - 10:23:15 EST


On Sat, Nov 27, 2021 at 01:39:58PM +0100, Andreas Gruenbacher wrote:
> On Sat, Nov 27, 2021 at 4:52 AM Andreas Gruenbacher <agruenba@xxxxxxxxxx> wrote:
> > On Sat, Nov 27, 2021 at 12:06 AM Catalin Marinas <catalin.marinas@xxxxxxx> wrote:
> > > If we know that the arch copy_to_user() has an error of say maximum 16
> > > bytes (or 15 rather on arm64), we can instead get fault_in_writeable()
> > > to probe the first 16 bytes rather than 1.
> >
> > That isn't going to help one bit: [raw_]copy_to_user() is allowed to
> > copy as little or as much as it wants as long as it follows the rules
> > documented in include/linux/uaccess.h:
> >
> > [] If copying succeeds, the return value must be 0.  If some data cannot be
> > [] fetched, it is permitted to copy less than had been fetched; the only
> > [] hard requirement is that not storing anything at all (i.e. returning size)
> > [] should happen only when nothing could be copied.  In other words, you don't
> > [] have to squeeze as much as possible - it is allowed, but not necessary.
> >
> > When fault_in_writeable() tells us that an address range is accessible
> > in principle, that doesn't mean that copy_to_user() will allow us to
> > access it in arbitrary chunks. It's also not the case that
> > fault_in_writeable(addr, size) is always followed by
> > copy_to_user(addr, ..., size) for the exact same address range, not
> > even in this case.
> >
> > These alignment restrictions have nothing to do with page or sub-page faults.
> >
> > I'm also fairly sure that passing in an unaligned buffer will send
> > search_ioctl into an endless loop on architectures with copy_to_user()
> > alignment restrictions; there don't seem to be any buffer alignment
> > checks.
>
> Let me retract that ...
>
> The description in include/linux/uaccess.h leaves out permissible
> reasons for fetching/storing less than requested. Thinking about it, if
> the address range passed to one of the copy functions includes an
> address that faults, it kind of makes sense to allow the copy function
> to stop short instead of copying every last byte right up to the address
> that fails.
>
> If that's the only reason, then it would be great to have that included
> in the description. And then we can indeed deal with the alignment
> effects in fault_in_writeable().

Ah, I started replying last night, sent it today without seeing your
follow-up.

> > > I attempted the above here and works ok:
> > >
> > > https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/log/?h=devel/btrfs-live-lock-fix
> > >
> > > but too late to post it this evening, I'll do it in the next day or so
> > > as an alternative to this series.
>
> I've taken a quick look. Under the assumption that alignment effects
> are tied to page / sub-page faults, I think we can really solve this
> generically as Willy has proposed.

I think Willy's proposal stopped at the page boundary, it should go
beyond.

> Maybe as shown below; no need for arch-specific code.
>
> diff --git a/mm/gup.c b/mm/gup.c
> index 2c51e9748a6a..a9b3d916b625 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1658,6 +1658,8 @@ static long __get_user_pages_locked(struct mm_struct *mm, unsigned long start,
> }
> #endif /* !CONFIG_MMU */
>
> +#define SUBPAGE_FAULT_SIZE 16
> +
> /**
> * fault_in_writeable - fault in userspace address range for writing
> * @uaddr: start of address range
> @@ -1673,8 +1675,19 @@ size_t fault_in_writeable(char __user *uaddr, size_t size)
> if (unlikely(size == 0))
> return 0;
> if (!PAGE_ALIGNED(uaddr)) {
> + if (SUBPAGE_FAULT_SIZE &&
> + !IS_ALIGNED((unsigned long)uaddr, SUBPAGE_FAULT_SIZE)) {
> + end = PTR_ALIGN(uaddr, SUBPAGE_FAULT_SIZE);
> + if (end - uaddr < size) {
> + if (unlikely(__put_user(0, uaddr) != 0))
> + return size;
> + uaddr = end;
> + if (unlikely(!end))
> + goto out;
> + }
> + }
> if (unlikely(__put_user(0, uaddr) != 0))
> - return size;
> + goto out;
> uaddr = (char __user *)PAGE_ALIGN((unsigned long)uaddr);
> }
> end = (char __user *)PAGE_ALIGN((unsigned long)start + size);

That's similar, somehow, to the arch-specific probing in one of my
patches: [1]. We could do the above if we can guarantee that the maximum
error margin in copy_to_user() is smaller than SUBPAGE_FAULT_SIZE. For
arm64 copy_to_user(), it is fine, but for copy_from_user(), if we ever
need to handle fault_in_readable(), it isn't (on arm64 up to 64 bytes
even if aligned: reads of large blocks are done in 4 * 16 loads, and if
one of them fails e.g. because of the 16-byte sub-page fault, no write
is done, hence such larger than 16 delta).

If you want something in the generic fault_in_writeable(), we probably
need a loop over UACCESS_MAX_WRITE_ERROR in SUBPAGE_FAULT_SIZE
increments. But I thought I'd rather keep this in the arch-specific
code.

Of course, the above fault_in_writeable() still needs the btrfs
search_ioctl() counterpart to change the probing on the actual fault
address or offset.

In the general case (uaccess error margin larger), I'm not entirely
convinced we can skip the check if PAGE_ALIGNED(uaddr). I should
probably get this logic through CBMC (or TLA+), I can't think it
through.

Thanks.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/commit/?h=devel/btrfs-live-lock-fix&id=af7e96d9e9537d9f9cc014f388b7b2bb4a5bc343

--
Catalin