Re: [PATCH 3/3] btrfs: Avoid live-lock in search_ioctl() on hardware with sub-page faults

From: Catalin Marinas
Date: Thu Nov 25 2021 - 15:46:07 EST


On Thu, Nov 25, 2021 at 10:13:25AM -0800, Linus Torvalds wrote:
> On Thu, Nov 25, 2021 at 3:10 AM Catalin Marinas <catalin.marinas@xxxxxxx> wrote:
> > For this specific btrfs case, if we want go with tuning the offset based
> > on the fault address, we'd need copy_to_user_nofault() (or a new
> > function) to be exact.
>
> I really don't see why you harp on the exactness.

I guess because I always thought we either fix fault_in_writable() to
probe the whole range (this series) or we change the loops to take the
copy_to_user() returned value into account when re-faulting.

> I really believe that the fix is to make the read/write probing just
> be more aggressive.
>
> Make the read/write probing require that AT LEAST <n> bytes be
> readable/writable at the beginning, where 'n' is 'min(len,ALIGN)', and
> ALIGN is whatever size that copy_from/to_user_xyz() might require just
> because it might do multi-byte accesses.
>
> In fact, make ALIGN be perhaps something reasonable like 512 bytes or
> whatever, and then you know you can handle the btrfs "copy a whole
> structure and reset if that fails" case too.

IIUC what you are suggesting, we still need changes to the btrfs loop
similar to willy's but that should work fine together with a slightly
more aggressive fault_in_writable().

A probing of at least sizeof(struct btrfs_ioctl_search_key) should
suffice without any loop changes and 512 would cover it but it doesn't
look generic enough. We could pass a 'probe_prefix' argument to
fault_in_exact_writeable() to only probe this and btrfs would just
specify the above sizeof().

> Don't require that the fundamental copying routines (and whatever
> fixup the code might need) be some kind of byte-precise - it's the
> error case that should instead be made stricter.
>
> If the user gave you a range that triggered a pointer color mismatch,
> then returning an error is fine, rather than say "we'll do as much as
> we can and waste time and effort on being byte-exact too".

Yes, I'm fine with not copying anything at all, I just want to avoid the
infinite loop.

I think we are down to three approaches:

1. Probe the whole range in fault_in_*() for sub-page faults, no need to
worry about copy_*_user() failures.

2. Get fault_in_*() to probe a sufficient prefix to cover the uaccess
inexactness. In addition, change the btrfs loop to fault-in from
where the copy_to_user() failed (willy's suggestion combined with
the more aggressive probing in fault_in_*()).

3. Implement fault_in_exact_writeable(uaddr, size, probe_prefix) where
loops that do some rewind would pass an appropriate value.

(1) is this series, (2) requires changing the loop logic, (3) looks
pretty simple.

I'm happy to attempt either (2) or (3) with a preference for the latter.

--
Catalin