Re: [PATCH v7 4/4] mm: vmalloc: convert vread() to vread_iter()

From: Lorenzo Stoakes
Date: Sun Mar 26 2023 - 10:21:11 EST


On Sun, Mar 26, 2023 at 01:26:57PM +0000, David Laight wrote:
> From: Baoquan He
> > Sent: 23 March 2023 13:32
> ...
> > > > > If this fails, then we fault in, and try again. We loop because there could
> > > > > be some extremely unfortunate timing with a race on e.g. swapping out or
> > > > > migrating pages between faulting in and trying to write out again.
> > > > >
> > > > > This is extremely unlikely, but to avoid any chance of breaking userland we
> > > > > repeat the operation until it completes. In nearly all real-world
> > > > > situations it'll either work immediately or loop once.
> > > >
> > > > Thanks a lot for these helpful details with patience. I got it now. I was
> > > > mainly confused by the while(true) loop in KCORE_VMALLOC case of read_kcore_iter.
> > > >
> > > > Now is there any chance that the faulted in memory is swapped out or
> > > > migrated again before vread_iter()? fault_in_iov_iter_writeable() will
> > > > pin the memory? I didn't find it from code and document. Seems it only
> > > > falults in memory. If yes, there's window between faluting in and
> > > > copy_to_user_nofault().
> > > >
> > >
> > > See the documentation of fault_in_safe_writeable():
> > >
> > > "Note that we don't pin or otherwise hold the pages referenced that we fault
> > > in. There's no guarantee that they'll stay in memory for any duration of
> > > time."
> >
> > Thanks for the info. Then swapping out/migration could happen again, so
> > that's why while(true) loop is meaningful.
>
> One of the problems is that is the system is under severe memory
> pressure and you try to fault in (say) 20 pages, the first page
> might get unmapped in order to map the last one in.
>
> So it is quite likely better to retry 'one page at a time'.

If you look at the kcore code, it is in fact only faulting one page at a
time. tsz never exceeds PAGE_SIZE, so we never attempt to fault in or copy
more than one page at a time, e.g.:-

if ((tsz = (PAGE_SIZE - (start & ~PAGE_MASK))) > buflen)
tsz = buflen;

...

tsz = (buflen > PAGE_SIZE ? PAGE_SIZE : buflen);

It might be a good idea to make this totally explicit in vread_iter()
(perhaps making it vread_page_iter() or such), but I think that might be
good for another patch series.

>
> There have also been cases where the instruction to copy data
> has faulted for reasons other than 'page fault'.
> ISTR an infinite loop being caused by misaligned accesses failing
> due to 'bad instruction choice' in the copy code.
> While this is rally a bug, an infinite retry in a file read/write
> didn't make it easy to spot.

I am not sure it's reasonable to not write code just in case an arch
implements buggy user copy code (do correct me if I'm misunderstanding you
about this!). By that token wouldn't a lot more be broken in that
situation? I don't imagine all other areas of the kernel would make
explicitly clear to you that this was the problem.

>
> So maybe there are cases where a dropping back to a 'bounce buffer'
> may be necessary.

One approach could be to reinstate the kernel bounce buffer, set up an
iterator that points to it and pass that in after one attempt with
userland.

But it feels a bit like overkill, as in the case of an aligment issue,
surely that would still occur and that'd just error out anyway? Again I'm
not sure bending over backwards to account for possibly buggy arch code is
sensible.

Ideally the iterator code would explicitly pass back the EFAULT error which
we could then explicitly handle but that'd require probably quite
significant rework there which feels a bit out of scope for this change.

We could implement some maximum number of attempts which statistically must
reduce the odds of repeated faults in the tiny window between fault in and
copy to effectively zero. But I'm not sure the other David would be happy
with that!

If we were to make a change to be extra careful I'd opt for simply trying X
times then giving up, given we're trying this a page at a time I don't
think X need be that large before any swap out/migrate bad luck becomes so
unlikely that we're competing with heat death of the universe timescales
before it might happen (again, I may be missing some common scenario where
the same single page swaps out/migrates over and over, please correct me if
so).

However I think there's a case to be made that it's fine as-is unless there
is another scenario we are overly concerned about?

>
> David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
>