Re: [PATCH v1 7/7] fs/proc/kcore: use page_offline_(freeze|unfreeze)

From: Mike Rapoport
Date: Mon May 03 2021 - 07:34:00 EST


On Mon, May 03, 2021 at 12:13:45PM +0200, David Hildenbrand wrote:
> On 03.05.21 11:28, Mike Rapoport wrote:
> > On Mon, May 03, 2021 at 10:28:36AM +0200, David Hildenbrand wrote:
> > > On 02.05.21 08:34, Mike Rapoport wrote:
> > > > On Thu, Apr 29, 2021 at 02:25:19PM +0200, David Hildenbrand wrote:
> > > > > Let's properly synchronize with drivers that set PageOffline(). Unfreeze
> > > > > every now and then, so drivers that want to set PageOffline() can make
> > > > > progress.
> > > > >
> > > > > Signed-off-by: David Hildenbrand <david@xxxxxxxxxx>
> > > > > ---
> > > > > fs/proc/kcore.c | 15 +++++++++++++++
> > > > > 1 file changed, 15 insertions(+)
> > > > >
> > > > > diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
> > > > > index 92ff1e4436cb..3d7531f47389 100644
> > > > > --- a/fs/proc/kcore.c
> > > > > +++ b/fs/proc/kcore.c
> > > > > @@ -311,6 +311,7 @@ static void append_kcore_note(char *notes, size_t *i, const char *name,
> > > > > static ssize_t
> > > > > read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
> > > > > {
> > > > > + size_t page_offline_frozen = 0;
> > > > > char *buf = file->private_data;
> > > > > size_t phdrs_offset, notes_offset, data_offset;
> > > > > size_t phdrs_len, notes_len;
> > > > > @@ -509,6 +510,18 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
> > > > > pfn = __pa(start) >> PAGE_SHIFT;
> > > > > page = pfn_to_online_page(pfn);
> > > >
> > > > Can't this race with page offlining for the first time we get here?
> > >
> > >
> > > To clarify, we have three types of offline pages in the kernel ...
> > >
> > > a) Pages part of an offline memory section; the memap is stale and not
> > > trustworthy. pfn_to_online_page() checks that. We *can* protect against
> > > memory offlining using get_online_mems()/put_online_mems(), but usually
> > > avoid doing so as the race window is very small (and a problem all over the
> > > kernel we basically never hit) and locking is rather expensive. In the
> > > future, we might switch to rcu to handle that more efficiently and avoiding
> > > these possible races.
> > >
> > > b) PageOffline(): logically offline pages contained in an online memory
> > > section with a sane memmap. virtio-mem calls these pages "fake offline";
> > > something like a "temporary" memory hole. The new mechanism I propose will
> > > be used to handle synchronization as races can be more severe, e.g., when
> > > reading actual page content here.
> > >
> > > c) Soft offline pages: hwpoisoned pages that are not actually harmful yet,
> > > but could become harmful in the future. So we better try to remove the page
> > > from the page allcoator and try to migrate away existing users.
> > >
> > >
> > > So page_offline_* handle "b) PageOffline()" only. There is a tiny race
> > > between pfn_to_online_page(pfn) and looking at the memmap as we have in many
> > > cases already throughout the kernel, to be tackled in the future.
> >
> > Right, but here you anyway add locking, so why exclude the first iteration?
>
> What we're protecting is PageOffline() below. If I didn't mess up, we should
> always be calling page_offline_freeze() before calling PageOffline(). Or am
> I missing something?

Somehow I was under impression we are protecting both pfn_to_online_page()
and PageOffline().

> > BTW, did you consider something like
>
> Yes, I played with something like that. We'd have to handle the first
> page_offline_freeze() freeze differently, though, and that's where things
> got a bit ugly in my attempts.
>
> >
> > if (page_offline_frozen++ % MAX_ORDER_NR_PAGES == 0) {
> > page_offline_unfreeze();
> > cond_resched();
> > page_offline_freeze();
> > }
> >
> > We don't seem to care about page_offline_frozen overflows here, do we?
>
> No, the buffer size is also size_t and gets incremented on a per-byte basis.
> The variant I have right now looked the cleanest to me. Happy to hear
> simpler alternatives.

Well, locking for the first time before the while() loop and doing
resched-relock outside switch() would be definitely nicer, and it makes the
last unlock unconditional.

The cost of prevention of memory offline during reads of !KCORE_RAM parts
does not seem that significant to me, but I may be missing something.

--
Sincerely yours,
Mike.