On Mon, May 03, 2021 at 12:13:45PM +0200, David Hildenbrand wrote:
On 03.05.21 11:28, Mike Rapoport wrote:Somehow I was under impression we are protecting both pfn_to_online_page()
On Mon, May 03, 2021 at 10:28:36AM +0200, David Hildenbrand wrote:
On 02.05.21 08:34, Mike Rapoport wrote:
On Thu, Apr 29, 2021 at 02:25:19PM +0200, David Hildenbrand wrote:
Let's properly synchronize with drivers that set PageOffline(). Unfreeze
every now and then, so drivers that want to set PageOffline() can make
progress.
Signed-off-by: David Hildenbrand <david@xxxxxxxxxx>
---
fs/proc/kcore.c | 15 +++++++++++++++
1 file changed, 15 insertions(+)
diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
index 92ff1e4436cb..3d7531f47389 100644
--- a/fs/proc/kcore.c
+++ b/fs/proc/kcore.c
@@ -311,6 +311,7 @@ static void append_kcore_note(char *notes, size_t *i, const char *name,
static ssize_t
read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
{
+ size_t page_offline_frozen = 0;
char *buf = file->private_data;
size_t phdrs_offset, notes_offset, data_offset;
size_t phdrs_len, notes_len;
@@ -509,6 +510,18 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
pfn = __pa(start) >> PAGE_SHIFT;
page = pfn_to_online_page(pfn);
Can't this race with page offlining for the first time we get here?
To clarify, we have three types of offline pages in the kernel ...
a) Pages part of an offline memory section; the memap is stale and not
trustworthy. pfn_to_online_page() checks that. We *can* protect against
memory offlining using get_online_mems()/put_online_mems(), but usually
avoid doing so as the race window is very small (and a problem all over the
kernel we basically never hit) and locking is rather expensive. In the
future, we might switch to rcu to handle that more efficiently and avoiding
these possible races.
b) PageOffline(): logically offline pages contained in an online memory
section with a sane memmap. virtio-mem calls these pages "fake offline";
something like a "temporary" memory hole. The new mechanism I propose will
be used to handle synchronization as races can be more severe, e.g., when
reading actual page content here.
c) Soft offline pages: hwpoisoned pages that are not actually harmful yet,
but could become harmful in the future. So we better try to remove the page
from the page allcoator and try to migrate away existing users.
So page_offline_* handle "b) PageOffline()" only. There is a tiny race
between pfn_to_online_page(pfn) and looking at the memmap as we have in many
cases already throughout the kernel, to be tackled in the future.
Right, but here you anyway add locking, so why exclude the first iteration?
What we're protecting is PageOffline() below. If I didn't mess up, we should
always be calling page_offline_freeze() before calling PageOffline(). Or am
I missing something?
and PageOffline().
BTW, did you consider something like
Yes, I played with something like that. We'd have to handle the first
page_offline_freeze() freeze differently, though, and that's where things
got a bit ugly in my attempts.
if (page_offline_frozen++ % MAX_ORDER_NR_PAGES == 0) {
page_offline_unfreeze();
cond_resched();
page_offline_freeze();
}
We don't seem to care about page_offline_frozen overflows here, do we?
No, the buffer size is also size_t and gets incremented on a per-byte basis.
The variant I have right now looked the cleanest to me. Happy to hear
simpler alternatives.
Well, locking for the first time before the while() loop and doing
resched-relock outside switch() would be definitely nicer, and it makes the
last unlock unconditional.
The cost of prevention of memory offline during reads of !KCORE_RAM parts
does not seem that significant to me, but I may be missing something.