Re: [RFC PATCH 0/3] RAS: Correctable Errors Collector thing
From: Chen Yucong
Date: Tue May 27 2014 - 22:49:32 EST
> From: Borislav Petkov <bp@xxxxxxx>
>
> Hi all,
>
> this is something Tony and I have been working on behind the curtains
> recently. Here it is in a RFC form, it passes quick testing in kvm. Let
> me send it out before I start hammering on it on a real machine.
>
> More indepth info about what it is and what it does is in patch 1/3.
>
> As always, comments and suggestions are most welcome.
>
> Thanks.
What's the point of this patch set?
My understanding is that if there are some(COUNT_MASK) corrected DRAM
ECC errors for a specific page frame, we can believe that the page frame
is so ill that it should be isolated as soon as possible.
The question is: memory_failure can not be used for isolating the page
frame which is being used by kernel, because it just poison the page and
IGNORED. memory_failure is mostly used for handling AR/AO type errors
related to the page frame which the userspace tasks are using now.
Although the relative page frame is very ill, it is not dead and can
still work. However, memory_failure may kill the userspace tasks,
especially for those page frames that are holding dynamic data rather
than file-backed(file/swap) data.
So I do not think that it is a good idea to directly use memory_failure
in this patch set.
thx!
cyc
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/