Re: [PATCH 5/9] HWPoison: add memory_failure_queue()
From: Ingo Molnar
Date: Wed May 25 2011 - 10:09:20 EST
* Luck, Tony <tony.luck@xxxxxxxxx> wrote:
> In your proposed solution, we'd generate an event that would be
> handled by some process/daemon ... but how would we ensure that the
> affected process does not run in the mean time? Could we create
> some analogous method to the ptrace stopped state, and hand control
> of the affected process to the daemon that gets the event?
Ok, i think there is a bit of a misunderstanding here - which is not
a surprise really: we made generic arguments all along with very few
specifics.
The RAS daemon would deal with 'slow' policy action: fully recovered
events. It would also log various events so that people can do post
mortem etc.
The main point of defining events here is so that there's a single
method of transport and a single flexible method of defining and
extracting events.
Some of the event processing would occur in the kernel: in code that
knows about memory_failure() and calls it while making sure we do not
execute any user-space instruction.
Some of the code would execute *very* early and in a very atomic way,
still in NMI context: panicing the box if the error is so severe.
Neither of these are steps that the RAS daemon can or wants to
handle.
The RAS tools would interact with the regular perf facilities setting
and configuring the various RAS related events. They'd handle the
'severity' config bits, they'd initiate testing (injection), etc.
Ideally the RAS daemon and tools would do what syslog does (and
more), with more structured events. In the end of the day most of the
'policy action' is taken by humans anyway, who want to take a look at
some ASCII output. So printk() integration and obvious ASCII output
for everything is important along the way.
> 2) The memory error was found in certain special sections of the
> kernel for which recovery is possible (e.g. while copying to/from
> user memory, perhaps also page copy and page clear).
>
> Here I don't have a solution. TIF_MCE_NOTIFY isn't checked when
> returning from do_machine_check() to kernel code.
Well, since we are already in interrupt context (albeit in a very
atomic NMI context), sending a self-IPI is not strictly necessary. We
could fix up the return address and jump to the right handler
straight away during the IRET.
A self-IPI might also not execute *immediately* - there's always the
chance of APIC related delays.
> In a CONFIG_PREEMPT=y kernel, all of the recoverable cases ought to
> be in places where pre-emption is allowed ... so perhaps we can
> also use the stop-and-switch option here?
Yes, these are generally preemptible cases - and if they are not we
can make the error fatal (we do not have to handle *every* complex
case, giving up is a fair answer as well - we do not want rare code
to be complex really).
But you don't need to stop-and-switch: just stack-nesting on top of
whatever preemptible code was running there would be enough, wouldnt
it? That stops a task from executing until the decision has been made
whether it can continue or not.
Thanks,
Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/