Is this patch in addition to, or instead of, the earlier core dump patch?
This is an addition, in previous coredump patch, manually call
memory_failure_queue()
to be asked to cope with corrupted page, and it is similar to your
"Copy-on-write poison recovery"[1], but after some discussion, I think
we could add MCE_IN_KERNEL_COPYIN to all MC-safe copy, which will
cope with corrupted page in the core do_machine_check() instead of
do it one-by-one.
Thanks for the context. I see how this all fits together now).
Your patch looks good.
Reviewed-by: Tony Luck <tony.luck@xxxxxxxxx>
-Tony
One small observation from testing. I injected to an application which consumed
the poisoned data and was sent a SIGBUS.
Kernel did not crash (hurrah!)
Console log said:
[ 417.610930] mce: [Hardware Error]: Machine check events logged
[ 417.618372] Memory failure: 0x89167f: recovery action for dirty LRU page: Recovered
... EDAC messages
[ 423.666918] MCE: Killing testprog:4770 due to hardware memory corruption fault at 7f8eccf35000
A core file was generated and saved in /var/lib/systemd/coredump
But my shell (/bin/bash) only said:
Bus error
not
Bus error (core dumped)
-Tony