Re: [PATCH v2] mm/slub: introduce SLAB_WARN_ON_ERROR

From: Miles Chen
Date: Tue Jan 29 2019 - 20:43:41 EST


On Tue, 2019-01-29 at 19:46 +0000, Christopher Lameter wrote:
> On Tue, 29 Jan 2019, Miles Chen wrote:
>
> > a) classic slub issue. e.g., use-after-free, redzone overwritten. It's
> > more efficient to report a issue as soon as slub detects it. (comparing
> > to monitor the log, set a breakpoint, and re-produce the issue). With
> > the coredump file, we can analyze the issue.
>
> What usually happens is that the systems fails with a strange error
> message. Then the system is rebooted using slub_debug options and the
> issue is reproduced yielding more information about the problem.
>
> Then you run the scenario again with additional debugging in the subsystem
> that caused the problem.

Thanks your comments and patient.

I now understand the difference between us.
I usually enable CONFIG_SLUB_DEBUG=y, CONFIG_SLUB_DEBUG_ON=y and setup
slub_debug by default and do all tests. (eng mode).
Not hit an issue first, then setup slub_debug and reproduce the issue
again.

CONFIG_SLUB_DEBUG is disabled for products.

>
> So you are already reproducing the issue because you need to activate
> debugging to get more information. Doing it for the 3rd time is not that
> much more difficult.
>
> None of your modifications will be active in a production kernel.
> slub_debug must be activated to use it and thus you are already
> reproducing the issue.
>
> > b) memory corruption issues caused by h/w write. e.g., memory
> > overwritten by a DMA engine. Memory corruptions may or may not related
> > to the slab cache that reports any error. For example: kmalloc-256 or
> > dentry may report the same errors. If we can preserve the the coredump
> > file without any restore/reset processing in slub, we could have more
> > information of this memory corruption.
>
> If debugging is active then reporting will include the accurate slab cache
> affected. The memory layout is already changing when you enable the
> existing debugging code. None of your code runs without that and thus is
> cannot add a coredump for the prod case without debugging.

I usually set slub_debug by default and get the coredump file.

> > c) memory corruption issues caused by unstable h/w. e.g., bit flipping
> > because of xxxx DRAM die or applying new power settings. It's hard to
> > re-produce this kind of issue and it much easier to tell this kind of
> > issue in the coredump file without any restore/reset processing.
>
> But then you patch does not help in this situation because the code has to
> be enabled by special slub debug options.
>
>
> > Users can set the option by slub_debug. We can still have the original
> > behavior(keep the system alive) if the option is not set. We can turn on
> > the option when we need the coredump file. (with panic_on_warn is set,
> > of course).
>
> I think we would need to turn on debugging by default and have your patch
> for this to make sense. We already reproducing the issue multiple times
> for debugging. This patch does not change that.
>
yes. I turn on the debugging by default. Does that make sense now?

Thanks again for your comments.