Re: [PATCH] mm/page_owner: print largest memory consumer when OOM panic occurs
From: Miles Chen
Date: Wed Dec 25 2019 - 04:31:31 EST
On Tue, 2019-12-24 at 08:47 -0500, Qian Cai wrote:
>
> > On Dec 24, 2019, at 1:45 AM, Miles Chen <miles.chen@xxxxxxxxxxxx> wrote:
> >
> > We use kmemleak too, but a memory leakage which is caused by
> > alloc_pages() in a kernel device driver cannot be caught by kmemleak.
> > We have fought against this kind of real problems for a few years and
> > find a way to make the debugging easier.
> >
> > We currently have information during OOM: process Node, zone, swap,
> > process (pid, rss, name), slab usage, and the backtrace, order, and
> > gfp flags of the OOM backtrace.
> > We can tell many different types of OOM problems by the information
> > above except the alloc_pages() leakage.
> >
> > The patch does work and save a lot of debugging time.
> > Could we consider the "greatest memory consumer" as another useful
> > OOM information?
>
> This is rather situational considering there are memory leaks here and there but it is not necessary that straightforward as a single place of greatest consumer.
Agreed, but having the greatest memory consumer information does no harm
here.
Maybe you can share some cases to me?
The greatest memory consumer provides a strong clue of of a memory
leakage.
I have seen some different types of OOM issues.
1. task leakage, we can observe these by the kernel_stack numbers
2. memory fragmentation, check the ZONE memory status and the allocation
order
3. kmalloc leakage, check the slab numbers and let's say the number
kamlloc-512 is abnormal,
and we can enable kmemleak, reproduce the issue. Most of the time, I saw
a single backtrace of that leak.
It's helpful to have the greatest memory consumer in this case.
4. vmalloc leakage, we have no vmalloc numbers now. And I saw a case
that we pass a large number
into vmalloc() in a fuzzing test and it causes OOM kernel panic.
It is hard to reproduce the issue and kmemleak can do little help here
because it is a OOM kernel panic.
That is the issue which inspires me to create this patch. We found the
root cause by this approach.
5. OOM due to out of normal memory (in 32bit kernel), we can check the
allocate flags and the
zone memory status. In this case, we can try to check the memory
allocations and see if they can
use highmem. Knowing the greatest memory consumer may or may not be
useful here.
6. OOM caused by 2 or more different backtraces. I saw this once and we
enable PAGE_OWNER and
get the complete information of memory usage and locate the root cause.
Again, knowing the greatest
memory consumer is still a help in this issue.
7. OOM cause by alloc_pages(). There are no existing useful information
for this issue.
CONFIG_PAGE_OWNER is useful and we can do more base on
CONFIG_PAGE_OWNER. (this patch)
>
> The other question is why the offensive drivers that use alloc_pages() repeatedly without using any object allocator? Do you have examples of this in drivers that could happen?
For example, we're implementing our iommu driver and there are many
alloc_pages() in drivers/iommu.
This approach helps us located some memory leakages in our
implementation.
Thanks again for your comments
It's Christmas now so I think we can discuss after the Christmas break?
I have posted the number of issues addressed by this approach (7 real
problems since 2019/5)
I think this approach can help people. :)
Miles