Re: [PATCH v3] lib/stackdepot: Double DEPOT_POOLS_CAP if KASAN is enabled

From: Andrey Konovalov
Date: Thu Aug 08 2024 - 15:22:28 EST


On Thu, Aug 8, 2024 at 8:32 PM Waiman Long <longman@xxxxxxxxxx> wrote:
>
>
> On 8/8/24 12:12, Andrey Konovalov wrote:
> > On Thu, Aug 8, 2024 at 2:54 PM Waiman Long <longman@xxxxxxxxxx> wrote:
> >> As said in commit 02754e0a484a ("lib/stackdepot.c: bump stackdepot
> >> capacity from 16MB to 128MB"), KASAN uses stackdepot to memorize stacks
> >> for all kmalloc/kfree calls. So stackdepot capacity was increased 8
> >> times to accommodate KASAN usage even thought it was claimed 4X should
> >> be enough at that time.
> >>
> >> With commit fc60e0caa94d ("lib/stackdepot: use fixed-sized slots
> >> for stack records"), all stackdepot records uses a fixed size with
> >> CONFIG_STACKDEPOT_MAX_FRAMES (default=64) entries. This is merged to
> >> support evictable KASAN stack records. Commit 31639fd6cebd ("stackdepot:
> >> use variable size records for non-evictable entries") re-enabled
> >> the use of variable size records for non-KASAN use cases, but KASAN
> >> (generic mode) still uses the large fixed size stack records.
> > No, since commit 711d349174fd ("kasan: revert eviction of stack traces
> > in generic mode") Generic KASAN doesn't use fixed-sized slots.
> >
> >> With the default CONFIG_STACKDEPOT_MAX_FRAMES of 64, KASAN use of
> >> stackdepot space had been more than double than before. Assuming an
> >> average stack frame size of 16, a KASAN stack record is almost 4X the
> >> size of a non-KASAN one.
> > So this is not correct.
> Yes, you are right. I missed the KASAN commit 711d349174fd in my
> analysis. I will update the commit log in the next version.
> >> When a wide variety of workloads are run on a debug kernel with KASAN
> >> enabled, the following warning may sometimes be printed.
> >>
> >> [ 6818.650674] Stack depot reached limit capacity
> >> [ 6818.650730] WARNING: CPU: 1 PID: 272741 at lib/stackdepot.c:252 depot_alloc_stack+0x39e/0x3d0
> >> :
> >> [ 6818.650907] Call Trace:
> >> [ 6818.650909] [<00047dd453d84b92>] depot_alloc_stack+0x3a2/0x3d0
> >> [ 6818.650916] [<00047dd453d85254>] stack_depot_save_flags+0x4f4/0x5c0
> >> [ 6818.650920] [<00047dd4535872c6>] kasan_save_stack+0x56/0x70
> >> [ 6818.650924] [<00047dd453587328>] kasan_save_track+0x28/0x40
> >> [ 6818.650927] [<00047dd45358a27a>] kasan_save_free_info+0x4a/0x70
> >> [ 6818.650930] [<00047dd45358766a>] __kasan_slab_free+0x12a/0x1d0
> >> [ 6818.650933] [<00047dd45350deb4>] kmem_cache_free+0x1b4/0x580
> >> [ 6818.650938] [<00047dd452c520da>] __put_task_struct+0x24a/0x320
> >> [ 6818.650945] [<00047dd452c6aee4>] delayed_put_task_struct+0x294/0x350
> >> [ 6818.650949] [<00047dd452e9066a>] rcu_do_batch+0x6ea/0x2090
> >> [ 6818.650953] [<00047dd452ea60f4>] rcu_core+0x474/0xa90
> >> [ 6818.650956] [<00047dd452c780c0>] handle_softirqs+0x3c0/0xf90
> >> [ 6818.650960] [<00047dd452c76fbe>] __irq_exit_rcu+0x35e/0x460
> >> [ 6818.650963] [<00047dd452c79992>] irq_exit_rcu+0x22/0xb0
> >> [ 6818.650966] [<00047dd454bd8128>] do_ext_irq+0xd8/0x120
> >> [ 6818.650972] [<00047dd454c0ddd0>] ext_int_handler+0xb8/0xe8
> >> [ 6818.650979] [<00047dd453589cf6>] kasan_check_range+0x236/0x2f0
> >> [ 6818.650982] [<00047dd453378cf0>] filemap_get_pages+0x190/0xaa0
> >> [ 6818.650986] [<00047dd453379940>] filemap_read+0x340/0xa70
> >> [ 6818.650989] [<00047dd3d325d226>] xfs_file_buffered_read+0x2c6/0x400 [xfs]
> >> [ 6818.651431] [<00047dd3d325dfe2>] xfs_file_read_iter+0x2c2/0x550 [xfs]
> >> [ 6818.651663] [<00047dd45364710c>] vfs_read+0x64c/0x8c0
> >> [ 6818.651669] [<00047dd453648ed8>] ksys_read+0x118/0x200
> >> [ 6818.651672] [<00047dd452b6cf5a>] do_syscall+0x27a/0x380
> >> [ 6818.651676] [<00047dd454bd7e74>] __do_syscall+0xf4/0x1a0
> >> [ 6818.651680] [<00047dd454c0db58>] system_call+0x70/0x98
> >>
> >> With all the recent changes in stackdepot to support new KASAN features,
> >> it is obvious that the current DEPOT_POOLS_CAP of 8192 may not be
> >> enough when KASAN is enabled. Fix this stackdepot capability issue
> >> by doubling DEPOT_POOLS_CAP if KASAN is enabled. With 4k pages, the
> >> maximum stackdepot capacity is doubled to 256 MB with KASAN enabled.
> > It is possible that the stack depot runs out of space due to a truly
> > large number of unique stack traces, but I would first make sure that
> > is indeed the case. The one thing to check would be to dump all the
> > stack traces from the stack depot when it overflows, and check whether
> > they make sense. There have been cases in the past, when e.g. the task
> > context part of a stack trace from an interrupt didn't get stripped
> > properly, and thus almost each stack trace from an interrupt was
> > considered unique by the stack depot. Perhaps, something similar
> > started happening again.
>
> It could be.
>
> Anyway, I got a bug report from our QE team and they have seen it once
> so far. So it is not an easily reproducible problem. I need to do
> further investigation first. Thanks for your quick response.

If you have access to the environment where the issue is happening,
you can just make the stack depot keep dumping all new stack traces
being saved and monitor them for some time. If there's a problem with
interrupt contexts, you should be able to see it even without waiting
for the stack depot to overflow.

Thanks!