Re: [PATCH v3] lockdep: Allow tuning tracing capacity constants.

From: Dmitry Vyukov
Date: Thu Nov 19 2020 - 13:09:47 EST


On Thu, Nov 19, 2020 at 3:36 PM Dmitry Vyukov <dvyukov@xxxxxxxxxx> wrote:
>
> On Thu, Nov 19, 2020 at 3:30 PM Dmitry Vyukov <dvyukov@xxxxxxxxxx> wrote:
> > >
> > > On Thu, Nov 19, 2020 at 2:45 PM Tetsuo Handa
> > > <penguin-kernel@xxxxxxxxxxxxxxxxxxx> wrote:
> > > >
> > > > On 2020/11/19 22:06, Dmitry Vyukov wrote:
> > > > >>>>
> > > > >>>> I am trying to reproduce this locally first. syzbot caims it can
> > > > >>>> reproduce it with a number of very simpler reproducers (like spawn
> > > > >>>> process, unshare, create socket):
> > > > >>>> https://syzkaller.appspot.com/bug?id=8a18efe79140782a88dcd098808d6ab20ed740cc
> > > > >>>>
> > > > >>>> I see a very slow drift, but it's very slow, so get only to:
> > > > >>>> direct dependencies: 22072 [max: 32768]
> > > > >>>>
> > > > >>>> But that's running a very uniform workload.
> > > > >>>>
> > > > >>>> However when I tried to cat /proc/lockdep to see if there is anything
> > > > >>>> fishy already,
> > > > >>>> I got this (on c2e7554e1b85935d962127efa3c2a76483b0b3b6).
> > > > >>>>
> > > > >>>> Some missing locks?
> > > >
> > > > Not a TOMOYO's bug. Maybe a lockdep's bug.
> > > >
> > > > >
> > > > > But I don't know if it's enough to explain the overflow or not...
> > > > >
> > > >
> > > > Since you can't hit the limit locally, I guess we need to ask syzbot to
> > > > run massive testcases.
> > >
> > > I am trying to test the code that will do this. Otherwise we will get
> > > days-long round-trips for stupid bugs. These files are also quite
> > > huge, I afraid that may not fit into storage.
> > >
> > > So far I get to at most:
> > >
> > > lock-classes: 2901 [max: 8192]
> > > direct dependencies: 25574 [max: 32768]
> > > dependency chains: 40605 [max: 65536]
> > > dependency chain hlocks used: 176814 [max: 327680]
> > > stack-trace entries: 258590 [max: 524288]
> > >
> > > with these worst offenders:
> > >
> > > # egrep "BD: [0-9]" /proc/lockdep
> > > 00000000df5b6792 FD: 2 BD: 1235 -.-.: &obj_hash[i].lock
> > > 000000005dfeb73c FD: 1 BD: 1236 ..-.: pool_lock
> > > 00000000b86254b1 FD: 14 BD: 1111 -.-.: &rq->lock
> > > 00000000866efb75 FD: 1 BD: 1112 ....: &cfs_b->lock
> > > 000000006970cf1a FD: 2 BD: 1126 ----: tk_core.seq.seqcount
> > > 00000000f49d95b0 FD: 3 BD: 1180 -.-.: &base->lock
> > > 00000000ba3f8454 FD: 5 BD: 1115 -.-.: hrtimer_bases.lock
> > > 00000000fb340f16 FD: 16 BD: 1030 -.-.: &p->pi_lock
> > > 00000000c9f6f58c FD: 1 BD: 1114 -.-.: &per_cpu_ptr(group->pcpu, cpu)->seq
> > > 0000000049d3998c FD: 1 BD: 1112 -.-.: &cfs_rq->removed.lock
> > > 00000000fdf7f396 FD: 7 BD: 1112 -...: &rt_b->rt_runtime_lock
> > > 0000000021aedb8d FD: 1 BD: 1113 -...: &rt_rq->rt_runtime_lock
> > > 000000004e34c8d4 FD: 1 BD: 1112 ....: &cp->lock
> > > 00000000b2ac5d96 FD: 1 BD: 1127 -.-.: pvclock_gtod_data
> > > 00000000c5df4dc3 FD: 1 BD: 1031 ..-.: &tsk->delays->lock
> > > 00000000fe623698 FD: 1 BD: 1112 -...:
> > > per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu)
> > >
> > >
> > > But the kernel continues to crash on different unrelated bugs...
> >
> >
> > Here is one successful sample. How do we debug it? What should we be
> > looking for?
> >
> > p.s. it's indeed huge, full log was 11MB, this probably won't be
> > chewed by syzbot.
> > Peter, are these [hex numbers] needed? Could we strip them during
> > post-processing? At first sight they look like derivatives of the
> > name.
>
> The worst back-edge offenders are:
>
> 00000000b445a595 FD: 2 BD: 1595 -.-.: &obj_hash[i].lock
> 0000000055ae0468 FD: 1 BD: 1596 ..-.: pool_lock
> 00000000b1336dc4 FD: 2 BD: 1002 ..-.: &zone->lock
> 000000009a0cabce FD: 1 BD: 1042 ...-: &____s->seqcount
> 000000001f2849b5 FD: 1 BD: 1192 ..-.: depot_lock
> 00000000d044255b FD: 1 BD: 1038 -.-.: &n->list_lock
> 000000005868699e FD: 17 BD: 1447 -.-.: &rq->lock
> 00000000bb52ab59 FD: 1 BD: 1448 ....: &cfs_b->lock
> 000000004f442fff FD: 2 BD: 1469 ----: tk_core.seq.seqcount
> 00000000c908cc32 FD: 3 BD: 1512 -.-.: &base->lock
> 00000000478677cc FD: 5 BD: 1435 -.-.: hrtimer_bases.lock
> 00000000b5b65cb1 FD: 19 BD: 1255 -.-.: &p->pi_lock
> 000000007f313bd5 FD: 1 BD: 1451 -.-.: &per_cpu_ptr(group->pcpu, cpu)->seq
> 00000000bac5d8ed FD: 1 BD: 1004 ...-: &____s->seqcount#2
> 000000000f57e411 FD: 1 BD: 1448 -.-.: &cfs_rq->removed.lock
> 0000000013c1ab65 FD: 7 BD: 1449 -.-.: &rt_b->rt_runtime_lock
> 000000003bdf78f4 FD: 1 BD: 1450 -.-.: &rt_rq->rt_runtime_lock
> 00000000975d5b80 FD: 1 BD: 1448 ....: &cp->lock
> 000000002586e81b FD: 1 BD: 1471 -.-.: pvclock_gtod_data
> 00000000d03aed24 FD: 1 BD: 1275 ..-.: &tsk->delays->lock
> 000000001119414f FD: 1 BD: 1448 -...:
> per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu)
> 000000006f3d793b FD: 6 BD: 1449 -.-.: &ctx->lock
> 00000000f3f0190c FD: 9 BD: 1448 -...: &rq->lock/1
> 000000007410cf1a FD: 1 BD: 1448 -...: &rd->rto_lock
>
> There are 19 with ~1500 incoming edges. So that's 20K.
>
> In my local testing I was at around 20-something K and these worst
> offenders were at ~1000 back edges.
> Now they got to 1500, so that is what got us over the 32K limit, right?
>
> Does this analysis make sense?
>
> Any ideas what to do with these?

And here is a full dump for:
BUG: MAX_LOCKDEP_CHAINS too low!

Attachment: log29.gz
Description: application/gzip