Re: Suggestions on how to debug kernel crashes where printk and gdb both does not work
From: Dongliang Mu
Date: Tue Jun 15 2021 - 22:03:26 EST
On Tue, Jun 15, 2021 at 9:35 PM Pavel Skripkin <paskripkin@xxxxxxxxx> wrote:
>
> On Tue, 15 Jun 2021 18:37:14 +0800
> Dongliang Mu <mudongliangabcd@xxxxxxxxx> wrote:
>
> > On Tue, Jun 15, 2021 at 4:30 AM Pavel Skripkin <paskripkin@xxxxxxxxx>
> > wrote:
> > >
> > > On Mon, 14 Jun 2021 23:04:03 +0800
> > > Dongliang Mu <mudongliangabcd@xxxxxxxxx> wrote:
> > >
> > > > On Mon, Jun 14, 2021 at 10:47 PM Pavel Skripkin
> > > > <paskripkin@xxxxxxxxx> wrote:
> > > > >
> > > > > On Mon, 14 Jun 2021 22:40:55 +0800
> > > > > Dongliang Mu <mudongliangabcd@xxxxxxxxx> wrote:
> > > > >
> > > > > > On Mon, Jun 14, 2021 at 10:25 PM Pavel Skripkin
> > > > > > <paskripkin@xxxxxxxxx> wrote:
> > > > > > >
> > > > > > > On Mon, 14 Jun 2021 22:19:10 +0800
> > > > > > > Dongliang Mu <mudongliangabcd@xxxxxxxxx> wrote:
> > > > > > >
> > > > > > > > On Mon, Jun 14, 2021 at 9:34 PM Pavel Skripkin
> > > > > > > > <paskripkin@xxxxxxxxx> wrote:
> > > > > > > > >
> > > > > > > > > On Mon, 14 Jun 2021 21:22:43 +0800
> > > > > > > > > Dongliang Mu <mudongliangabcd@xxxxxxxxx> wrote:
> > > > > > > > >
> > > > > > > > > > Dear kernel developers,
> > > > > > > > > >
> > > > > > > > > > I was trying to debug the crash - memory leak in
> > > > > > > > > > hwsim_add_one [1] recently. However, I encountered a
> > > > > > > > > > disgusting issue: my breakpoint and printk/pr_alert
> > > > > > > > > > in the functions that will be surely executed do not
> > > > > > > > > > work. The stack trace is in the following. I wrote
> > > > > > > > > > this email to ask for some suggestions on how to
> > > > > > > > > > debug such cases?
> > > > > > > > > >
> > > > > > > > > > Thanks very much. Looking forward to your reply.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Hi, Dongliang!
> > > > > > > > >
> > > > > > > > > This bug is not similar to others on the dashboard. I
> > > > > > > > > spent some time debugging it a week ago. The main
> > > > > > > > > problem here, that memory allocation happens in the
> > > > > > > > > boot time:
> > > > > > > > >
> > > > > > > > > > [<ffffffff84359255>] kernel_init+0xc/0x1a7
> > > > > > > > > > init/main.c:1447
> > > > > > > > >
> > > > > > > >
> > > > > > > > Oh, nice catch. No wonder why my debugging does not work.
> > > > > > > > :(
> > > > > > > >
> > > > > > > > > and reproducer simply tries to
> > > > > > > > > free this data. You can use ftrace to look at it. Smth
> > > > > > > > > like this:
> > > > > > > > >
> > > > > > > > > $ echo 'hwsim_*' > $TRACE_DIR/set_ftrace_filter
> > > > > > > >
> > > > > > > > Thanks for your suggestion.
> > > > > > > >
> > > > > > > > Do you have any conclusions about this case? If you have
> > > > > > > > found out the root cause and start writing patches, I
> > > > > > > > will turn my focus to other cases.
> > > > > > >
> > > > > > > No, I had some busy days and I have nothing about this bug
> > > > > > > for now. I've just traced the reproducer execution and
> > > > > > > that's all :)
> > > > > > >
> > > > > > > I guess, some error handling paths are broken, but Im not
> > > > > > > sure
> > > > > >
> > > > > > In the beginning, I agreed with you. However, after I manually
> > > > > > checked functions: hwsim_probe (initialization) and
> > > > > > hwsim_remove (cleanup), then things may be different. The
> > > > > > cleanup looks correct to me. I would like to debug but stuck
> > > > > > with the debugging process.
> > > > > >
> > > > > > And there is another issue: the cleanup function also does not
> > > > > > output anything or hit the breakpoint. I don't quite
> > > > > > understand it since the cleanup is not at the boot time.
> > > > > >
> > > > > > Any idea?
> > > > > >
> > > > >
> > > > > Output from ftrace (syzkaller repro):
> > > > >
> > > > > root@syzkaller:~# cat /sys/kernel/tracing/trace
> > > > > # tracer: function_graph
> > > > > #
> > > > > # CPU DURATION FUNCTION CALLS
> > > > > # | | | | | | |
> > > > > 1) | hwsim_del_radio_nl() {
> > > > > 1) | hwsim_del() {
> > > > > 1) | hwsim_edge_unsubscribe_me() {
> > > > > 1) ! 310.041 us | hwsim_free_edge();
> > > > > 1) ! 665.221 us | }
> > > > > 1) * 52999.05 us | }
> > > > > 1) * 53035.38 us | }
> > > > >
> > > > > Cleanup function is not the case, I think :)
> > > >
> > > > It seems like I spot the incorrect cleanup function (hwsim_remove
> > > > is the right one is in my mind). Let me learn how to use ftrace
> > > > to log the executed functions and then discuss this case with you
> > > > guys.
> > > >
> > >
> > > Hmmm, I think, there is a mess with lists.
> > >
> > > I just want to share my debug results, I have no idea about the fix
> > > for now.
> > >
> > > In hwsim_probe() edge for phy->idx = 1 is allocated, then reproduces
> > > sends a request to delete phy with idx == 0, so this check in
> > > hwsim_edge_unsubscribe_me():
> > >
> > > if (e->endpoint->idx == phy->idx) {
> > > ... clean up code ...
> > > }
> > >
> > > won't be passed and edge won't be freed (because it was allocated
> > > for phy with idx == 1). Allocated edge for phy 1 becomes leaked
> > > after hwsim_del(). I can't really see the code where phy with idx
> > > == 1 can be deleted from list...
> >
> > Thanks for sharing your debugging result.
> >
> > hwsim_phys
> > |
> > ---------------------------------
> > | |
> > sub0 (edges) sub1 (edges)
> > ----> e (idx = 1) ----> e (idx = 0)
> >
> > hwsim_del_radio_nl will call hwsim_del to delete phy (idx:1).
> > However, in this function, it only deletes the e in the edge list of
> > sub1. Then it deletes phy (i.e., sub0) from the hwsim_phys list. So it
> > leaves the e in the edge list of sub0 non-free.
> >
> > I proposed a patch and test it successfully in the syzbot dashboard.
> >
>
> Cool! I thougth about similar fix before going to bed, but I had really
> busy morning today :)
>
> > diff --git a/drivers/net/ieee802154/mac802154_hwsim.c
> > b/drivers/net/ieee802154/mac802154_hwsim.c
> > index da9135231c07..b05159cff33a 100644
> > --- a/drivers/net/ieee802154/mac802154_hwsim.c
> > +++ b/drivers/net/ieee802154/mac802154_hwsim.c
> > @@ -824,9 +824,16 @@ static int hwsim_add_one(struct genl_info *info,
> > struct device *dev,
> > static void hwsim_del(struct hwsim_phy *phy)
> > {
> > struct hwsim_pib *pib;
> > + struct hwsim_edge *e;
> >
> > hwsim_edge_unsubscribe_me(phy);
> >
> > + // remove the edges in the list
> > + list_for_each_entry_rcu(e, &phy->edges, list) {
> > + list_del_rcu(&e->list);
> > + hwsim_free_edge(e);
> > + }
> > +
>
> I think, rcu_read_lock() and rcu_read_unlock() are needed here (like in
> hwsim_edge_unsubscribe_me()). Or you can delete this edges after deleting
> phy node from global list, then, i guess, rcu locking won't be needed
> here.
Yes, you're right. rcu_read_lock is needed here. However, from the
code below list_del(&phy->list), I think we'd better still add
rcu_read_lock for those statements.
How do you think about the following patch? BTW, I've sent a patch
with the prefix PATCH. Maybe we can discuss this patch there.
diff --git a/drivers/net/ieee802154/mac802154_hwsim.c
b/drivers/net/ieee802154/mac802154_hwsim.c
index da9135231c07..cf659361a3fb 100644
--- a/drivers/net/ieee802154/mac802154_hwsim.c
+++ b/drivers/net/ieee802154/mac802154_hwsim.c
@@ -824,12 +824,17 @@ static int hwsim_add_one(struct genl_info *info,
struct device *dev,
static void hwsim_del(struct hwsim_phy *phy)
{
struct hwsim_pib *pib;
+ struct hwsim_edge *e;
hwsim_edge_unsubscribe_me(phy);
list_del(&phy->list);
rcu_read_lock();
+ list_for_each_entry_rcu(e, &phy->edges, list) {
+ list_del_rcu(&e->list);
+ hwsim_free_edge(e);
+ }
pib = rcu_dereference(phy->pib);
rcu_read_unlock();
>
> > list_del(&phy->list);
> >
> > rcu_read_lock();
> >
> > I will send a patch later.
> >
> >
> > >
> > > Maybe, it's kmemleak bug. Similar strange case was with this one
> > > https://syzkaller.appspot.com/bug?id=3a325b8389fc41c1bc94de0f4ac437ed13cce584.
> > > I find it strange, that I could reach leaked pointers after
> > > kmemleak reported a leak. Im not familiar with kmemleak internals
> > > and I might be wrong
> > >
> > >
> > > With regards,
> > > Pavel Skripkin
>
>
>
>
> With regards,
> Pavel Skripkin