Re: [syzbot] [xfs?] INFO: task hung in __fdget_pos (4)

From: Dave Chinner
Date: Sun Sep 03 2023 - 19:09:59 EST


On Mon, Sep 04, 2023 at 12:47:53AM +0200, Mateusz Guzik wrote:
> On 9/4/23, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> > On Sun, Sep 03, 2023 at 10:33:57AM +0200, Mateusz Guzik wrote:
> >> On Sun, Sep 03, 2023 at 03:25:28PM +1000, Dave Chinner wrote:
> >> > On Sat, Sep 02, 2023 at 09:11:34PM -0700, syzbot wrote:
> >> > > Hello,
> >> > >
> >> > > syzbot found the following issue on:
> >> > >
> >> > > HEAD commit: b97d64c72259 Merge tag
> >> > > '6.6-rc-smb3-client-fixes-part1' of..
> >> > > git tree: upstream
> >> > > console output:
> >> > > https://syzkaller.appspot.com/x/log.txt?x=14136d8fa80000
> >> > > kernel config:
> >> > > https://syzkaller.appspot.com/x/.config?x=958c1fdc38118172
> >> > > dashboard link:
> >> > > https://syzkaller.appspot.com/bug?extid=e245f0516ee625aaa412
> >> > > compiler: Debian clang version 15.0.6, GNU ld (GNU Binutils for
> >> > > Debian) 2.40
> >> > >
> >> > > Unfortunately, I don't have any reproducer for this issue yet.
> >> >
> >> > Been happening for months, apparently, yet for some reason it now
> >> > thinks a locking hang in __fdget_pos() is an XFS issue?
> >> >
> >> > #syz set subsystems: fs
> >> >
> >>
> >> The report does not have info necessary to figure this out -- no
> >> backtrace for whichever thread which holds f_pos_lock. I clicked on a
> >> bunch of other reports and it is the same story.
> >
> > That's true, but there's nothing that points at XFS in *any* of the
> > bug reports. Indeed, log from the most recent report doesn't have
> > any of the output from the time stuff hung. i.e. the log starts
> > at kernel time 669.487771 seconds, and the hung task report is at:
> >
>
> I did not mean to imply this is an xfs problem.
>
> You wrote reports have been coming in for months so it is pretty clear
> nobody is investigating.

Which is pretty much the case for all filesystem bug reports from
syzbot except for those reported against XFS. Almost nobody else is
doing immediately triage syzbot reports, so they just sit there
gathering dust.

This reflects they reality that syzbot is doing stuff that just
doesn't happen to filesystems in production systems. Users will
almost never see these issues in real life because they aren't
corrupting the crap out their filesystems and running randomly
generated syscalls on them.

> I figured I'm going to change that bit.

I wish you the best of luck.

> >> Can the kernel be configured to dump backtraces from *all* threads?
> >
> > It already is (sysrq-t), but I'm not sure that will help - if it is
> > a leaked unlock then nothing will show up at all.
>
> See the other part of the thread, I roped in someone from syzkaller
> along with the question if they can use sysrq t.

Yes, I saw that. I'm just saying that there's a real good chance it
won't actually help and will just generate a heap more report noise.
Stack traces from all the idle tasks don't really tell us anything
about what went wrong....

Lockdep reports and logs are difficult enough to parse at the best
of times, adding more noise won't help anyone. We already have the
output of sysrq-l (from the nmi sent to all cpus before panic), so
the only remaining really useful debugging output missing is sysrq-w
(blocked tasks).

If something is blocked holding the f_pos_lock long enough to
trigger the hung task timer, then whatever task is holding it must
be blocked itself on something else. The sysrq-w output will dump
that without adding all the idle task noise....

Cheers,

Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx