RE: Any known soft lockup issue with vfs_write()->fsnotify()?

From: Haiyang Zhang
Date: Thu Mar 08 2018 - 10:08:07 EST


There was another report of the same issue on CoreOS, 4.14.11-coreos. The host/guest is AWS G4. So the problem is not limited to Azure VMs. It doesn't happen on older kernel like 4.4. Maybe the problem is related to some (recent) changes on fsnotify or other fs code?

Soft lockup kernel panic reboot on AWS instance on fsnotify and vfs_write #2356
https://github.com/coreos/bugs/issues/2356

Thanks,
- Haiyang

> -----Original Message-----
> From: Jan Kara <jack@xxxxxxx>
> Sent: Monday, March 5, 2018 3:49 PM
> To: Dexuan Cui <decui@xxxxxxxxxxxxx>
> Cc: linux-fsdevel@xxxxxxxxxxxxxxx; Jan Kara <jack@xxxxxxx>; Amir Goldstein
> <amir73il@xxxxxxxxx>; Miklos Szeredi <mszeredi@xxxxxxxxxx>; Haiyang
> Zhang <haiyangz@xxxxxxxxxxxxx>; 'linux-kernel@xxxxxxxxxxxxxxx' <linux-
> kernel@xxxxxxxxxxxxxxx>; Jork Loeser <Jork.Loeser@xxxxxxxxxxxxx>
> Subject: Re: Any known soft lockup issue with vfs_write()->fsnotify()?
>
> Hi!
>
> On Fri 02-03-18 22:28:50, Dexuan Cui wrote:
> > Recently people are getting a soft lock issue with vfs_write()->fsnotify().
> > The detailed calltrace is available at:
> > https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithu
> >
> b.com%2Fcoreos%2Fbugs%2Fissues%2F2356&data=04%7C01%7Chaiyangz%40
> micros
> >
> oft.com%7Ca1b1bc6822c9442195ad08d582da7942%7C72f988bf86f141af91ab2
> d7cd
> >
> 011db47%7C1%7C0%7C636558797237925702%7CUnknown%7CTWFpbGZsb3d8
> eyJWIjoiM
> > C4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwifQ%3D%3D%7C-
> 2&sdata=pdwtsbU
> > 0%2FW3y7Zy%2BX%2Ffkbx%2FPktoKVBgimfxMyVk6Lyw%3D&reserved=0
> > https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithu
> >
> b.com%2Fcoreos%2Fbugs%2Fissues%2F2364&data=04%7C01%7Chaiyangz%40
> micros
> >
> oft.com%7Ca1b1bc6822c9442195ad08d582da7942%7C72f988bf86f141af91ab2
> d7cd
> >
> 011db47%7C1%7C0%7C636558797237925702%7CUnknown%7CTWFpbGZsb3d8
> eyJWIjoiM
> > C4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwifQ%3D%3D%7C-
> 2&sdata=w%2Bjed
> > u0yIYlpRut5sHa2%2Bhs5cdcdxp1dd3sHkyvRCPw%3D&reserved=0
>
> I didn't see them yet.
>
> > The kernel versions showing up the issue are:
> > 4.14.11-coreos
> > 4.14.19-coreos
> > 4.13.0-1009 -- this is the kernel with which I'm personally seeing the lockup.
> >
> > I have not got a chance to try the latest mainline kernel yet.
>
> It would be good to try 4.15 kernel to see whether recent fixes from Miklos
> didn't fix your problem. They should be present in 4.14.11/19 kernels as well
> but one never knows...
>
> > Before the lockup error message suddenly appears, Linux has been
> > running fine for many hours. I have NOT found a consistent way to
> > reproduce the lockup yet.
> >
> > Looks the kernel is stuck in fsnotify(), when it tries to get the
> > fsnotify_mark_srcu lock.
>
> It is not possible that we would 'hang' in srcu_read_lock() - that is just a read of
> one variable and increment of another. We'd have to be looping somewhere
> and watchdog would have to happen to hit us always at that place. Weird. Are
> you sure RIP points to srcu_read_lock?
>
> > "git log fs/notify/fsnotify.c" on the latest mainline shows that some
> > recent patches might help.
> >
> > I'd like to check if this is a known issue.
>
> As I've mentioned above, so far I didn't see reports like this...
>
> Honza
> --
> Jan Kara <jack@xxxxxxxx>
> SUSE Labs, CR