Re: Analyzed/Solved/Bisected: Booting 2.6.30-rc2-git7 very slow

From: Martin Knoblauch
Date: Wed May 27 2009 - 07:26:18 EST


----- Original Message ----

> From: Martin Knoblauch <knobi@xxxxxxxxxxxx>
> To: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> Cc: Mike Galbraith <efault@xxxxxx>; viro@xxxxxxxxxxxxxxxxxx; rjw@xxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; tigran@xxxxxxxxxxxxxxxxxxxxx; Kay Sievers <kay.sievers@xxxxxxxx>; shemminger@xxxxxxxxxx; Jesse Barnes <jbarnes@xxxxxxxxxxxxxxxx>; Matthew Wilcox <matthew@xxxxxx>
> Sent: Wednesday, May 27, 2009 11:14:06 AM
> Subject: Re: Analyzed/Solved/Bisected: Booting 2.6.30-rc2-git7 very slow
>
> ----- Original Message ----
>
> > From: Andrew Morton
> > To: Martin Knoblauch
> > Cc: Mike Galbraith ; viro@xxxxxxxxxxxxxxxxxx; rjw@xxxxxxx;
> linux-kernel@xxxxxxxxxxxxxxx; tigran@xxxxxxxxxxxxxxxxxxxxx; Kay Sievers
> ; shemminger@xxxxxxxxxx; Jesse Barnes
> ; Matthew Wilcox
> > Sent: Wednesday, May 27, 2009 8:31:02 AM
> > Subject: Re: Analyzed/Solved/Bisected: Booting 2.6.30-rc2-git7 very slow
> >
> > On Wed, 20 May 2009 03:22:28 -0700 (PDT) Martin Knoblauch
> > wrote:
> >
> > >
> > > ----- Original Message ----
> > >
> > > > From: Mike Galbraith
> > > > To: Martin Knoblauch
> > > > Cc: Andrew Morton ; viro@xxxxxxxxxxxxxxxxxx;
> > rjw@xxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; tigran@xxxxxxxxxxxxxxxxxxxx
> > > > Sent: Wednesday, May 6, 2009 10:37:45 AM
> > > > Subject: Re: Analyzed/Solved: Booting 2.6.30-rc2-git7 very slow
> > > >
> > > > On Wed, 2009-05-06 at 00:55 -0700, Martin Knoblauch wrote:
> > > >
> > > > > just to bring this back to my problem :-)
> > > >
> > > > Good idea :-)
> > > >
> > > > > Last week I reported that the "new" sysfs entry in /proc/mounts already
>
> > comes
> > > > out of initrd. Does this ring a bell?
> > > > >
> > > > > http://lkml.indiana.edu/hypermail/linux/kernel/0904.3/03048.html
> > > >
> > > > Nope, no bells.
> > > >
> > > > The only thing I can suggest is that you try a bisection.
> > > >
> > > > -Mike
> > >
> > > OK, so I finally managed to bisect the issue down to the following commit.
> > Not much that I can say about it. Someone else suggested that it might all be
> a
> > question of timing. Might very well be. I will try it out on a system with a
> > different SCSI/RAID controller. The failing system has an "Smart Array 6i"
> > (cciss). "cciss", "ext3" and "jbd" are all modules coming from initrd.
> > >
> > > |commit 1120f8b8169fb2cb51219d326892d963e762edb6
> > > |Author: Stephen Hemminger
> > > |Date: Thu Dec 18 09:17:16 2008 -0800
> > > |
> > > | PCI: handle long delays in VPD access
> > > |
> > > | Accessing the VPD area can take a long time. The existing
> > > | VPD access code fails consistently on my hardware. There are comments
> > > |
> > > | Change the access routines to:
> > > | * use a mutex rather than spinning with IRQ's disabled and lock held
> > > | * have a much longer timeout
> > > | * call cond_resched while spinning
> > > |
> > > | Signed-off-by: Stephen Hemminger
> > > | Reviewed-by: Matthew Wilcox
> > > | Signed-off-by: Jesse Barnes
> > >
> >
> >
> >
> > So afacit what's happening is that the above change caused one of your
> > PCI devices to take a very long time to initialise, yes? Was it the
> > CCISS driver?
> >
>
> the whole thing is not understood. I mentioned CCISS only because it is the most
> visible difference difference between my two test platforms.
>
> > If you add "printk.time=y" to the kernel boot command line then you'll
> > get timestamped boot messages which will make it easier to determine
> > where the time was consumed. Adding `initcall_debug' to the boot line
> > will help us delve further into the delay, assuming that the offending
> > driver is build into vmlinux (which it might not be).
> >
>
> added both options. "dmesg" output from both is appended. The initcall timings
> vary in both directions. For CCISS, they are actually faster for the "bad" case.
>
> > Either way, it would be useful to know which driver the above change
> > broke.
> >
>
> agreed.
>
> > Once we know that, the questions is: doe sthe driver still work? If
> > so, then presumably the hardware if behaving unexpectedly, or in a way
> > which we're failing to cope with.
> >
>
> if it is CCISS, I can definitely say that it does work OK. As far as I can see,
> the whole system works OK, besides the duplicate sysfs line coming out of
> initrd.
>
> > Or perhaps that patch was simply buggy.
> >
> > btw, I don't agree that this report should be closed for "fuzziness"!
> > AFACIT the regression clearly and reproducibly occurs on one of your
> > machines, yes? That ain't fuzzy!
>
> frankly, I will stop caring about the DL380s before 2.6.31 gets released. My
> production kernels are not affected and the hotplug scripts can easily be fixed
> for testting. So, my interest is more curiosity. And the day-job does not really
> justify spending much more time on it. Aren't day-jobs annoying ...
>
> Cheers
> Martin

FWIW, I compiled the CCISS driver into the kernel. This makes the second "/sys" line in /proc/mounts go away, dmesg attached. But does it prove anything? The initialization of the CCISS hardware now happens about 2 seconds earlier in the bootup sequence. Does this hint to a problem with CCISS, or just confirms that the whole issue is really timing dependent? Anyway, I add Mike to CC.

Cheers
Martin

Attachment: dmesg-g904d6a3-dirty-modscsi-nomodcciss-good
Description: Binary data