"Albert D. Cahalan" wrote:
>
> Robert Redelmeier writes:
> > Daniel Phillips wrote in part:
>
> >> One thing to keep in mind in all of this is: nobody is testing the
> >> reliability of their journalling or any other kind of filesystem just by
> >> running it. To test these things you have to crash/interrupt the system
> >> *a lot*. Otherwise, you are just fooling yourself and everybody else.
> >> How many crashes does it take to find that one little window of
> >> vulnerability that comes up every 10,000 crashes normally but suddenly
> >> starts coming up every time just because your customer uses their system
> >> a different way? You're doing the right thing by crash-testing it, now
> >> instead of doing it 5 times do it 1,000 times. Here's one of my
> >> favorite tests: unzip a kernel source tree and wait until the disk light
> >> goes out. A second or so after it comes on again (kflushd) hit the
> >> reset button.
> >
> > Good idea. I certainly believe in stressing hardware (see .sig),
> > but I'm not sure this test is rigorous enough. The problem is
> > the reset button is only connected to the CPU and the hard disk
> > will probably continue to write out sectors from it's hw buffer.
> > OTOH, I don't like the idea of pulling the plug too often. It's
> > very hard on the hardware. I'd expect a mechanical disk failure
> > before 10,000 cycles.
>
> The nice way to develop this code is with a block device that
> discards all writes after a timer goes off.
And it has to stop the filesystem processing somehow and convince the hd
to discard its write cache.
Somebody should write a crash-simulator along those lines and do some
real head-to-head comparisons amongst the various candidates for the
"most reliable filesystem" crown. Doing a million simulated crashes
under controlled conditions might not catch every possible thing that
could go wrong but it would sure help remove the subjective opinion
element. I think it would spur on development too because we'd have a
real yardstick to measure progress against.
The tricky part of the crash simulator would be recovering the resources
the filesystem was using and convincing the VFS to let go of the the
partition. If you could return the system to a stable state you could
do many, many more test runs in the same time. Maybe VMWare could help
here.
-- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/
This archive was generated by hypermail 2b29 : Sat Oct 07 2000 - 21:00:09 EST