out of order execution of shell commands

From: Olaf Hering
Date: Thu Oct 07 2004 - 16:21:37 EST



we have a weird bug where the open of the shell output redirection
happens before the mkdir command returns.
As a result, the redirection will fail. Unexpected data loss.

This bug was seen on ppc64 (dual ppc970) and also on i386, single cpu
3GHz xeon with smp kernel. Running kernel was our SLES9 2.6.5. It did
also happen on ppc64 with 2.6.9-rc2-mm4, but I havent verified that.
It was not seen with just one script running, but with 2 (or more).

The inner loop fails after 2-3 hours, sometimes earlier, sometimes
never. Some tracing shows:

sys32_open(2076) fs-stress-olh.s(2736):c1,j3193150 S /dev/zero
sys32_open(2076) fs-stress-olh.s(2736):c1,j3193154 S 01/11/59/8/data
sys32_open(2076) fs-stress-olh.s(2736):c1,j3193157 S ./01/11/59/9/DIRNAME
sys_mkdir(1696) fs-stress-olh.s(2736):c1,j3193162 E 01/11/59/9

That means, a gap of up 5 jiffies betwen the start of the open
syscall and the end of the mkdir syscall.
In this case, everything happend on cpu1.
The dd command does never fail.

If I add more printk (start/end of sys_open, sys_mkdir), the bug will
disappear.

A quick look at the bash2 sources shows a wait(-1,..), I believe the
bash does nothing unusual.

Any ideas whats going on here, and how to debug it further?


testcase:

fs-stress-olh.sh
#!/bin/bash
export PATH=
/bin/echo $$ > /dev/shm/bug42232-pid-$$
/bin/mkdir stress-root
cd stress-root

/bin/echo Setting up ...
for i in 0 1 2 3 4 5 6 7 8 9; do
for j in 0 1 2 3 4 5 6 7 8 9; do
for k in 0 1 2 3 4 5 6 7 8 9; do
/bin/echo -n "$i$j$k " >> onethousand
done
/bin/echo -n "$i$j " >> onehundret
done
/bin/echo -n "$i " >> ten
done
/bin/echo -n "0 1 2 3 4 5 6 7 8 9" > onethousand
for h in 0 1 ; do
for i in `/bin/cat onethousand`; do
/bin/mkdir $h$i
/bin/echo $h$i > ./$h$i/DIRNAME
for j in `/bin/cat onehundret`; do
/bin/mkdir $h$i/$j
/bin/echo $h$i/$j > ./$h$i/$j/DIRNAME
for k in `/bin/cat onehundret`; do
/bin/mkdir $h$i/$j/$k
/bin/echo $h$i/$j/$k > ./$h$i/$j/$k/DIRNAME
for l in `/bin/cat ten`; do
/bin/mkdir $h$i/$j/$k/$l
/bin/echo $h$i/$j/$k/$l > ./$h$i/$j/$k/$l/DIRNAME || /bin/kill -STOP `/bin/cat /dev/shm/bug42232-pid-*`
/bin/dd if=/dev/zero of=$h$i/$j/$k/$l/data bs=1024 count=1 &> /dev/zero
done
done
done
done
done

cd ..
/bin/rm -rf stress-root /dev/shm/bug42232-pid-$$


start_test.sh
#!/bin/bash
set -ex
cd /bug42232
test -x ~olh/bugs/42232/fs-stress-olh.sh
numcpus=`grep -Ec '^cpu[0-9]' /proc/stat`
for i in `seq 1 $numcpus`
do
(rm -rf $i ; mkdir -v $i; cd $i || exit 1 ; pwd ; nohup sh -c 'while date ; do ~olh/bugs/42232/fs-stress-olh.sh ; done' < /dev/null > logfile 2>&1 & )
done
sleep $numcpus
( tail --follow=name --retry --lines=42 ?/logfile & )
--
USB is for mice, FireWire is for men!

sUse lINUX ag, nÃRNBERG
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/