Re: inode_unused list corruption in 2.4.26 - spin_lock problem?

From: Chris Caputo
Date: Tue Jul 27 2004 - 11:11:57 EST


On Tue, 27 Jul 2004, Marcelo Tosatti wrote:
> On Mon, Jul 26, 2004 at 10:41:24AM -0700, Chris Caputo wrote:
> > On Sat, 3 Jul 2004, Arjan van de Ven wrote:
> > > On Fri, Jul 02, 2004 at 01:00:19PM -0700, Chris Caputo wrote:
> > > > On Fri, 25 Jun 2004, Marcelo Tosatti wrote:
> > > > > On Wed, Jun 23, 2004 at 06:50:48PM -0700, Chris Caputo wrote:
> > > > > > Is it safe to assume that the x86 version of atomic_dec_and_lock(), which
> > > > > > iput() uses, is well trusted? I figure it's got to be, but doesn't hurt
> > > > > > to ask.
> > > > >
> > > > > Pretty sure it is, used all over. You can try to use non-optimize version
> > > > > at lib/dec_and_lock.c for a test.
> > > >
> > > > My current theory is that occasionally when irqbalance changes CPU
> > > > affinities that the resulting set_ioapic_affinity() calls somehow cause
> > > > either inter-CPU locking or cache coherency or ??? to fail.
> > >
> > > or.... some spinlock is just incorrect and having the irqbalance irqlayout
> > > unhides that.. irqbalance only balances very very rarely so I doubt it's the
> > > cause of anything...
>
> Hi Chris,
>
> > It has been a while since I have been able to follow up on this but I want
> > to let you know that I _have been able_ to reproduce the problem (believed
> > to be IRQ twiddling resulting in failed spinlock protection) with a stock
> > kernel.
>
> Well, no manipulation of the inode lists are done under IRQ context.
>
> What you think might be happening is that an IRQ comes in __refile_inode()
> (and other paths) and that causes a problem?
>
> Thats perfectly fine. Again, no manipulation of the lists are done in
> IRQ context.

I may have phrased the question poorly. I am trying to get at whether the
x86 "lock ; dec $x" operation can somehow be affected by changes to IRQ
affinity, since that is what appears to happen. I assume SMP inter-cpu
locking doesn't generate IRQs at the linux kernel level (right?), but
there seems to be something at a low level which gets affected by changes
to which CPU handles which IRQs.

> > I would like to come up with a more reliable way to reproduce the problem
> > with a stock kernel (2.4.26), since it is presently very rare (less than
> > once per week) in the way I presently get it to happen, but as yet have
> > not done so.
>
> What is your workload? I'm more than willing to use the SMP boxes I have
> access to try to reproduce this.

Here is the basic repro so far:

-1) SMP x86 machine (PIII or PIV), 4 gigs of RAM, kernel 2.4.26. Here are
some possibly relevant configs:

CONFIG_X86=y
CONFIG_X86_WP_WORKS_OK=y
CONFIG_X86_INVLPG=y
CONFIG_X86_CMPXCHG=y
CONFIG_X86_XADD=y
CONFIG_X86_BSWAP=y
CONFIG_X86_POPAD_OK=y
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_X86_HAS_TSC=y
CONFIG_X86_GOOD_APIC=y
CONFIG_X86_PGE=y
CONFIG_X86_USE_PPRO_CHECKSUM=y
CONFIG_X86_F00F_WORKS_OK=y
CONFIG_X86_MCE=y
CONFIG_HIGHMEM4G=y
CONFIG_HIGHMEM=y
CONFIG_HIGHIO=y
CONFIG_MTRR=y
CONFIG_SMP=y
CONFIG_X86_TSC=y
CONFIG_HAVE_DEC_LOCK=y
CONFIG_X86_IO_APIC=y
CONFIG_X86_LOCAL_APIC=y

0) Start irqbalance.
1) Have 10+ gigs of space available for testing preferably on two separate
filesystems, mounted on say /test_0 and /test_1.
2) As root do a "find /" to fill up the dcache/inode caches with a bunch
of data.
3) Compile the attached writer2.c: "gcc -O2 -o writer2 writer2.c"
4) Run "writer2 /test_0 0" and "writer2 /test_1 0" in the background.
5) Run "emerge sync" a few times per day.
6) Wait for crash.

Step 5 is unique to Gentoo, so it would be nice to have another test
program which does the equivalent in terms of opening/created/deleting a
bunch of times. Intent to be inspiring that a bunch of items are added to
and removed from the inode_unused list regularly.

> You said you also reproduced the same inode_unused corruption with
> 2.4.24, yes?

I have reproduced the same symptoms (inode_unused list corruption) on
2.4.24 with a modified kernel (tux patches plus my filesystem) but I have
not tried to do so with a stock 2.4.24. So far only stock kernel I have
tried has been 2.4.26.

> > My plan of attack is to remove irqbalance from the equation and repeatedly
> > change with random intervals /proc/irq entries directly from one user mode
> > program while another user mode program does things which inspire a lot of
> > fs/inode.c spinlock activity (since that is where I continue to see list
> > corruption).
> >
> > A few questions which could help me with this:
> >
> > - Which IRQ (if any) is used by CPU's to coordinate inter-CPU locking?
>
> None as far as spinlocks are concerned. On x86 spinlock just does "lock
> ; dec $x" operation which guarantees the atomicity of the "dec".
>
> I feel I'm not answering your question, still. What do you mean?

What I seem to be seeing is that changes in IRQ affinity are at some level
screwing up the guaranteed atomicity of the "lock ; dec $x". I realize
that might be crazy to think, but maybe it is something known about the
x86 spec. The two CPU's I have seen this with, by the way, are a PIV-Xeon
(CPUID F29) and a PIII (CPUID 06B1), so I don't imagine a microcode-type
bug unique to a certain CPU release.

> > - What does it mean if a stack trace is incomplete? For example, one I
> > have gotten is simply the tail end of the code snippet:
> >
> > 0b 9a 00 5d c8
> >
> > And so I have wondered if the failure to make a full stack trace
> > indicates something in of itself.
>
> Dont know the answer. I also usually see incomplete stack traces
> that I can make no sense of.
>
> > Thanks for any assistance. I hope to find more time to work on this in
> > the coming weeks.
>
> I'm willing to help. Just please do not mix report data which includes
> your own filesystem/modified kernel. Lets work with stock kernels and
> only that, to make it as simple as possible.

Agreed.

Chris







#define _LARGEFILE64_SOURCE
#define _FILE_OFFSET_BITS 64

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <errno.h>
#include <sys/time.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <sys/statvfs.h>
#include <sys/resource.h>
#include <signal.h>

#define NUM_DIFF_START_LOCS 10000
#define MAX_WRITE_LEN (10 * 1048576) // 10 megs

#define NUM_FILES 1
#define FILE_SIZE (10 * 1024 * 1048576LL) // 10 gigs

typedef signed long long s64;

int g_fSigTermReceived = 0;


static void sig_term(int signo)
{
g_fSigTermReceived = 1;
}


int write_wrapper(int fd,
void *buf,
int len)
{
int total = 0;
int subtotal = 0;
char *targetBuf = buf;

while (total < len)
{
subtotal = write(fd, targetBuf, len - total);

if (0 > subtotal)
{
return -1;
}
else
{
total += subtotal;
targetBuf += subtotal;
}
}

return total;
}


int main(int argc,
char *argv[])
{
unsigned int uIterationCount, uCurIteration;
unsigned int uLength;
char *srcBuffer;
struct timeval tvStart, tvEnd;
s64 s64_usec;
char szDirectory[256];
char szFilename[256];
int fds[NUM_FILES], fd;
unsigned int i;
off_t lseekRes;

//
// Usage.
//

if (2 != argc &&
3 != argc)
{
fprintf(stderr,
"Usage: %s <test dir> [# of iterations]\n",
argv[0]);
fprintf(stderr,
" -if iteration count is not specified, defaults to 1. If 0\n");
fprintf(stderr,
" it means run for a long long time (2^32 - 1).\n");
exit(1);
}

//
// Consume parameters.
//
argc--; // eat program name
strncpy(szDirectory, argv[1], sizeof(szDirectory));
argc--; // eat test directory
if (0 < argc)
{
uIterationCount = strtoul(argv[2], NULL, 10);
}
else
{
uIterationCount = 1;
}

// make it so a supplied iteration count of 0 runs for a long long time.
if (0 == uIterationCount)
{
uIterationCount--;
}

// seed random()
srandom(getpid());

// allocate space for the srcBuffer
// add long int extra so we can do a rapid random fill below
srcBuffer = malloc(MAX_WRITE_LEN +
NUM_DIFF_START_LOCS +
sizeof(long int));
if (NULL == srcBuffer)
{
fprintf(stderr,
"%s: was unable to malloc %u bytes. Exiting.\n",
argv[0],
MAX_WRITE_LEN +
NUM_DIFF_START_LOCS +
sizeof(long int));
exit(1);
}

// fill srcBuffer with random data
{
long int *pli;
long int *end;
int len = (MAX_WRITE_LEN +
NUM_DIFF_START_LOCS +
sizeof(long int)) / sizeof(long int);

end = (long int *)srcBuffer + len;

for (pli = (long int *)srcBuffer; pli < end; pli++)
{
*pli = random();
}
}

// create the files
for (i = 0; i < NUM_FILES; i++)
{
if (0 > snprintf(szFilename,
sizeof(szFilename),
"%s/%u",
szDirectory,
i))
{
fprintf(stderr, "snprintf(szFilename) failed\n");
exit(1);
}
fds[i] = open(szFilename, O_CREAT | O_WRONLY | O_LARGEFILE, 0666);
if (-1 == fds[i])
{
perror("open");
goto exit;
}
if (-1 == lseek(fds[i], FILE_SIZE - 1, SEEK_SET))
{
perror("lseek");
goto exit;
}
if (-1 == write(fds[i], "x", 1))
{
perror("write");
goto exit;
}
if (-1 == lseek(fds[i], 0, SEEK_SET)) // set back to start
{
perror("lseek");
goto exit;
}
}

signal(SIGINT, sig_term);

//
// Main loop.
//
for (uCurIteration = 1;
uCurIteration <= uIterationCount && 0 == g_fSigTermReceived;
uCurIteration++)
{
printf("%u:W:",
uCurIteration);

uLength = random() % MAX_WRITE_LEN;
printf("len = %u",
uLength);

fd = fds[random() % NUM_FILES];

lseekRes = lseek(fd,
0,
SEEK_CUR);

if (-1 == lseekRes)
{
perror("lseek");
goto exit;
}

if (lseekRes + uLength >= FILE_SIZE)
{
if (-1 == lseek(fd, 0, SEEK_SET)) // set back to start
{
perror("lseek");
goto exit;
}
}

gettimeofday(&tvStart, NULL);

// Do networking.
if (-1 == write_wrapper(fd,
srcBuffer + random() % NUM_DIFF_START_LOCS,
uLength))
{
perror("write_wrapper");
goto exit;
}

gettimeofday(&tvEnd, NULL);

s64_usec = (s64)tvEnd.tv_sec * 1000000 + tvEnd.tv_usec;
s64_usec -= (s64)tvStart.tv_sec * 1000000 + tvStart.tv_usec;
printf(" %Ld usecs %.2f mbits/sec",
s64_usec,
7.62939453125 * (float)uLength /
(float)s64_usec);

printf("\n");
fflush(stdout);
usleep(random() % 100000);
}

exit:
free(srcBuffer);

if (1 == g_fSigTermReceived)
{
printf("SIGINT received!\n");
}

#if 0
printf("cleaning up %u files...\n", NUM_FILES);
fflush(stdout);

// close/unlink the files
for (i = 0; i < NUM_FILES; i++)
{
printf("%u ", i);
fflush(stdout);
if (0 > snprintf(szFilename,
sizeof(szFilename),
"%s/%u",
szDirectory,
i))
{
fprintf(stderr, "snprintf(szFilename) failed\n");
exit(1);
}
close(fds[i]);
unlink(szFilename);
}

printf("\n");
#endif

exit(0);
}