Re: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386,bisected, reproducable)
From: Bron Gondwana
Date: Tue Jun 17 2008 - 23:15:12 EST
On Tue, Jun 17, 2008 at 02:20:49PM -0700, Linus Torvalds wrote:
> On Tue, 17 Jun 2008, Linus Torvalds wrote:
> >
> > Hmm. Something like this *may* salvage it.
> >
> > Untested, so far (I'll reboot and test soon enough), but even if it fixes
> > things, it's not really very good.
>
> Ok, so I just rebooted with this, and it does indeed fix the bug.
>
> I'd be happier with a more complete fix (ie being byte-accurate and
> actually doing the partial copy when it hits a fault in the middle), but
> this seems to be the minimal fix, and at least fixes the totally bogus
> return values from the x86-64 __copy_user*() functions.
>
> Not that I checked that I got _all_ cases correct (and maybe there are
> other versions of __copy_user that I missed entirely), but Bron's
> test-case at least seems to work properly for me now.
>
> Bron? If you have a more complete test-suite (ie the real-world case that
> made you find this), it would be good to verify the whole thing.
Ok - I pulled the latest linus-2.6 git, and discovered
the patch was already in there, so I just built and
rebooted (git 952f4a0a9b27e6dbd5d32e330b3f609ebfa0b061).
Confirmed - fixed in both the test code and the cyr_dbtool
test case I was using previously (I would have posted that
instead, but building cyrus is a bit of pain. You need
bdb and sasl and all sorts of extraneous crap - and
cyrusdb_skiplist.c depends on about half of Cyrus'
infrastructure, so I couldn't just pull it out by itself)
For my sins, I appear to be becoming the world expert on
that particular file. I've debugged skiplist bugs many
times over, and completely rewritten the locking code.
It really does some pretty evil things - the memory accesses
look something like this:
[file...................]
[mmap^....^.^........^^..................................]
[file...................++++++++++++]
[mmap^....^.^........^^.^^ ^ ^^.....................]
Where (^) is the bits that get accessed. All reads are via
the mmap, all writes are done with retry_write or
retry_writev (Cyrus library functions that keep hammering
until all the bytes are written)
I was suspecting as early as Friday night (we've been
debugging this one for a few days now!) that it was page
break related, because the bug only seemed to be appearing
on seen databases with really long seen lists (they're in
ranged integer format like 1:5,7:9,12,14:22,24:...).
It didn't help that at first we were only finding out about
cases where the corruption hit exactly on the "navigational
components", hence breaking the skiplist logic. And then
the backpointer writes would scribble all over the corrupt
area as well, so that made it even stranger to debug!
OK - so I'll report this issue to the Cyrus mailing list.
Warn people not to run on kernels 2.6.23 -> 2.6.25.7 with
x86_64 kernels. At least not without the skanky little
patch that I'm planning to post:
int magic = 0;
for (i = 0; i < maplen; i++) magic ^= mapbase[i];
Since I've tested that as a viable workaround!
Bron.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/