Apologies for the enormous quote, but I wanted to get the lspci and[enormous quote snipped]
dmesg in here, in case someone else has some insight.
Yes, I'll do those tests tonight.
So, in the oddball config department, you've got a ISAPnP modem over
which you're running PPP; CONFIG_PREEMPT is on.
That corruption size really does make me think of network packets, so
I'm tempted to blame it on PPP. Can you find out the MTU of your PPP
link? "ifconfig ppp0" or something like that.
Can you try doing something like
while true; do
bk clone -qlr40514130hBbvgP4CvwEVEu27oxm46w testing-2.6 foo
(cd foo; bk pull -q)
rm -rf foo
x=`expr $x + 1`
echo -n "$x "
On Fri, May 14, 2004 at 07:46:17AM -0700, Larry McVoy wrote:My instinct is that this is a file system or VM problem. Here's why: BK
wraps its data in multiple integrity checks. When you are doing a pull,
the data that is sent across the wire is wrapped both at the individual
diff level (each delta has a check) as well as a CRC around the whole
set of diffs and metadata sent. Since Steven is pulling (I believe,
please confirm) from bkbits.net, we know that the data being generated
is fine - if it wasn't the whole world would be on our case.
On the receiving side BK splats the entire set of diffs and metadata
down on disk, checking the CRC, and it doesn't proceed to the apply patch
part until the entire thing is on disk and checked. Then when the patches
are applied, each per patch checksum is verified (except, as we recently
found out, in the case of the changeset file, we added some fast path
optimization for that and dropped the check on the floor. Oops).
I don't think pppd can be part of the problem because of the way BK is
designed - you shouldn't have gotten to the place you did if the data
was corrupted in transit.
I agree, I don't see how it could be an in-flight corruption.
If any of pppd/kernel stuff is corrupting in memory pages, that's a
different matter entirely, that could cause these problems.
This is my current top suspect. Well, that, or rotted hardware with the
most bizarre symptoms I've ever seen. A 16550, or ISA DMA controller,
that just happens to stomp on buffer cache pages?
The fact that Steven is the only guy seeing this really makes me
suspicious that it is something with his kernel. I don't think this
is a memory error problem, those never look like this, they look like
a few bits being flipped. Blocks of nulls are always file/vm system.
Yeah, I'm sure it's some function of his hardware and config. But
really, how many people do you suppose are running 2.6 with PPP and
PREEMPT? And how many of them would notice a few pages per day of
partial buffer cache trashing? I had a machine with one byte of memory
that gave you "x" 50% of the time, and "x & 0xdf" the other 50% of the
time; it took several months and a fairly serious filesystem blowup
before I noticed enough problems to go run memtest86.