Re: hunt for 2.6.37 dm-crypt+ext4 corruption? (was: Re: dm-cryptbarrier support is effective)

From: Jon Nelson
Date: Tue Dec 07 2010 - 16:01:33 EST


On Tue, Dec 7, 2010 at 1:35 PM, Ted Ts'o <tytso@xxxxxxx> wrote:
> On Tue, Dec 07, 2010 at 01:22:43PM -0500, Mike Snitzer wrote:
>> > 1. create a database (from bash):
>> >
>> > createdb test
>> >
>> > 2. place the following contents in a file (I used 't.sql'):
>> >
>> > begin;
>> > create temporary table foo as select x as a, ARRAY[x] as b FROM
>> > generate_series(1, 10000000 ) AS x;
>> > create index foo_a_idx on foo (a);
>> > create index foo_b_idx on foo USING GIN (b);
>> > rollback;
>> >
>> > 3. execute that sql:
>> >
>> > psql -f t.sql --echo-all test
>> >
>> > With 2.6.34.7 I can re-run [3] all day long, as many times as I want,
>> > without issue.
>> >
>> > With 2.6.37-rc4-13 (the currently-installed KOTD kernel) if tails
>> > pretty frequently.
>
> So I just tried to reproduce this on an Ubuntu 10.04 system running
> 2.6.37-rc5 (completely stock except for a few apparmor patches that I
> needed to keep the apparmor userspace from complaining). ÂI'm using
> Postgres 8.4.5-0ubuntu10.04.
>
> Using the above procedure, I wasn't able to reproduce. ÂThen I
> realized this might have been because I was using an SSD root file
> system (which is secured using LUKS/dm-crypt, with LVM on top of
> dm-crypt). ÂSo I mounted a file system on a 5400 rpm SSD disk, which
> is also protected using LUKS/dm-crypt with LVM on top. ÂI then
> executed the PostgresQL commands:
>
> CREATE TABLESPACE test LOCATION '/kbuild/postgres';
> SET default_tablespace = test;
> COMMIT
> \quit
>
> I then re-ran the above proceduing, and verified that all of the I/O
> was going to the 5400rpm laptop disk.
>
> I then ran the above procedure a half-dozen times, and I still haven't
> been able to reproduce any Postgresql errors or kernel errors.
>
> Jon, can you help me identify what might be different with your run
> and mine? ÂWhat version of Postgres are you using?

I am using postgres 8.4.5 on openSUSE 11.3 x86_64.
The problems were observed on both "real" hardware (thinkpad T61p) and
in virtualbox, where all current testing is taking place. The current
kernel is a "vanilla" (unpatched) kernel. I *did* set wal_sync_method
to fdatasync, however, if that is relevant. Otherwise, the pg config
is stock. With no crypt involved, I did have to iterate the tests to
observe the issue - a half-dozen times or more were necessary.
Typically, when crypt was involved, the issue would manifest much more
rapidly.

--
Jon
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/