Re: dm-crypt barrier support is effective

From: Matt
Date: Wed Dec 01 2010 - 12:35:22 EST


On Wed, Dec 1, 2010 at 5:52 PM, Mike Snitzer <snitzer@xxxxxxxxxx> wrote:
> On Wed, Dec 01 2010 at 11:05am -0500,
> Matt <jackdachef@xxxxxxxxx> wrote:
>
>> On Mon, Nov 15, 2010 at 12:24 AM, Matt <jackdachef@xxxxxxxxx> wrote:
>> > On Sun, Nov 14, 2010 at 10:54 PM, Milan Broz <mbroz@xxxxxxxxxx> wrote:
>> >> On 11/14/2010 10:49 PM, Matt wrote:
>> >>> only with the dm-crypt scaling patch I could observe the data-corruption
>> >>
>> >> even with v5 I sent on Friday?
>> >>
>> >> Are you sure that it is not related to some fs problem in 2.6.37-rc1?
>> >>
>> >> If it works on 2.6.36 without problems, it is probably problems somewhere
>> >> else (flush/fua conversion was trivial here - DM is still doing full flush
>> >> and there are no other changes in code IMHO.)
>> >>
>> >> Milan
>> >>
>> >
>> > Hi Milan,
>> >
>> > I'm aware of your new v5 patch (which should include several
>> > improvements (or potential fixes in my case) over the v3 patch)
>> >
>> > as I already wrote my schedule unfortunately currently doesn't allow
>> > me to test it
>> >
>> > * in the case of no corruption it would be nice to have 2.6.37-rc* running :)
>> >
>> > * in the case of data corruption that would mean restoring my system -
>> > since it's my production box and right now I don't have a fallback at
>> > reach
>> > at earliest I could give it a shot at the beginning of December. Then
>> > I could also test reiserfs and ext4 as a system partition to rule out
>> > that it's
>> > a ext4-specific thing (currently I'm running reiserfs on my system-partition).
>> >
>> > Thanks !
>> >
>> > Matt
>> >
>>
>>
>> OK guys,
>>
>> I've updated my system to latest glibc 2.12.1-r3 (on gentoo) and gcc
>> hardened 4.5.1-r1 with 1.4 patchset which also uses pie (that one
>> should fix problems with graphite)
>>
>> not much system changes besides that,
>>
>> with those it worked fine with 2.6.36 and I couldn't observe any
>> filesystem corruption
>
> So dm-crypt cpu scalability v5 with 2.6.36 worked fine.
>
>> the bad news is: I'm again seeing corruption (!) [on ext4, on the /
>> (root) partition]:
>
> ...
>
>> ===> so the No.1 trigger of this kind of corruption where files are
>> empty, missing or the content gets corrupted (at least for me) is
>> compiling software which is part of the system (e.g. emerge -e
>> system);
>>
>> the system is Gentoo ~amd64; with binutils 2.20.51.0.12 (afaik this
>> one has changed from 2.20.51.0.10 to 2.20.51.0.12 from my last
>> report); gcc 4.5.1 (Gentoo Hardened 4.5.1-r1 p1.4, pie-0.4.5) <--
>> works fine with 2.6.36 and 2.6.36.1
>>
>> I'm not sure whether benchmarks would have the same "impact"
>
> Seems this emerge is a good test if it reliably enduces the corruption.
>
>> the kernel currently running is 2.6.37-rc4 with the [PATCH v5] dm
>> crypt: scale to multiple CPUs
>>
>> besides that additional patchsets are applied (I apologize that it's
>> not only plain vanilla with the dm-crypt patch):
>> * Prevent kswapd dumping excessive amounts of memory in response to
>> high-order allocation
>> * ext4: coordinate data-only flush requests sent by fsync
>> * vmscan: protect executable page from inactive list scan
>> * writeback livelock fixes v2
>
> Have you actually experienced any of the issues the above patches are
> meant to address?  Seems you're applying patches guessing/hoping
> that they'll fix the dm-crypt corruption.
>
>> I originally had hoped that the mentioned patch in "ext4: coordinate
>> data-only flush requests sent by fsync", namely: "md: Call
>> blk_queue_flush() to establish flush/fua" and additional changes &
>> fixes to 2.6.37-rc4 would once and for all fix problems but it didn't
>
> That md patch doesn't help DM at all.  And the ext4 coordination patch
> is completely bleeding and actually broken (especially as it relates to
> DM -- but that breakage is ony a concern for request-based DM,
> e.g. DM-mapth), anyway see:
> https://www.redhat.com/archives/dm-devel/2010-November/msg00185.html
>
> I'm not sure which patches you're using for the ext4 fsync changes but
> please don't use them at all.  It is purely an optimization for
> extremely heavy fsync workloads and is only getting in the way at this
> point.
>
>> I'm also using the the writeback livelock fixes and the dm-crypt scale
>> to multiple CPUs with 2.6.36 so those generally work fine
>>
>> so it has be something that changed from 2.6.36->2.6.37 within
>> dm-crypt or other parts that gets stressed and breaks during usage of
>> the "[PATCH v5] dm crypt: scale to multiple CPUs" patch
>>
>> the other included patches surely won't be the cause for that (100%).
>>
>> Filesystem corruption only seems to occur on the / (root) where the
>> system resides -
>
> We need better fault isolation; you've introduced enough change that it
> isn't helping zero in on what your particular problem is.  Milan has
> tested he latest version of the dm-crypt cpu scalability patch quite a
> bit and hasn't seen any corruption -- but clearly the corruption you're
> seeing is a real concern and we need to get to the bottom of it.
>
> I'd really appreciate it if you could just use Linus' latest linux-2.6
> tree plus Milan's latest patch (technically v6 even though it wasn't
> labeled as such): https://patchwork.kernel.org/patch/365542/
>
> Porting that same v6 patch to 2.6.36 would also be nice (to verify you
> still don't see any corruption there).
>
> Mike
>

Hi Mike,

those other patches were for other problems I was seeing: e.g.
interactivity/latency problems I was experiencing during heavy
flushing, etc. and some more - so I speculated that those would
improve it


OK, enough of that additional stuff which distracts from this issue -
I'll leave them out for now ...

Thanks for the info !

To console you: I was using v5 from Milan's patch up until now and I
haven't noticed any corruption with it in conjunction with 2.6.36

I modified it according to Milan's mail:

>On 11/15/2010 08:25 AM, Heinz Diehl wrote:
>> On 15.11.2010, Milan Broz wrote:
>
>> drivers/md/dm-crypt.c: In function crypt_ctr':
>> drivers/md/dm-crypt.c:1408: error: WQ_MEM_RECLAIM' undeclared (first use in this function)
>> drivers/md/dm-crypt.c:1408: error: (Each undeclared identifier is reported only once
>> drivers/md/dm-crypt.c:1408: error: for each function it appears in.)

>It should be enough to just replace WQ_MEM_RECLAIM to WQ_RESCUER for 2.6.36.
>(that define is new in 2.6.37)

>Milan

http://www.redhat.com/archives/dm-devel/2010-November/msg00099.html

and that worked fine

Thanks for pointing to v6 ! I hadn't noticed that there was a new one :)

Well, so I'll restore my box to a working/productive state and will
try out v6 (I'm pretty confident that it'll work without problems).

After that I'll see what info Tejun and the others need when the next
corruption might occur with vanilla 2.6.37-rc* and v6 so that there's
something to investigate

Thanks & Regards

Matt
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/