Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer

From: Michal Hocko
Date: Mon Aug 14 2017 - 09:59:28 EST


On Sat 12-08-17 00:46:18, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Fri 11-08-17 16:54:36, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > On Fri 11-08-17 11:28:52, Tetsuo Handa wrote:
> > > > > Will you explain the mechanism why random values are written instead of zeros
> > > > > so that this patch can actually fix the race problem?
> > > >
> > > > I am not sure what you mean here. Were you able to see a write with an
> > > > unexpected content?
> > >
> > > Yes. See http://lkml.kernel.org/r/201708072228.FAJ09347.tOOVOFFQJSHMFL@xxxxxxxxxxxxxxxxxxx .
> >
> > Ahh, I've missed that random part of your output. That is really strange
> > because AFAICS the oom reaper shouldn't really interact here. We are
> > only unmapping anonymous memory and even if a refault slips through we
> > should always get zeros.
> >
> > Your test case doesn't mmap MAP_PRIVATE of a file so we shouldn't even
> > get any uninitialized data from a file by missing CoWed content. The
> > only possible explanations would be that a page fault returned a
> > non-zero data which would be a bug on its own or that a file write
> > extend the file without actually writing to it which smells like a fs
> > bug to me.
>
> As I wrote at http://lkml.kernel.org/r/201708112053.FIG52141.tHJSOQFLOFMFOV@xxxxxxxxxxxxxxxxxxx ,
> I don't think it is a fs bug.

Were you able to reproduce with other filesystems? I wonder what is
different in my testing because I cannot reproduce this at all. Well, I
had to reduce the number of competing writer threads to 128 because I
quickly hit the trashing behavior with more of them (and 4 CPUs). I will
try on a larger machine.
--
Michal Hocko
SUSE Labs