Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer

From: Andrea Arcangeli
Date: Fri Aug 11 2017 - 06:23:10 EST


On Fri, Aug 11, 2017 at 04:54:36PM +0900, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Fri 11-08-17 11:28:52, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > +/*
> > > > + * Checks whether a page fault on the given mm is still reliable.
> > > > + * This is no longer true if the oom reaper started to reap the
> > > > + * address space which is reflected by MMF_UNSTABLE flag set in
> > > > + * the mm. At that moment any !shared mapping would lose the content
> > > > + * and could cause a memory corruption (zero pages instead of the
> > > > + * original content).
> > > > + *
> > > > + * User should call this before establishing a page table entry for
> > > > + * a !shared mapping and under the proper page table lock.
> > > > + *
> > > > + * Return 0 when the PF is safe VM_FAULT_SIGBUS otherwise.
> > > > + */
> > > > +static inline int check_stable_address_space(struct mm_struct *mm)
> > > > +{
> > > > + if (unlikely(test_bit(MMF_UNSTABLE, &mm->flags)))
> > > > + return VM_FAULT_SIGBUS;
> > > > + return 0;
> > > > +}
> > > > +
> > >
> > > Will you explain the mechanism why random values are written instead of zeros
> > > so that this patch can actually fix the race problem?
> >
> > I am not sure what you mean here. Were you able to see a write with an
> > unexpected content?
>
> Yes. See http://lkml.kernel.org/r/201708072228.FAJ09347.tOOVOFFQJSHMFL@xxxxxxxxxxxxxxxxxxx .

The oom reaper depends on userland not possibly running anymore in any
thread associated with the reaped "mm" by the time wake_oom_reaper is
called and I'm not sure do_send_sig_info is anything close to provide
such guarantee. Problem is the reschedule seems async see
native_smp_send_reschedule invoked by kick_process. So perhaps the
thread is running with a corrupted stack for a little while until the
IPI arrives to destination. I guess it wouldn't be reproducible
without a large NUMA system.

Said that I looked the assembly of your program and I don't see
anything in the file_writer that could load data from the stack by the
time it starts to write() and clearly the sigkill and
smp_send_reschedule() will happen after it's already in the write()
tight loop. The only thing it loads from the user stack after it
reaches the tight loop is the canary which should then crash it if it
breaks out of the write loop which still wouldn't cause a write.

So I don't see much explanation on the VM side, but perhaps it's
possible this is a filesystem bug that enlarges the i_size before
issuing the write that SIGBUS in copy_from_user, because of
MMF_UNSTABLE is set at first access? And then leaves i_size enlarged
and what you're seeing in od -b is leaked content from an unintialized
disk block? This would happen on ext4 as well if mounted with -o
journal=data instead of -o journal=ordered in fact, perhaps you simply
have a filesystem that isn't mounted with journal=oredered semantics
and this isn't the OOM killer.

Also why you're using octal output? -x would be more intuitive for the
0xff (377) which is to be expected (should be all zeros or 0xff, and
some zero showup too).

Assuming those values not-zeros and not-0xff are simply lack of
ordered journaling mode and it's deleted file data (you clearly must
not have a ssd with -o discard or it'd be zero there), even if you
would only see zeroes it wouldn't concern me any bit less.

The non zeroes and non-0xff if they happen beyond the end of the
previous i_size they concern me less becuase they're at least less
obviously going to create sticky data corruption in a OOM killed
database. The database could handle it by recording the valid i_size
it successfully expanded the file to, with userland journaling in its
own user metadata.

Those expected zeroes that showup in your dump, are the real major
issue here and they showup as well. A database that hits OOM would
then generate persistent sticky memory corruption in user data that
could break the entire userland journaling and you could notice only
much later too.

OOM deadlock is certainly preferable here. Rebooting on a OOM hang is
totally ok and very minor issue as the user journaling is guaranteed
to be preserved. Writing random zeroes on shared storage may break the
whole thing instead and you may notice at next reboot to upgrade the
kernel that the db journaling fails and nothing starts and you could
have lost data too.

Back to your previous xfs OOM reaper timeout failure, one way around
it, is to implement a down_read_trylock_unfair, that will obtain a
read lock ignoring any write waiter breaking fairness but if done only
in the OOM reaper that would be not a
concern. down_read_trylock_unfair should solve this xfs lockup
involving khugepaged without the need to remove the mmap_sem from the
OOM reaper while mm_users > 0. Problem would then remain if the OOM
selected task is allocating memory and stuck on a xfs lock taken by
shrink_slabs while holding the mmap_sem for writing. This is why my
preference would be to dig in xfs and solve the source of the OOM
lockup at its core, as the OOM reaper is kicking the can down the
road, and ultimately if the process runs on pure
MAP_ANONYMOUS|MAP_SHARED kicking the can won't move it one bit, unless
OOM reaper starts to reap shmem too by expanding even more with more
checks and stuff when the fix for xfs ultimately will become simpler
and more self contained and targeted.

I would like if it would be possible to tell which kernel thread has
to be allowed to make progress lowering the wmark to unstuck the
TIF_MEMDIE task. For kernel threads this could involve adding a
pf_memalloc_pid dependency that is accessible at OOM time. Workqueues
submitted in PF_MEMALLOC context could set this pf_memalloc_pid
dependency in the worker threads themselves, fs kernel threads would
need the filesystem to set this pid dependency. So if TIF_MEMDIE pid
matches the current kernel thread pf_memalloc_pid, the kernel thread
allocation would inherit PF_MEMALLOC wmark privileges, by artificially
lowering the wmark for the TIF_MEMDIE task.

Or simply we could stop calling shrink_slab for fs dependent slab
caches with a per shrinker flag, in direct reclaim and offload those
to kswapd only. That would be a real simple change, much simpler than
the current unsafe but simpler OOM reaper.

There are several dozen of mbytes of RAM available when the system
hangs and fails to get rid of the TIF_MEMDIE task, problem they must
be given to the kernel thread that the TIF_MEMDIE task is waiting for
and we can't rely on lockdep to sort it out or it's too slow.

Refusal to fix the fs hangs and relying solely on the OOM reaper
ultimately causes the OOM reaper to keep escalating, to the point not
even down_read_trylock_unfair would suffice anymore and it would need
to zap pagetables without holding the mmap_sem at all (for example in
order to solve your same xfs OOM hang that would still remain if
shrink_slabs runs in direct reclaim under a mmap_sem-for-writing
section like while allocating a vma in mmap).

Thanks,
Andrea