Re: [PATCH] mm, oom: allow oom reaper to race with exit_mmap
From: Michal Hocko
Date: Mon Jul 24 2017 - 12:12:01 EST
On Mon 24-07-17 17:51:42, Kirill A. Shutemov wrote:
> On Mon, Jul 24, 2017 at 04:15:26PM +0200, Michal Hocko wrote:
[...]
> > What kind of scalability implication you have in mind? There is
> > basically a zero contention on the mmap_sem that late in the exit path
> > so this should be pretty much a fast path of the down_write. I agree it
> > is not 0 cost but the cost of the address space freeing should basically
> > make it a noise.
>
> Even in fast path case, it adds two atomic operation per-process. If the
> cache line is not exclusive to the core by the time of exit(2) it can be
> noticible.
>
> ... but I guess it's not very hot scenario.
>
> I guess I'm just too cautious here. :)
I definitely did not want to handwave your concern. I just think we can
rule out the slow path and didn't think about the fast path overhead.
> > > Should we do performance/scalability evaluation of the patch before
> > > getting it applied?
> >
> > What kind of test(s) would you be interested in?
>
> Can we at lest check that number of /bin/true we can spawn per second
> wouldn't be harmed by the patch? ;)
OK, so measuring a single /bin/true doesn't tell anything so I've done
root@test1:~# cat a.sh
#!/bin/sh
NR=$1
for i in $(seq $NR)
do
/bin/true
done
in my virtual machine (on a otherwise idle host) with 4 cpus and 2GB of
RAM
Unpatched kernel
root@test1:~# /usr/bin/time -v ./a.sh 100000
Command being timed: "./a.sh 100000"
User time (seconds): 53.57
System time (seconds): 26.12
Percent of CPU this job got: 100%
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:19.46
root@test1:~# /usr/bin/time -v ./a.sh 100000
Command being timed: "./a.sh 100000"
User time (seconds): 53.90
System time (seconds): 26.23
Percent of CPU this job got: 100%
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:19.77
root@test1:~# /usr/bin/time -v ./a.sh 100000
Command being timed: "./a.sh 100000"
User time (seconds): 54.02
System time (seconds): 26.18
Percent of CPU this job got: 100%
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:19.92
patched kernel
root@test1:~# /usr/bin/time -v ./a.sh 100000
Command being timed: "./a.sh 100000"
User time (seconds): 53.81
System time (seconds): 26.55
Percent of CPU this job got: 100%
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:19.99
root@test1:~# /usr/bin/time -v ./a.sh 100000
Command being timed: "./a.sh 100000"
User time (seconds): 53.78
System time (seconds): 26.15
Percent of CPU this job got: 100%
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:19.67
root@test1:~# /usr/bin/time -v ./a.sh 100000
Command being timed: "./a.sh 100000"
User time (seconds): 54.08
System time (seconds): 26.87
Percent of CPU this job got: 100%
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:20.52
the results very quite a lot (have a look at the user time which
shouldn't have no reason to vary at all - maybe the virtual machine
aspect?). I would say that we are still reasonably close to a noise
here. Considering that /bin/true would close to the worst case I think
this looks reasonably. What do you think?
If you absolutely insist, I can make the lock conditional only for oom
victims. That would still mean current->signal->oom_mm pointers fetches
and a 2 branches.
--
Michal Hocko
SUSE Labs