Re: [PATCH] Fix OOPS in mmap_region() when merging adjacentVM_LOCKED file segments

From: Lee Schermerhorn
Date: Tue Feb 03 2009 - 16:50:29 EST


On Tue, 2009-02-03 at 17:10 +0000, Hugh Dickins wrote:
> On Tue, 3 Feb 2009, Lee Schermerhorn wrote:
> > On Sat, 2009-01-31 at 12:35 +0000, Hugh Dickins wrote:
> > > We need a way to communicate not-MAP_NORESERVE to shmem.c, and we don't
> > > just need it in the explicit shmem_zero_setup() case, we also need it
> > > for the (probably rare nowadays) case when mmap() is working on file
> > ^^^^^^^^^^^^^^^^^^^^^^^^
> > > /dev/zero (drivers/char/mem.c mmap_zero()), rather than using MAP_ANON.
> >
> >
> > This reminded me of something I'd seen recently looking
> > at /proc/<pid>/[numa]_maps for <a large commercial database> on
> > Linux/x86_64:
> >...
> > 2adadf711000-2adadf721000 rwxp 00000000 00:0e 4072 /dev/zero
> > 2adadf721000-2adadf731000 rwxp 00000000 00:0e 4072 /dev/zero
> > 2adadf731000-2adadf741000 rwxp 00000000 00:0e 4072 /dev/zero
> >
> > <and so on, for another 90 lines until>
> >
> > 7fffcdd36000-7fffcdd4e000 rwxp 7fffcdd36000 00:00 0 [stack]
> > ffffffffff600000-ffffffffffe00000 ---p 00000000 00:00 0 [vdso]
> >
> > For portability between Linux and various Unix-like systems that don't
> > support MAP_ANON*, perhaps?
> >
> > Anyway, from the addresses and permissions, these all look potentially
> > mergeable. The offset is preventing merging, right? I guess that's one
> > of the downsides of mapping /dev/zero rather than using MAP_ANONYMOUS?
> >
> > Makes one wonder whether it would be worthwhile [not to mention
> > possible] to rework mmap_zero() to mimic MAP_ANONYMOUS...
>
> That's certainly an interesting observation, and thank you for sharing
> it with us (hmm, I sound like a self-help group leader or something).
>
> I don't really have anything to add to what Linus said (and hadn't
> got around to realizing the significance of the "p" there before I
> saw his reply).
>
> Mmm, it's interesting, but I fear to add more hacks in there just
> for this - I guess we could, but I'd rather not, unless it becomes
> a serious issue.
>
> Let's just tuck away the knowledge of this case for now.

Right. And a bit more info to tuck away...

I routinely grab the proc maps and numa_maps from our largish servers
running various "industry standard benchmarks". Prompted by Linus'
comment that "if it's just a hundred segments, nobody really cares", I
went back and looked a bit further at the maps for a recent run.

Below are some segment counts for the run. The benchmark involved 32
"instances" of the application--a technique used to reduce contention on
application internal resources as the user count increases--along with
it's data base task[s]. Each instance spawns a few processes [5-6
average, up to ~14, for this run] that share a few instance-specific
SYSV segments between them.

In each instance, one of those shmem segments exhibits a similar pattern
to the /dev/zero segments from the prior mail. Many, altho' not all, of
the individual vmas are adjacent with the same permissions: 'r--s'.
E.g., a small snippet:

2ac0e3cf0000-2ac0e40f5000 r--s 00d26000 00:08 15695938 /SYSV0000277a (deleted)
2ac0e40f5000-2ac0e4101000 r--s 0112b000 00:08 15695938 /SYSV0000277a (deleted)
2ac0e4101000-2ac0e4102000 r--s 01137000 00:08 15695938 /SYSV0000277a (deleted)
2ac0e4102000-2ac0e4113000 r--s 01138000 00:08 15695938 /SYSV0000277a (deleted)
2ac0e4113000-2ac0e4114000 r--s 01149000 00:08 15695938 /SYSV0000277a (deleted)
2ac0e4114000-2ac0e4115000 r--s 0114a000 00:08 15695938 /SYSV0000277a (deleted)
2ac0e4115000-2ac0e4116000 r--s 0114b000 00:08 15695938 /SYSV0000277a (deleted)
2ac0e4116000-2ac0e4117000 r--s 0114c000 00:08 15695938 /SYSV0000277a (deleted)

I counted 2000-3600+ of these for a couple of tasks. How they got like
this--one vma per page?--I'm not sure. Perhaps a sequence of mprotect()
calls or such after attaching the segment. [I'll try to get an strace
sometime.] Then I counted the occurrences of the patterns:
'^2.*r--s.*/SYSV' in each of the instances as, again, each instance uses
a different shmem segment among its tasks. For good measure, I counted
the '/dev/zero' segments as well:

SYSV shm /dev/zero
instance 00 5771 217
instance 01 6025 183
instance 02 5738 176
instance 03 5798 177
instance 04 5709 182
instance 05 5423 915
instance 06 5513 929
instance 07 5915 180
instance 08 5802 182
instance 09 5690 177
instance 10 5643 177
instance 11 5647 180
instance 12 5656 182
instance 13 5672 181
instance 14 5522 180
instance 15 5497 180
instance 16 5594 179
instance 17 4922 906
instance 18 6956 935
instance 19 5769 181
instance 20 5771 180
instance 21 5712 180
instance 22 5711 184
instance 23 5631 179
instance 24 5586 180
instance 25 5640 180
instance 26 5614 176
instance 27 5523 176
instance 28 5600 179
instance 29 5473 177
instance 30 5581 180
instance 31 5470 180

A total of ~ 180K shmem segments, not counting the /dev/zero mappings.
Good thing we have a lot of memory :). A couple of those segments per
instance are different shmem segments--just 2 or 3 out of 5k-6k in the
cases that I looked at.

The benchmark seems to run fairly well, so I'm not saying we have a
problem here--with the Linux kernel, anyway. Just some raw data from a
pseudo-real-world application load. ['pseudo' because I'm told no real
user would ever set up the app quite this way :)]

Also, this is on a vintage 2.6.16+ kernel [not my choice]. Soon I'll
have data from a much more recent release.

Lee




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/