Re: [PATCH v2 1/1] Documentation: update pagemap with shmem exceptions
From: Tiberiu Georgescu
Date: Tue Sep 21 2021 - 04:52:42 EST
> On 20 Sep 2021, at 20:07, Peter Xu <peterx@xxxxxxxxxx> wrote:
>
> Hi, Tiberiu,
>
> Thanks for the patch! Yes it would still be nice to comment on this behavior,
> some trivial nitpicks below.
>
> On Mon, Sep 20, 2021 at 04:49:31PM +0000, Tiberiu A Georgescu wrote:
>> +In user space, whether the page is swapped or none can be deduced with the
>> +lseek system call. For a single page, the algorithm is:
>> +
>> +0. If the pagemap entry of the page has bit 63 (page present) set, the page
>> + is present.
>> +1. Otherwise, get an fd to the file where the page is backed. For anonymous
>> + shared pages, the file can be found in ``/proc/pid/map_files/``.
>> +2. Call lseek with LSEEK_DATA flag and seek to the virtual address of the page
>
> s/LSEEK_DATA/SEEK_DATA/
Oops, mb. Will change that.
>
>> + you wish to inspect. If it overshoots the PAGE_SIZE, the page is NONE.
>> +3. Otherwise, the page is in swap.
>
> It could also not be in swap, right?
>
> Example 1: this process mmap()ed an existing shmem file with data filled in,
> but without accessing it yet. Then the page cache exists, not in swap, but
> pgtables will be empty.
>
> Example 2: this process has mapped this shmem with 2M thp, all data filled in,
> then due to some reason thp splits, then the pgtable can also be none but lseek
> will succeed, I think.
>
Ok, those are a lot of exceptions. So it's possible for the pagemap entry to be
empty, yet the page itself to be actually present. When that happens, the page is
mistakenly considered in "swap" by the current algorithm.
Thanks a lot for pointing that out!
> So to further identify whether that's in swap, we need a step 5 with mincore()
> system call, perhaps?
I tested it some more, and it still looks like the mincore() syscall considers pages
in the swap cache as "in memory". This is how I tested:
1. Create a cgroup with 1M limit_in_bytes, and allow swapping
2. mmap 1024 pages (both shared and private present the same behaviour)
3. write to all pages in order
4. compare mincore output with pagemap output
This is an example of a usual mincore output in this scenario, shortened for
coherency (4x8 instead of 16x64):
00000000
00000000
00001110 <- this bugs me
01111111
The last 7 bits are definitely marking pages present in memory, but there are
some other bits set a little earlier. When comparing this output with the pagemap,
indeed, there are 7 consecutive pages present, and the rest of them are
swapped, including those 3 which are marked as present by mincore.
At this point, I can only assume the bits in between are on the swap cache.
If you have another explanation, please share it with me. In the meanwhile,
I will rework the doc patch, and see if there is any other way to differentiate
clearly between the three types of pages. If not, I guess we'll stick to
mincore() and a best-effort 5th step.
--
Kind regards,
Tibi