Re: [PATCH v5 4/4] zram: introduce zram memory tracking

From: Andrew Morton
Date: Tue Apr 17 2018 - 17:59:26 EST


On Mon, 16 Apr 2018 18:09:46 +0900 Minchan Kim <minchan@xxxxxxxxxx> wrote:

> zRam as swap is useful for small memory device. However, swap means
> those pages on zram are mostly cold pages due to VM's LRU algorithm.
> Especially, once init data for application are touched for launching,
> they tend to be not accessed any more and finally swapped out.
> zRAM can store such cold pages as compressed form but it's pointless
> to keep in memory. Better idea is app developers free them directly
> rather than remaining them on heap.
>
> This patch tell us last access time of each block of zram via
> "cat /sys/kernel/debug/zram/zram0/block_state".
>
> The output is as follows,
> 300 75.033841 .wh
> 301 63.806904 s..
> 302 63.806919 ..h
>
> First column is zram's block index and 3rh one represents symbol
> (s: same page w: written page to backing store h: huge page) of the
> block state. Second column represents usec time unit of the block
> was last accessed. So above example means the 300th block is accessed
> at 75.033851 second and it was huge so it was written to the backing
> store.
>
> Admin can leverage this information to catch cold|incompressible pages
> of process with *pagemap* once part of heaps are swapped out.

A few things..

- Terms like "Admin can" and "Admin could" are worrisome. How do we
know that admins *will* use this? How do we know that we aren't
adding a bunch of stuff which nobody will find to be (sufficiently)
useful? For example, is there some userspace tool to which you are
contributing which will be updated to use this feature?

- block_state's second column is in microseconds since some
undocumented time. But how is userspace to know how much time has
elapsed since the access? ie, "current time".

- Is the sched_clock() return value suitable for exporting to
userspace? Is it monotonic? Is it consistent across CPUs, across
CPU hotadd/remove, across suspend/resume, etc? Does it run all the
way up to 2^64 on all CPU types, or will some processors wrap it at
(say) 32 bits? etcetera. Documentation/timers/timekeeping.txt
points out that suspend/resume can mess it up and that the counter
can drift between cpus.