Re: [PATCH v5 4/4] zram: introduce zram memory tracking

From: Andrew Morton
Date: Wed Apr 18 2018 - 17:07:22 EST


On Wed, 18 Apr 2018 10:26:36 +0900 Minchan Kim <minchan@xxxxxxxxxx> wrote:

> Hi Andrew,
>
> On Tue, Apr 17, 2018 at 02:59:21PM -0700, Andrew Morton wrote:
> > On Mon, 16 Apr 2018 18:09:46 +0900 Minchan Kim <minchan@xxxxxxxxxx> wrote:
> >
> > > zRam as swap is useful for small memory device. However, swap means
> > > those pages on zram are mostly cold pages due to VM's LRU algorithm.
> > > Especially, once init data for application are touched for launching,
> > > they tend to be not accessed any more and finally swapped out.
> > > zRAM can store such cold pages as compressed form but it's pointless
> > > to keep in memory. Better idea is app developers free them directly
> > > rather than remaining them on heap.
> > >
> > > This patch tell us last access time of each block of zram via
> > > "cat /sys/kernel/debug/zram/zram0/block_state".
> > >
> > > The output is as follows,
> > > 300 75.033841 .wh
> > > 301 63.806904 s..
> > > 302 63.806919 ..h
> > >
> > > First column is zram's block index and 3rh one represents symbol
> > > (s: same page w: written page to backing store h: huge page) of the
> > > block state. Second column represents usec time unit of the block
> > > was last accessed. So above example means the 300th block is accessed
> > > at 75.033851 second and it was huge so it was written to the backing
> > > store.
> > >
> > > Admin can leverage this information to catch cold|incompressible pages
> > > of process with *pagemap* once part of heaps are swapped out.
> >
> > A few things..
> >
> > - Terms like "Admin can" and "Admin could" are worrisome. How do we
> > know that admins *will* use this? How do we know that we aren't
> > adding a bunch of stuff which nobody will find to be (sufficiently)
> > useful? For example, is there some userspace tool to which you are
> > contributing which will be updated to use this feature?
>
> Actually, I used this feature two years ago to find memory hogger
> although the feature was very fast prototyping. It was very useful
> to reduce memory cost in embedded space.
>
> The reason I am trying to upstream the feature is I need the feature
> again. :)
>
> Yub, I have a userspace tool to use the feature although it was
> not compatible with this new version. It should be updated with
> new format. I will find a time to submit the tool.

hm, OK, can we get this info into the changelog?

> >
> > - block_state's second column is in microseconds since some
> > undocumented time. But how is userspace to know how much time has
> > elapsed since the access? ie, "current time".
>
> It's a sched_clock so it should be elapsed time since the system boot.
> I should have written it explictly.
> I will fix it.
>
> >
> > - Is the sched_clock() return value suitable for exporting to
> > userspace? Is it monotonic? Is it consistent across CPUs, across
> > CPU hotadd/remove, across suspend/resume, etc? Does it run all the
> > way up to 2^64 on all CPU types, or will some processors wrap it at
> > (say) 32 bits? etcetera. Documentation/timers/timekeeping.txt
> > points out that suspend/resume can mess it up and that the counter
> > can drift between cpus.
>
> Good point!
>
> I just referenced it from ftrace because I thought the goal is similiar
> "no need to be exact unless the drift is frequent but wanted to be fast"
>
> AFAIK, ftrace/printk is active user of the function so if the problem
> happens frequently, it might be serious. :)

It could be that ktime_get() is a better fit here - especially if
sched_clock() goes nuts after resume. Unfortunately ktime_get()
appears to be totally undocumented :(