Re: [PATCH v5 4/4] zram: introduce zram memory tracking

From: Minchan Kim
Date: Thu Apr 19 2018 - 22:09:34 EST


On Wed, Apr 18, 2018 at 02:07:15PM -0700, Andrew Morton wrote:
> On Wed, 18 Apr 2018 10:26:36 +0900 Minchan Kim <minchan@xxxxxxxxxx> wrote:
>
> > Hi Andrew,
> >
> > On Tue, Apr 17, 2018 at 02:59:21PM -0700, Andrew Morton wrote:
> > > On Mon, 16 Apr 2018 18:09:46 +0900 Minchan Kim <minchan@xxxxxxxxxx> wrote:
> > >
> > > > zRam as swap is useful for small memory device. However, swap means
> > > > those pages on zram are mostly cold pages due to VM's LRU algorithm.
> > > > Especially, once init data for application are touched for launching,
> > > > they tend to be not accessed any more and finally swapped out.
> > > > zRAM can store such cold pages as compressed form but it's pointless
> > > > to keep in memory. Better idea is app developers free them directly
> > > > rather than remaining them on heap.
> > > >
> > > > This patch tell us last access time of each block of zram via
> > > > "cat /sys/kernel/debug/zram/zram0/block_state".
> > > >
> > > > The output is as follows,
> > > > 300 75.033841 .wh
> > > > 301 63.806904 s..
> > > > 302 63.806919 ..h
> > > >
> > > > First column is zram's block index and 3rh one represents symbol
> > > > (s: same page w: written page to backing store h: huge page) of the
> > > > block state. Second column represents usec time unit of the block
> > > > was last accessed. So above example means the 300th block is accessed
> > > > at 75.033851 second and it was huge so it was written to the backing
> > > > store.
> > > >
> > > > Admin can leverage this information to catch cold|incompressible pages
> > > > of process with *pagemap* once part of heaps are swapped out.
> > >
> > > A few things..
> > >
> > > - Terms like "Admin can" and "Admin could" are worrisome. How do we
> > > know that admins *will* use this? How do we know that we aren't
> > > adding a bunch of stuff which nobody will find to be (sufficiently)
> > > useful? For example, is there some userspace tool to which you are
> > > contributing which will be updated to use this feature?
> >
> > Actually, I used this feature two years ago to find memory hogger
> > although the feature was very fast prototyping. It was very useful
> > to reduce memory cost in embedded space.
> >
> > The reason I am trying to upstream the feature is I need the feature
> > again. :)
> >
> > Yub, I have a userspace tool to use the feature although it was
> > not compatible with this new version. It should be updated with
> > new format. I will find a time to submit the tool.
>
> hm, OK, can we get this info into the changelog?

No problem. I will add as follows,

"I used the feature a few years ago to find memory hoggers in userspace
to notice them what memory they have wasted without touch for a long time.
With it, they could reduce unnecessary memory space. However, at that time,
I hacked up zram for the feature but now I need the feature again so
I decided it would be better to upstream rather than keeping it alone.
I hope I submit the userspace tool to use the feature soon"

>
> > >
> > > - block_state's second column is in microseconds since some
> > > undocumented time. But how is userspace to know how much time has
> > > elapsed since the access? ie, "current time".
> >
> > It's a sched_clock so it should be elapsed time since the system boot.
> > I should have written it explictly.
> > I will fix it.
> >
> > >
> > > - Is the sched_clock() return value suitable for exporting to
> > > userspace? Is it monotonic? Is it consistent across CPUs, across
> > > CPU hotadd/remove, across suspend/resume, etc? Does it run all the
> > > way up to 2^64 on all CPU types, or will some processors wrap it at
> > > (say) 32 bits? etcetera. Documentation/timers/timekeeping.txt
> > > points out that suspend/resume can mess it up and that the counter
> > > can drift between cpus.
> >
> > Good point!
> >
> > I just referenced it from ftrace because I thought the goal is similiar
> > "no need to be exact unless the drift is frequent but wanted to be fast"
> >
> > AFAIK, ftrace/printk is active user of the function so if the problem
> > happens frequently, it might be serious. :)
>
> It could be that ktime_get() is a better fit here - especially if
> sched_clock() goes nuts after resume. Unfortunately ktime_get()
> appears to be totally undocumented :(
>

I will use ktime_get_boottime(). With it, zram is not demamaged by
suspend/resume and code would be more simple/clear. For user, it
would be more straightforward to parse the time.

Thanks for good suggestion, Andrew!