Re: [PATCH v5 4/4] zram: introduce zram memory tracking

From: Minchan Kim
Date: Fri Apr 20 2018 - 02:35:47 EST


On Fri, Apr 20, 2018 at 11:09:21AM +0900, Minchan Kim wrote:
> On Wed, Apr 18, 2018 at 02:07:15PM -0700, Andrew Morton wrote:
> > On Wed, 18 Apr 2018 10:26:36 +0900 Minchan Kim <minchan@xxxxxxxxxx> wrote:
> >
> > > Hi Andrew,
> > >
> > > On Tue, Apr 17, 2018 at 02:59:21PM -0700, Andrew Morton wrote:
> > > > On Mon, 16 Apr 2018 18:09:46 +0900 Minchan Kim <minchan@xxxxxxxxxx> wrote:
> > > >
> > > > > zRam as swap is useful for small memory device. However, swap means
> > > > > those pages on zram are mostly cold pages due to VM's LRU algorithm.
> > > > > Especially, once init data for application are touched for launching,
> > > > > they tend to be not accessed any more and finally swapped out.
> > > > > zRAM can store such cold pages as compressed form but it's pointless
> > > > > to keep in memory. Better idea is app developers free them directly
> > > > > rather than remaining them on heap.
> > > > >
> > > > > This patch tell us last access time of each block of zram via
> > > > > "cat /sys/kernel/debug/zram/zram0/block_state".
> > > > >
> > > > > The output is as follows,
> > > > > 300 75.033841 .wh
> > > > > 301 63.806904 s..
> > > > > 302 63.806919 ..h
> > > > >
> > > > > First column is zram's block index and 3rh one represents symbol
> > > > > (s: same page w: written page to backing store h: huge page) of the
> > > > > block state. Second column represents usec time unit of the block
> > > > > was last accessed. So above example means the 300th block is accessed
> > > > > at 75.033851 second and it was huge so it was written to the backing
> > > > > store.
> > > > >
> > > > > Admin can leverage this information to catch cold|incompressible pages
> > > > > of process with *pagemap* once part of heaps are swapped out.
> > > >
> > > > A few things..
> > > >
> > > > - Terms like "Admin can" and "Admin could" are worrisome. How do we
> > > > know that admins *will* use this? How do we know that we aren't
> > > > adding a bunch of stuff which nobody will find to be (sufficiently)
> > > > useful? For example, is there some userspace tool to which you are
> > > > contributing which will be updated to use this feature?
> > >
> > > Actually, I used this feature two years ago to find memory hogger
> > > although the feature was very fast prototyping. It was very useful
> > > to reduce memory cost in embedded space.
> > >
> > > The reason I am trying to upstream the feature is I need the feature
> > > again. :)
> > >
> > > Yub, I have a userspace tool to use the feature although it was
> > > not compatible with this new version. It should be updated with
> > > new format. I will find a time to submit the tool.
> >
> > hm, OK, can we get this info into the changelog?
>
> No problem. I will add as follows,
>
> "I used the feature a few years ago to find memory hoggers in userspace
> to notice them what memory they have wasted without touch for a long time.
> With it, they could reduce unnecessary memory space. However, at that time,
> I hacked up zram for the feature but now I need the feature again so
> I decided it would be better to upstream rather than keeping it alone.
> I hope I submit the userspace tool to use the feature soon"
>
> >
> > > >
> > > > - block_state's second column is in microseconds since some
> > > > undocumented time. But how is userspace to know how much time has
> > > > elapsed since the access? ie, "current time".
> > >
> > > It's a sched_clock so it should be elapsed time since the system boot.
> > > I should have written it explictly.
> > > I will fix it.
> > >
> > > >
> > > > - Is the sched_clock() return value suitable for exporting to
> > > > userspace? Is it monotonic? Is it consistent across CPUs, across
> > > > CPU hotadd/remove, across suspend/resume, etc? Does it run all the
> > > > way up to 2^64 on all CPU types, or will some processors wrap it at
> > > > (say) 32 bits? etcetera. Documentation/timers/timekeeping.txt
> > > > points out that suspend/resume can mess it up and that the counter
> > > > can drift between cpus.
> > >
> > > Good point!
> > >
> > > I just referenced it from ftrace because I thought the goal is similiar
> > > "no need to be exact unless the drift is frequent but wanted to be fast"
> > >
> > > AFAIK, ftrace/printk is active user of the function so if the problem
> > > happens frequently, it might be serious. :)
> >
> > It could be that ktime_get() is a better fit here - especially if
> > sched_clock() goes nuts after resume. Unfortunately ktime_get()
> > appears to be totally undocumented :(
> >
>
> I will use ktime_get_boottime(). With it, zram is not demamaged by
> suspend/resume and code would be more simple/clear. For user, it
> would be more straightforward to parse the time.
>
> Thanks for good suggestion, Andrew!
>

Hey Andrew,

This is updated patch for 4/4.
If you want to replace full patchset, please tell me. I will send full
patchset.