Re: [PATCH v4 2/2] ksm: provide support to use deferrable timers for scanner thread

From: Hugh Dickins
Date: Tue Sep 09 2014 - 16:16:55 EST


On Mon, 8 Sep 2014, Peter Zijlstra wrote:
> On Mon, Sep 08, 2014 at 01:25:36AM -0700, Hugh Dickins wrote:
> >
> > --- 3.17-rc4/include/linux/ksm.h 2014-03-30 20:40:15.000000000 -0700
> > +++ linux/include/linux/ksm.h 2014-09-07 11:54:41.528003316 -0700
>
> > @@ -87,6 +96,11 @@ static inline void ksm_exit(struct mm_st
> > {
> > }
> >
> > +static inline wait_queue_head_t *ksm_switch(struct mm_struct *mm)
>
> s/ksm_switch/__&/

I don't think so (and I did try building with CONFIG_KSM off too).

>
> > +{
> > + return NULL;
> > +}
> > +
> > static inline int PageKsm(struct page *page)
> > {
> > return 0;
>
> > --- 3.17-rc4/kernel/sched/core.c 2014-08-16 16:00:54.062189063 -0700
> > +++ linux/kernel/sched/core.c 2014-09-07 11:54:41.528003316 -0700
>
> > @@ -2304,6 +2305,7 @@ context_switch(struct rq *rq, struct tas
> > struct task_struct *next)
> > {
> > struct mm_struct *mm, *oldmm;
> > + wait_queue_head_t *wake_ksm = NULL;
> >
> > prepare_task_switch(rq, prev, next);
> >
> > @@ -2320,8 +2322,10 @@ context_switch(struct rq *rq, struct tas
> > next->active_mm = oldmm;
> > atomic_inc(&oldmm->mm_count);
> > enter_lazy_tlb(oldmm, next);
> > - } else
> > + } else {
> > switch_mm(oldmm, mm, next);
> > + wake_ksm = ksm_switch(mm);
>
> Is this the right mm?

It's next->mm, that's the one I intended (though the patch might
be equally workable using prev->mm instead: given free rein, I'd
have opted for hooking into both prev and next, but free rein is
definitely not what should be granted around here!).

> We've just switched the stack,

I thought that came in switch_to() a few lines further down,
but don't think it matters for this.

> so we're looing at next->mm when we switched away from current.
> That might not exist anymore.

I fail to see how that can be. Looking at the x86 switch_mm(),
I can see it referencing (unsurprisingly!) both old and new mms
at this point, and no reference to an mm is dropped before the
ksm_switch(). oldmm (there called mm) is mmdropped later in
finish_task_switch().

>
> > + }
> >
> > if (!prev->mm) {
> > prev->active_mm = NULL;
> > @@ -2348,6 +2352,9 @@ context_switch(struct rq *rq, struct tas
> > * frame will be invalid.
> > */
> > finish_task_switch(this_rq(), prev);
> > +
> > + if (wake_ksm)
> > + wake_up_interruptible(wake_ksm);
> > }
>
> Quite horrible for sure. I really hate seeing KSM cruft all the way down

Yes, I expected that, and I would certainly feel the same way.

And even worse, imagine if this were successful, we might come along
and ask to do something similar for khugepaged. Though if it comes to
that, I'm sure we would generalize into one hook which does not say
"ksm" or "khugepaged" on it, but would still a present a single unlikely
flag to be tested at this level. Maybe you would even prefer the
generalized version, but I don't want to complicate the prototype yet.

If it weren't for the "we already have the mm cachelines here" argument,
I by now feel fairly sure that I would be going for hooking into timer
tick instead (where Thomas could then express his horror!).

Do you think I should just forget about cacheline micro-optimizations
and go in that direction instead?

> here. Can't we create a new (timer) infrastructure that does the right
> thing? Surely this isn't the only such case.

A sleep-walking timer, that goes to sleep in one bed, but may wake in
another; and defers while beds are empty? I'd be happy to try using
that for KSM if it already existed, and no doubt Chintan would too

But I don't think KSM presents a very good case for developing it.
I think KSM's use of a sleep_millisecs timer is really just an apology
for the amount of often wasted work that it does, and dates from before
we niced it down 5. I prefer the idea of a KSM which waits on activity
amongst the restricted set of tasks it is tracking: as this patch tries.

But my preference may be naive: doing lots of unnecessary work doesn't
matter as much as waking cpus from deep sleep.

>
> I know both RCU and some NOHZ_FULL muck already track when the system is
> completely idle. This is yet another case of that.

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/