Re: [Approach #2] [RFC][PATCH] Remove cgroup member from structpage

From: KAMEZAWA Hiroyuki
Date: Tue Sep 09 2008 - 22:41:49 EST


On Tue, 09 Sep 2008 19:11:09 -0700
Balbir Singh <balbir@xxxxxxxxxxxxxxxxxx> wrote:
> > 1. This is nonsense...do you know the memory map of IBM's (maybe ppc) machine ?
> > Node's memory are splitted into several pieces and not ordered by node number.
> > example)
> > Node 0 | Node 1 | Node 2 | Node 1 | Node 2 |
> >
> > This seems special but when I helped SPARSEMEM and MEMORY_HOTPLUG,
> > I saw mannnny kinds of memory map. As you wrote, this should be re-designed.
> >
>
> Thanks, so that means that we cannot before hand predict the size of pcg_map[n],
> we'll need to do an incremental addition to pcg_map?
Or use some "allocate a chunk of page_cgroup for a chunk of continuous pages".
(This is the reason I mentioned SPARSEMEM.)

>
> > 2. If pre-allocating all is ok, I stop my work. Mine is of-no-use.
>
> One of the goals of this patch is refinement, it is a starting piece, something
> I shared very early. I am not asking you to stop your work. While I think
> pre-allocating is not the best way to do this, the trade off is the sparseness
> of the machine. I don't mind doing it in other ways, but we'll still need to do
> some batch'ed preallocation (of a smaller size maybe).
>
Hmm, maybe clarifying trade-off and comapring them is the first step.
I'll post my idea if it comes.

> > But you have to know that by pre-allocationg, we can't use avoid-lru-lock
> > by batch like page_vec technique. We can't delay uncharge because a page
> > can be reused soon.
> >
> >
>
> Care to elaborate on this? Why not? If the page is reused, we act on the batch
> and sync it up
>
And touch vec on other cpu ? The reason "vec" is fast is because it's per-cpu.
If we want to use "delaying", we'll have to make page_cgroup unused and not-on-lru
when the page of page_cgroup is added to free queue.

> >
> >
> >> + pcg_map[n] = alloc_bootmem_node(pgdat, size);
> >> + /*
> >> + * We can do smoother recovery
> >> + */
> >> + BUG_ON(!pcg_map[n]);
> >> + return 0;
> >> }
> >>
> >> -static int try_lock_page_cgroup(struct page *page)
> >> +void page_cgroup_init(int nid, unsigned long pfn)
> >> {
> >> - return bit_spin_trylock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
> >> + unsigned long node_pfn;
> >> + struct page_cgroup *pc;
> >> +
> >> + if (mem_cgroup_subsys.disabled)
> >> + return;
> >> +
> >> + node_pfn = pfn - NODE_DATA(nid)->node_start_pfn;
> >> + pc = &pcg_map[nid][node_pfn];
> >> +
> >> + BUG_ON(!pc);
> >> + pc->flags = PAGE_CGROUP_FLAG_VALID;
> >> + INIT_LIST_HEAD(&pc->lru);
> >> + pc->page = NULL;
> > This NULL is unnecessary. pc->page = pnf_to_page(pfn) always.
> >
>
> OK
>
> >
> >> + pc->mem_cgroup = NULL;
> >> }
> >>
> >> -static void unlock_page_cgroup(struct page *page)
> >> +struct page_cgroup *__page_get_page_cgroup(struct page *page, bool lock,
> >> + bool trylock)
> >> {
> >> - bit_spin_unlock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
> >> + struct page_cgroup *pc;
> >> + int ret;
> >> + int node = page_to_nid(page);
> >> + unsigned long pfn;
> >> +
> >> + pfn = page_to_pfn(page) - NODE_DATA(node)->node_start_pfn;
> >> + pc = &pcg_map[node][pfn];
> >> + BUG_ON(!(pc->flags & PAGE_CGROUP_FLAG_VALID));
> >> + if (lock)
> >> + lock_page_cgroup(pc);
> >> + else if (trylock) {
> >> + ret = trylock_page_cgroup(pc);
> >> + if (!ret)
> >> + pc = NULL;
> >> + }
> >> +
> >> + return pc;
> >> +}
> >> +
> >> +/*
> >> + * Should be called with page_cgroup lock held. Any additions to pc->flags
> >> + * should be reflected here. This might seem ugly, refine it later.
> >> + */
> >> +void page_clear_page_cgroup(struct page_cgroup *pc)
> >> +{
> >> + pc->flags &= ~PAGE_CGROUP_FLAG_INUSE;
> >> }
> >>
> >> static void __mem_cgroup_remove_list(struct mem_cgroup_per_zone *mz,
> >> @@ -377,17 +443,15 @@ void mem_cgroup_move_lists(struct page *
> >> * safely get to page_cgroup without it, so just try_lock it:
> >> * mem_cgroup_isolate_pages allows for page left on wrong list.
> >> */
> >> - if (!try_lock_page_cgroup(page))
> >> + pc = page_get_page_cgroup_trylock(page);
> >> + if (!pc)
> >> return;
> >>
> >> - pc = page_get_page_cgroup(page);
> >> - if (pc) {
> >> - mz = page_cgroup_zoneinfo(pc);
> >> - spin_lock_irqsave(&mz->lru_lock, flags);
> >> - __mem_cgroup_move_lists(pc, lru);
> >> - spin_unlock_irqrestore(&mz->lru_lock, flags);
> >> - }
> >> - unlock_page_cgroup(page);
> >> + mz = page_cgroup_zoneinfo(pc);
> >> + spin_lock_irqsave(&mz->lru_lock, flags);
> >> + __mem_cgroup_move_lists(pc, lru);
> >> + spin_unlock_irqrestore(&mz->lru_lock, flags);
> >> + unlock_page_cgroup(pc);
> >
> > This lock/unlock_page_cgroup is against what ?
> >
>
> We use page_cgroup_zoneinfo(pc), we want to make sure pc does not disappear or
> change from underneath us.
>
> >> }
> >>
> >> /*
> >> @@ -521,10 +585,6 @@ static int mem_cgroup_charge_common(stru
> >> unsigned long nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> >> struct mem_cgroup_per_zone *mz;
> >>
> >> - pc = kmem_cache_alloc(page_cgroup_cache, gfp_mask);
> >> - if (unlikely(pc == NULL))
> >> - goto err;
> >> -
> >> /*
> >> * We always charge the cgroup the mm_struct belongs to.
> >> * The mm_struct's mem_cgroup changes on task migration if the
> >> @@ -567,43 +627,40 @@ static int mem_cgroup_charge_common(stru
> >> }
> >> }
> >>
> >> + pc = page_get_page_cgroup_locked(page);
> >> + if (pc->flags & PAGE_CGROUP_FLAG_INUSE) {
> >> + unlock_page_cgroup(pc);
> >> + res_counter_uncharge(&mem->res, PAGE_SIZE);
> >> + css_put(&mem->css);
> >> + goto done;
> >> + }
> >> +
> > Can this happen ? Our direction should be
> > VM_BUG_ON(pc->flags & PAGE_CGROUP_FLAG_INUSE)
> >
>
> Yes, it can.. several people trying to map the same page at once. Can't we race
> doing that?
>
I'll dig this. My version(lockless) already removed this and use VM_BUG_ON()

> >
> >
> >> pc->mem_cgroup = mem;
> >> pc->page = page;
> >> + pc->flags |= PAGE_CGROUP_FLAG_INUSE;
> >> +
> >> /*
> >> * If a page is accounted as a page cache, insert to inactive list.
> >> * If anon, insert to active list.
> >> */
> >> if (ctype == MEM_CGROUP_CHARGE_TYPE_CACHE) {
> >> - pc->flags = PAGE_CGROUP_FLAG_CACHE;
> >> + pc->flags |= PAGE_CGROUP_FLAG_CACHE;
> >> if (page_is_file_cache(page))
> >> pc->flags |= PAGE_CGROUP_FLAG_FILE;
> >> else
> >> pc->flags |= PAGE_CGROUP_FLAG_ACTIVE;
> >> } else
> >> - pc->flags = PAGE_CGROUP_FLAG_ACTIVE;
> >> -
> >> - lock_page_cgroup(page);
> >> - if (unlikely(page_get_page_cgroup(page))) {
> >> - unlock_page_cgroup(page);
> >> - res_counter_uncharge(&mem->res, PAGE_SIZE);
> >> - css_put(&mem->css);
> >> - kmem_cache_free(page_cgroup_cache, pc);
> >> - goto done;
> >> - }
> >> - page_assign_page_cgroup(page, pc);
> >> + pc->flags |= PAGE_CGROUP_FLAG_ACTIVE;
> >>
> >> mz = page_cgroup_zoneinfo(pc);
> >> spin_lock_irqsave(&mz->lru_lock, flags);
> >> __mem_cgroup_add_list(mz, pc);
> >> spin_unlock_irqrestore(&mz->lru_lock, flags);
> >> -
> >> - unlock_page_cgroup(page);
> >> + unlock_page_cgroup(pc);
> >
> > Is this lock/unlock_page_cgroup is for what kind of race ?
>
> for setting pc->flags and for setting pc->page and pc->mem_cgroup.
>

Hmm...there is a confustion, maybe.

The page_cgroup is now 1:1 to struct page. Then, we can guarantee that

- There is no race between charge v.s. uncharge.

Only problem is force_empty. (But it's difficult..)

This means pc->mem_cgroup is safe here.
And pc->flags should be atomic flags, anyway. I believe we have to record
"Dirty bit" at el, later.

Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/