Re: [0/2] memrlimit improve error handling

From: Balbir Singh
Date: Tue Jun 24 2008 - 21:01:30 EST


Hugh Dickins wrote:
> Hi Balbir,
>
> Below I've a few comments on this particular patch, then report other
> problems I've seen with memrlimits configured into my 2.6.25-rc5-mm3
> kernel - I've not tried _using_ them. My own opinion would be that
> -mm already contains enough breakage, we don't need memrlimits yet.
>

Thanks for the review and feedback, I'll take a look and fix

> On Fri, 20 Jun 2008, Balbir Singh wrote:
>> memrlimit cgroup does not handle error cases after may_expand_vm(). This
>> BUG was reported by Kamezawa, with the test case below to reproduce it
>>
>> This patch adds better handling support to fix the reported problem.
>>
>> Reported-By: kamezawa.hiroyu@xxxxxxxxxxxxxx
>>
>> Signed-off-by: Balbir Singh <balbir@xxxxxxxxxxxxxxxxxx>
>
> I wrote a similar patch when I saw Kame's report, because I was
> interested in getting that accounting right. Comparing the two,
> a couple of notes on the details...
>
> In mm/mmap.c:
>> @@ -1982,6 +1992,7 @@ unsigned long do_brk(unsigned long addr,
>> struct rb_node ** rb_link, * rb_parent;
>> pgoff_t pgoff = addr >> PAGE_SHIFT;
>> int error;
>> + int ret = -ENOMEM;
>
> I think we don't want int error returned from some parts of do_brk()
> and int ret returned from other parts: please use int error throughout.
> It would probably be nicer to add int error rather than int ret in
> acct_stack_growth() too, but that doesn't matter much.
>

Fair enough, will fix

> In mm/mremap.c:
>> @@ -256,6 +257,7 @@ unsigned long do_mremap(unsigned long ad
>> struct vm_area_struct *vma;
>> unsigned long ret = -EINVAL;
>> unsigned long charged = 0;
>> + int vm_expanded = 0;
>>
>> if (flags & ~(MREMAP_FIXED | MREMAP_MAYMOVE))
>> goto out;
>> @@ -349,6 +351,7 @@ unsigned long do_mremap(unsigned long ad
>> goto out;
>> }
>>
>> + vm_expanded = 1;
>> if (vma->vm_flags & VM_ACCOUNT) {
>> charged = (new_len - old_len) >> PAGE_SHIFT;
>> if (security_vm_enough_memory(charged))
>> @@ -411,6 +414,9 @@ out:
>> if (ret & ~PAGE_MASK)
>> vm_unacct_memory(charged);
>> out_nc:
>> + if (vm_expanded)
>> + memrlimit_cgroup_uncharge_as(mm,
>> + (new_len - old_len) >> PAGE_SHIFT);
>> return ret;
>> }
>
> See how vm_unacct_memory(charged) is only called if (ret & ~PAGE_MASK)?
> If ret is a valid address being returned, we do not want to uncharge.
> So I believe you need to do likewise with your uncharge_as().
>
> And please handle them both in the same way: either follow the same
> "charged" style as is being used for vm_unacct_memory, rather than a
> boolean; or convert vm_unacct_memory over to use your boolean style:
> but it's unhelpful to have the two using different techniques.
>

I did look at that and use that code, but we initialize ret = -EINVAL and then
we have the checks for

if (flags & ~(MREMAP_FIXED | MREMAP_MAYMOVE))
goto out;

if (addr & ~PAGE_MASK)
goto out;

In out, we check for
ret & ~PAGE_MASK

That's OK for vm_unacct_memory, since charged is set to 0, but not for
mem_cgroup_uncharge_as as we uncharge (new_len - old_len) >> PAGE_SHIFT);

I'll find a way to see if we can convert both over.

> In kernel/fork.c:
> nothing, but when I had a quick look there, again the error handling
> appeared to be broken e.g. if allocate_mm fails, where's the uncharge?
> But be careful: there's a particular point where enough of the new mm
> is set up that it gets torn down by normal exit_mmap.
>

Yes, that needs fixing

> And while looking at copy_mm, do I see an extra down_write and
> up_write of mmap_sem, just to guard some memrlimit charging?

Yes, we need to prevent charging and migration race.

> Don't add overhead when memrlimits are CONFIGed off; but can't
> it be moved into dup_mm() where mmap_sem is already down_write?
>

Yes, it can be moved there

> Please go through all your charges, again, to double check
> that you've got the uncharging right in the error cases.
>

Will do


> I was interested in these because I find that after running kernel
> build on tmpfs swapping loads in a 700M memcg (but you may well hit
> other rc5-mm3 bugs if you try that, I've fixes to send out) for some

I've never tried that before

> hours on x86_64, my shutdown hits the kernel/res_counter.c:49
> res_counter_uncharge_locked() WARN_ON(counter->usage < val) several
> times, called from memrlimit_cgroup_uncharge_as called from exit_mmap.
>
> I have no idea what the cause is; but I've not seen it on i386,
> and I'm not seeing it after shorter runs. It could even be some
> error introduced by other patches in what I'm testing: so don't
> worry too much about it yet, but please keep an eye out for it.
>

Sure, will do

> (If I'd thought harder, I'd have been less interested in the
> charge_as leaks: those WARN_ONs come when too much is being
> uncharged, not when too little.)
>
> But the issue which worries me most, because it's mislocking
> which affects anyone with CONFIG_MM_OWNER=y, is seen when I
> have CONFIG_DEBUG_SPINLOCK_SLEEP=y:
>
> BUG: sleeping function called from invalid context at kernel/rwsem.c:48
> in_atomic():1, irqs_disabled():0
> 1 lock held by blogd/830:
> #0: (tasklist_lock){..--}, at: [<78125869>] mm_update_next_owner+0x1dd/0x1fa
> Pid: 830, comm: blogd Not tainted 2.6.26-rc5-mm2 #2
> [<78120ea2>] __might_sleep+0xed/0xf2
> [<78367947>] down_write+0x13/0x43
> [<781257b6>] mm_update_next_owner+0x12a/0x1fa
> [<7812595a>] exit_mm+0xd4/0xe5
> [<78125f9d>] do_exit+0x1e7/0x2bc
> [<781260fb>] do_group_exit+0x61/0x8a
> [<78126134>] sys_exit_group+0x10/0x13
> [<78102dd9>] sysenter_past_esp+0x6a/0xa5
>
> Your memrlimit-cgroup-mm-owner-callback-changes-to-add-task-info.patch
> thinks it can acquire mmap_sem while it has read_lock(&tasklist_lock).
> Sorry, no: you'll have to rework that somehow.
>

Yes, that is nasty and a bad miss. It should be easy to fix, but will require
extensive testing.


> (In passing, I'll add that I'm not a great fan of these memrlimits:
> to me it's loony to be charging people for virtual address space,
> it's _virtual_, and process A can have as much as it likes without
> affecting process B in any way. You're following the lead of RLIMIT_AS,
> but I've always thought RLIMIT_AS a lame attempt to move into the mmap
> decade, after RLIMIT_DATA and RLIMIT_STACK no longer made sense.
>
> Taking Alan Cox's Committed_AS as a limited resource charged per mm makes
> much more sense to me: but yes, it's not perfect, and it is a lot harder
> to get its accounting right, and to maintain that down the line. Okay,
> you've gone for the easier option of tracking total_vm, getting that
> right is a more achievable target. And I accept that I may be too
> pessimistic about it: total_vm may often enough give a rough
> approximation to something else worth limiting.)

You seem to have read my mind, my motivation for memrlimits is

1. Administrators to set a limit and be sure that a cgroup cannot consume more
swap + RSS than the assigned virtual memory limit
2. It allows applications to fail gracefully or decide what parts to free up
to get more memory or change their allocation pattern (a scientific application
deciding what size of matrix to allocate for example).

Excellent feedback, thanks for the review!

--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/