Re: [PATCH v1] ucounts: Fix race condition between alloc_ucounts and put_ucounts

From: Eric W. Biederman
Date: Wed Jul 28 2021 - 13:05:23 EST


Alexey Gladkov <legion@xxxxxxxxxx> writes:

> On Wed, Jul 28, 2021 at 10:58:37AM +0800, Hillf Danton wrote:
>> On Tue, 27 Jul 2021 17:24:18 +0200 Alexey Gladkov wrote:
>> > +++ b/kernel/ucount.c
>> > @@ -160,6 +160,7 @@ struct ucounts *alloc_ucounts(struct user_namespace *ns, kuid_t uid)
>> > {
>> > struct hlist_head *hashent = ucounts_hashentry(ns, uid);
>> > struct ucounts *ucounts, *new;
>> > + long overflow;
>> >
>> > spin_lock_irq(&ucounts_lock);
>> > ucounts = find_ucounts(ns, uid, hashent);
>> > @@ -184,8 +185,12 @@ struct ucounts *alloc_ucounts(struct user_namespace *ns, kuid_t uid)
>> > return new;
>> > }
>> > }
>> > + overflow = atomic_add_negative(1, &ucounts->count);
>> > spin_unlock_irq(&ucounts_lock);
>> > - ucounts = get_ucounts(ucounts);
>> > + if (overflow) {
>> > + put_ucounts(ucounts);
>>
>> Given if (atomic_add_unless(atomic, -1, 1))
>> return 0;
>>
>> put can not help roll back overflow.
>
> In case of overflow, we don't try to rollback overflow. We return an
> error.

Unfortunately I don't see the email with the original comment, but let
me see if I can clarify a little.

The code in get_ucounts explicitly uses atomic_add_negative as a
performance optimization. Which means just by testing the negative
status of the count it is easy to tell if the count is larger than is
supported. Where this matters is that atomic_add_negative can be
cheaper than cmpxchg. Which means it is faster to reserve all of the
negative numbers to catch the case where the counter grows too large,
then to precisely bound the count at a specific cut off.

This particular code path can not use atomic_add_unless(.., -1,...)
get_ucounts may have already hit the limit so it may be a negative value
other than -1.

>> BTW can you specify a bit on the real workloads with the risk of count overflow?

One place where I think it is possible to reach a count of 2^31 is to
set the rlimit for pending signals to unlimited and post a bunch of
realtime signals to a process which simply does not read them.

As pointed out in Alex's link below this code notices when the maximum
count is reached and fails gracefully unlike refcount_t which would leak
memory.

The point is to handle unrealistic workloads gracefully from a reference
counting perspective. If real workloads start reaching the maximum
count something probably needs to change. (larger counts or changing
what gets counted).

> For example, one user has too many processes in one namespace.
>
> It is necessary to check and handle the possibility of counter overflow
> in this case. Linus described it here:
>
> https://lore.kernel.org/lkml/CAHk-%3dwjYOCgM%2bmKzwTZwkDDg12DdYjFFkmoFKYLim7NFmR9HBg@xxxxxxxxxxxxxx/
>
>> > + return NULL;
>> > + }
>> > return ucounts;
>> > }
>> >
>> > @@ -193,8 +198,7 @@ void put_ucounts(struct ucounts *ucounts)
>> > {
>> > unsigned long flags;
>> >
>> > - if (atomic_dec_and_test(&ucounts->count)) {
>> > - spin_lock_irqsave(&ucounts_lock, flags);
>> > + if (atomic_dec_and_lock_irqsave(&ucounts->count, &ucounts_lock, flags)) {
>> > hlist_del_init(&ucounts->node);
>> > spin_unlock_irqrestore(&ucounts_lock, flags);
>> > kfree(ucounts);
>> > --
>> > 2.29.3

Eric