Re: [PATCH 1/3] ipc: convert ipc_namespace.count from atomic_t to refcount_t

From: Kees Cook
Date: Thu Jul 20 2017 - 11:12:52 EST


On Thu, Jul 20, 2017 at 5:34 AM, Eric W. Biederman
<ebiederm@xxxxxxxxxxxx> wrote:
> Ingo Molnar <mingo@xxxxxxxxxx> writes:
>
>> * Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote:
>>
>>> On Wed, 19 Jul 2017 15:54:27 -0700 Davidlohr Bueso <dave@xxxxxxxxxxxx> wrote:
>>>
>>> > On Wed, 19 Jul 2017, Andrew Morton wrote:
>>> >
>>> > >I do rather dislike these conversions from the point of view of
>>> > >performance overhead and general code bloat. But I seem to have lost
>>> > >that struggle and I don't think any of these are fastpath(?).
>>> >
>>> > Well, since we now have fd25d19 (locking/refcount: Create unchecked atomic_t
>>> > implementation), performance is supposed to be ok.
>>>
>>> Sure, things are OK for people who disable the feature.
>>
>> So with the WIP fast-refcount series from Kees:
>>
>> [PATCH v6 0/2] x86: Implement fast refcount overflow protection
>>
>> I believe the robustness difference between optimized-refcount_t and
>> full-refcount_t will be marginal.
>>
>> I.e. we'll be able to have both higher API safety _and_ performance.
>>
>>> But for people who want to enable the feature we really should minimize the cost
>>> by avoiding blindly converting sites which simply don't need it: simple, safe,
>>> old, well-tested code. Why go and slow down such code? Need to apply some
>>> common sense here...
>>
>> It's old, well-tested code _for existing, sane parameters_, until someone finds a
>> decade old bug in one of these with an insane parameters no-one stumbled upon so
>> far, and builds an exploit on top of it.
>>
>> Only by touching all these places do we have a chance to improve things measurably
>> in terms of reducing the probability of bugs.
>
> The more I hear people pushing the upsides of refcount_t without
> considering the downsides the more I dislike it.
>
> - refcount_t is really the wrong thing because it uses saturation
> semantics. So by definition it includes a bug.

This is a feature, not a bug. :) If the kernel has a refcount overflow
flaw (which, in the pantheon of exploitable kernel bugs, is
_common_[1], as I've referenced earlier), then we're downgrading an
exploitable use-after-free to a harmless memory allocation leak. Even
if you don't include malicious attackers in the consideration, this
changes a memory corruption of unknown results into a memory leak.
That's actually an _improvement_ to availability and integrity.

> - refcount_t will only really prevent something if there is an extra
> increment. That is not the kind of bug people are likely to make.

Like I've said, this is common. This is usually a mistake in error
handling which forgets (or misplaces) a "put".

> - refcount_t won't help if you have an extra decrement. The bad
> use-after-free will still happen.

Yes, and not having a protected refcount_t will also allow a
use-after-free. There is no change here, so it's not a "downside" of
refcount_t. In fact, having gained the implicit annotation of
refcount_t being a refcounter (rather than a simple atomic_t) means
that auditing users is easier and more focused. This could reduce the
chance people make mistakes in the first place, especially since the
API is more constrained than atomic_t.

> - refcount_t won't help if there is a memory stomp. As with an extra
> decrement the bad use-after-free will still happen.

A stomp of the refcount_t value itself? Sure, and this remains as
vulnerable as atomic_t. This isn't a downside to refcount_t. And
again, since there _is_ checking of the value in places, it's possible
an actionable warning will be produced (though, yes, after the
use-after-free has been exposed), which is a benefit over simple
atomic_t. I mention this in the commit log ("better to maybe produce
the warning than be universally silent").

> So all I see is a huge amount of code churn to implement a buggy (by
> definition) refcounting API, that risks adding new bugs and only truly
> helps with bugs that are unlikely in the first place.

Given that the conversions alone have been uncovering refcount bugs
and that the implementation isn't "buggy" (it provides a specific set
of protections), I strongly disagree with your assessment.

> I really don't think this is an obvious slam dunk.

It entirely blocks a commonly exploitable flaw in the kernel. This
isn't a probabilistic mitigation, either. While I'm not sure I'd ever
describe a security protection as a slam dunk, I think this is up
there. :)

-Kees

[1] When I say "common", I'm speaking from the perspective of security
flaw frequency. The kernel sees about 1-2 high severity security flaws
a year (with an average lifetime of 5 years), and the
refcount-overflow use-after-free class of flaw is normally reliable
for attackers (and I'd classify as high severity). With 2016 seeing
two known separate refcount-overflow use-after-free flaws, this could
be better described as an epidemic, but I'll try to be less
inflammatory and just say "common".

--
Kees Cook
Pixel Security