RE: [PATCH 1/3] ipc: convert ipc_namespace.count from atomic_t to refcount_t

From: Reshetova, Elena
Date: Mon Jul 10 2017 - 08:11:46 EST


> "Reshetova, Elena" <elena.reshetova@xxxxxxxxx> writes:
>
> >> "Reshetova, Elena" <elena.reshetova@xxxxxxxxx> writes:
> >>
> >> 2>> Elena Reshetova <elena.reshetova@xxxxxxxxx> writes:
> >> >>
> >> >> > refcount_t type and corresponding API should be
> >> >> > used instead of atomic_t when the variable is used as
> >> >> > a reference counter. This allows to avoid accidental
> >> >> > refcounter overflows that might lead to use-after-free
> >> >> > situations.
> >> >>
> >> >> In this patch you can see all of the uses of the count.
> >> >> What accidental refcount overflows are possible?
> >> >
> >> > Even if one can guarantee and prove that in the current implementation
> >> > there are no overflows possible, we can't say that for
> >> > sure for any future implementation. Bugs might always happen
> >> > unfortunately, but if we convert the refcounter to a safer
> >> > type we can be sure that overflows are not possible.
> >> >
> >> > Does it make sense to you?
> >>
> >> Not for code that is likely to remain unchanged for a decade no.
> >
> > Can we really be sure for any kernel code about this? And does it make
> > sense to trust our security on a fact like this?
>
> But refcount_t doesn't fix anything. At best it changes a bad bug to a
> less bad bug. So now my machine OOMS instead of allows a memory
> overwrite. It still doesn't work.

Well, it is a step forward from security standpoint. OOMS is really hard
to exploit vs. memory overwrites. Pretty much all exploits need either
memory write or memory read, out of memory is really much harder to
exploit.

>
> Plus refcount_t does not provide any safety on the architectures where
> it is a noop.

Not sure I understood this. What do you mean by "noop"?
refcount_t is currently architecture independent.


>
> >> This looks like a large set of unautomated changes without any real
> >> thought put into it.
> >
> > We are soon into the end of the first year that we started to look into
> > refcounter overflow/underflow problem and coming up this far was
> > not easy enough (just check all the millions of emails on kernel-hardening
> > mailing list). Each refcount_t conversion candidate was first found by Coccinelle
> > analysis and then manually checked and converted. The story of
> > refcount_t API and all discussions go even further.
> > So you can't really claim that there is no " thought put into it " :)
>
> But the conversion of the instance happens without thought and manually.
> Which is a good recipe for typos. Which is what I am saying.
>
> There have been lots of conversions like that in the kernel and
> practically every one has introduced at least one typo.

What do you exactly mean by "typo"? Typos should be detected at these stages:

1) typos like wrong function name etc. can be found at compile time
(trust me I have found a number of these on the very first iteration with patches)
2) "typos" (not sure if it is correct to call them typos) like usage of wrong refcount_t
API vs. original atomic_t API can be found during internal reviews or reviews by maintainers
3) much bigger problem is actually not any typos, but hidden issues that show up only
in run-time that detect underflows/overflows or inability to increment from zero.
These only are nasty, but given that refcount_t WARNs left and right about them,
we can detect them fast.

I don't know what is a better recipe for doing API changes like this?
Do you have any suggestions?

>
> So from an engineering standpoint it is a very valid question to ask
> about. And I find the apparent insistence that you don't make typos
> very disturbing.
>
> > That almost always results in a typo somewhere
> >> that breaks things.
> >>
> >> So there is no benefit to the code, and a non-zero chance that there
> >> will be a typo breaking the code.
> >
> > The code is very active on issuing WARNs when anything goes wrong.
> > Using this feature we have not only found errors in conversions, but
> > sometimes errors in code itself. So, any bug would be actually much
> > faster visible than using old atomic_t interface.
> >
> > In addition by default refcount_t equals to atomic, which also gives a
> > possibility to make a softer transition and catch all related bugs in couple
> > of cycles when enabling CONFIG_REFCOUNT_FULL.
>
> But if you make a typo and change one operation for another I don't see
> how any of that applies.

It is hard to make a "typo" to change one operation to another. It is not a
one-two char mismatch error. When doing these patches we followed the
logic of "less code changes - better" (since less chances of making mistake),
so if in some cases functions are changed (like from atomic_sub to
refcount_sub_and_test(), or from atomic_inc_not_zero() to atomic_inc() etc.)
there was a reason for making it and the change wasn't automatic and without
thinking at all. Again, we do have our maintainers also to catch if a change that
we did doesn't actually work for them right?

Best Regards,
Elena.
>
> And that is what it looks like I we are looking at here.
>
> Eric