Re: [PATCH] net: sched: Fix memory exposure from short TCA_U32_SEL

From: Al Viro
Date: Mon Aug 27 2018 - 20:03:36 EST

On Mon, Aug 27, 2018 at 02:31:41PM -0700, Cong Wang wrote:
> > I cant think of any challenges. Cong/Jiri? Would it require development
> > time classifiers/actions/qdiscs to sit in that directory (I suspect you
> > dont want them in include/net).
> > BTW, the idea of improving grep-ability of the code by prefixing the
> > ops appropriately makes sense. i.e we should have ops->cls_init,
> > ops->act_init etc.
> Hmm? Isn't struct tcf_proto_ops used and must be provided
> by each tc filter module? How does it work if you move it into
> net/sched/* for out-of-tree modules? Are they supposed to
> include "..../net/sched/tcf_proto.h"?? Or something else?

If you care about out-of-tree modules, that could easily live in
include/net/tcf_proto.h, provided that it's not pulled by indirect
includes into hell knows how many places. Try
make allmodconfig
make >/dev/null 2>&1
find -name '.*.cmd'|xargs grep sch_generic.h

That finds 2977 files here, most of them having nothing to do with

> BTW, we need some grep tool that really understands C syntax,
> not making each variable friendly to plain grep.

This isn't the matter of C syntax; it needs to handle C typization,
and you really can't do that anywhere near reliably without looking
at preprocessor output. Which very much depends upon .config...

BTW, something odd in cls_u32.c: what happens if we have the following
tcf_proto <tp>, it's ->data being <c0> and ->root - <ht0>
tc_u_common <c0>, in its ->hlist
<ht1>, in its ->ht[0]
and set ->ht_down in <knode> to the <ht0>? AFAICS,
there's nothing to prevent that - TCA_U32_LINK being
0x80000000 will do just that. What happens upon u32_destroy()
in that case? Unless I'm misreading that code, refcounts will be
<c0>: 1
<ht0>: 2
<ht1>: 1
and in u32_destroy() we'll get this:
root_ht = <ht0>
tp_c = <c0>
if (root_ht && --root_ht->refcnt == 0)
u32_destroy_hnode(tp, root_ht, extack);
decrements refcnt to 1 and does nothing else.
if (--tp_c->refcnt == 0) {
is satisfied
<c0> unhashed
while ((ht = rtnl_dereference(tp_c->hlist)) != NULL) {
we take ht = <ht1>
u32_clear_hnode(tp, ht, extack);
which does
for (h = 0; h <= ht->divisor; h++) {
while ((n = rtnl_dereference(ht->ht[h])) != NULL) {
n = <knode>
remove <knode> from <ht1>->ht[0]
tcf_unbind_filter(tp, &n->res);
u32_remove_hw_knode(tp, n, extack);
idr_remove(&ht->handle_idr, n->handle);
if (tcf_exts_get_net(&n->exts))
tcf_queue_work(&n->rwork, u32_delete_key_freepf_work);
u32_destroy_key(n->tp, n, true);
... and we hit u32_destroy_key(<tp>, <knode>, true), which does
struct tc_u_hnode *ht = rtnl_dereference(n->ht_down);
ht = <ht0>
if (ht && --ht->refcnt == 0)
*NOW* <ht0>->refcnt is 0, and we free the damn thing.
<knode> is freed and we return to u32_destroy_hnode() where we
see that there's nothing else left in <ht1>->ht[...] and return
to u32_destroy(). Where
RCU_INIT_POINTER(tp_c->hlist, ht->next);
sets <c0>->hlist to <ht1>->next, aka <h0>. Which is already freed.

/* u32_destroy_key() will later free ht for us, if it's
* still referenced by some knode
if (--ht->refcnt == 0)
kfree_rcu(ht, rcu);
<ht1>->refcnt reaches 0 and we free it (RCU-delayed)
... and we go for the next iteration, this time with ht = <ht0>.
Doing all kinds of unsanitary things to the memory it used to occupy...

Incidentally, if we hit
tcf_queue_work(&n->rwork, u32_delete_key_freepf_work);
instead of u32_destroy_key(), the things don't seem to be any better - we
won't do anything to <knode> until rtnl is dropped, so u32_destroy() won't
break on the second pass through the loop - it'll free <ht0> there and
return. Setting us up for trouble, since when u32_delete_key_freepf_work()
finally gets to u32_destroy_key() we'll have <knode>->ht_down pointing
to freed memory and decrementing its contents...

What am I missing in there? Is it just "we should never have ->ht_down
pointing to anyone's ->root"? If so, I'm not sure how to detect that;
if not... what should happen to the orphaned root_ht? Should it
remain on the list? We might have two tcf_proto sharing tp->data,
so tp_c and its list might very well survive the u32_destroy()...

Note, BTW, that if we do leave the orphan on the list and later
change the tc_u_knode so that ->ht_down doesn't point to that
thing anymore, we'll get its refcount incremented to 2 in
u32_init_knode(), then decremented to 1 by u32_set_parms() and
then arrange for u32_delete_key_work() to be run. Which will
drive the refcount to 0 and free the damn thing. While it's
still in the middle of ->hlist...