Re: [PATCH 1/2] dccp: ccid: move timers to struct dccp_sock

From: Thadeu Lima de Souza Cascardo
Date: Mon Nov 09 2020 - 16:31:49 EST


On Mon, Nov 09, 2020 at 01:15:54PM -0800, Jakub Kicinski wrote:
> On Mon, 9 Nov 2020 18:09:09 -0300 Thadeu Lima de Souza Cascardo wrote:
> > On Mon, Nov 09, 2020 at 09:49:38AM -0800, Jakub Kicinski wrote:
> > > On Mon, 9 Nov 2020 08:48:28 -0300 Thadeu Lima de Souza Cascardo wrote:
> > > > On Fri, Oct 16, 2020 at 03:30:16PM -0700, Jakub Kicinski wrote:
> > > > > On Tue, 13 Oct 2020 19:18:48 +0200 Kleber Sacilotto de Souza wrote:
> > > > > > From: Thadeu Lima de Souza Cascardo <cascardo@xxxxxxxxxxxxx>
> > > > > >
> > > > > > When dccps_hc_tx_ccid is freed, ccid timers may still trigger. The reason
> > > > > > del_timer_sync can't be used is because this relies on keeping a reference
> > > > > > to struct sock. But as we keep a pointer to dccps_hc_tx_ccid and free that
> > > > > > during disconnect, the timer should really belong to struct dccp_sock.
> > > > > >
> > > > > > This addresses CVE-2020-16119.
> > > > > >
> > > > > > Fixes: 839a6094140a (net: dccp: Convert timers to use timer_setup())
> > > > > > Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@xxxxxxxxxxxxx>
> > > > > > Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@xxxxxxxxxxxxx>
> > > > >
> > > > > I've been mulling over this fix.
> > > > >
> > > > > The layering violation really doesn't sit well.
> > > > >
> > > > > We're reusing the timer object. What if we are really unlucky, the
> > > > > fires and gets blocked by a cosmic ray just as it's about to try to
> > > > > lock the socket, then user manages to reconnect, and timer starts
> > > > > again. Potentially with a different CCID algo altogether?
> > > > >
> > > > > Is disconnect ever called under the BH lock? Maybe plumb a bool
> > > > > argument through to ccid*_hc_tx_exit() and do a sk_stop_timer_sync()
> > > > > when called from disconnect()?
> > > > >
> > > > > Or do refcounting on ccid_priv so that the timer holds both the socket
> > > > > and the priv?
> > > >
> > > > Sorry about too late a response. I was on vacation, then came back and spent a
> > > > couple of days testing this further, and had to switch to other tasks.
> > > >
> > > > So, while testing this, I had to resort to tricks like having a very small
> > > > expire and enqueuing on a different CPU. Then, after some minutes, I hit a UAF.
> > > > That's with or without the first of the second patch.
> > > >
> > > > I also tried to refcount ccid instead of the socket, keeping the timer on the
> > > > ccid, but that still hit the UAF, and that's when I had to switch tasks.
> > >
> > > Hm, not instead, as well. I think trying cancel the timer _sync from
> > > the disconnect path would be the simplest solution, tho.
> >
> > I don't think so. On other paths, we would still have the possibility that:
> >
> > CPU1: timer expires and is about to run
> > CPU2: calls stop_timer (which does not stop anything) and frees ccid
> > CPU1: timer runs and uses freed ccid
> >
> > And those paths, IIUC, may be run under a SoftIRQ on the receive path, so would
> > not be able to call stop_timer_sync.
>
> Which paths are those (my memory of this code is waning)? I thought
> disconnect is only called from the user space side (shutdown syscall).
> The only other way to terminate the connection is to close the socket,
> which Eric already fixed by postponing the destruction of ccid in that
> case.

dccp_v4_do_rcv -> dccp_rcv_established -> dccp_parse_options ->
dccp_feat_parse_options -> dccp_feat_handle_nn_established ->
dccp_feat_activate -> __dccp_feat_activate -> dccp_hdlr_ccid ->
ccid_hc_tx_delete

>
> > > > Oh, and in the meantime, I found one or two other fixes that we
> > > > should apply, will send them shortly.
> > > >
> > > > But I would argue that we should apply the revert as it addresses the
> > > > CVE, without really regressing the other UAF, as I argued. Does that
> > > > make sense?
> > >
> > > We can - it's always a little strange to go from one bug to a different
> > > without a fix - but the justification being that while the previous UAF
> > > required a race condition the new one is actually worst because it can
> > > be triggered reliably?
> >
> > Well, I am arguing here that commit 2677d20677314101293e6da0094ede7b5526d2b1
> > ("dccp: don't free ccid2_hc_tx_sock struct in dccp_disconnect()") doesn't
> > really fix anything. Whenever ccid_hx_tx_delete is called, that UAF might
> > happen, because the timer might trigger right after we free the ccid struct.
> >
> > And, yes, on the other hand, we can reliably launch the DoS attack that is
> > fixed by the revert of that commit.
>
> OK.
>