Re: [PATCH V2] rcu: Make sure new krcp free business is handled after the wanted rcu grace period.

From: Paul E. McKenney
Date: Mon Apr 03 2023 - 18:58:35 EST


On Fri, Mar 31, 2023 at 08:42:09PM +0800, Ziwei Dai wrote:
> In kfree_rcu_monitor(), new free business at krcp is attached to any free
> channel at krwp. kfree_rcu_monitor() is responsible to make sure new free
> business is handled after the rcu grace period. But if there is any none-free
> channel at krwp already, that means there is an on-going rcu work,
> which will cause the kvfree_call_rcu()-triggered free business is done
> before the wanted rcu grace period ends.
>
> This commit ignore krwp which has non-free channel at kfree_rcu_monitor(),
> to fix the issue that kvfree_call_rcu() loses effectiveness.
>
> Below is the css_set obj "from_cset" use-after-free case caused by
> kvfree_call_rcu() losing effectiveness.
> CPU 0 calls rcu_read_lock(), then use "from_cset", then hard irq comes,
> the task is schedule out.
> CPU 1 calls kfree_rcu(cset, rcu_head), willing to free "from_cset" after new gp.
> But "from_cset" is freed right after current gp end. "from_cset" is reallocated.
> CPU 0 's task arrives back, references "from_cset"'s member, which causes crash.
>
> CPU 0 CPU 1
> count_memcg_event_mm()
> |rcu_read_lock() <---
> |mem_cgroup_from_task()
> |// css_set_ptr is the "from_cset" mentioned on CPU 1
> |css_set_ptr = rcu_dereference((task)->cgroups)
> |// Hard irq comes, current task is scheduled out.
>
> cgroup_attach_task()
> |cgroup_migrate()
> |cgroup_migrate_execute()
> |css_set_move_task(task, from_cset, to_cset, true)
> |cgroup_move_task(task, to_cset)
> |rcu_assign_pointer(.., to_cset)
> |...
> |cgroup_migrate_finish()
> |put_css_set_locked(from_cset)
> |from_cset->refcount return 0
> |kfree_rcu(cset, rcu_head) // means to free from_cset after new gp
> |add_ptr_to_bulk_krc_lock()
> |schedule_delayed_work(&krcp->monitor_work, ..)
>
> kfree_rcu_monitor()
> |krcp->bulk_head[0]'s work attached to krwp->bulk_head_free[]
> |queue_rcu_work(system_wq, &krwp->rcu_work)
> |if rwork->rcu.work is not in WORK_STRUCT_PENDING_BIT state,
> |call_rcu(&rwork->rcu, rcu_work_rcufn) <--- request a new gp
>
> // There is a perious call_rcu(.., rcu_work_rcufn)
> // gp end, rcu_work_rcufn() is called.
> rcu_work_rcufn()
> |__queue_work(.., rwork->wq, &rwork->work);
>
> |kfree_rcu_work()
> |krwp->bulk_head_free[0] bulk is freed before new gp end!!!
> |The "from_cset" is freed before new gp end.
>
> // the task is scheduled in after many ms.
> |css_set_ptr->subsys[(subsys_id) <--- Caused kernel crash, because css_set_ptr is freed.
>
> v2: Use helper function instead of inserted code block at kfree_rcu_monitor().
>
> Fixes: c014efeef76a ("rcu: Add multiple in-flight batches of kfree_rcu() work")
> Signed-off-by: Ziwei Dai <ziwei.dai@xxxxxxxxxx>

Good catch, thank you!!!

How difficult was this to trigger? If it can be triggered easily,
this of course needs to go into mainline sooner rather than later.

Longer term, would it make sense to run the three channels through RCU
separately, in order to avoid one channel refraining from starting a grace
period just because some other channel has callbacks waiting for a grace
period to complete? One argument against might be energy efficiency, but
perhaps the ->gp_snap field could be used to get the best of both worlds.

Either way, this fixes only one bug of two. The second bug is in the
kfree_rcu() tests, which should have caught this bug. Thoughts on a
good fix for those tests?

I have applied Uladzislau's and Mukesh's tags, and done the usual
wordsmithing as shown at the end of this message. Please let me know
if I messed anything up.

> ---
> kernel/rcu/tree.c | 27 +++++++++++++++++++--------
> 1 file changed, 19 insertions(+), 8 deletions(-)
>
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 8e880c0..7b95ee9 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3024,6 +3024,18 @@ static void kfree_rcu_work(struct work_struct *work)
> return !!READ_ONCE(krcp->head);
> }
>
> +static bool
> +need_wait_for_krwp_work(struct kfree_rcu_cpu_work *krwp)
> +{
> + int i;
> +
> + for (i = 0; i < FREE_N_CHANNELS; i++)
> + if (!list_empty(&krwp->bulk_head_free[i]))
> + return true;
> +
> + return !!krwp->head_free;

This is fixed from v1, good!

> +}
> +
> static int krc_count(struct kfree_rcu_cpu *krcp)
> {
> int sum = atomic_read(&krcp->head_count);
> @@ -3107,15 +3119,14 @@ static void kfree_rcu_monitor(struct work_struct *work)
> for (i = 0; i < KFREE_N_BATCHES; i++) {
> struct kfree_rcu_cpu_work *krwp = &(krcp->krw_arr[i]);
>
> - // Try to detach bulk_head or head and attach it over any
> - // available corresponding free channel. It can be that
> - // a previous RCU batch is in progress, it means that
> - // immediately to queue another one is not possible so
> - // in that case the monitor work is rearmed.
> - if ((!list_empty(&krcp->bulk_head[0]) && list_empty(&krwp->bulk_head_free[0])) ||
> - (!list_empty(&krcp->bulk_head[1]) && list_empty(&krwp->bulk_head_free[1])) ||
> - (READ_ONCE(krcp->head) && !krwp->head_free)) {
> + // Try to detach bulk_head or head and attach it, only when
> + // all channels are free. Any channel is not free means at krwp
> + // there is on-going rcu work to handle krwp's free business.
> + if (need_wait_for_krwp_work(krwp))
> + continue;
>
> + // kvfree_rcu_drain_ready() might handle this krcp, if so give up.
> + if (need_offload_krc(krcp)) {
> // Channel 1 corresponds to the SLAB-pointer bulk path.
> // Channel 2 corresponds to vmalloc-pointer bulk path.
> for (j = 0; j < FREE_N_CHANNELS; j++) {
> --
> 1.9.1

------------------------------------------------------------------------

commit e222f9a512539c3f4093a55d16624d9da614800b
Author: Ziwei Dai <ziwei.dai@xxxxxxxxxx>
Date: Fri Mar 31 20:42:09 2023 +0800

rcu: Avoid freeing new kfree_rcu() memory after old grace period

Memory passed to kvfree_rcu() that is to be freed is tracked by a
per-CPU kfree_rcu_cpu structure, which in turn contains pointers
to kvfree_rcu_bulk_data structures that contain pointers to memory
that has not yet been handed to RCU, along with an kfree_rcu_cpu_work
structure that tracks the memory that has already been handed to RCU.
These structures track three categories of memory: (1) Memory for
kfree(), (2) Memory for kvfree(), and (3) Memory for both that arrived
during an OOM episode. The first two categories are tracked in a
cache-friendly manner involving a dynamically allocated page of pointers
(the aforementioned kvfree_rcu_bulk_data structures), while the third
uses a simple (but decidedly cache-unfriendly) linked list through the
rcu_head structures in each block of memory.

On a given CPU, these three categories are handled as a unit, with that
CPU's kfree_rcu_cpu_work structure having one pointer for each of the
three categories. Clearly, new memory for a given category cannot be
placed in the corresponding kfree_rcu_cpu_work structure until any old
memory has had its grace period elapse and thus has been removed. And
the kfree_rcu_monitor() function does in fact check for this.

Except that the kfree_rcu_monitor() function checks these pointers one
at a time. This means that if the previous kfree_rcu() memory passed
to RCU had only category 1 and the current one has only category 2, the
kfree_rcu_monitor() function will send that current category-2 memory
along immediately. This can result in memory being freed too soon,
that is, out from under unsuspecting RCU readers.

To see this, consider the following sequence of events, in which:

o Task A on CPU 0 calls rcu_read_lock(), then uses "from_cset",
then is preempted.

o CPU 1 calls kfree_rcu(cset, rcu_head) in order to free "from_cset"
after a later grace period. Except that "from_cset" is freed
right after the previous grace period ended, so that "from_cset"
is immediately freed. Task A resumes and references "from_cset"'s
member, after which nothing good happens.

In full detail:

CPU 0 CPU 1
---------------------- ----------------------
count_memcg_event_mm()
|rcu_read_lock() <---
|mem_cgroup_from_task()
|// css_set_ptr is the "from_cset" mentioned on CPU 1
|css_set_ptr = rcu_dereference((task)->cgroups)
|// Hard irq comes, current task is scheduled out.

cgroup_attach_task()
|cgroup_migrate()
|cgroup_migrate_execute()
|css_set_move_task(task, from_cset, to_cset, true)
|cgroup_move_task(task, to_cset)
|rcu_assign_pointer(.., to_cset)
|...
|cgroup_migrate_finish()
|put_css_set_locked(from_cset)
|from_cset->refcount return 0
|kfree_rcu(cset, rcu_head) // free from_cset after new gp
|add_ptr_to_bulk_krc_lock()
|schedule_delayed_work(&krcp->monitor_work, ..)

kfree_rcu_monitor()
|krcp->bulk_head[0]'s work attached to krwp->bulk_head_free[]
|queue_rcu_work(system_wq, &krwp->rcu_work)
|if rwork->rcu.work is not in WORK_STRUCT_PENDING_BIT state,
|call_rcu(&rwork->rcu, rcu_work_rcufn) <--- request new gp

// There is a perious call_rcu(.., rcu_work_rcufn)
// gp end, rcu_work_rcufn() is called.
rcu_work_rcufn()
|__queue_work(.., rwork->wq, &rwork->work);

|kfree_rcu_work()
|krwp->bulk_head_free[0] bulk is freed before new gp end!!!
|The "from_cset" is freed before new gp end.

// the task resumes some time later.
|css_set_ptr->subsys[(subsys_id) <--- Caused kernel crash, because css_set_ptr is freed.

This commit therefore causes kfree_rcu_monitor() to refrain from moving
kfree_rcu() memory to the kfree_rcu_cpu_work structure until the RCU
grace period has completed for all three categories.

v2: Use helper function instead of inserted code block at kfree_rcu_monitor().

Fixes: c014efeef76a ("rcu: Add multiple in-flight batches of kfree_rcu() work")
Reported-by: Mukesh Ojha <quic_mojha@xxxxxxxxxxx>
Signed-off-by: Ziwei Dai <ziwei.dai@xxxxxxxxxx>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@xxxxxxxxx>
Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxx>

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 859ee02f6614..e2dbea6cee4b 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3051,6 +3051,18 @@ need_offload_krc(struct kfree_rcu_cpu *krcp)
return !!READ_ONCE(krcp->head);
}

+static bool
+need_wait_for_krwp_work(struct kfree_rcu_cpu_work *krwp)
+{
+ int i;
+
+ for (i = 0; i < FREE_N_CHANNELS; i++)
+ if (!list_empty(&krwp->bulk_head_free[i]))
+ return true;
+
+ return !!krwp->head_free;
+}
+
static int krc_count(struct kfree_rcu_cpu *krcp)
{
int sum = atomic_read(&krcp->head_count);
@@ -3134,15 +3146,14 @@ static void kfree_rcu_monitor(struct work_struct *work)
for (i = 0; i < KFREE_N_BATCHES; i++) {
struct kfree_rcu_cpu_work *krwp = &(krcp->krw_arr[i]);

- // Try to detach bulk_head or head and attach it over any
- // available corresponding free channel. It can be that
- // a previous RCU batch is in progress, it means that
- // immediately to queue another one is not possible so
- // in that case the monitor work is rearmed.
- if ((!list_empty(&krcp->bulk_head[0]) && list_empty(&krwp->bulk_head_free[0])) ||
- (!list_empty(&krcp->bulk_head[1]) && list_empty(&krwp->bulk_head_free[1])) ||
- (READ_ONCE(krcp->head) && !krwp->head_free)) {
+ // Try to detach bulk_head or head and attach it, only when
+ // all channels are free. Any channel is not free means at krwp
+ // there is on-going rcu work to handle krwp's free business.
+ if (need_wait_for_krwp_work(krwp))
+ continue;

+ // kvfree_rcu_drain_ready() might handle this krcp, if so give up.
+ if (need_offload_krc(krcp)) {
// Channel 1 corresponds to the SLAB-pointer bulk path.
// Channel 2 corresponds to vmalloc-pointer bulk path.
for (j = 0; j < FREE_N_CHANNELS; j++) {