答复: [PATCH] rcu: Make sure new krcp free business is handled after the wanted rcu grace period.

From: 代子为 (Ziwei Dai)
Date: Thu Mar 30 2023 - 05:48:08 EST


Hi Uladzislau and all,

Sorry for the disclaimer in the original mail.
Please help comment in this new thread.

We found this issue at K5.15. We try to fix this issue on K5.15.
It seems mainline also has this issue.

Below is the first debug patch on k5.15 device, which is under stress test, issue not reproduce so far.
============================================================
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 66951e130c2fc..44759641f7234 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3342,15 +3342,21 @@ static void kfree_rcu_monitor(struct work_struct *work)
// Attempt to start a new batch.
for (i = 0; i < KFREE_N_BATCHES; i++) {
struct kfree_rcu_cpu_work *krwp = &(krcp->krw_arr[i]);
+ bool rcu_work_pending;

// Try to detach bkvhead or head and attach it over any
// available corresponding free channel. It can be that
// a previous RCU batch is in progress, it means that
// immediately to queue another one is not possible so
// in that case the monitor work is rearmed.
- if ((krcp->bkvhead[0] && !krwp->bkvhead_free[0]) ||
- (krcp->bkvhead[1] && !krwp->bkvhead_free[1]) ||
- (krcp->head && !krwp->head_free)) {
+ rcu_work_pending = test_bit(
+ WORK_STRUCT_PENDING_BIT,
+ work_data_bits(&krwp->rcu_work.work));
+ // If there is on-going rcu work, continue.
+ if (rcu_work_pending || krwp->bkvhead_free[0] ||
+ krwp->bkvhead_free[1] || krwp->head_free)
+ continue;
+ if (krcp->bkvhead[0] || krcp->bkvhead[1] || krcp->head) {
// Channel 1 corresponds to the SLAB-pointer bulk path.
// Channel 2 corresponds to vmalloc-pointer bulk path.
for (j = 0; j < FREE_N_CHANNELS; j++) {

As " rcu_work_pending" judgement seems redundant, I made the second patch below on k5.15. We will make stress test.
============================================================
Below is the first debug patch on k5.15 device, which is under stress test, issue not reproduce so far.
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 66951e130c2fc..f219c60a8ec30 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3343,14 +3343,13 @@ static void kfree_rcu_monitor(struct work_struct *work)
for (i = 0; i < KFREE_N_BATCHES; i++) {
struct kfree_rcu_cpu_work *krwp = &(krcp->krw_arr[i]);

- // Try to detach bkvhead or head and attach it over any
- // available corresponding free channel. It can be that
- // a previous RCU batch is in progress, it means that
- // immediately to queue another one is not possible so
- // in that case the monitor work is rearmed.
- if ((krcp->bkvhead[0] && !krwp->bkvhead_free[0]) ||
- (krcp->bkvhead[1] && !krwp->bkvhead_free[1]) ||
- (krcp->head && !krwp->head_free)) {
+ // Try to detach bulk_head or head and attach it, only when
+ // all channels are free. Any channel is not free means at krwp
+ // there is on-going rcu work to handle krwp's free business.
+ if (krwp->bkvhead_free[0] || krwp->bkvhead_free[1] ||
+ krwp->head_free)
+ continue;
+ if (krcp->bkvhead[0] || krcp->bkvhead[1] || krcp->head) {
// Channel 1 corresponds to the SLAB-pointer bulk path.
// Channel 2 corresponds to vmalloc-pointer bulk path.
for (j = 0; j < FREE_N_CHANNELS; j++) {


> -----邮件原件-----
> 发件人: 代子为 (Ziwei Dai) <Ziwei.Dai@xxxxxxxxxx>
> 发送时间: 2023年3月30日 17:27
> 收件人: paulmck@xxxxxxxxxx; frederic@xxxxxxxxxx;
> quic_neeraju@xxxxxxxxxxx; josh@xxxxxxxxxxxxxxxx; rostedt@xxxxxxxxxxx;
> mathieu.desnoyers@xxxxxxxxxxxx; jiangshanlai@xxxxxxxxx;
> joel@xxxxxxxxxxxxxxxxx; rcu@xxxxxxxxxxxxxxx
> 抄送: linux-kernel@xxxxxxxxxxxxxxx; 王双 (Shuang Wang)
> <shuang.wang@xxxxxxxxxx>; 辛依凡 (Yifan Xin) <Yifan.Xin@xxxxxxxxxx>;
> 王科 (Ke Wang) <Ke.Wang@xxxxxxxxxx>; 闫学文 (Xuewen Yan)
> <Xuewen.Yan@xxxxxxxxxx>; 牛志国 (Zhiguo Niu) <Zhiguo.Niu@xxxxxxxxxx>;
> 代子为 (Ziwei Dai) <Ziwei.Dai@xxxxxxxxxx>; 黄朝阳 (Zhaoyang Huang)
> <zhaoyang.huang@xxxxxxxxxx>
> 主题: [PATCH] rcu: Make sure new krcp free business is handled after the
> wanted rcu grace period.
>
> From: 代子为 (Ziwei Dai) <ziwei.dai@xxxxxxxxxxxxxxxxxxxxxxxxxxx>
>
> In kfree_rcu_monitor(), new free business at krcp is attached to any free
> channel at krwp. kfree_rcu_monitor() is responsible to make sure new free
> business is handled after the rcu grace period. But if there is any none-free
> channel at krwp already, that means there is an on-going rcu work, which will
> cause the kvfree_call_rcu()-triggered free business is done before the wanted
> rcu grace period ends.
>
> This commit ignores krwp which has non-free channel at kfree_rcu_monitor(),
> to fix the issue that kvfree_call_rcu() loses effectiveness.
>
> Below is the css_set obj "from_cset" use-after-free issue caused by
> kvfree_call_rcu() losing effectiveness.
> Core 0 calls rcu_read_lock(), then use "from_cset", then hard irq comes.
> Core 1 calls kfree_rcu(cset, rcu_head), willing to free "from_cset" after new
> gp.
> Core 2 frees "from_cset" after current gp end. "from_cset" is reallocated.
> Core 0 references "from_cset"'s member, which causes crash.
>
> Core 0 Core 1 Core 2
> count_memcg_event_mm()
> |rcu_read_lock() <---
> |mem_cgroup_from_task()
> |// <css_set ptr> is the "from_cset" mentioned on core 1 |<css_set ptr> =
> rcu_dereference((task)->cgroups) |// Hard irq comes, current task is
> scheduled out.
>
> Core 1:
> cgroup_attach_task()
> |cgroup_migrate()
> |cgroup_migrate_execute()
> |css_set_move_task(task, from_cset, to_cset, true)
> |cgroup_move_task(task, to_cset)
> |rcu_assign_pointer(.., to_cset)
> |...
> |cgroup_migrate_finish()
> |put_css_set_locked(from_cset)
> |from_cset->refcount return 0
> |kfree_rcu(cset, rcu_head) <--- means to free from_cset
> after new gp
> |add_ptr_to_bulk_krc_lock()
> |schedule_delayed_work(&krcp->monitor_work, ..)
>
> kfree_rcu_monitor()
> |krcp->bulk_head[0]'s work attached to
> krwp->bulk_head_free[]
> |queue_rcu_work(system_wq, &krwp->rcu_work)
> |if rwork->rcu.work is not in WORK_STRUCT_PENDING_BIT
> state,
> |call_rcu(&rwork->rcu, rcu_work_rcufn) <--- request a new gp
>
> // There is a perious call_rcu(..,
> rcu_work_rcufn)
> // gp end, rcu_work_rcufn() is called.
> rcu_work_rcufn()
> |__queue_work(.., rwork->wq,
> &rwork->work);
> Core 2:
> // or there is a pending
> kfree_rcu_work() work called.
> |kfree_rcu_work()
> |krwp->bulk_head_free[0] bulk is
> freed before new gp end!!!
> |The "from_cset" mentioned on core
> 1 is freed before new gp end.
> Core 0:
> // the task is schedule in after many ms.
> |<css_set ptr>->subsys[(subsys_id) <--- caused kernel crash, because
> <css_set ptr>="from_cset" is freed.
>
> Signed-off-by: Ziwei Dai <ziwei.dai@xxxxxxxxxx>
>
> :# modified: tree.c
> ---
> kernel/rcu/tree.c | 19 ++++++++++---------
> 1 file changed, 10 insertions(+), 9 deletions(-)
>
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 8e880c0..f6451a8
> 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3107,15 +3107,16 @@ static void kfree_rcu_monitor(struct
> work_struct *work)
> for (i = 0; i < KFREE_N_BATCHES; i++) {
> struct kfree_rcu_cpu_work *krwp = &(krcp->krw_arr[i]);
>
> - // Try to detach bulk_head or head and attach it over any
> - // available corresponding free channel. It can be that
> - // a previous RCU batch is in progress, it means that
> - // immediately to queue another one is not possible so
> - // in that case the monitor work is rearmed.
> - if ((!list_empty(&krcp->bulk_head[0]) &&
> list_empty(&krwp->bulk_head_free[0])) ||
> - (!list_empty(&krcp->bulk_head[1]) &&
> list_empty(&krwp->bulk_head_free[1])) ||
> - (READ_ONCE(krcp->head) && !krwp->head_free)) {
> -
> + // Try to detach bulk_head or head and attach it, only when
> + // all channels are free. Any channel is not free means at krwp
> + // there is on-going rcu work to handle krwp's free business.
> + if (!list_empty(&krwp->bulk_head_free[0]) ||
> + !list_empty(&krwp->bulk_head_free[1]) ||
> + krwp->head_free)
> + continue;
> + if (!list_empty(&krcp->bulk_head[0]) ||
> + !list_empty(&krcp->bulk_head[1]) ||
> + READ_ONCE(krcp->head)) {
> // Channel 1 corresponds to the SLAB-pointer bulk path.
> // Channel 2 corresponds to vmalloc-pointer bulk path.
> for (j = 0; j < FREE_N_CHANNELS; j++) {
> --
> 1.9.1