Re: [External] Re: [PATCH v2 2/2] memcg: Don't trigger hung task warnings when memcg is releasing resources.

From: Julian Sun

Date: Wed Sep 24 2025 - 06:37:10 EST

On 9/24/25 4:28 PM, Peter Zijlstra wrote:

Hi,

On Wed, Sep 24, 2025 at 03:50:41PM +0800, Julian Sun wrote:

On 9/24/25 2:32 PM, Peter Zijlstra wrote:

On Wed, Sep 24, 2025 at 11:41:00AM +0800, Julian Sun wrote:

Hung task warning in mem_cgroup_css_free() is undesirable and
unnecessary since the behavior of waiting for a long time is
expected.

Use touch_hung_task_detector() to eliminate the possible
hung task warning.

Signed-off-by: Julian Sun <sunjunchao@xxxxxxxxxxxxx>

Still hate this. It is not tied to progress. If progress really stalls,
no warning will be given.

Hi, peter

Thanks for your review and comments.

I did take a look at your solution provided yesterday, and get your point.
However AFAICS it can't resolve the unexpected warnings here. Because it
only works after we reach the finish_writeback_work(), and the key point
here is, it *already* takes a long time before we reach
finish_writeback_work(), and there is true progress before finish the
writeback work that hung task detector still can not know.

But wb_split_bdi_pages() should already split things into smaller chunks
if there is low bandwidth, right? And we call finish_writeback_work()
for each chunk.

AFAICS, wb_split_bdi_pages() will only be invoked in the sync scenarios, and not in the background writeback scenarios - which is exactly the case here.

And I noticed that there's something similar in background writeback, where writeback_chunk_size() will split all pages into several chunks, the min chunk size is 1024(MIN_WRITEBACK_PAGES) pages. The difference from wb_split_bdi_pages() is that we don't report progress after finishing the writeback of a chunk.

If a chunk is still taking too long, surely the solution is to use
smaller chunks?

Yeah it still takes a long time, I checked the write_bandwidth and avg_write_bandwidth when warning was triggered:

>>> wb.write_bandwidth
(unsigned long)24
>>> wb.avg_write_bandwidth
(unsigned long)24
>>> wb.write_bandwidth
(unsigned long)13
>>> wb.write_bandwidth
(unsigned long)13

At this bandwidth, it will still takes a lot of seconds to write back MIN_WRITEBACK_PAGES pages.

So it might be a solution, but given the fact that the current minimum chunk size (1024) has been in place for over ten years, and that making it smaller would probably have a negative impact on performance. I'm afraid the filesystem maintainers will not accept this change.
If we don’t modify this part but can report progress after finishing the chunk writeback, it should probably eliminate most of the unexpected warnings.

If we want to make the hung task detector to known the progress of writeback
work, we need to add some code within do_writepages(): after each finish of
a_ops->writepages(), we need to make detector to known there's progress.
Something like this:

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 3e248d1c3969..49572a83c47b 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2635,6 +2635,10 @@ int do_writepages(struct address_space *mapping,
struct writeback_control *wbc)
else
/* deal with chardevs and other special files */
ret = 0;
+ /* Make hung task detector to known there's progress. */
+ if (force_wake)
+ wake_up_all(waitq);
+
if (ret != -ENOMEM || wbc->sync_mode != WB_SYNC_ALL)
break;

which has a big impact on current code - I don't want to introduce this.

You sure? It looks to me like the next level down is wb_writeback() and
writeback_sb_inodes(), and those already have time based breaks in and
still have access to wb_writeback_work::done, while do_writepages() no
longer has that context.

Yeah, exactly. What I mean is report progress within the whole writeback work, either writeback_sb_inodes() or do_writepages() is ok.

Yes, the behavior in this patch does have the possibility to paper cover the
real warnings, and what I want to argue is that the essence of this patch is
the same as the current touch_nmi_watchdog() and touch_softlockup_watchdog()
- these functions are used only in specific scenarios we known and only
affect a single event. And there seems no report that
touch_nmi/softlockup_watchdog() will paper cover the real warnings (do we?).

Correct me if there's anything I'm missing or misunderstanding.

The thing with touch_nmi_watchdog() is that you need to keep doing it.
The moment you stop calling touch_nmi_watchdog(), you will cause it to
fire.

That is very much in line with the thing I proposed, and rather unlike
your proposal that blanket kill reporting for the task, irrespective of
how long it sits there waiting.

Thanks for clarification. So how about the following solution?

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index a07b8cf73ae2..e0698fd3f9ab 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -14,6 +14,7 @@
* Additions for address_space-based writeback
*/

+#include <linux/sched/sysctl.h>
#include <linux/kernel.h>
#include <linux/export.h>
#include <linux/spinlock.h>
@@ -213,7 +214,7 @@ static void wb_queue_work(struct bdi_writeback *wb,
void wb_wait_for_completion(struct wb_completion *done)
{
atomic_dec(&done->cnt); /* put down the initial count */
- wait_event(*done->waitq, !atomic_read(&done->cnt));
+ wait_event(*done->waitq, (done->stamp = jiffies; !atomic_read(&done->cnt)));
}

#ifdef CONFIG_CGROUP_WRITEBACK
@@ -1975,6 +1976,11 @@ static long writeback_sb_inodes(struct super_block *sb,
*/
__writeback_single_inode(inode, &wbc);

+ /* Report progress to make hung task detector know it. */
+ if (jiffies - work->done->stamp >
+ HZ * sysctl_hung_task_timeout_secs / 2)
+ wake_up_all(work->done->waitq);
+
wbc_detach_inode(&wbc);
work->nr_pages -= write_chunk - wbc.nr_to_write;
wrote = write_chunk - wbc.nr_to_write - wbc.pages_skipped;

Instead of waking up all waiting threads every half second, we only wake them up if the writeback work lasts for the value of sysctl_hung_task_timeout_secs / 2 seconds to reduce possible overhead.

Hi, Jan, Christian, how do you think about it?

Please correct me if there's anything I'm missing or misunderstanding.

Thanks,
--
Julian Sun <sunjunchao@xxxxxxxxxxxxx>