Re: kernel BUG at mm/page-writeback.c:LINE!

From: Hugh Dickins
Date: Mon Jan 04 2021 - 22:30:03 EST


On Mon, 4 Jan 2021, Linus Torvalds wrote:
> On Mon, Jan 4, 2021 at 12:41 PM Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> > Linus, how confident are you in those wait_on_page_bit_common()
> > changes?
>
> Pretty confident. The atomicity of the bitops themselves is fairly simple.
>
> But in the writeback bit? No. The old code would basically _loop_ if
> it was woken up and the writeback bit was set again, and would hide
> any problems with it.
>
> The new code basically goes "ok, the writeback bit was clear at one
> point, so I've waited enough".
>
> We could easily turn the "if ()" in wait_on_page_writeback() into a "while()".
>
> But honestly, it does smell to me like the bug is always in the caller
> not having serialized with whatever actually starts writeback. High
> figured out one such case.
>
> This code holds the page lock, but I don't see where
> set_page_writeback() would always be done with the page lock held. So
> what really protects against PG_writeback simply being set again?
>
> The whole BUG_ON() seems entirely buggy to me.
>
> In fact, even if you hold the page lock while doing
> set_page_writeback(), since the actual IO does *NOT* hold the page
> lock, the unlock happens without it. So even if every single case of
> setting the page writeback were to hold the page lock,

I did an audit when this came up before, and though not 100% confident
in my diligence, it certainly looked that way; and others looked too
(IIRC Matthew had a patch to add a WARN_ON_ONCE or whatever, but that
didn't go upstream).

> what keeps this from happening:
>
> CPU1 = end previous writeback
> CPU2 = start new writeback under page lock
> CPU3 = write_cache_pages()
>
> CPU1 CPU2 CPU3
> ---- ---- ----
>
> end_page_writeback()
> test_clear_page_writeback(page)
> ... delayed...
>
>
> lock_page();
> set_page_writeback()
> unlock_page()
>
>
> lock_page()
> wait_on_page_writeback();
>
> wake_up_page(page, PG_writeback);
> .. wakes up CPU3 ..
>
> BUG_ON(PageWriteback(page));
>
> IOW, that BUG_ON() really feels entirely bogus to me. Notice how it
> wasn't actually serialized with the waking up of the _previous_ bit.

Well. That looks so obvious now you suggest it, that I feel very
stupid for not seeing it before, so have tried hard to disprove you.
But I think you're right.

>
> Could we make the wait_on_page_writeback() just loop if it sees the
> page under writeback again? Sure.
>
> Could we make the wait_on_page_bit_common() code say "if this is
> PG_writeback, I won't wake it up after all, because the bit is set
> again?" Sure.
>
> But I feel it's really that end_page_writeback() itself is
> fundamentally buggy, because the "wakeup" is not atomic with the bit
> clearing _and_ it doesn't actually hold the page lock that is
> allegedly serializing this all.

And we won't be adding a lock_page() into end_page_writeback()!

>
> That raciness was what caused the "stale wakeup from previous owner"
> thing too. And I think that Hugh fixed the page re-use case, but the
> fundamental problem of end_page_writeback() kind of remained.
>
> And yes, I think this was all hidden by wait_on_page_writeback()
> effectively looping over the "PageWriteback(page)" test because of how
> wait_on_page_bit() worked.
>
> So the one-liner of changing the "if" to "while" in
> wait_on_page_writeback() should get us back to what we used to do.

I think that is the realistic way to go.

>
> Except I still get the feeling that the bug really is not in
> wait_on_page_writeback(), but in the end_page_writeback() side.
>
> Comments? I'm perfectly happy doing the one-liner. I would just be
> _happier_ with end_page_writeback() having the serialization..
>
> The real problem is that "wake_up_page(page, bit)" is not the thing
> that actually clears the bit. So there's a fundamental race between
> clearing the bit and waking something up.
>
> Which makes me think that the best option would actually be to move
> the bit clearing _into_ wake_up_page(). But that looks like a very big
> change.

I'll be surprised if that direction is even possible, without unpleasant
extra locking. If there were only one wakeup to be done, perhaps, but
potentially there are many. When I looked before, it seemed that the
clear bit needs to come before the wakeup, and the wakeup needs to come
before the clear bit. And the BOOKMARK case drops q->lock.

It should be possible to rely on the XArray's i_pages lock rather than
the page lock for serialization, much as I did in one variant of the
patch I sent originally. Updated version appended below for show
(most of it rearrangement+cleanup rather than the functional change);
but I think it's slightly incomplete (__test_set_page_writeback()
should take i_pages lock even in the !mapping_use_writeback_tags case);
and last time around you were worried by the NULL mapping case - which
I believe just comes from an abundance of caution (who writes back
without any backing? and truncation/eviction wait_on_page_writeback()).

But I expect you to take one look and quickly opt instead for the "while"
in wait_on_page_writeback(): which is not so bad now that you've shown a
scenario which justifies it.

--- 5.11-rc2/include/linux/page-flags.h 2020-12-27 20:39:36.819923814 -0800
+++ linux/include/linux/page-flags.h 2021-01-04 17:27:32.771920607 -0800
@@ -549,7 +549,6 @@ static __always_inline void SetPageUptod

CLEARPAGEFLAG(Uptodate, uptodate, PF_NO_TAIL)

-int test_clear_page_writeback(struct page *page);
int __test_set_page_writeback(struct page *page, bool keep_write);

#define test_set_page_writeback(page) \
--- 5.11-rc2/include/linux/pagemap.h 2020-12-13 14:41:30.000000000 -0800
+++ linux/include/linux/pagemap.h 2021-01-04 17:27:32.775920636 -0800
@@ -660,6 +660,7 @@ static inline int lock_page_or_retry(str
*/
extern void wait_on_page_bit(struct page *page, int bit_nr);
extern int wait_on_page_bit_killable(struct page *page, int bit_nr);
+extern void wake_up_page_bit(struct page *page, int bit_nr);

/*
* Wait for a page to be unlocked.
--- 5.11-rc2/kernel/sched/wait_bit.c 2020-03-29 15:25:41.000000000 -0700
+++ linux/kernel/sched/wait_bit.c 2021-01-04 17:27:32.779920665 -0800
@@ -90,9 +90,8 @@ __wait_on_bit_lock(struct wait_queue_hea
ret = action(&wbq_entry->key, mode);
/*
* See the comment in prepare_to_wait_event().
- * finish_wait() does not necessarily takes wwq_head->lock,
- * but test_and_set_bit() implies mb() which pairs with
- * smp_mb__after_atomic() before wake_up_page().
+ * finish_wait() does not necessarily take
+ * wwq_head->lock, but test_and_set_bit() implies mb().
*/
if (ret)
finish_wait(wq_head, &wbq_entry->wq_entry);
--- 5.11-rc2/mm/filemap.c 2020-12-27 20:39:37.659932212 -0800
+++ linux/mm/filemap.c 2021-01-04 17:28:59.464560914 -0800
@@ -1093,7 +1093,7 @@ static int wake_page_function(wait_queue
return (flags & WQ_FLAG_EXCLUSIVE) != 0;
}

-static void wake_up_page_bit(struct page *page, int bit_nr)
+void wake_up_page_bit(struct page *page, int bit_nr)
{
wait_queue_head_t *q = page_waitqueue(page);
struct wait_page_key key;
@@ -1147,13 +1147,6 @@ static void wake_up_page_bit(struct page
spin_unlock_irqrestore(&q->lock, flags);
}

-static void wake_up_page(struct page *page, int bit)
-{
- if (!PageWaiters(page))
- return;
- wake_up_page_bit(page, bit);
-}
-
/*
* A choice of three behaviors for wait_on_page_bit_common():
*/
@@ -1466,40 +1459,6 @@ void unlock_page(struct page *page)
}
EXPORT_SYMBOL(unlock_page);

-/**
- * end_page_writeback - end writeback against a page
- * @page: the page
- */
-void end_page_writeback(struct page *page)
-{
- /*
- * TestClearPageReclaim could be used here but it is an atomic
- * operation and overkill in this particular case. Failing to
- * shuffle a page marked for immediate reclaim is too mild to
- * justify taking an atomic operation penalty at the end of
- * ever page writeback.
- */
- if (PageReclaim(page)) {
- ClearPageReclaim(page);
- rotate_reclaimable_page(page);
- }
-
- /*
- * Writeback does not hold a page reference of its own, relying
- * on truncation to wait for the clearing of PG_writeback.
- * But here we must make sure that the page is not freed and
- * reused before the wake_up_page().
- */
- get_page(page);
- if (!test_clear_page_writeback(page))
- BUG();
-
- smp_mb__after_atomic();
- wake_up_page(page, PG_writeback);
- put_page(page);
-}
-EXPORT_SYMBOL(end_page_writeback);
-
/*
* After completing I/O on a page, call this routine to update the page
* flags appropriately
--- 5.11-rc2/mm/page-writeback.c 2020-12-13 14:41:30.000000000 -0800
+++ linux/mm/page-writeback.c 2021-01-04 17:28:59.464560914 -0800
@@ -589,7 +589,7 @@ static void wb_domain_writeout_inc(struc

/*
* Increment @wb's writeout completion count and the global writeout
- * completion count. Called from test_clear_page_writeback().
+ * completion count. Called from end_page_writeback().
*/
static inline void __wb_writeout_inc(struct bdi_writeback *wb)
{
@@ -2719,49 +2719,69 @@ int clear_page_dirty_for_io(struct page
}
EXPORT_SYMBOL(clear_page_dirty_for_io);

-int test_clear_page_writeback(struct page *page)
+/**
+ * end_page_writeback - end writeback against a page
+ * @page: the page
+ */
+void end_page_writeback(struct page *page)
{
- struct address_space *mapping = page_mapping(page);
+ struct address_space *mapping;
struct mem_cgroup *memcg;
struct lruvec *lruvec;
- int ret;
+ unsigned long flags;
+ int writeback;
+
+ /*
+ * TestClearPageReclaim could be used here but it is an atomic
+ * operation and overkill in this particular case. Failing to
+ * shuffle a page marked for immediate reclaim is too mild to
+ * justify taking an atomic operation penalty at the end of
+ * every page writeback.
+ */
+ if (PageReclaim(page)) {
+ ClearPageReclaim(page);
+ rotate_reclaimable_page(page);
+ }

+ mapping = page_mapping(page);
+ WARN_ON_ONCE(!mapping);
memcg = lock_page_memcg(page);
lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+
+ dec_lruvec_state(lruvec, NR_WRITEBACK);
+ dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
+ inc_node_page_state(page, NR_WRITTEN);
+
+ if (mapping)
+ xa_lock_irqsave(&mapping->i_pages, flags);
+
+ writeback = TestClearPageWriteback(page);
+ /* No need for smp_mb__after_atomic() after TestClear */
+ if (PageWaiters(page))
+ wake_up_page_bit(page, PG_writeback);
+
if (mapping && mapping_use_writeback_tags(mapping)) {
struct inode *inode = mapping->host;
struct backing_dev_info *bdi = inode_to_bdi(inode);
- unsigned long flags;

- xa_lock_irqsave(&mapping->i_pages, flags);
- ret = TestClearPageWriteback(page);
- if (ret) {
- __xa_clear_mark(&mapping->i_pages, page_index(page),
+ __xa_clear_mark(&mapping->i_pages, page_index(page),
PAGECACHE_TAG_WRITEBACK);
- if (bdi->capabilities & BDI_CAP_WRITEBACK_ACCT) {
- struct bdi_writeback *wb = inode_to_wb(inode);
+ if (bdi->capabilities & BDI_CAP_WRITEBACK_ACCT) {
+ struct bdi_writeback *wb = inode_to_wb(inode);

- dec_wb_stat(wb, WB_WRITEBACK);
- __wb_writeout_inc(wb);
- }
+ dec_wb_stat(wb, WB_WRITEBACK);
+ __wb_writeout_inc(wb);
}
+ if (inode && !mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK))
+ sb_clear_inode_writeback(inode);
+ }

- if (mapping->host && !mapping_tagged(mapping,
- PAGECACHE_TAG_WRITEBACK))
- sb_clear_inode_writeback(mapping->host);
-
+ if (mapping)
xa_unlock_irqrestore(&mapping->i_pages, flags);
- } else {
- ret = TestClearPageWriteback(page);
- }
- if (ret) {
- dec_lruvec_state(lruvec, NR_WRITEBACK);
- dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
- inc_node_page_state(page, NR_WRITTEN);
- }
__unlock_page_memcg(memcg);
- return ret;
+ BUG_ON(!writeback);
}
+EXPORT_SYMBOL(end_page_writeback);

int __test_set_page_writeback(struct page *page, bool keep_write)
{