Re: [PATCH printk v3 02/14] printk: Adjust mapping for 32bit seq macros
From: Petr Mladek
Date: Mon Jan 15 2024 - 11:21:50 EST
On Mon 2024-01-15 11:58:39, John Ogness wrote:
> On 2024-01-12, Petr Mladek <pmladek@xxxxxxxx> wrote:
> >> > Reported-by: Francesco Dolcini <francesco@xxxxxxxxxx>
> >> > Reported-by: kernel test robot <oliver.sang@xxxxxxxxx>
> >> > Closes: https://lore.kernel.org/oe-lkp/202311171611.78d41dbe-oliver.sang@xxxxxxxxx
> >> > Reported-by: kernel test robot <oliver.sang@xxxxxxxxx>
> >> > Closes: https://lore.kernel.org/oe-lkp/202311161555.3ee16fc9-oliver.sang@xxxxxxxxx
> >> > Signed-off-by: Sebastian Andrzej Siewior <bigeasy@xxxxxxxxxxxxx>
> >> > Signed-off-by: John Ogness <john.ogness@xxxxxxxxxxxxx>
> >>
> >> Great catch! It must have been complicated to debug this.
> >>
> >> Reviewed-by: Petr Mladek <pmladek@xxxxxxxx>
> >
> > That said, I am a bit nervous that a bug like this might cause
> > workqueue stall and panic() the kernel.
> >
> > At least, this is how I read
> > https://lore.kernel.org/oe-lkp/202311171611.78d41dbe-oliver.sang@xxxxxxxxx/
>
> Note that the bug is reported for code that mainline will never
> have. This patch "fixes" the problem before it is a problem. From the
> perspective of mainline, the problem never existed. Perhaps it is
> inappropriate to list these with the Closes tag.
I think that it is better to keep the reference. It might help
archaeologists to understand the problem ;-)
> > It looks like it caused some loop and refcout overlow or so.
> > But I might be wrong.
> >
> > I would like to better understand this and check if we could prevent
> > it somehow.
>
> The code in question can be found here:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/kernel/printk/printk_ringbuffer.c?id=768c33033e41ad195a9304ebb183afb730d5ae98#n2031
>
> The URL above points to prb_next_reserve_seq(), which is in an infinite
> loop.
>
> desc_last_finalized_seq() is returning a huge garbage value due to the
> 32bit sequence value 1 being incorrectly mapped to 0xffffffff00000001
> (which is 18446744069414584321). When desc_read_finalized_seq() is
> called (line 2082), it obviously returns -EINVAL because there is no
> record with this huge sequence number.
>
> The code interrupts -EINVAL to mean that the record has been
> overwritten, and so it tries again with an updated last finalized
> sequence number => infinite loop.
>
> With this patch applied, the 32bit value 1 is correctly mapped to
> sequence number 1 and all is OK.
>
> The problem was that desc_last_finalized_seq() returned a sequence
> number that was higher than the highest record. That was the bug. As
> long as desc_last_finalized_seq() always returns a sequence number that
> exists or used to exist, the retry loop is fine. And since the tail
> record is always in the finalized state, there will always be a
> finalized record available.
Thank you both for explanation. You are right that this was kind
of special.
You know, the code around reading the messages is getting more
and more complex. Also the u32 -> u64 transition is not 100% safe.
So some sanity checks might be useful.
That said, I do not see any obvious trivial one. Just the following
came to my mind. prb_first_seq() is reliable and we could do:
void next_seq_sanity_check(struct printk_ringbuffer *rb, u64 next_seq)
{
struct prb_desc_ring *desc_ring = &rb->desc_ring;
u64 first_seq = prb_first_seq(rb);
u64 desc_count = DESC_COUNT(&rb->desc_ring)).
WARN_ONCE(seq > first_seq + DESC_COUNT(desc_ring));
}
Well, I am not sure if it is worth it. Also the WARN() won't
be printed on consoles when the reading is broken.
Best Regards,
Petr