Re: [RFC PATCH] char: misc: Init misc->list in a safe way

From: Orson Zhai
Date: Tue Jun 27 2017 - 21:58:05 EST


Hi Greg & Arnd,

On Tue, Jun 27, 2017 at 08:29:17AM +0200, Greg Kroah-Hartman wrote:
>
> A: Because it messes up the order in which people normally read text.
> Q: Why is top-posting such a bad thing?
> A: Top-posting.
> Q: What is the most annoying thing in e-mail?
>
> A: No.
> Q: Should I include quotations after my reply?
>
> http://daringfireball.net/2007/07/on_top

We are so sorry for any troubles to you. I will take the role of Zhongping to
continue discussion here.

>
> On Tue, Jun 27, 2017 at 02:02:13AM +0000, Zhongping Tan (èäå) wrote:
> > Ok, firstly we need to discuss the list usage, for list head we need
> > do initialization, but for list node we don't need initialization at
> > all.
>
> I don't understand, why is your misc driver touching that field at all?
> Do I need to go and make it "private" somehow?

There maybe some mis-understanding caused by not very well english expression.
I'll clarify what had happended to us.

>
> > And for misc_list head, we use LIST_HEAD to define and initialize
> > it. So I don't know why we put INIT_LIST_HEAD(&misc->list) in function
> > misc_register, any bugs when without it?
>
> Again, what is wrong with the code today? What driver is this causing
> problems for?

It maybe a little bit long story.

In recent, our test team report a crash of non-stop reboot test on a mobile phone after 46 hours.

After some investigation, we got the information as following,

&misc_list = 0xFFFFFFFF822AC780 -> (
next_=_0xFFFFFFFFA0087158 -> (
next = 0xFFFFFFFFA0087158 -> (
next = 0xFFFFFFFFA0087158,
prev = 0xFFFFFFFFA0087158 -> (
next = 0xFFFFFFFFA0087158,
prev = 0xFFFFFFFFA0087158)),
prev = 0xFFFFFFFFA0087158 -> (
next = 0xFFFFFFFFA0087158,
prev = 0xFFFFFFFFA0087158)),
prev = 0xFFFF88007BF55618)

it seems that misc_list fall into a loop which cause surfaceflinger (a service from Android) dead.

And we got futher more information after that,

(struct miscdevice*)0xFFFFFFFFA0087140 = 0xFFFFFFFFA0087140 -> (
minor = 20,
name = 0xFFFFFFFFA008606D -> "fm",
fops = 0xFFFFFFFFA0086700 -> (
owner = 0xFFFFFFFFA0087480,
llseek = 0x0,
read = 0xFFFFFFFFA0081860,
write = 0x0,
read_iter = 0x0,
write_iter = 0x0,
iterate = 0x0,
poll = 0x0,
unlocked_ioctl = 0xFFFFFFFFA00832C0,
compat_ioctl = 0xFFFFFFFFA0083870,
mmap = 0x0,
open = 0xFFFFFFFFA0081B80,
flush = 0x0,
release = 0xFFFFFFFFA00838C0,
fsync = 0x0,
aio_fsync = 0x0,
fasync = 0x0,
lock = 0x0,
sendpage = 0x0,
get_unmapped_area = 0x0,
check_flags = 0x0,
flock = 0x0,
splice_write = 0x0,
splice_read = 0x0,
setlease = 0x0,
fallocate = 0x0,
show_fdinfo = 0x0),
list = (
next = 0xFFFFFFFFA0087158,
prev = 0xFFFFFFFFA0087158),
parent = 0x0,
this_device = 0xFFFF88007BD32000,
groups = 0x0,
nodename = 0x0,
mode = 0)

We found the device is "fm". We highly suspect that fm driver call misc_register twice and reinitialize list to make ->pre & ->next pointing to himself.

Meanwhile, we checked fm driver and found nothing obviously wrong in the code.
Consider that this is a crash after 46 hours continuous power-on/off, it maybe caused by some special cases we are hard to know for now.
We think it might make some sence to add protection code into misc_register() at first.

Hope this could help to understand our situation.

We'll try to provide any detail inforamtion about this if necessary.

Thanks,
Orson

>
> thanks,
>
> greg k-h