Hi Konrad/Miquel,Tried running a KASAN enabled image on IPQ board, but no luck. Nothing came out.
On 2/1/2022 9:21 PM, Konrad Dybcio wrote:
+ few more who have access to the board.
On 01/02/2022 14:52, Miquel Raynal wrote:
Hi Konrad,
konrad.dybcio@xxxxxxxxxxxxxx wrote on Mon, 31 Jan 2022 20:54:12 +0100:
On 31/01/2022 15:13, Sricharan Ramabadhran wrote:This does look like a pointer error at some point and some kernel data
Hi Konrad,I'm sorry I have so few details on hand, and no kernel tree (no access to that machine either, for now).
On 1/31/2022 3:39 PM, Konrad Dybcio wrote:
On 28/01/2022 18:50, Sricharan Ramabadhran wrote:Ok sure. So was the READID command itself failing (or) the > subsequent one ?
Hi Konrad,I won't have access to the board for about two weeks, sorry.
On 1/28/2022 9:55 AM, Sricharan Ramabadhran wrote:
Hi Miquel,While we could not reproduce this issue on our ipq boards (do >>> not have a mdm9607 right now) and
On 1/26/2022 4:12 PM, Miquel Raynal wrote:
Hi Mani,Sorry Miquel, somehow we did not get this email in our inbox.
mani@xxxxxxxxxx wrote on Wed, 26 Jan 2022 16:03:16 +0530:
On Wed, Jan 26, 2022 at 11:16:13AM +0100, Miquel Raynal wrote:Oh, ok, I didn't know. Thanks!
Hello,Sorry. I was hoping that Qcom folks would chime in as I don't >>>>>> have any idea
miquel.raynal@xxxxxxxxxxx wrote on Fri, 14 Jan 2022 08:27:18 +0100:Hi Konrad,Sadre, I've spent a significant amount of time reviewing your >>>>>>> patches,
konrad.dybcio@xxxxxxxxxxxxxx wrote on Thu, 13 Jan 2022 19:44:26 >>>>>>>> +0100:While I have absolutely 0 idea why and how, running >>>>>>>>> clear_bam_transactionI'm adding two people from codeaurora who worked a lot on this >>>>>>>> driver.
when READID is issued makes the DMA totally clog up and refuse >>>>>>>>> to function
at all on mdm9607. In fact, it is so bad that all the data >>>>>>>>> gets garbled
and after a short while in the nand probe flow, the CPU >>>>>>>>> decides that
sepuku is the only option.
Removing _READID from the if condition makes it work like a >>>>>>>>> charm, I can
read data and mount partitions without a problem.
Signed-off-by: Konrad Dybcio <konrad.dybcio@xxxxxxxxxxxxxx>
---
This is totally just an observation which took me an inhumane >>>>>>>>> amount of
debug prints to find.. perhaps there's a better reason behind >>>>>>>>> this, but
I can't seem to find any answers.. Therefore, this is a BIG RFC!
Hopefully they will have an idea :)
now it's your turn to not take a month to answer to your peers
proposals.
Please help reviewing this patch.
about the mdm9607 platform. It could be that the mail server >>>>>> migration from
codeaurora to quicinc put a barrier here.
Let me ping them internally.
Thanks to Mani for pinging us, we will test this up today and >>>> get back.
issue does not look any obvious.
can you please give the debug logs that you did for the above >>> stage by stage ?
When I get to it, I'll surely try to send you the logs, though there
wasn't much more than just something jumping to who-knows-where
after clear_bam_transaction was called, resulting in values >> associated with
the NAND being all zeroed out in pr_err/_debug/etc.
We can check which parameter reset by the clear_bam_transaction is > causing the
failure. Meanwhile, looping in Pradeep who has access to the > board, so in a better
position to debug.
I will try to describe to the best of my abilities what I recall.
My methodology of making sure things don't go haywire was to print the oob size
of our NAND basically every two lines of code (yes, i was very desperate at one point),
as that was zeroed out when *the bug* happened,
has been corrupted very badly by the driver.
leading to a kernel bug/panic/stallDo you remember if this function was called for the first time when
(can't recall what exactly it was, but it said something along the lines of "no support for
oob size 0" and then it didn't fail graceully, leading to some bad jumps and ultimately
a dead platform..)
after hours of digging, I found out that everything goes fine until clear_bam_transaction is called,
this happened?
I think so, if I recall correctly there are no more callers in this path, as readid is the first nand command executed in flash probe flow.
after that gets executed every nand op starts reading all zeroes (for example in JEDEC ID check)I don't see it in the list of supported devices, what's the exact
so I added the changes from this patch, and things magically started working... My suspicion is
that the underlying FIFO isn't fully drained (is it a FIFO on 9607? bah, i work on too many socs at once)
compatible used?
qcom,ipq4019-nand
and this function only makes Linux think it is, without actually draining it, and the leftoverI would bet for a non allocated bam-ish pointer that is reset to zero
commands get executed with some parts of them getting overwritten, resulting in the
famous garbage in - garbage out situation, but that's only a guesstimate..
in the clear_bam_transaction() helper.
Can you get your hands on the board again?
Sure, but as I mentioned previously, only in about 2 weeks, I can't really do any dev before then.. :(
It would be nice to check if the allocation always occurs before use,
and if yes on how much bytes.
If the pointer is not dangling, then perhaps something else smashes
that pointer.
Konrad
Do note this somehow worked fine on 5.11 and then broke on 5.12/13. I went as far as replacing most
of the kernel with the updated/downgraded parts via git checkout (i tried many combinations),
to no avail.. I even tried different compilers and optimization levels, thinking it could have been
a codegen issue, but no luck either.
I.. do understand this email is a total mess to read, as much as it was to write, but
without access to my code and the machine itself I can't give you solid details, and
the fact this situation is far from ordinary doesn't help either..
The latest (ancient, not quite pretty, but probably working if my memory is correct) version of my patches
for the mdm9607 is available at [1], I will push the new revision after I get access to the workstation.
Going by the description, for kernel corruption, we can try out a KASAN build.
Since you have mentioned it worked till 5.11, you bisected the driver till 5.11 head and it worked ?