Re: [PATCH v8 2/3] CMDQ: Mediatek CMDQ driver

From: Matthias Brugger
Date: Fri Jun 17 2016 - 11:58:16 EST




On 17/06/16 10:28, Horng-Shyang Liao wrote:
Hi Matthias,

On Tue, 2016-06-14 at 20:07 +0800, Horng-Shyang Liao wrote:
Hi Matthias,

On Tue, 2016-06-14 at 12:17 +0200, Matthias Brugger wrote:

On 14/06/16 09:44, Horng-Shyang Liao wrote:
Hi Matthias,

On Wed, 2016-06-08 at 17:35 +0200, Matthias Brugger wrote:

On 08/06/16 14:25, Horng-Shyang Liao wrote:
Hi Matthias,

On Wed, 2016-06-08 at 12:45 +0200, Matthias Brugger wrote:

On 08/06/16 07:40, Horng-Shyang Liao wrote:
Hi Matthias,

On Tue, 2016-06-07 at 18:59 +0200, Matthias Brugger wrote:

On 03/06/16 15:11, Matthias Brugger wrote:


[...]

+
+ smp_mb(); /* modify jump before enable thread */
+ }
+
+ cmdq_thread_writel(thread, task->pa_base +
task->command_size,
+ CMDQ_THR_END_ADDR);
+ cmdq_thread_resume(thread);
+ }
+ list_move_tail(&task->list_entry, &thread->task_busy_list);
+ spin_unlock_irqrestore(&cmdq->exec_lock, flags);
+}
+
+static void cmdq_handle_error_done(struct cmdq *cmdq,
+ struct cmdq_thread *thread, u32 irq_flag)
+{
+ struct cmdq_task *task, *tmp, *curr_task = NULL;
+ u32 curr_pa;
+ struct cmdq_cb_data cmdq_cb_data;
+ bool err;
+
+ if (irq_flag & CMDQ_THR_IRQ_ERROR)
+ err = true;
+ else if (irq_flag & CMDQ_THR_IRQ_DONE)
+ err = false;
+ else
+ return;
+
+ curr_pa = cmdq_thread_readl(thread, CMDQ_THR_CURR_ADDR);
+
+ list_for_each_entry_safe(task, tmp, &thread->task_busy_list,
+ list_entry) {
+ if (curr_pa >= task->pa_base &&
+ curr_pa < (task->pa_base + task->command_size))

What are you checking here? It seems as if you make some implcit
assumptions about pa_base and the order of execution of
commands in the
thread. Is it save to do so? Does dma_alloc_coherent give any
guarantees
about dma_handle?

1. Check what is the current running task in this GCE thread.
2. Yes.
3. Yes, CMDQ doesn't use iommu, so physical address is continuous.


Yes, physical addresses might be continous, but AFAIK there is no
guarantee that the dma_handle address is steadily growing, when
calling
dma_alloc_coherent. And if I understand the code correctly, you
use this
assumption to decide if the task picked from task_busy_list is
currently
executing. So I think this mecanism is not working.

I don't use dma_handle address, and just use physical addresses.
From CPU's point of view, tasks are linked by the busy list.
From GCE's point of view, tasks are linked by the JUMP command.

In which cases does the HW thread raise an interrupt.
In case of error. When does CMDQ_THR_IRQ_DONE get raised?

GCE will raise interrupt if any task is done or error.
However, GCE is fast, so CPU may get multiple done tasks
when it is running ISR.

In case of error, that GCE thread will pause and raise interrupt.
So, CPU may get multiple done tasks and one error task.


I think we should reimplement the ISR mechanism. Can't we just read
CURR_IRQ_STATUS and THR_IRQ_STATUS in the handler and leave
cmdq_handle_error_done to the thread_fn? You will need to pass
information from the handler to thread_fn, but that shouldn't be an
issue. AFAIK interrupts are disabled in the handler, so we should stay
there as short as possible. Traversing task_busy_list is expensive, so
we need to do it in a thread context.

Actually, our initial implementation is similar to your suggestion,
but display needs CMDQ to return callback function very precisely,
else display will drop frame.
For display, CMDQ interrupt will be raised every 16 ~ 17 ms,
and CMDQ needs to call callback function in ISR.
If we defer callback to workqueue, the time interval may be larger than
32 ms.sometimes.


I think the problem is, that you implemented the workqueue as a ordered
workqueue, so there is no parallel processing. I'm still not sure why
you need the workqueue to be ordered. Can you please explain.

The order should be kept.
Let me use mouse cursor as an example.
If task 1 means move mouse cursor to point A, task 2 means point B,
and task 3 means point C, our expected result is A -> B -> C.
If the order is not kept, the result could become A -> C -> B.


Got it, thanks for the clarification.


I think a way to get rid of the workqueue is to use a timer, which gets
programmed to the time a timeout in the first task in the busy list
would happen. Everytime we update the busy list (e.g. because of task
got finished by the thread), we update the timer. When the timer
triggers, which hopefully won't happen too often, we return timeout on
the busy list elements, until the time is lower then the actual time.

At least with this we can reduce the data structures in this driver and
make it more lightweight.

From my understanding, your proposed method can handle timeout case.

However, the workqueue is also in charge of releasing tasks.
Do you take releasing tasks into consideration by using the proposed
timer method?
Furthermore, I think the code will become more complex if we also use
timer to implement releasing tasks.


Can't we call
clk_disable_unprepare(cmdq->clock);
cmdq_task_release(task);
after invoking the callback?

After I put clk_disable_unprepare(cmdq->clock) into ISR, I encounter
another BUG.

(Quote some Linux 4.7 source code.)

605 void clk_unprepare(struct clk *clk)
606 {
607 if (IS_ERR_OR_NULL(clk))
608 return;
609
610 clk_prepare_lock(); // <-- Here
611 clk_core_unprepare(clk->core);
612 clk_prepare_unlock();
613 }
614 EXPORT_SYMBOL_GPL(clk_unprepare);

91 static void clk_prepare_lock(void)
92 {
93 if (!mutex_trylock(&prepare_lock)) { // <-- Here
94 if (prepare_owner == current) {
95 prepare_refcnt++;
96 return;
97 }
98 mutex_lock(&prepare_lock);
99 }
100 WARN_ON_ONCE(prepare_owner != NULL);
101 WARN_ON_ONCE(prepare_refcnt != 0);
102 prepare_owner = current;
103 prepare_refcnt = 1;
104 }

So, 'unprepare' can sleep and cannot be put into ISR.
I also try to put it into a timer, but the error is the same
since timer callback is executed by softirq.

We need clk_disable_unprepare() since it can save power consumption
in idle.

We can call clk_prepare in probe and then use clk_enable/clk_disable, which don't sleep.

Regards,
Matthias

Therefore, I plan to
(1) move releasing buffer and task into ISR,
(2) move timeout into timer, and
(3) keep workqueue for clk_disable_unprepare().

What do you think?

Thanks,
HS


Do you mean just call these two functions in ISR?
My major concern is dma_free_coherent() and kfree() in
cmdq_task_release(task).

Why do we need the dma calls at all? Can't we just calculate the
physical address using __pa(x)?

I prefer to use dma_map_single/dma_unmap_single.


Can you please elaborate why you need this. We don't do dma, so we
should not use dma memory for this.

We need a buffer to share between CPU and GCE, so we do need DMA.
CPU is in charge of writing GCE commands into this buffer.
GCE is in charge of reading and running GCE commands from this buffer.
When we chain CMDQ tasks, we also need to modify GCE JUMP command.
Therefore, I prefer to use dma_alloc_coherent and dma_free_coherent.

However, if we want to use timer to handle timeout, we need to release
memory in ISR.
In this case, using kmalloc/kfree + dma_map_single/dma_unmap_single
instead of dma_alloc_coherent/dma_free_coherent is an alternative
solution, but taking care the synchronization between cache and memory
is the expected overhead.

Therefore, your suggestion is to use GFP_ATOMIC for both
dma_alloc_coherent() and kzalloc(). Right?

I don't think we need GFP_ATOMIC, the critical path will just free the
memory.

I tested these two functions, and kfree was safe.
However, dma_free_coherent raised BUG.
BUG: failure at
/mnt/host/source/src/third_party/kernel/v3.18/mm/vmalloc.c:1514/vunmap()!

Just a general hint. Please try to evaluate on a recent kernel. It looks
like as if you tried this on a v3.18 based one.

This driver should be backward compatible to v3.18 for a MTK project.

Best regards,
Matthias

Thanks,
HS

1512 void vunmap(const void *addr)
1513 {
1514 BUG_ON(in_interrupt()); // <-- here
1515 might_sleep();
1516 if (addr)
1517 __vunmap(addr, 0);
1518 }
1519 EXPORT_SYMBOL(vunmap);

Therefore, I plan to use kmalloc + dma_map_single instead of
dma_alloc_coherent, and dma_unmap_single + kfree instead of
dma_free_coherent.

What do you think about the function replacement?

If so, I can try to implement timeout by timer, and discuss with you
if I have further questions.


Sounds good :)

Thanks,
Matthias

Thanks,
HS

Regrading the clock, wouldn't it be easier to handle the clock
enable/disable depending on the state of task_busy_list? I suppose we
can't as we would need to check the task_busy_list of all threads, right?

Regards,
Matthias

Thanks,
HS