On 14.01.2020 11:54, Shile Zhang wrote:
This time SMP is enabled. You have many threads are running. Interrupts are enabled
On 2020/1/13 16:11, Kirill Tkhai wrote:
On 13.01.2020 03:54, Shile Zhang wrote:Yes, I'd tried this for issue confirm, before I sent this patch. Likes I mentioned the debug log in cover letter, I also add mdelay between local_irq_enable/disable, this system jiffies is stuck without update.
On 2020/1/10 19:42, Kirill Tkhai wrote:The thing is we simply shouldn't introduce such the primitives since the thread
On 10.01.2020 12:30, Shile Zhang wrote:Hi Kirill,
When 'CONFIG_DEFERRED_STRUCT_PAGE_INIT' is set, 'pgdat_resize_lock'1)Linux kernel is *preemptible*. Kernel with CONFIG_PREEMPT_RT option even may preempt
will be called inside 'pgdatinit' kthread to initialise the deferred
pages with local interrupts disabled. Which is introduced by
commit 3a2d7fa8a3d5 ("mm: disable interrupts while initializing deferred
pages").
But 'pgdatinit' kthread is possible be pined on the boot CPU (CPU#0 by
default), especially in small system with NRCPUS <= 2. In this case, the
interrupts are disabled on boot CPU during memory initialising, which
caused the tick_sched timer be blocked, leading to wall clock stuck.
Fixes: commit 3a2d7fa8a3d5 ("mm: disable interrupts while initializing
deferred pages")
Signed-off-by: Shile Zhang <shile.zhang@xxxxxxxxxxxxxxxxx>
---
ÂÂ include/linux/memory_hotplug.h | 16 ++++++++++++++--
ÂÂ 1 file changed, 14 insertions(+), 2 deletions(-)
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index ba0dca6aac6e..be69a6dc4fee 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -6,6 +6,8 @@
ÂÂ #include <linux/spinlock.h>
ÂÂ #include <linux/notifier.h>
ÂÂ #include <linux/bug.h>
+#include <linux/sched.h>
+#include <linux/smp.h>
ÂÂ Â struct page;
ÂÂ struct zone;
@@ -282,12 +284,22 @@ static inline bool movable_node_is_enabled(void)
ÂÂ static inline
ÂÂ void pgdat_resize_lock(struct pglist_data *pgdat, unsigned long *flags)
ÂÂ {
-ÂÂÂ spin_lock_irqsave(&pgdat->node_size_lock, *flags);
+ÂÂÂ /*
+ÂÂÂÂ * Disable local interrupts on boot CPU will stop the tick_sched
+ÂÂÂÂ * timer, which will block jiffies(wall clock) update.
+ÂÂÂÂ */
+ÂÂÂ if (current->cpu != get_boot_cpu_id())
+ÂÂÂÂÂÂÂ spin_lock_irqsave(&pgdat->node_size_lock, *flags);
+ÂÂÂ else
+ÂÂÂÂÂÂÂ spin_lock(&pgdat->node_size_lock);
ÂÂ }
ÂÂ static inline
ÂÂ void pgdat_resize_unlock(struct pglist_data *pgdat, unsigned long *flags)
ÂÂ {
-ÂÂÂ spin_unlock_irqrestore(&pgdat->node_size_lock, *flags);
+ÂÂÂ if (current->cpu != get_boot_cpu_id())
+ÂÂÂÂÂÂÂ spin_unlock_irqrestore(&pgdat->node_size_lock, *flags);
+ÂÂÂ else
+ÂÂÂÂÂÂÂ spin_unlock(&pgdat->node_size_lock);
ÂÂ }
ÂÂ static inline
ÂÂ void pgdat_resize_init(struct pglist_data *pgdat)
*kernel* code in the middle of function. When you are executing a code containing
pgdat_resize_lock() and pgdat_resize_unlock(), the process may migrate to another cpu
between them.
bool cpuÂÂÂÂÂÂÂÂÂÂÂÂÂÂ another cpu
----------------------------------
pgdat_resize_lock()
ÂÂÂ spin_lock()
ÂÂÂ --> migrate to another cpu
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ pgdat_resize_unlock()
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ spin_unlock_irqrestore(<uninitialized flags>)
(Yes, in case of CONFIG_PREEMPT_RT, process is preemptible even after spin_lock() call).
This looks like a bad helpers, and we should not introduce such the design.
Thanks for your comments!
Sorry for I'm not very clear about this lock/unlock, but I encountered this issue
with "CONFIG_PREEMPT is not set".
may migrate to another cpu, while you own the lock. This looks like a buggy design.
Have you tried temporary enabling interrupts in the middle of cycle after a huge enough2)I think there is no the problem this patch solves. Do we really this statistics?Sorry for I've not put this issue very clearly. It's *not* just one statistics log
Can't we simple remove print message from deferred_init_memmap() and solve this?
with wrong time calculate, but the wall clock is stuck.
So the 'systemd-analyze' command also give a wrong time as I mentioned in the cover
letter. I don't think is OK just remove the log, it cannot solve the wall clock latency.
memory block is initialized? Something like:
deferred_init_memmap()
{
ÂÂÂÂwhile (spfn < epfn) {
ÂÂÂÂÂÂÂ nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
ÂÂÂÂÂÂÂ local_irq_enable();
ÂÂÂÂÂÂÂ local_irq_disable();
ÂÂÂÂ}
}
So I think there must be problem to use spin_lock_irqsave in early boot path on boot CPU.
and they occur. So, it's OK to disable interrupts for some time.
My opinion is we should try to enable interrupts in the cycle after some fixed
amount of memory is initialized. Say, every 1GB. This should resolve two problems:
handling timer interrupt with update jiffies at time, and keeping the fix for the issue,
that Pavel fixed in 3a2d7fa8a3d5.
Or, maybe, enable/disable interrupts somewhere inside deferred_init_maxorder().
Also, you may try to check that sched_clock() gives better results with interrupts
disabled (on x86 it uses rdtsc, when it's possible. But it also may fallback to
jiffies-based clock in some hardware cases, and they also won't go with interrupts
disabled).