Re: [PATCH] watchdog: nohz: don't run watchdog on nohz_full cores

From: Chris Metcalf
Date: Thu Apr 02 2015 - 11:43:08 EST

Next message: Alexei Starovoitov: "Re: [tip:perf/core] bpf: Fix the build on BPF_SYSCALL= y && !CONFIG_TRACING kernels, make it more configurable"
Previous message: Kirill A. Shutemov: "Re: [PATCHv4 12/24] thp: PMD splitting without splitting compound page"
In reply to: Frederic Weisbecker: "Re: [PATCH] watchdog: nohz: don't run watchdog on nohz_full cores"
Next in thread: Don Zickus: "Re: [PATCH] watchdog: nohz: don't run watchdog on nohz_full cores"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 04/02/2015 11:38 AM, Frederic Weisbecker wrote:

On Thu, Apr 02, 2015 at 10:15:27AM -0400, Don Zickus wrote:

On Thu, Apr 02, 2015 at 09:49:45AM -0400, Chris Metcalf wrote:

Can I ask how the NO_HZ_FULL technology works from userspace? Is there a
system command that has to be sent? How does the kernel know to turn off
ticks and trust userspace to do the right thing?

The NO_HZ_FULL option, when configured into the kernel, lets
you boot with "nohz_full=1-15" (or whatever cpumask you like),
typically in conjunction with "isolcpus=1-15". At this point no tasks
will run on those cores until explicitly placed there by affinity, and
once there and running in userspace, the kernel will automatically
get out of their way and not interrupt at all. This lets those tasks
run with 100.000% of the cpu, which is a requirement for many
user-space device drivers running high throughput devices.
(This is typically the use case for the tile architecture customers.)

So, other than a boot flag, there are no system commands or
other APIs to deal with.

Ah, I am starting to understand your approach in the original patch better.

Part of the requirement, though, is that there can be only one task
bound and runnable on that cpu, otherwise the kernel has to be
involved to do the context-switching off of the scheduler tick.
This is why having the standard watchdog kernel thread doesn't
work in this context.

So, there is no preemption happening, which means the softlockup is rather
pointless.

Still useful actually because nohz full only takes effect when a single task runs
on the CPU. But there can still be more than 1 task running, just nohz full will
be disabled. It all happens dynamically.

Can interrupts be disabled or handled on that cpu? I am trying
to see if the hardlockup detector becomes rather silly on those cpus too.

No interrupts aren't disabled on these CPUs. Now the goal is to avoid them:
migrate irqs, nohz full, etc...

But there can be irqs. And actually there is at least 1 tick every second in
order to keep the scheduler stats moving forward. We plan to get rid of it but
anyway the point is that IRQ can happen on nohz full CPUs.

I continue to suspect that the right model here is to disable the
watchdog specifically on the cores that the user has tagged with
the nohz_full boot argument. I agree that there might be a case
to be made for leaving the watchdog conditionally (as suggested
by Ingo) but it should be possible to have the watchdogs on
the nohz_full cores be turned off completely if desired.

I think I might be slowly coming around to your thoughts. I might request a
different patch though based on the answers above. Maybe even create a
subset of the online cpus for the watchdog to work off of. The watchdog
would copy the online cpu mask, mask off the nohz cpus and just function
that way. It would print loud messages for each nohz cpu it was masking
off.

All agreed with that! We should at least keep the watchdog running on
non-nohz-full CPUs. And also allow to re-enable it everywhere when needed,
in case we have a lockup to chase on nohz full CPUs.

Then perhaps as a debug aid, expose a /proc/sys/kernel/watchdog_cpumask for
folks to modify in case they want to enable the watchdog on the nohz cpus.

That sounds like a good idea.

OK, I will respin v2 of the patch as follows:

- Provide a watchdog_cpumask as suggested by Don.
- On a non-NO_HZ_FULL build, it defaults to cpu_possible as normal
- On a NO_HZ_FULL build, it defaults to the housekeeping cpus
- If the mask is modified, we disable and then re-enable the watchdog,
so that the watchdog init code can exit() the appropriate threads as
they start up

This should address the various concerns that have been raised.

--
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Alexei Starovoitov: "Re: [tip:perf/core] bpf: Fix the build on BPF_SYSCALL= y && !CONFIG_TRACING kernels, make it more configurable"
Previous message: Kirill A. Shutemov: "Re: [PATCHv4 12/24] thp: PMD splitting without splitting compound page"
In reply to: Frederic Weisbecker: "Re: [PATCH] watchdog: nohz: don't run watchdog on nohz_full cores"
Next in thread: Don Zickus: "Re: [PATCH] watchdog: nohz: don't run watchdog on nohz_full cores"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]