motherboards with recent Intel chipsets: please test this (iTCO-wdt)

From: David Madore
Date: Sat Sep 10 2016 - 10:40:20 EST


TL;DR: On some motherboards with an Intel chipset, at least from Asus
and Asrock, the hardware watchdog (linux driver iTCO-wdt) fails to
reboot the system correctly (POST fails and leaves system unusable).
Looking for people willing to test, in order to pinpoint the problem.


Background:

I am looking for users of a desktop with a fairly recent Intel
chipset, especially if one or several of the following conditions are
satisfied: (1)the BIOS is written by AMI (American Megatrends), (2)the
chipset is of the Intel 100 series or C230 series (a.k.a. "Sunrise
Point", used for "Skylake" processors with an LGA1151 socket), and
(3)the system is booting under UEFI (as opposed to legacy BIOS).

The point of this test is to check whether the hardware watchdog
included in these chipsets (and known in Intel parlance, this watchdog
as the "TCO watchdog", where "TCO" stands for "Total Cost of
Ownership") reboots the system properly or, as on my motherboard,
places it in a broken state (POST fails, even when the reset button is
later pressed, or even if the power button is pressed twice; the power
supply needs to be disconnected for a few minutes to restore the
system to a working state). This is a very serious bug, which could
be due to the BIOS, the hardware, or Linux (I suspect the former, but
it is conceivable that Linux could work around it).

Do not perform this test unless you can disconnect the power supply!


How to test:

Boot a recent Linux kernel. Load the i2c-i801 and i2c-smbus modules.
Then load the iTCO-wdt module. This should cause lines such as the
following to appear in the kernel log (dmesg), indicating that Linux
has detected the device:

iTCO_wdt: Intel TCO WatchDog Timer Driver v1.11
iTCO_wdt: Found a Intel PCH TCO device (Version=4, TCOBASE=0x0400)
iTCO_wdt: initialized. heartbeat=120 sec (nowayout=0)

Make sure all your filesystems are unmounted or mounted read-only (on
systemd, e.g.: systemctl isolate emergency.target ; sync ; echo u >>
/proc/sysrq-trigger ; sync (and make sure "Emergency Remount complete"
appears at the end of dmesg)). A /dev/watchdog device should have
appeared. Then run

cat >> /dev/watchdog

and press enter twice. Do not interrupt (do not press control-C or
control-D), just wait for a few minutes. After a certain time (twice
the "heartbeat" value indicated by the kernel), the system will try to
reboot. What interests me is whether the reboot succeeds (POST
proceeds as normal, and OS restarts) or whether the system locks up
(in which case you will need to power cycle it at the power supply
unit level in order to restore it to normal).

Please report (to me, to avoid spamming this list - I will post a
summary) results along with information as to the hardware used:
motherboard brand and model, BIOS vendor and date (dmidecode should
give this information), UEFI or legacy boot, and any extension cards
that might be used on the system (in particular, whether the system
uses an integrated GPU or a separate graphics card). I am interested
in both positive and negative results.

Thanks in advance to all who are willing to test this!


Xref:

https://lkml.org/lkml/2016/9/8/641

https://www.reddit.com/r/linuxquestions/comments/51xad5/users_of_a_desktop_with_an_intel_chipset_could/


--
David A. Madore
( http://www.madore.org/~david/ )