I recently encountered a system crash that required power cycling one of my machines. At that point I decided to have a look at hardware watchdogs (which should trigger an automatic reboot in case the watchdog does no longer respond).
Fortunately the system involved had such a hardware watchdog in place:
linux # dmesg | grep -i watchdog
[ 0.330901] NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
[ 392.504661] sp5100_tco: SP5100/SB800 TCO WatchDog Timer Driver
[ 392.525720] sp5100-tco sp5100-tco: Using 0xfeb00000 for watchdog MMIO address
linux # wdctl
Device: /dev/watchdog0
Identity: SP5100 TCO timer [version 0]
Timeout: 60 seconds
Timeleft: 60 seconds
Pre-timeout: 0 seconds
FLAG DESCRIPTION STATUS BOOT-STATUS
KEEPALIVEPING Keep alive ping reply 1 0
MAGICCLOSE Supports magic close char 0 0
SETTIMEOUT Set timeout (in seconds) 0 0
However there was no other software in place to make use of the watchdog. So I installed the package watchdog
and modified some settings:
linux # apt install watchdog
<...>
linux # vi /etc/watchdog.conf
<...>
watchdog-device = /dev/watchdog
watchdog-timeout = 60
interval = 5
<...>
So the system will check itself every 5 seconds (and will reboot after 60s with no response).
If you call wdctl
now, you’ll see the “Timeout” of 60 seconds and the “Timeleft” hopefully somewhere between 60 and 55 seconds:
linux # wdctl
Device: /dev/watchdog0
Identity: SP5100 TCO timer [version 0]
Timeout: 60 seconds
Timeleft: 57 seconds
FLAG DESCRIPTION STATUS BOOT-STATUS
KEEPALIVEPING Keep alive ping reply 0 0
MAGICCLOSE Supports magic close char 0 0
SETTIMEOUT Set timeout (in seconds) 0 0
Normally I hate to reboot systems if they got stuck, however right now I’m curious to see whether this watchdog really works …