I recently encountered a system crash that required power cycling one of my machines. At that point I decided to have a look at hardware watchdogs (which should trigger an automatic reboot in case the watchdog does no longer respond).
Fortunately the system involved had such a hardware watchdog in place:
linux # dmesg | grep -i watchdog
[ 0.330901] NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
[ 392.504661] sp5100_tco: SP5100/SB800 TCO WatchDog Timer Driver
[ 392.525720] sp5100-tco sp5100-tco: Using 0xfeb00000 for watchdog MMIO address
linux # wdctl
Device: /dev/watchdog0
Identity: SP5100 TCO timer [version 0]
Timeout: 60 seconds
Timeleft: 60 seconds
Pre-timeout: 0 seconds
FLAG DESCRIPTION STATUS BOOT-STATUS
KEEPALIVEPING Keep alive ping reply 1 0
MAGICCLOSE Supports magic close char 0 0
SETTIMEOUT Set timeout (in seconds) 0 0
However there was no other software in place to make use of the watchdog. So I installed the package watchdog
and modified some settings:
linux # apt install watchdog
<...>
linux # vi /etc/watchdog.conf
<...>
watchdog-device = /dev/watchdog
watchdog-timeout = 60
interval = 5
<...>
So the system will check itself every 5 seconds (and will reboot after 60s with no response).
If you call wdctl
now, you’ll see the “Timeout” of 60 seconds and the “Timeleft” hopefully somewhere between 60 and 55 seconds:
linux # wdctl
Device: /dev/watchdog0
Identity: SP5100 TCO timer [version 0]
Timeout: 60 seconds
Timeleft: 57 seconds
FLAG DESCRIPTION STATUS BOOT-STATUS
KEEPALIVEPING Keep alive ping reply 0 0
MAGICCLOSE Supports magic close char 0 0
SETTIMEOUT Set timeout (in seconds) 0 0
Normally I hate to reboot systems if they got stuck, however right now I’m curious to see whether this watchdog really works …
Update
So tonight the machine crashed again – no effect of the watchdog: machine did not reboot automatically. However according to the logs the watchdog wasn’t active at all (mind the “No such file or directory
” error):
linux # journalctl -u watchdog -b -1
Jun 27 08:28:43 srv.mydomain.de systemd[1]: Starting watchdog.service - watchdog daemon...
Jun 27 08:28:43 srv.mydomain.de watchdog[27804]: starting daemon (5.16):
Jun 27 08:28:43 srv.mydomain.de watchdog[27804]: int=5s realtime=yes sync=no load=0,0,0 soft=no
Jun 27 08:28:43 srv.mydomain.de watchdog[27804]: memory not checked
Jun 27 08:28:43 srv.mydomain.de watchdog[27804]: ping: no machine to check
Jun 27 08:28:43 srv.mydomain.de watchdog[27804]: file: no file to check
Jun 27 08:28:43 srv.mydomain.de watchdog[27804]: pidfile: no server process to check
Jun 27 08:28:43 srv.mydomain.de watchdog[27804]: interface: no interface to check
Jun 27 08:28:43 srv.mydomain.de watchdog[27804]: temperature: no sensors to check
Jun 27 08:28:43 srv.mydomain.de watchdog[27804]: no test binary files
Jun 27 08:28:43 srv.mydomain.de watchdog[27804]: no repair binary files
Jun 27 08:28:43 srv.mydomain.de watchdog[27804]: error retry time-out = 60 seconds
Jun 27 08:28:43 srv.mydomain.de watchdog[27804]: repair attempts = 1
Jun 27 08:28:43 srv.mydomain.de watchdog[27804]: alive=/dev/watchdog heartbeat=[none] to=root no_act=no force=no
Jun 27 08:28:43 srv.mydomain.de watchdog[27804]: cannot open /dev/watchdog (errno = 2 = 'No such file or directory')
Jun 27 08:28:43 srv.mydomain.de systemd[1]: Started watchdog.service - watchdog daemon.
Currently the log looks more promising:
linux # journalctl -u watchdog -b 0
Jun 28 06:16:01 srv.mydomain.de systemd[1]: Starting watchdog.service - watchdog daemon...
Jun 28 06:16:01 srv.mydomain.de watchdog[46885]: starting daemon (5.16):
Jun 28 06:16:01 srv.mydomain.de watchdog[46885]: int=5s realtime=yes sync=no load=0,0,0 soft=no
Jun 28 06:16:01 srv.mydomain.de watchdog[46885]: memory not checked
Jun 28 06:16:01 srv.mydomain.de watchdog[46885]: ping: no machine to check
Jun 28 06:16:01 srv.mydomain.de watchdog[46885]: file: no file to check
Jun 28 06:16:01 srv.mydomain.de watchdog[46885]: pidfile: no server process to check
Jun 28 06:16:01 srv.mydomain.de watchdog[46885]: interface: no interface to check
Jun 28 06:16:01 srv.mydomain.de watchdog[46885]: temperature: no sensors to check
Jun 28 06:16:01 srv.mydomain.de watchdog[46885]: no test binary files
Jun 28 06:16:01 srv.mydomain.de watchdog[46885]: no repair binary files
Jun 28 06:16:01 srv.mydomain.de watchdog[46885]: error retry time-out = 60 seconds
Jun 28 06:16:01 srv.mydomain.de watchdog[46885]: repair attempts = 1
Jun 28 06:16:01 srv.mydomain.de watchdog[46885]: alive=/dev/watchdog heartbeat=[none] to=root no_act=no force=no
Jun 28 06:16:01 srv.mydomain.de watchdog[46885]: watchdog now set to 60 seconds
Jun 28 06:16:01 srv.mydomain.de watchdog[46885]: hardware watchdog identity: SP5100 TCO timer
Jun 28 06:16:01 srv.mydomain.de systemd[1]: Started watchdog.service - watchdog daemon.