Linux Watchdog Service on Vultr
Introduction
In order to automatically recover from guest lockup events, Vultr supports the Linux watchdog service. When you install, configure, and run the watchdog service on your instance, it will interact with a special virtual device which Vultr will monitor. If the watchdog fails, the Vultr control plane will automatically reboot your instance. This document will focus on the wd_keepalive service, which is a simplified daemon exclusively focused on the watchdog hardware device.
These instructions are provided showcasing Debian 11 as an example.
Prerequisites
To use Watchdog, you need:
- Vultr cloud server instance with appropriate watchdog drivers and kernel module loaded
- Install the publicly maintained watchdog software suite
- Configure and run the wd_keepalive daemon
Preliminary Setup
Verify your instance recognizes the watchdog device:
ls -al /dev/watchdog*
Install the watchdog software from standard repositories:
apt-get install watchdog
edit /etc/watchdog.conf
and uncomment the watchdog-device line;
# Uncomment this to use the watchdog device driver access "file".
#watchdog-device = /dev/watchdog
edit the Systemd configuration file /lib/systemd/system/wd_keepalive.service
and add the following lines under the [Install]
section.
[Install]
WantedBy=multi-user.target
Rebuild Systemd files, Start the Watchdog service and set it to run at system boot:
systemctl daemon-reload; systemctl start wd_keepalive; systemctl enable wd_keepalive; systemctl status wd_keepalive
Testing
To observe the behavior of the watchdog, you can trigger a system hang as follows. Your instance should automatically reboot in about a minute after hanging it with this command:
sync; sleep 2; sync; echo c > /proc/sysrq-trigger
Related documents
Troubleshooting
Verify wd_keepalive is running
root@vultr:~# systemctl status wd_keepalive
wd_keepalive.service - watchdog keepalive daemon
Loaded: loaded (/lib/systemd/system/wd_keepalive.service; enabled; vendor preset: enabled)
Active: active (running) since Tue 2023-05-30 16:36:49 UTC; 5s ago
Process: 1831 ExecStartPre=/bin/sh -c [ -z "${watchdog_module}" ] || [ "${watchdog_module}" = "none" ] || /sbin/modprobe $watchdog_module (code=exited, status=0/SUCCESS)
Process: 1832 ExecStartPre=/bin/systemctl reset-failed watchdog.service (code=exited, status=0/SUCCESS)
Process: 1833 ExecStart=/usr/sbin/wd_keepalive $watchdog_options (code=exited, status=0/SUCCESS)
Process: 1836 ExecStartPost=/bin/sh -c ln -s /var/run/wd_keepalive.pid /run/sendsigs.omit.d/wd_keepalive.pid (code=exited, status=0/SUCCESS)
Main PID: 1835 (wd_keepalive)
Tasks: 1 (limit: 2233)
Memory: 540.0K
CPU: 14ms
CGroup: /system.slice/wd_keepalive.service
|--1835 /usr/sbin/wd_keepalive
May 30 16:36:49 vultr systemd[1]: Starting watchdog keepalive daemon...
May 30 16:36:49 vultr wd_keepalive[1835]: starting watchdog keepalive daemon (5.16):
May 30 16:36:49 vultr wd_keepalive[1835]: int=1 alive=/dev/watchdog realtime=yes
May 30 16:36:49 vultr wd_keepalive[1835]: watchdog now set to 60 seconds
May 30 16:36:49 vultr wd_keepalive[1835]: hardware watchdog identity: i6300ESB timer
May 30 16:36:49 vultr systemd[1]: Started watchdog keepalive daemon.
root@vultr:~#
Confirm the watchdog device is present:
root@vultr:~# ls -al /dev/watchdog*
crw------- 1 root root 10, 130 May 30 15:57 /dev/watchdog
crw------- 1 root root 244, 0 May 30 15:57 /dev/watchdog0
root@vultr:~#
Identify the watchdog device type:
root@vultr:~# lspci -v | grep -i watch
02:01.0 System peripheral: Intel Corporation 6300ESB Watchdog Timer
root@vultr:~#
Load the appropriate watchdog kernel module:
root@vultr:~# modprobe i6300esb
root@vultr:~#
Note, only the i6300esb
watchdog device is supported and not iTCDO
watchdog. Also, the softdog type is not recommended.
If your instance was deployed a long time ago, you may need to reboot the instance from the Vultr API or Control panel to update the underlying configurations.