Linux Watchdog Service on Vultr

Updated on May 30, 2023
Linux Watchdog Service on Vultr header image

Introduction

In order to automatically recover from guest lockup events, Vultr supports the Linux watchdog service. When you install, configure, and run the watchdog service on your instance, it will interact with a special virtual device which Vultr will monitor. If the watchdog fails, the Vultr control plane will automatically reboot your instance. This document will focus on the wd_keepalive service, which is a simplified daemon exclusively focused on the watchdog hardware device.

These instructions are provided showcasing Debian 11 as an example.

Prerequisites

To use Watchdog, you need:

  • Vultr cloud server instance with appropriate watchdog drivers and kernel module loaded
  • Install the publicly maintained watchdog software suite
  • Configure and run the wd_keepalive daemon

Preliminary Setup

Verify your instance recognizes the watchdog device: ls -al /dev/watchdog*

Install the watchdog software from standard repositories: apt-get install watchdog

edit /etc/watchdog.conf and uncomment the watchdog-device line;

# Uncomment this to use the watchdog device driver access "file".

#watchdog-device                = /dev/watchdog

edit the Systemd configuration file /lib/systemd/system/wd_keepalive.service and add the following lines under the [Install] section.

[Install]
WantedBy=multi-user.target

Rebuild Systemd files, Start the Watchdog service and set it to run at system boot:

	systemctl daemon-reload; systemctl start wd_keepalive; systemctl enable wd_keepalive; systemctl status wd_keepalive

Testing

To observe the behavior of the watchdog, you can trigger a system hang as follows. Your instance should automatically reboot in about a minute after hanging it with this command: sync; sleep 2; sync; echo c > /proc/sysrq-trigger

Linux watchdog service

Troubleshooting

Verify wd_keepalive is running

	root@vultr:~# systemctl status wd_keepalive
 wd_keepalive.service - watchdog keepalive daemon
		 Loaded: loaded (/lib/systemd/system/wd_keepalive.service; enabled; vendor preset: enabled)
		 Active: active (running) since Tue 2023-05-30 16:36:49 UTC; 5s ago
		Process: 1831 ExecStartPre=/bin/sh -c [ -z "${watchdog_module}" ] || [ "${watchdog_module}" = "none" ] || /sbin/modprobe $watchdog_module (code=exited, status=0/SUCCESS)
		Process: 1832 ExecStartPre=/bin/systemctl reset-failed watchdog.service (code=exited, status=0/SUCCESS)
		Process: 1833 ExecStart=/usr/sbin/wd_keepalive $watchdog_options (code=exited, status=0/SUCCESS)
		Process: 1836 ExecStartPost=/bin/sh -c ln -s /var/run/wd_keepalive.pid /run/sendsigs.omit.d/wd_keepalive.pid (code=exited, status=0/SUCCESS)
	 Main PID: 1835 (wd_keepalive)
			Tasks: 1 (limit: 2233)
		 Memory: 540.0K
				CPU: 14ms
		 CGroup: /system.slice/wd_keepalive.service
						 |--1835 /usr/sbin/wd_keepalive

	May 30 16:36:49 vultr systemd[1]: Starting watchdog keepalive daemon...
	May 30 16:36:49 vultr wd_keepalive[1835]: starting watchdog keepalive daemon (5.16):
	May 30 16:36:49 vultr wd_keepalive[1835]:  int=1 alive=/dev/watchdog realtime=yes
	May 30 16:36:49 vultr wd_keepalive[1835]: watchdog now set to 60 seconds
	May 30 16:36:49 vultr wd_keepalive[1835]: hardware watchdog identity: i6300ESB timer
	May 30 16:36:49 vultr systemd[1]: Started watchdog keepalive daemon.
	root@vultr:~#

Confirm the watchdog device is present:

root@vultr:~# ls -al /dev/watchdog*
crw------- 1 root root  10, 130 May 30 15:57 /dev/watchdog
crw------- 1 root root 244,   0 May 30 15:57 /dev/watchdog0
root@vultr:~#

Identify the watchdog device type:

root@vultr:~# lspci -v | grep -i watch
02:01.0 System peripheral: Intel Corporation 6300ESB Watchdog Timer
root@vultr:~#

Load the appropriate watchdog kernel module:

root@vultr:~# modprobe i6300esb
root@vultr:~#

Note, only the i6300esb watchdog device is supported and not iTCDO watchdog. Also, the softdog type is not recommended.

If your instance was deployed a long time ago, you may need to reboot the instance from the Vultr API or Control panel to update the underlying configurations.