Feature request: watchdog / self recovery

Problem

The RM-1 is only useful when it is reachable. If its Ethernet uplink stops working, whether due to a physical link issue, DHCP failure, driver glitch, or network stack problem, the device effectively disappears from the operator’s perspective.

In my experience (having almost lost control of my KVM + machine setup), this is a serious weakness. The device itself is still powered and functional, but because it depends on external connectivity, it cannot recover on its own. In practice, a simple reboot often restores connectivity. Unfortunately, without a built-in recovery mechanism, the device can remain offline indefinitely until someone intervenes physically, which defeats the purpose of having remote management hardware in the first place.

The result is a single-point failure:
A transient network fault can permanently remove remote access.

Proposed behavior

The device should periodically check whether its primary wired interface is still operational and attempt to recover if it is not.

Rather than immediately rebooting, recovery could happen in stages. Many failures are software or negotiation-related and can be resolved without restarting the whole system.

A simple recovery ladder could look like this:

First, determine whether the interface is usable. This can be done locally by checking link carrier and address state:

$ cat /sys/class/net/eth0/carrier
$ ip -4 addr show eth0

If carrier is down or no usable address exists, attempt a soft reset of the interface:

$ ip link set eth0 down
$ sleep 3
$ ip link set eth0 up

After a short wait, the system should re-evaluate status.

If the interface is still not functional, the next step would be restarting the network stack, for example:

$ killall connmand
# pray `init` respawns it automatically

If that still doesn’t recover, a stronger step is to restart connman itself. On my RM-1, connmand is managed by inittab respawn (and S45connman doesn’t actually stop/start the daemon), so the practical “restart” is to kill it and let init respawn it:

$ connmanctl enable ethernet

Only if these recovery attempts fail should the system escalate to a reboot:

$ reboot

Safeguards & Practical Implementation

Reboot escalation needs to be guarded to avoid two failure modes: reboot loops, and disruption of an active virtual USB (ISO) session.

Reboot rate limiting: the watchdog should not reboot repeatedly when the upstream network (router/ISP) is down. Store the last reboot timestamp and only allow a recovery reboot once per window (e.g., 3600 seconds).

LAST_REBOOT_FILE=/etc/kvmd/user/state/net-selfheal/last_reboot
NOW=$(date +%s)
LAST=$(cat "$LAST_REBOOT_FILE" 2>/dev/null || echo 0)

can_reboot() {
  [ $((NOW - LAST)) -gt 3600 ]
}

Do not reboot when USB mass-storage is active: the RM-1 can act as a USB thumb drive to the controlled PC. If an ISO is mounted/presented, rebooting can break an OS install or corrupt an operation on the remote machine. A practical test is to look at USB gadget configfs: if any mass_storage LUN has a backing file configured, treat it as “active” and suppress reboot.

On the RM-1 I’m using, the gadget is under /sys/kernel/config/usb_gadget/rockchip/...:

usb_storage_active() {
  for f in /sys/kernel/config/usb_gadget/rockchip/functions/mass_storage.*/lun.*/file; do
    [ -e "$f" ] || continue
    backing="$(cat "$f" 2>/dev/null | tr -d '\r\n')"
    case "$backing" in
      ""|"none") continue ;;
      *) return 0 ;;
    esac
  done

  return 1
}

If usb_storage_active returns true, the watchdog may still attempt non-disruptive recovery (interface bounce, network service restart), but must not reboot.

Suggested overall flow (pseudo-code):

every 10 minutes:
  if eth0 healthy:
     clear failure counter; exit

  attempt interface reset; re-check
  if healthy: exit

  attempt network restart; re-check
  if healthy: exit

  if usb_storage_active: log + exit (no reboot)
  if ping veto succeeds: log + exit (no reboot)
  if reboot rate-limit allows: reboot
  else: log + exit

This design is straightforward to implement today via a cron job (BusyBox crond) and a shell script. I’d love to see this as an official (verified + supported) feature!

Setup I’m using for now

mkdir -p /etc/kvmd/user

cat >/etc/kvmd/user/net-selfheal.sh <<'EOF'
#!/bin/sh
set -eu

IFACE="eth0"
TAG="net-selfheal"

CHECK_V4=1
RECHECK_SLEEP=15
MAX_FAILS=2
REBOOT_COOLDOWN=3600

PING_VETO=1
PING_VETO_IP1="1.1.1.1"
PING_VETO_IP2="8.8.8.8"

USB_GUARD_REQUIRE_UDC=0

PERSIST_STATE_DIR="/etc/kvmd/user/state/net-selfheal"
VOL_STATE_DIR="/tmp/net-selfheal"

FAIL_FILE="$VOL_STATE_DIR/fails"
LAST_REBOOT_FILE="$PERSIST_STATE_DIR/last_reboot"

log() { logger -t "$TAG" "$*"; }

mkdir -p "$PERSIST_STATE_DIR" "$VOL_STATE_DIR"

carrier_up() {
  [ -r "/sys/class/net/$IFACE/carrier" ] || return 1
  [ "$(cat "/sys/class/net/$IFACE/carrier" 2>/dev/null || echo 0)" = "1" ]
}

has_ipv4() {
  ip -4 addr show dev "$IFACE" 2>/dev/null | grep -q ' inet '
}

link_ok() {
  carrier_up || return 1
  [ "$CHECK_V4" -eq 0 ] && return 0
  has_ipv4
}

usb_gadget_connected() {
  [ -r /sys/kernel/config/usb_gadget/rockchip/UDC ] || return 1
  udc="$(cat /sys/kernel/config/usb_gadget/rockchip/UDC 2>/dev/null | tr -d '\r\n')"
  [ -n "$udc" ] && [ "$udc" != "none" ]
}

usb_storage_presented() {
  for f in /sys/kernel/config/usb_gadget/rockchip/functions/mass_storage.*/lun.*/file; do
    [ -e "$f" ] || continue
    backing="$(cat "$f" 2>/dev/null | tr -d '\r\n')"
    case "$backing" in
      ""|"none") continue ;;
      *) return 0 ;;
    esac
  done
  return 1
}

usb_storage_active() {
  usb_storage_presented || return 1
  if [ "$USB_GUARD_REQUIRE_UDC" -eq 1 ]; then
    usb_gadget_connected
  else
    return 0
  fi
}

ping_veto_ok() {
  [ "$PING_VETO" -eq 1 ] || return 1
  ping -c 1 -W 2 "$PING_VETO_IP1" >/dev/null 2>&1 && return 0
  ping -c 1 -W 2 "$PING_VETO_IP2" >/dev/null 2>&1 && return 0
  return 1
}

read_int_file() {
  if [ -f "$1" ]; then
    v="$(cat "$1" 2>/dev/null | tr -dc '0-9')"
    [ -n "$v" ] && { echo "$v"; return; }
  fi
  echo "$2"
}

write_int_file() { echo "$2" >"$1"; }

iface_bounce() {
  ip link set "$IFACE" down >/dev/null 2>&1 || true
  sleep 3
  ip link set "$IFACE" up >/dev/null 2>&1 || true
}

restart_kvmd_network() {
  [ -x /etc/init.d/S46kvmd-network ] || return 1
  /etc/init.d/S46kvmd-network restart >/dev/null 2>&1 || return 1
  return 0
}

restart_connman_real() {
  # On this image connmand is respawned by inittab, and S45connman does not manage it.
  # So: kill connmand; init will restart it.
  if pidof connmand >/dev/null 2>&1; then
    killall connmand >/dev/null 2>&1 || true
  fi
  # Give init/respawn a moment
  sleep 2
  pidof connmand >/dev/null 2>&1
}

# Fast path
if link_ok; then
  oldfails="$(read_int_file "$FAIL_FILE" 0)"
  [ "$oldfails" -ne 0 ] && log "Recovered (fails was $oldfails)."
  write_int_file "$FAIL_FILE" 0
  exit 0
fi

fails="$(read_int_file "$FAIL_FILE" 0)"
fails=$((fails + 1))
write_int_file "$FAIL_FILE" "$fails"
log "Detected degraded link on $IFACE. Failure $fails/$MAX_FAILS. Starting recovery."

# Stage 1: bounce interface
iface_bounce
sleep "$RECHECK_SLEEP"
if link_ok; then
  log "Recovery succeeded after interface bounce."
  write_int_file "$FAIL_FILE" 0
  exit 0
fi

# Stage 2: re-apply network config via kvmd-network (connmanctl config)
restart_kvmd_network || true
sleep "$RECHECK_SLEEP"
if link_ok; then
  log "Recovery succeeded after re-applying kvmd-network configuration."
  write_int_file "$FAIL_FILE" 0
  exit 0
fi

# Stage 3: real connman restart (kill -> inittab respawn)
if restart_connman_real; then
  # After respawn, re-apply config again (DHCP/manual + DNS)
  restart_kvmd_network || true
  sleep "$RECHECK_SLEEP"
  if link_ok; then
    log "Recovery succeeded after restarting connman."
    write_int_file "$FAIL_FILE" 0
    exit 0
  fi
fi

if [ "$fails" -lt "$MAX_FAILS" ]; then
  log "Still degraded; will retry on next scheduled run (no reboot yet)."
  exit 0
fi

if usb_storage_active; then
  log "USB mass-storage is active (ISO/image presented); suppressing reboot."
  exit 0
fi

if ping_veto_ok; then
  log "Ping veto succeeded; suppressing reboot despite degraded local link state."
  exit 0
fi

now="$(date +%s)"
last="$(read_int_file "$LAST_REBOOT_FILE" 0)"
if [ $((now - last)) -lt "$REBOOT_COOLDOWN" ]; then
  log "Reboot suppressed by rate limit (last reboot $((now - last))s ago)."
  exit 0
fi


mkdir -p /etc/kvmd/user

cat >/etc/kvmd/user/net-selfheal.sh <<'EOF'
#!/bin/sh
set -eu

IFACE="eth0"
TAG="net-selfheal"

CHECK_V4=1
RECHECK_SLEEP=15
MAX_FAILS=2
REBOOT_COOLDOWN=3600

PING_VETO=1
PING_VETO_IP1="1.1.1.1"
PING_VETO_IP2="8.8.8.8"

USB_GUARD_REQUIRE_UDC=0

PERSIST_STATE_DIR="/etc/kvmd/user/state/net-selfheal"
VOL_STATE_DIR="/tmp/net-selfheal"

FAIL_FILE="$VOL_STATE_DIR/fails"
LAST_REBOOT_FILE="$PERSIST_STATE_DIR/last_reboot"

log() { logger -t "$TAG" "$*"; }

mkdir -p "$PERSIST_STATE_DIR" "$VOL_STATE_DIR"

carrier_up() {
  [ -r "/sys/class/net/$IFACE/carrier" ] || return 1
  [ "$(cat "/sys/class/net/$IFACE/carrier" 2>/dev/null || echo 0)" = "1" ]
}

has_ipv4() {
  ip -4 addr show dev "$IFACE" 2>/dev/null | grep -q ' inet '
}

link_ok() {
  carrier_up || return 1
  [ "$CHECK_V4" -eq 0 ] && return 0
  has_ipv4
}

usb_gadget_connected() {
  [ -r /sys/kernel/config/usb_gadget/rockchip/UDC ] || return 1
  udc="$(cat /sys/kernel/config/usb_gadget/rockchip/UDC 2>/dev/null | tr -d '\r\n')"
  [ -n "$udc" ] && [ "$udc" != "none" ]
}

usb_storage_presented() {
  for f in /sys/kernel/config/usb_gadget/rockchip/functions/mass_storage.*/lun.*/file; do
    [ -e "$f" ] || continue
    backing="$(cat "$f" 2>/dev/null | tr -d '\r\n')"
    case "$backing" in
      ""|"none") continue ;;
      *) return 0 ;;
    esac
  done
  return 1
}

usb_storage_active() {
  usb_storage_presented || return 1
  if [ "$USB_GUARD_REQUIRE_UDC" -eq 1 ]; then
    usb_gadget_connected
  else
    return 0
  fi
}

ping_veto_ok() {
  [ "$PING_VETO" -eq 1 ] || return 1
  ping -c 1 -W 2 "$PING_VETO_IP1" >/dev/null 2>&1 && return 0
  ping -c 1 -W 2 "$PING_VETO_IP2" >/dev/null 2>&1 && return 0
  return 1
}

read_int_file() {
  if [ -f "$1" ]; then
    v="$(cat "$1" 2>/dev/null | tr -dc '0-9')"
    [ -n "$v" ] && { echo "$v"; return; }
  fi
  echo "$2"
}

write_int_file() { echo "$2" >"$1"; }

iface_bounce() {
  ip link set "$IFACE" down >/dev/null 2>&1 || true
  sleep 3
  ip link set "$IFACE" up >/dev/null 2>&1 || true
}

restart_kvmd_network() {
  [ -x /etc/init.d/S46kvmd-network ] || return 1
  /etc/init.d/S46kvmd-network restart >/dev/null 2>&1 || return 1
  return 0
}

restart_connman_real() {
  # On this image connmand is respawned by inittab, and S45connman does not manage it.
  # So: kill connmand; init will restart it.
  if pidof connmand >/dev/null 2>&1; then
    killall connmand >/dev/null 2>&1 || true
  fi
  # Give init/respawn a moment
  sleep 2
  pidof connmand >/dev/null 2>&1
}

# Fast path
if link_ok; then
  oldfails="$(read_int_file "$FAIL_FILE" 0)"
  [ "$oldfails" -ne 0 ] && log "Recovered (fails was $oldfails)."
  write_int_file "$FAIL_FILE" 0
  exit 0
fi

fails="$(read_int_file "$FAIL_FILE" 0)"
fails=$((fails + 1))
write_int_file "$FAIL_FILE" "$fails"
log "Detected degraded link on $IFACE. Failure $fails/$MAX_FAILS. Starting recovery."

# Stage 1: bounce interface
iface_bounce
sleep "$RECHECK_SLEEP"
if link_ok; then
  log "Recovery succeeded after interface bounce."
  write_int_file "$FAIL_FILE" 0
  exit 0
fi

# Stage 2: re-apply network config via kvmd-network (connmanctl config)
restart_kvmd_network || true
sleep "$RECHECK_SLEEP"
if link_ok; then
  log "Recovery succeeded after re-applying kvmd-network configuration."
  write_int_file "$FAIL_FILE" 0
  exit 0
fi

# Stage 3: real connman restart (kill -> inittab respawn)
if restart_connman_real; then
  # After respawn, re-apply config again (DHCP/manual + DNS)
  restart_kvmd_network || true
  sleep "$RECHECK_SLEEP"
  if link_ok; then
    log "Recovery succeeded after restarting connman."
    write_int_file "$FAIL_FILE" 0
    exit 0
  fi
fi

if [ "$fails" -lt "$MAX_FAILS" ]; then
  log "Still degraded; will retry on next scheduled run (no reboot yet)."
  exit 0
fi

if usb_storage_active; then
  log "USB mass-storage is active (ISO/image presented); suppressing reboot."
  exit 0
fi

if ping_veto_ok; then
  log "Ping veto succeeded; suppressing reboot despite degraded local link state."
  exit 0
fi

now="$(date +%s)"
last="$(read_int_file "$LAST_REBOOT_FILE" 0)"
if [ $((now - last)) -lt "$REBOOT_COOLDOWN" ]; then
  log "Reboot suppressed by rate limit (last reboot $((now - last))s ago)."
  exit 0
fi

write_int_file "$LAST_REBOOT_FILE" "$now"
log "Escalating to reboot: link still degraded after recovery attempts."
reboot
EOF

chmod +x /etc/kvmd/user/net-selfheal.sh


mkdir -p /etc/kvmd/user/scripts

cat >/etc/kvmd/user/scripts/S20-net-selfheal-cron.sh <<'EOF'
#!/bin/sh
case "$1" in
  start)
    mkdir -p /var/spool/cron/crontabs
    echo '*/10 * * * * /etc/kvmd/user/net-selfheal.sh' > /var/spool/cron/crontabs/root
    chmod 600 /var/spool/cron/crontabs/root
    pidof crond >/dev/null 2>&1 || crond -S -l 8
    ;;
esac
EOF

chmod +x /etc/kvmd/user/scripts/S20-net-selfheal-cron.sh

# Apply immediately
/etc/kvmd/user/scripts/S20-net-selfheal-cron.sh start

I’d exercise the failure scenarios if I wasn’t this close to losing remote access to my RM-1 once today already! :smiley:

Something that may be nice is to implement this as a broader set of ping options. Being able to run an ICMP ping, TCP ping, or curl on a set of hosts could open up a whole realm of possible actions in addition to various local reboots: sending a webhook/email notification, sending a WoL packet to a down host, running an arbitrary script stored in the user area, etc.

This may be more flexible considering there are a bunch of people who don't want the device to have Internet access at all and it would let them define "alive" however they want.

You mean there's a possibility that your ISP had a fault, and after the fault occurred and the network returned to normal, you couldn't access our device through glkvm.com?

I haven't encountered this on the Comet systems, but on my NAS and smart power strip systems I have. Sometimes a process on the system will fail to the point where it's no longer remotely accessible. The fear here is that something on the Comet could fail and not be able to recover. Satisfying that fear would be information like "If we can't ping the router or an Internet IP address we'll force a reboot and hope that restores things." As two examples:

  1. I have an ASUSTOR NAS device that doesn't always upgrade well. If it fails to shutdown the various services it just hangs without having NFS/SMB access to the device. However, if I can access it I was able to disable some DIVs via Chrome's dev tools and enable the SSH server which allowed me to reboot it.
  2. I have a Digital Loggers smart power strip that is currently a dumb power strip because it can't detect that the web interface is inaccessible and hasn't rebooted the control software. I need to physically reboot the device to restore access before I can figure out what went wrong.

If both of these had better watchdog processes they could have rebooted themselves and obviated the need for a hacky fix for the NAS or a truck roll for the Power Distribution Unit. I haven't seen anything similar for the Comet, but a better understanding of existing watchdogs and enhancement of those procedures would be good to understand the reliability of the solution.

1 Like

The problem was at the LAN level, not WAN. The RM-1 lost its wired Ethernet connection to my local router (FritzBox), not its internet access.

Here’s what I know:

  • The router stopped detecting the RM-1 on the LAN port entirely.
  • RM-1 was not pingable on the local network.
  • RM-1 had no IP assignment (not visible in router’s client list, or from other machine).
  • Device was unreachable via Tailscale (because it had no network connectivity at all).
  • My other machine on the same switch/wall socket continued working normally.
  • No power loss (both devices on same outlet, other machine isn’t configured to wake on reboot and remained up)
  • ISP/internet was fine (other devices had connectivity)

The fix:

  • I changed the network link speed setting on the fritzbox port from "Automatic (1Gbps)" to "100 Mbit/s" for the RM-1. The device immediately reconnected and got an IP.
  • I did not physically change anything before it broke, nor to fix it. I couldn’t remotely reboot the RM-1 either.

It appears to be a link negotiation issue between the RM-1's NIC and the FritzBox after some network event (not sure it was an ISP fault). The speed change I did reset the link, and that helped – I doubt “link speed” itself is the cause or fix.

1 Like

Your solution is very enlightening, and we will conduct some investigations on it.