Wireguard working setup stops to work, and changing listen port on disconected peer restores it

mat1 · February 11, 2021, 9:28pm

I have a wireguard “server” (s2s) and it works very well…

But sometimes a client peer gets disconnected (the internet connection is a wireless connection and sometimes it has a very bad quality and timeouts ocurrs - it depends on the weather situation) and then the peer will not restore.

I‘ve no idea why this occurrs, when restarting the wireguard-interface on the client peer it works immideately (init.d gl_s2s restart)
Manually executing the wireguard_watchdog, waiting (hours) does not help.

I think it works after interface-restart, because the listen-port gets changed from the client-peer and the a reconnection is possible(?)

What can be the issue here?
Do i see anywhere logs on the wireguard server / client side? I cannot see any information witin ssh / logread…

mat1 · February 12, 2021, 8:41am

I’ve seen that:

The Handshake is sent to the destination address:

Client-Side

09:28:54.857415 IP client.49542 > server.51820: UDP, length 148

Server-Side:

09:31:24.161261 IP client.49542 > server.51820: UDP, length 148
09:31:24.175533 IP server.51820 > client.49542: UDP, length 92 [this is missing on the client-side....]

In this situation I’ve the correct entry in the conntrack module in the firewall of the client-side:

root@FIREWALL:~# conntrack -L | grep 49542
conntrack v0.9.14 (conntrack-tools): udp      17 26 src=192.168.2.122 dst=##server-ip## sport=49542 dport=51820 [UNREPLIED] src=##server-public-ip## dst=##client-public-ip## sport=51820 dport=49542 mark=1694498816 use=1
109 flow entries have been shown.

Also when flushing the Conntrack-Table on the Client-Side, it gets entered again:

conntrack v0.9.14 (conntrack-tools): 166 flow entries have been shown.
udp      17 29 src=192.168.2.122 dst=##server-public-ip## sport=49542 dport=51820 [UNREPLIED] src=##server-public-ip## dst=##client-public-ip## sport=51820 dport=49542 mark=1694498816 use=1

It’s really weird…

alzhao · February 14, 2021, 5:38am

Unfortunately wireguard has no logs.

Should we make wireguard change ports automatically?

mat1 · February 14, 2021, 9:16am

In my case it would resolve the problem. The Problem is on the Remote-Side where we have the VPN-Server installed, it is behind an existing Barracuda Firewall. When we clear here the UDP Session on the Barracuda-Firewall the connection works immediately.
The Problem is, that the session does not expires because the client always sends the handshake, but the response back doesn’t get received (probably because of a session-id issue).

Currently we have resolved it with a temporary script what runs every X minutes via crontab:

#!/bin/bash
# Packets Loss Watch
# Simple SHELL script for Linux and UNIX system monitoring with
# ping command
#
# Copyright (c) 2006 nixCraft project <http://www.cyberciti.biz/fb/>
# Copyleft 2013 Stephen Larroque
# This script is licensed under GNU GPL version 2.0 or above
#
# This script was inspired by a nixCraft script http://www.cyberciti.biz/tips/simple-linux-and-unix-system-monitoring-with-ping-command-and-scripts.html
#
# For more complex needs, take a look at:
# - SmokePing: http://oss.oetiker.ch/smokeping/
# - DropWatch: http://humblec.com/dropwatch-to-see-where-the-packets-are-dropped-in-kernel-stack/
# - sjitter: http://www.nicolargo.com/dev/sjitter/
# - iperf: http://iperf.fr/
# -------------------------------------------------------------------------

#=== PARAMETERS change them here
# add ip / hostname separated by while space
HOSTS="192.168.50.70"
# no ping request
COUNT=8

#=== Local vars (do not change them)
# Cron-friendly: Automaticaly change directory to the current one
cd $(dirname "$0")

# Current script filename
SCRIPTNAME=$(basename "$0")

# Current date and time
today=$(date '+%Y-%m-%d')
currtime=$(date '+%H:%M:%S')

#=== Help message
if [[ "$@" =~ "--help" ]]; then
  echo "Usage: bash $SCRIPTNAME
Check the rate of packets loss and output the result in a file named plwatch.txt in the same directory as this script.
Note: this script is cron-friendly, so you can add it to a cron job to regularly check your packets loss.
"
	exit
fi

#=== Main script
for myHost in $HOSTS
do
  msg=$(ping -c $COUNT $myHost | grep 'loss')
  echo "[$today $currtime] ($myHost $COUNT) $msg" >> /root/plwatch.txt
  count=$(ping -c $COUNT $myHost | grep 'received' | awk -F',' '{ print $2 }' | awk '{ print $1 }')
  if [ $count -eq 0 ]; then
    # 100% failed
    echo "Host : $myHost is down (ping failed) at $(date)"
    echo "[$today $currtime] ($myHost $COUNT) Host is down, restart gl_s2s tunnel" >> /root/plwatch.txt
    /etc/init.d/gl_s2s restart                            
    echo "gl_s2s tunnel restarted"
    ubus call mqtt pub '{"api":"/user/data", "data":"gl_s2s tunnel restarted"}'
  else
    echo "tunnel is up - succedded $count pings - so do not do anything"
  fi
done

mat1 · February 15, 2021, 7:13pm

@alzhao

What I detected now is also that the Server keeps sending Keepalive-Requests when a Client gets disconnected.

This we detect on the wg0 (Wireguard Server from Interface) and also on the wg1 (S2S Wireguard Server):

interface: wg0
  public key: xxxxxxxxxxxxxxxxxxxxx=
  private key: (hidden)
  listening port: 51820
peer: xxxxxxx=
  endpoint: xxx.xxx.xxx.xxx:49543
  allowed ips: 10.0.0.3/32
  latest handshake: 6 days, 1 hour, 2 minutes, 43 seconds ago
  transfer: 27.05 KiB received, 11.69 MiB sent
  persistent keepalive: every 25 seconds

If we launch a tcpdump on the source-port we see that the server is still sending keepalive-requests to this remote-ip-address:

root@FW-VPNGW:~# tcpdump port 49542
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on br-lan, link-type EN10MB (Ethernet), capture size 262144 bytes
19:38:28.033866 IP console.gl-inet.com.51820 > xxx.xxx.xxx.xxx.49542: UDP, length 148
19:38:33.068402 IP console.gl-inet.com.51820 > xxx.xxx.xxx.xxx.49542: UDP, length 148
19:38:38.353968 IP console.gl-inet.com.51820 > xxx.xxx.xxx.xxx.49542: UDP, length 148
19:38:44.108667 IP console.gl-inet.com.51820 > xxx.xxx.xxx.xxx.49542: UDP, length 148

In this case on the NAT-Firewall the Session will be kept open, because they still see this keepalive-traffic on the firewall. So in this case also if we have been disconnected since 6 days we cannot use this source-port to reconnect (in a S2S scenario).

So the question is:

Why does the Server still keeps sending keepalives also when the client does a “clean” disconnect?
Isn’t it enough if the client only sends this keepalive requests? Does the server also needs to send this?

I also found this thread on reddit

In your case I see that you always set the PersistentKeepalive to a fixed value of 25 - independently if it’s a server or if it’s a client:

wireguard_server startup script:

        echo -e "PersistentKeepalive = 25\n" >>"$WFILE"

gl s2s startup script: (here it is a config-variable, but it’s not set on the s2s node)

        config_get keepalive     "${section}" "keepalive"
        [ -n "${keepalive}" ] && echo "PersistentKeepalive = ${keepalive}" >> "${wg_cfg}"

mat1 · March 2, 2021, 8:15pm

@alzhao can you check my latest questions please?
Is there any reason on enabling persistent keepalives on the server-side?

alzhao · March 3, 2021, 11:14am

Sorry I don’t know as well.

@Riho-shuu can you check?

eric · March 3, 2021, 4:53pm

I cannot answer for GL iNet, but I am running 3 cloud VPS as Wireguard VPNs for friends and family, that are not behind a NAT (they all have real IPv4 addresses) and I have turned off PersistentKeepalive on all of them months ago, without any issues, as I don’t want any extra chatter giving away what they are doing. I do run PersistentKeepalive on the client side as most of the time my clients are behind a NAT.

Happi · March 3, 2021, 6:23pm

Sorry to hijack the thread but how do you resolve DNS? I tried setting up Stubby on my VPS (Ubuntu 18.06) but just can't get it to work with Wireguard (the Wireguard part works fine).

PS. curious as to why you need 3 VPS's and not just the one!

eric · March 3, 2021, 7:20pm

My VPS were running Ubuntu 18.04, now upgraded to 20.04, but both OS versions work fine as Wireguard servers. I use a free DDNS service on each of the VPS to give me a domain name. so I am not using the IP address directly, as the IP address may change over time. I use the DDNS domain names on my Wireguard clients, including GL iNet routers running an assortment of firmware versions. It just works. I have also tested Macos, Windows 10, and Android Wireguard clients with my VPS servers, and they all work.

Why 3? When they are free, why not? Two are free Oracle VPSs, and the other is a free Google VPS. I prefer the Oracle free servers, as they give you two better equipped servers with up to 10 TB per month of data transfer. Google is stingier with resources on their free tier, but I always like having a backup system

mat1 · March 9, 2021, 12:06pm

@Riho-shuu can you give me feedback about the keepalive on server-side and if it can be removed in future updates?

Riho-shuu · March 10, 2021, 3:07am

My apology for the late response.

I have checked the wireguard_server startup script you mentioned, that command line is used for peers but not server.

From my testing when the client loses internet I can see requests from the client:

But server is not sending the requests:

mat1 · March 10, 2021, 10:27am

@Riho-shuu

Executing wg on the Wireguard-Server i see this config:

root@VPNGW:~# wg
interface: wg0
  public key: ****
  private key: (hidden)
  listening port: 51820

peer: ****
  endpoint: ****:57539
  allowed ips: 10.0.0.3/32
  latest handshake: 41 seconds ago
  transfer: 2.68 MiB received, 61.21 MiB sent
  persistent keepalive: every 25 seconds

peer: ****
  allowed ips: 10.0.0.2/32
  persistent keepalive: every 25 seconds

So here we see that the persistent-keepalive is also done server-side. We configured the server from the gl-inet GUI

On the s2s this is not set.

Downchuck · August 15, 2021, 3:09am

Doesn’t seem like the server-side keepalive is helping; once that keep-alive times out, I suspect, and I’m not clear here, that the server needs to drop the connection – I’ve seen both Windows client and gl.inet client time out to the same gl.inet server in the same manner. Increasing the frequency of keepalive packets from the client-side did not seem to help.

Note: this is a NAT client - the keepalive from the server is likely to keep pinging the client forever; where that NAT client is quite likely to loose access to the ListenPort if it sits idle. One would think the keep alive would keep that UDP alive - but that does not seem to be the case, at least well enough to use gl.inet without logging into the admin page every now and then to bump the listenport.