LAN DNS resolution works, until it doesn't

BLUF: 2-3 times a day, LAN host name resolution will stop working. I have to log into the server and run a command, after which everything works for a while. This always happens overnight; it usually happens at some point during the day, as well.

I have a GL-iNet AX1800 running in router mode. The router claims it's running OpenWrt 21.02-SNAPSHOT r16399+173-c67509efd7. I have it configured to connect to Mullvad via Wireguard, and have a couple of LAN machines set to static IPs but am otherwise letting dhcp do its thing. I'm also excluding a small group of IPs to bypass Mullvad, because they're work-related VPN endpoints (no need to VPN-over-VPN). Finally, I've poked around in the shell to add some LAN cnames.

For a long while, I was struggling to consistently getting named LAN hosts to be resolved by the router; it was inconsistent at best. I tried enough things that I can't remember all of them; one thing I think I did do was replace dnsmasq with dnsmasq-full, and that's what's currently installed. For a while this seemed to work fine, and then I think I got a firmware update and my troubles started.

WAN DNS resolution happens over Wireguard via Mullvad's DNS servers -- as I want it to. This works consistently, all the time. However, every day, 1-3 times a day, LAN host resolution stops working. It happens every night, and I sometimes during the day; I don't have a feeling for periodicy.

When LAN resolution starts failing, my work-around is to ssh into the server and run route_policy 3. I've narrowed it down to this via tracing:

  • /etc/init.d/vpnpolicy-apply restart fixed it, so I traced that and found that
  • both /usr/bin/vpn_domain_update.sh and /usr/bin/route_policy were being called, which meant my proxy_mode must be "3", which led me to trying
  • calling vpn_domain_update.sh, which didn't fix the issue, so trying
  • route_policy 3, which does fix it.

This is as far as I've traced it; I suspect that it's not the firewall changes that the script is doing, but rather the ipset changes and set_domain_policy() shell function that resolve the issue.

I still have no idea what's causing the LAN host resolution to consistently, periodically fail.

I also wonder if it's odd that there are 4 dnsmasq processes running:

 5668 root      2704 S    /usr/sbin/dnsmasq -C /etc/dnsmasq.conf.vpn -x /var/run/dnsmasq/dnsmasq.vpn.pid --server=193.138.219.228 --no-resolv
 5669 root      2676 S    /usr/sbin/dnsmasq -C /etc/dnsmasq.conf.vpn -x /var/run/dnsmasq/dnsmasq.vpn.pid --server=193.138.219.228 --no-resolv
 6002 dnsmasq   2724 S    /usr/sbin/dnsmasq -C /var/etc/dnsmasq.conf.cfg01411c -k -x /var/run/dnsmasq/dnsmasq.cfg01411c.pid
 6007 root      2692 S    /usr/sbin/dnsmasq -C /var/etc/dnsmasq.conf.cfg01411c -k -x /var/run/dnsmasq/dnsmasq.cfg01411c.pid

Two look like intentional forking, but the fact that each set is using different configurations looks susicious. I do know that DNS resolution still works if I call service dnsmasq stop, because that kills only the non-VPN instances; and with only those WAN and LAN resolution still work fine. However, even with service dnsmasq disable, vpnpolicy_apply and/or route_policy 3 start all four when they are run.

  • I've tried stopping and disabling the dnsmasq service. Indeed, it kills the non-vpn-config pair, and both LAN and WAN DNS resolution continues to work without them. However, it doesn't prevent the issue occurring, and it just gets started back up by /usr/bin/route_policy when I run service vpnpolicy-apply restart.
  • I've renamed /etc/init.d/dnsmasq. This causes /usr/bin/route_policy to complain, does prevent the second set of dnsmasq instances from running, and it leaves LAN/WAN DNS resolution in a working state -- but it doesn't stop the issue from happening, and all it does is prevent 2 of the dnsmasq instances from running.

Thanks,

hi, are you using stable firmware 4.6.8?
How's the setup? Is it easy to reproduce the issue? I'd like to analyze it locally.

Hi!

Yes, firmware 4.6.8, release 1.

What can I tell you about the setup? It's in router mode, and connected 24/7 to Mullvad via Wireguard.

I think the problem may happen only twice a day; I haven't been able to nail down an exact time. Calling /usr/bin/route_policy 3 fixes it, 100% of the time. uci -q get vpnpolicy.route_policy.proxy_mode is indeed "3".

As I said, I don't know how it got into this situation. I messed around with the settings a lot at first because I couldn't get LAN addresses to resolve. As I said above, I'm pretty sure that I replaced dsnmasq with dnsmasq-full to try to fix it. And, indeed, now I can resolve both LAN and WAN addresses; except that twice a day, every day, LAN resolution stops working and I have to log in and manually call route_policy 3.

I have only one router; I don't know how I can describe what is needed to cause the issue since I don't know exactly when or what is causing it to stop working.

I found the root cause, the dual dnsmasq way has some fault, the VPN one fails to sync hostnames distributed by router itself /tmp/dhcp.leases.
Here is the wordaroud, running command:

#needed for 4.6.8, not need for 4.7 from on
sed -i 's/local=\/lan/local=\/lan_chgd/' /etc/dnsmasq.conf.vpn 
#needed for 4.6 and 4.7
echo -e "\nserver=/lan/127.0.0.1#53" >>/etc/dnsmasq.conf.vpn

. /usr/bin/route_policy 
handle_dns

The idea is to forward the hostname.lan DNS query to the Non-vpn dnsmasq.

This workaround has a side effect, it will stop a hostname resolution if the host is on the side of VPN server network.

Thank you! That makes sense, although I don't understand oll of it.

Is the fact that I have 4 dnsmasqs running unusual? Why is that the case -- is it because I replaced dnsmasq with dnsmasq-full? Was there a different way to get both LAN and VPN DNS server working together with only a configuration change to dnsmasq?

I like these products -- I had an Opal that I liked so well I bought an Onyx, and I liked that so much I replaced my ASUS router with an AX1800. I suspect if I need more routers or choose to upgrade in the future, it'll be an GL-iNet router, so I'd like to understand how I should have addressed the LAN lease resolution issue -- could I have done it only through the UI?

That's expected. We implemented it for DNS traffic separation. It'll change to use dnsmasq multiple instance feature in firmware 4.8, not bootstrap a process manually.
dnsmasq-full should be the default one. So it's not a issue.

Further configuration is possible:

#also forward .lan DNS query to server side:
echo -e "\nserver=/lan/10.0.0.1#53" >>/etc/dnsmasq.conf.vpn

The full command is:

#needed for 4.6.8, not need for 4.7 from on
sed -i 's/local=\/lan/local=\/lan_chgd/' /etc/dnsmasq.conf.vpn 

#needed for 4.6 and 4.7, forward .lan DNS query to main dnsmasq
echo -e "\nserver=/lan/127.0.0.1#53" >>/etc/dnsmasq.conf.vpn
#also forward .lan DNS query to server side, change 10.0.0.1 to server tunnel ip.
echo -e "\nserver=/lan/10.0.0.1#53" >>/etc/dnsmasq.conf.vpn

. /usr/bin/route_policy 
handle_dns

Sorry for now only by command. We'll fix in future releases.

1 Like

I'm back to confirm that since I made the change, the issue has completely disappeared. It works perfectly, so thanks again!