LAN DNS resolution works, until it doesn't

sxan · February 22, 2025, 6:43pm

BLUF: 2-3 times a day, LAN host name resolution will stop working. I have to log into the server and run a command, after which everything works for a while. This always happens overnight; it usually happens at some point during the day, as well.

I have a GL-iNet AX1800 running in router mode. The router claims it's running OpenWrt 21.02-SNAPSHOT r16399+173-c67509efd7. I have it configured to connect to Mullvad via Wireguard, and have a couple of LAN machines set to static IPs but am otherwise letting dhcp do its thing. I'm also excluding a small group of IPs to bypass Mullvad, because they're work-related VPN endpoints (no need to VPN-over-VPN). Finally, I've poked around in the shell to add some LAN cnames.

For a long while, I was struggling to consistently getting named LAN hosts to be resolved by the router; it was inconsistent at best. I tried enough things that I can't remember all of them; one thing I think I did do was replace dnsmasq with dnsmasq-full, and that's what's currently installed. For a while this seemed to work fine, and then I think I got a firmware update and my troubles started.

WAN DNS resolution happens over Wireguard via Mullvad's DNS servers -- as I want it to. This works consistently, all the time. However, every day, 1-3 times a day, LAN host resolution stops working. It happens every night, and I sometimes during the day; I don't have a feeling for periodicy.

When LAN resolution starts failing, my work-around is to ssh into the server and run route_policy 3. I've narrowed it down to this via tracing:

/etc/init.d/vpnpolicy-apply restart fixed it, so I traced that and found that
both /usr/bin/vpn_domain_update.sh and /usr/bin/route_policy were being called, which meant my proxy_mode must be "3", which led me to trying
calling vpn_domain_update.sh, which didn't fix the issue, so trying
route_policy 3, which does fix it.

This is as far as I've traced it; I suspect that it's not the firewall changes that the script is doing, but rather the ipset changes and set_domain_policy() shell function that resolve the issue.

I still have no idea what's causing the LAN host resolution to consistently, periodically fail.

I also wonder if it's odd that there are 4 dnsmasq processes running:

 5668 root      2704 S    /usr/sbin/dnsmasq -C /etc/dnsmasq.conf.vpn -x /var/run/dnsmasq/dnsmasq.vpn.pid --server=193.138.219.228 --no-resolv
 5669 root      2676 S    /usr/sbin/dnsmasq -C /etc/dnsmasq.conf.vpn -x /var/run/dnsmasq/dnsmasq.vpn.pid --server=193.138.219.228 --no-resolv
 6002 dnsmasq   2724 S    /usr/sbin/dnsmasq -C /var/etc/dnsmasq.conf.cfg01411c -k -x /var/run/dnsmasq/dnsmasq.cfg01411c.pid
 6007 root      2692 S    /usr/sbin/dnsmasq -C /var/etc/dnsmasq.conf.cfg01411c -k -x /var/run/dnsmasq/dnsmasq.cfg01411c.pid

Two look like intentional forking, but the fact that each set is using different configurations looks susicious. I do know that DNS resolution still works if I call service dnsmasq stop, because that kills only the non-VPN instances; and with only those WAN and LAN resolution still work fine. However, even with service dnsmasq disable, vpnpolicy_apply and/or route_policy 3 start all four when they are run.

I've tried stopping and disabling the dnsmasq service. Indeed, it kills the non-vpn-config pair, and both LAN and WAN DNS resolution continues to work without them. However, it doesn't prevent the issue occurring, and it just gets started back up by /usr/bin/route_policy when I run service vpnpolicy-apply restart.
I've renamed /etc/init.d/dnsmasq. This causes /usr/bin/route_policy to complain, does prevent the second set of dnsmasq instances from running, and it leaves LAN/WAN DNS resolution in a working state -- but it doesn't stop the issue from happening, and all it does is prevent 2 of the dnsmasq instances from running.

Thanks,

hansome · March 2, 2025, 4:50am

hi, are you using stable firmware 4.6.8?
How's the setup? Is it easy to reproduce the issue? I'd like to analyze it locally.

sxan · March 2, 2025, 6:44pm

Hi!

Yes, firmware 4.6.8, release 1.

What can I tell you about the setup? It's in router mode, and connected 24/7 to Mullvad via Wireguard.

I think the problem may happen only twice a day; I haven't been able to nail down an exact time. Calling /usr/bin/route_policy 3 fixes it, 100% of the time. uci -q get vpnpolicy.route_policy.proxy_mode is indeed "3".

As I said, I don't know how it got into this situation. I messed around with the settings a lot at first because I couldn't get LAN addresses to resolve. As I said above, I'm pretty sure that I replaced dsnmasq with dnsmasq-full to try to fix it. And, indeed, now I can resolve both LAN and WAN addresses; except that twice a day, every day, LAN resolution stops working and I have to log in and manually call route_policy 3.

I have only one router; I don't know how I can describe what is needed to cause the issue since I don't know exactly when or what is causing it to stop working.

hansome · March 4, 2025, 12:46pm

I found the root cause, the dual dnsmasq way has some fault, the VPN one fails to sync hostnames distributed by router itself /tmp/dhcp.leases.
Here is the wordaroud, running command:

#needed for 4.6.8, not need for 4.7 from on
sed -i 's/local=\/lan/local=\/lan_chgd/' /etc/dnsmasq.conf.vpn 
#needed for 4.6 and 4.7
echo -e "\nserver=/lan/127.0.0.1#53" >>/etc/dnsmasq.conf.vpn

. /usr/bin/route_policy 
handle_dns

The idea is to forward the hostname.lan DNS query to the Non-vpn dnsmasq.

This workaround has a side effect, it will stop a hostname resolution if the host is on the side of VPN server network.

sxan · March 4, 2025, 2:17pm

Thank you! That makes sense, although I don't understand oll of it.

Is the fact that I have 4 dnsmasqs running unusual? Why is that the case -- is it because I replaced dnsmasq with dnsmasq-full? Was there a different way to get both LAN and VPN DNS server working together with only a configuration change to dnsmasq?

I like these products -- I had an Opal that I liked so well I bought an Onyx, and I liked that so much I replaced my ASUS router with an AX1800. I suspect if I need more routers or choose to upgrade in the future, it'll be an GL-iNet router, so I'd like to understand how I should have addressed the LAN lease resolution issue -- could I have done it only through the UI?

hansome · March 5, 2025, 2:34am

That's expected. We implemented it for DNS traffic separation. It'll change to use dnsmasq multiple instance feature in firmware 4.8, not bootstrap a process manually.
dnsmasq-full should be the default one. So it's not a issue.

Further configuration is possible:

#also forward .lan DNS query to server side:
echo -e "\nserver=/lan/10.0.0.1#53" >>/etc/dnsmasq.conf.vpn

The full command is:

#needed for 4.6.8, not need for 4.7 from on
sed -i 's/local=\/lan/local=\/lan_chgd/' /etc/dnsmasq.conf.vpn 

#needed for 4.6 and 4.7, forward .lan DNS query to main dnsmasq
echo -e "\nserver=/lan/127.0.0.1#53" >>/etc/dnsmasq.conf.vpn
#also forward .lan DNS query to server side, change 10.0.0.1 to server tunnel ip.
echo -e "\nserver=/lan/10.0.0.1#53" >>/etc/dnsmasq.conf.vpn

. /usr/bin/route_policy 
handle_dns

Sorry for now only by command. We'll fix in future releases.

sxan · March 5, 2025, 4:35pm

I'm back to confirm that since I made the change, the issue has completely disappeared. It works perfectly, so thanks again!