GL-MT3600BE: intermittent TCP/TLS failures when routing LAN clients to another gateway on the same LAN; fixed by SNAT

Hi everyone,

I recently replaced my main router with a GL-MT3600BE, and I started seeing intermittent connection failures in a specific routing scenario.

The issue appears when the GL-MT3600BE routes LAN client traffic to another gateway that is also inside the same LAN subnet.

Network topology

My LAN subnet is:

192.168.11.0/24

Main router:

GL-MT3600BELAN
IP: 192.168.11.1

Client:

Windows client
IP: 192.168.11.144
Default gateway: 192.168.11.1

There are two additional gateways on the same LAN:

192.168.11.16  -> Tailscale subnet router / gateway
192.168.11.21  -> fake-ip / proxy gateway

The GL-MT3600BE has static routes like:

7.0.0.0/10      via 192.168.11.21 dev br-lan
10.21.0.0/16    via 192.168.11.16 dev br-lan
172.18.20.0/24  via 192.168.11.16 dev br-lan
172.30.0.0/16   via 192.168.11.16 dev br-lan
192.168.12.0/24 via 192.168.11.16 dev br-lan

Example route output:

7.0.0.0/10 via 192.168.11.21 dev br-lan proto static
10.21.0.0/16 via 192.168.11.16 dev br-lan proto static
172.18.20.0/24 via 192.168.11.16 dev br-lan proto static
172.30.0.0/16 via 192.168.11.16 dev br-lan proto static
192.168.12.0/24 via 192.168.11.16 dev br-lan proto static

Symptom

Some HTTPS connections fail on the first attempt, but succeed immediately on the second attempt.

Example from Windows curl:

curl.exe -v -o NUL -s -w "dns=%{time_namelookup} connect=%{time_connect} starttransfer=%{time_starttransfer} total=%{time_total}`n" https://xxxx.xxx.xxx.com

Failure:

* Host xxxx.xxx.xxx.com:443 was resolved.
* IPv4: 10.21.6.74
*   Trying 10.21.6.74:443...
* schannel: disabled automatic use of client certificate
* ALPN: curl offers http/1.1
* Recv failure: Connection was reset
* schannel: failed to receive handshake, SSL/TLS connection failed
* closing connection #0dns=0.025663 connect=0.043946 starttransfer=0.000000 total=19.253947

Second attempt succeeds:

* Host xxx.xxx.xxx.com:443 was resolved.
* IPv4: 10.21.6.74
*   Trying 10.21.6.74:443...
* ALPN: server accepted http/1.1
< HTTP/1.1 200 OK
dns=0.054640 connect=0.088751 starttransfer=0.308301 total=0.316747

The same type of failure also happened with fake-ip traffic, for example:

sqlmodel.tiangolo.com -> 7.0.0.79

Failure example:

* Host sqlmodel.tiangolo.com:443 was resolved.
* IPv4: 7.0.0.79
*   Trying 7.0.0.79:443...
* Recv failure: Connection was reset
* schannel: failed to receive handshake, SSL/TLS connection faileddns=0.030430 connect=0.035898 starttransfer=0.000000 total=19.207708

Packet capture findings

On the fake-ip route, I saw TCP connection establishment, then TLS ClientHello retransmissions:

192.168.11.144 -> 7.0.0.79:443 [SYN]
7.0.0.79 -> 192.168.11.144:443 [SYN, ACK]
192.168.11.144 -> 7.0.0.79:443 [ACK]
192.168.11.144 -> 7.0.0.79:443 TLSv1.2 Client Hello
192.168.11.144 -> 7.0.0.79:443 TCP Retransmission [PSH, ACK]
...
192.168.11.144 -> 7.0.0.79:443 [RST, ACK]

For the Tailscale route, I captured on both the router and the Tailscale gateway.

On the Tailscale interface, I could see the remote host sending SYN/ACK back:

192.168.11.144 -> 10.21.6.74:443 [SYN]
10.21.6.74 -> 192.168.11.144:443 [SYN, ACK]
10.21.6.74 -> 192.168.11.144:443 [SYN, ACK retransmission]
10.21.6.74 -> 192.168.11.144:443 [SYN, ACK retransmission]
192.168.11.144 -> 10.21.6.74:443 [RST, ACK]

This made me suspect an asymmetric routing issue:

Before SNAT:

Client -> GL-MT3600BE -> LAN-side gateway -> remote target
Client <- LAN-side gateway <- remote target

The return path bypasses the GL-MT3600BE.

Important test

If I add a route directly on the Windows client, bypassing the GL-MT3600BE as the intermediate router, the problem disappears.

For example:

route add 7.0.0.0 mask 192.0.0.0 192.168.11.21 metric 1

After this, fake-ip traffic is stable:

Windows client -> 192.168.11.21 -> proxy gateway

This suggests the proxy gateway itself is probably fine.

Workaround that fixed the issue

I added an SNAT rule on the GL-MT3600BE for traffic from LAN clients to non-LAN destinations, when the egress zone is still LAN:

Source:      192.168.11.0/24
Destination: !192.168.11.0/24
Zone:        lan -> lan
Action:      SNAT to 192.168.11.1

In LuCI it shows as:

SNAT all -- 192.168.11.0/24 !192.168.11.0/24 to:192.168.11.1

After adding this rule, the packet counter increases and the intermittent HTTPS failures appear to be fixed.

The resulting path becomes symmetric:

After SNAT:
Client -> GL-MT3600BE -> LAN-side gateway -> remote target
Client <- GL-MT3600BE <- LAN-side gateway <- remote target

Question

Is this expected behavior for this kind of topology?

I understand that routing from br-lan back out through br-lan to another gateway in the same subnet can create asymmetric routing, ICMP redirect, or flow offloading edge cases.

However, this worked more reliably on my previous router ( Openwrt OpenWrt 25.12 On x86 ). On the GL-MT3600BE, I consistently saw first-attempt TCP/TLS failures until I added SNAT.

Could this be related to:

flow offloading / hardware offloading
ICMP redirectMTK acceleration
LAN-to-LAN routing via br-lan
asymmetric return path handling

Would GL.iNet recommend SNAT for this topology, or is there a better way to configure static routes to LAN-side gateways?

Thanks.

Worth trying this with the hardware acceleration disabled I think and seeing if you see the same behaviour.

Thanks. I tried disabling hardware acceleration, but it did not change the behavior.

The only thing what makes me look a bit confused is this route here.

Does tailscale not use the 100 range?

7.0.0.0/10 is supposed to be a public ip range, the 100 range is often classified as a private reserved address by the IETF, often also used for CGNAT but tailscale uses this (imho also a bit wrong in my eyes how tailscale does it but not that breaking).

also 7.0.0.0/10 does not follow the RFC1918 standard.

So likely what I see happening here is that you sinkhole route the full 7.0.0.1 to 7.63.255.255 range.

And you likely also have a security dnsmasq setting active which discards or blocks bogons which do not follow rfc1918.

Also it is worth noting that tailscale is based on the wireguard implementation, I personally don't know much about tailscale, but I do know alot about wireguard I'm confident these 100 ip on the default setup are just virtual ip and static routed.

It can become dangerous when a virtual ip overlaps real public ip and I think the bogon detection detects the route as bogon, which is right fully so, because it is not good if public ip would behave as local source traffic.

Thanks for pointing this out.

Just to clarify: the 7.0.0.0/10 route is not for Tailscale. It is used by my fake-ip DNS/proxy gateway.

I agree that 7.0.0.0/10 is unusual and not RFC1918. And I changed to use 192.18.0.16/15 now.

1 Like

192.18 is also wrong please see:

You can expand the 'Private IPv4 addresses' dropdown to see a nice table :slight_smile:

According to the engineers' analysis, the main cause of the problem is not TLS, nor the target gateway itself, but rather that when the GL-MT3600BE is used as the primary gateway, it forwards LAN client traffic to another gateway within the same LAN. This causes the return route to potentially bypass the GL, forming an asymmetric path.

The reason your added lan->lan SNAT is effective is because it changes the source address to 192.168.11.1, forcing the return packets to pass through the GL again, thus restoring a symmetric path and a stable connection.

This asymmetric traffic shouldn’t be an issue unless the traffic is passing a firewall boundary though that requires a stateful connection?

Passing in and out of the LAN interface only in one direction should not be a problem as it’s not passing the firewall so should not care about subsequent packets not being part of a stateful flow it. There should be nothing there to filter them?

Thanks @xize11, you are right, I made a typo. I meat 198.18.0.0/15 , not 192.18.0.0/15.

And yes, 198.18.0.0/15 is not a private network in the RFC1918.

The reason I avoided RFC1918 ranges is that my Tailscale remote subnet routes already use a lot of private address space, including 10.0.0.0/8, 192.168.0.0/16 and 172.16.0.0/12 . If I use another private range for fake-ip, it may easily conflict with real remote networks.

So I agree that 198.18.0.0/15 is not “private”, but for fake-ip it seems to be a better choice.

Thanks @lucas2, This matches my test result.

I actually agree with @oorweeg here. Asymmetric routing by itself should not necessarily break TCP. If the packets are simply routed on the LAN side, and there is no strict stateful firewall boundary involved, then the return path not going through the GL router should not automatically cause TLS or TCP failures.

I would really appreciate it if the GL.iNet engineers could explain what is happening internally here, or confirm whether this is something that can be fixed.

For now, SNAT is a working workaround, but it hides the original LAN client IP.

Hmm, can't the tailscale range not be smaller, about how much clients we talk about?

Here is a example of using 254 clients:

192.168.78.0/24 = 192.168.78.0 - 192.168.78.255 (incl broadcast address).

10.0.7.0/24 = 10.0.7.0 - 10.0.7.255
10.234.46.0/24 = 10.234.46.0 - 10.234.46.255

^ this practically sums up 3 networks with each 254 clients.

You can really have ALOT of networks like this :slight_smile:

/16 and /8 often just adds the range a octal before and make a longer range a network above 254 clients, but often when you do you probably looking to the direction to host like a datacenter because those ranges are big and likely not necessary, I never used them in my homelab either and I have quite alot of networks.

Thank you for your update.

Here are some analyses and opinions from engineers regarding this issue:

The SNAT path is roughly as follows:

Outbound:

Windows Client
192.168.11.144
-> GL-MT3600BE
192.168.11.1
-> LAN-side Gateway
192.168.11.16 / 192.168.11.21
-> Remote target

Return Route:

Remote target
-> LAN-side Gateway
192.168.11.16 / 192.168.11.21
-> Windows Client
192.168.11.144

The problem is that the GL-MT3600BE only sees outbound traffic and doesn't consistently see return traffic.

This affects several things:

  1. Incomplete firewall and conntrack status

OpenWrt/Linux firewalls rely on connection tracking, specifically conntrack. A normal TCP connection should be:

SYN ->
<- SYN/ACK
ACK ->
TLS ClientHello ->
<- TLS ServerHello

If the GL-MT3600BE only sees:

Client -> Router -> Gateway -> Remote, but not:

Remote -> Gateway -> Router -> Client

Therefore, its judgment of the connection state may be incomplete. In some cases, it might consider subsequent packets not to be part of a normal, established connection.

  1. TCP and NAT states may not match.

Without SNAT, the source address seen by the LAN-side gateway is:

192.168.11.144

Because 192.168.11.144 and 192.168.11.16/21 are on the same LAN, the gateway will send the return packet directly to the client instead of sending it back to 192.168.11.1.

In other words, the GL-MT3600BE participated in the outbound route but withdrew from the return route. This makes any functionality that relies on "both connections passing through me" unstable. For example:

conntrack
firewall state
flow offloading
hardware NAT / acceleration
MTK acceleration
QoS / SQM

Partial TCP MSS/PMTU Handling

  1. ICMP Redirects May Further Alter Client Paths

In this topology, the master router might assume:

Client 192.168.11.144 needs to go to 10.21.6.74. The next hop is actually 192.168.11.16 on the same network segment.

Therefore, it might send an ICMP redirect to the client, meaning:

From now on, go directly to 192.168.11.16 and don't go through me anymore.

Some clients accept this, some don't; some security policies will discard it; some paths will change during the connection process. The result is an intermittent phenomenon of "failed the first time, succeeded the second time."

  1. Layer 3 forwarding on the same Layer 2 LAN can easily trigger edge cases.

Your path is:

br-lan in -> GL-MT3600BE routing -> br-lan out
that is, entering through the same interface and exiting through the same interface. This is not ordinary LAN-to-WAN forwarding, but LAN-to-LAN hairpin routing.

Many router acceleration paths, firewall rules, and NAT rules more commonly assume:

LAN -> WAN
WAN -> LAN
LAN -> VPN
VPN -> LAN
instead of:

LAN -> LAN, same subnet, different gateway
Therefore, it is more likely to encounter issues related to flow offload, hardware acceleration, bridge interface forwarding, ARP neighbor entries, and ICMP redirects.

  1. Why is SNAT stable?

After SNAT, the GL-MT3600BE changes the client's source address to its own:

Original packet:

192.168.11.144 -> 10.21.6.74

After SNAT:

192.168.11.1 -> 10.21.6.74

The LAN-side gateway sees the request originating from:

192.168.11.1

Therefore, its response packet can only be sent to the GL-MT3600BE:

10.21.6.74 -> 192.168.11.1

Then the GL-MT3600BE restores the path to the client based on the conntrack/NAT state:

10.21.6.74 -> 192.168.11.144

Then the path becomes:

Outbound:
Client -> GL-MT3600BE -> LAN-side Gateway -> Remote

Backhaul: Remote -> LAN-side Gateway -> GL-MT3600BE -> Client. With a symmetrical path, conntrack, firewall, offload, and QoS can all see the complete traffic, resulting in a more stable connection.

In short: The problem with asymmetrical paths isn't that "IP routing theoretically can't be done this way," but rather that modern home/embedded routers have many stateful functions. If a device only sees half of the connection, intermittent anomalies may occur in TCP, TLS, NAT, flow offload, or ICMP redirection. SNAT's role is to force the backhaul to also pass through the main router.

@msmmbl Do you have a managed/vlan capable switch? You may be better off just creating VLAN interfaces for your alternate routes and treating them as proper routed paths to avoid the asymmetry of LAN<>LAN

My Beryl 7 is kinda spare now I have a Slate 7 Pro so if I get more time I’ll do some experimentation on your current setup with it.