GL-XE3000 possible kernel/mt_wifi memory leak causing nginx dashboard OOM after ~6 days

I am seeing what looks like a kernel/driver memory leak or severe memory fragmentation issue on a GL-XE3000. The router itself continues routing, but the GL.iNet dashboard stops responding because nginx gets killed by the OOM killer.

Device details:

  • Model: GL.iNet GL-XE3000
  • Firmware: 4.8.3
  • Build: 902
  • Build date: 2025-11-06 20:12:02
  • Type: release3
  • Kernel: 5.4.211
  • Target: mediatek/mt7981
  • Wi-Fi driver package: kmod-mt_wifi - 5.4.211+TEST-16
  • Uptime when issue occurred: about 6 days 17 hours

Symptoms:

  • http://192.168.8.1/ and https://192.168.8.1/ stop responding.
  • Ports 80 and 443 are closed/refused because nginx is killed.
  • Router still works as gateway.
  • DNS still responds.
  • SSH still works, though sometimes slowly.
  • LuCI on 8080/8443 remains available.
  • Restarting nginx only helps briefly unless memory pressure is relieved.

Memory state during the issue:

MemTotal:         491376 kB
MemAvailable:      ~14-35 MB
Slab:             ~322 MB
SReclaimable:       ~7 MB
SUnreclaim:       ~315 MB
SwapTotal:          0 kB initially

After I temporarily added a 128 MB swap file and restarted nginx, the dashboard came back, but the underlying SUnreclaim remained very high.

Relevant OOM/kernel logs:

Out of memory: Killed process ... nginx
worker process ... exited on signal 9

slab_reclaimable:1741 slab_unreclaimable:79031
slab_reclaimable:1736 slab_unreclaimable:78899

The strongest clue is this kernel allocation failure stack from the MediaTek Wi-Fi driver path:

lua: page allocation failure: order:3, mode:0x40a20(GFP_ATOMIC|__GFP_COMP)

os_alloc_mem+0x1c/0x38 [mt_wifi]
RTMPIoctlGetMacTableStaInfo+0x28/0x430 [mt_wifi]
RTMP_AP_IoctlHandle+0x358/0x900 [mt_wifi]
rt28xx_ap_ioctl+0x97c/0x1188 [mt_wifi]
rt28xx_ioctl+0x50/0x88 [mt_wifi]
ap_iw_handler+0x3c/0x310 [mt_wifi]
ioctl_private_call
wireless_process_ioctl
wext_handle_ioctl

7981@C12L1,RTMPIoctlGetMacTableStaInfo() 7101: Allocate memory fail!!!

The process involved was:

/usr/bin/lua /usr/bin/gl_clients_update

The modem stack was also active/noisy around the same time:

gl_modem invoked oom-killer
modem_AT: get_AT_device api Loop count:1
gl_modem: Start lock operator initialization...

My current interpretation:

This does not look like normal userspace memory usage. Process RSS was modest, conntrack count was normal, and socket counts were normal. The problem appears to be unreclaimable kernel slab growth, possibly triggered or exposed by repeated client scanning via gl_clients_update and the mt_wifi driver, with modem polling adding pressure.

Temporary mitigations I applied:

worker_processes auto -> worker_processes 1
added temporary 128 MB swap file
restarted nginx

This brought the dashboard back, but it does not solve the underlying kernel memory growth.

Questions:

  1. Is there a known memory leak or fragmentation issue in kmod-mt_wifi / RTMPIoctlGetMacTableStaInfo() for GL-XE3000 / MT7981?

  2. Is firmware 4.8.3 build 902 affected?

  3. Is there a newer firmware, U-Boot, or Wi-Fi driver build that specifically addresses unreclaimable slab growth or nginx/dashboard OOM?

  4. Is it safe/recommended to disable or reduce gl_clients_update polling as a workaround?

  5. Are there diagnostics you want me to collect before rebooting, since rebooting clears the slab memory?

Thanks.

Hi

Sorry for the delayed reply.

Please:

  1. Export the device logs and send them to us via private message so we can check the OOM logs.

  2. If the issue still persists, please use the following command to check the actual memory usage of gl_clients_update:

    cat /proc/$(pgrep -f gl_clients_update | head -n1)/status | grep -E 'VmSize|VmRSS|VmData|VmStk|VmExe|VmLib'
    

​Thank you for your patience and cooperation.


How to export logs:

How to send private messages:

Sorry for the late reply since I had rebooted after 12 hours of no response. I had to wait for the problem to occur again, which happened today as we speak, as I am writing this : as requested, I ran the commands.And here are the details.

VmRSS: 780 kB

pid=29345
VmSize: 12864 kB
VmRSS: 4908 kB
VmData: 4656 kB
VmStk: 132 kB
VmExe: 12 kB
VmLib: 7708 kB

The real problem is still kernel memory:

SUnreclaim: ~306 MB

That supports the same theory: gl_clients_update may be triggering the MediaTek Wi-Fi/kernel path, but the leaked memory is ending up in unreclaimable kernel slab, not inside the Lua process itself.I have attached the logs and sent you a private message.

Thank you for sharing the logs with us.

We will ask the development team to investigate further.

Hi

From the logs, we are also unable to determine which specific module is causing the OOM issue.

However, in most cases, kernel memory leaks are related to drivers.
Could you please upgrade the XE3000 to the latest v4.9.0 beta and check whether the issue still occurs? This version includes fixes for several driver-related issues.

If the issue still persists, please:

  1. Try disabling gl_clients_update:
/etc/init.d/gl_clients stop
  1. If that does not help, we may need to gradually disable features to narrow down the issue for further investigation.

Thanks for your cooperation and understanding.


Download link:

Upgrade guide:

The beta is up and performing better so far, but it is too early to declare the memory leak fixed.

Current state:

  • Firmware is now 4.9.0
  • Kernel build: 5.4.211, built Mon Jun 1 2026
  • Uptime: about 13 minutes
  • GL dashboard is working again on 80/443
  • LuCI is also working on 8080/8443
  • No OOM kills since the beta boot
  • WireGuard clients are up and handshaking
  • Internet and DNS from the router are working

The important memory comparison:

Before beta / before reboot:

SUnreclaim: ~306 MB
MemAvailable: ~23-36 MB
nginx: OOM-killed

Now:

SUnreclaim: ~67 MB
MemAvailable: ~189 MB
nginx: running

That is much healthier. Over a 1-minute sample, SUnreclaim stayed stable around 67 MB, so there is no immediate runaway leak. The real test is whether it climbs again over the next 2-6 days.

One issue I found: GL DDNS is broken right now.

Current active internet route is through:

secondwan / pppoe-secondwan

But GL DDNS is configured to use:

interface='wan'

And wan is currently down:

wan: up=false, NO_DEVICE

That explains the repeated DDNS timeout errors and why this domain is currently unresolved:

lxxxxx.glddns.com -> NXDOMAIN

So the beta itself looks promising for the dashboard/OOM issue, but DDNS needs config cleanup: point GL DDNS to secondwan, and probably disable/fix the IPv6 DDNS entry tied to wan6.

Hi

Thank you for the update.

We're glad to hear that memory usage now appears much healthier than before. Please continue monitoring the device and let us know whether the issue reoccurs.


Regarding the GL DDNS issue, the public IP detection method is actually configured to obtain the IP address by querying a specified web service:

gl_ddns.glddns.ip_url='http://checkip.dyndns.com'
gl_ddns.glddns.ip_source='web'

rather than directly from the bound interface:

gl_ddns.glddns.interface='wan'

Therefore, this should not affect normal operation.

We tested locally using an XE3000 running v4.9.0 Beta 2, and GL DDNS worked correctly even when only the Secondary WAN was enabled:


Could you please check again?

If the issue still persists, please:

  1. Export the device logs and send them to us via private message so we can review the GL DDNS status.

  2. SSH into the router and run the following command to verify whether communication with the configured IP detection server is working properly:

    curl http://checkip.dyndns.com