@bruce Thank you for publishing rtl8366ub_mdio.c/.h — this is genuinely appreciated, and it turns out to be more useful than you may have expected: the wrapper you just published contains the root cause of the VLAN PVID bug I reported. The bug is not in Realtek's NDA'd SDK — it's in this GPL glue layer, so you can fix it today without waiting on Realtek.
The bug — rtl8366ub_sw_set_vlan_ports():
port = &val->value.ports[0];
for (i = 0; i < val->len; i++, port++) {
RTK_PORTMASK_PORT_SET(vlan.mbr, port->id);
if (!(port->flags & BIT(SWITCH_PORT_FLAG_TAGGED)))
RTK_PORTMASK_PORT_SET(vlan.untag, port->id);
if (EXT_PORT0 == port->id || EXT_PORT1 == port->id) {
continue;
}
/*
* To ensure that we have a valid MC entry for this VLAN,
* initialize the port VLAN ID here.
*/
err = rtk_vlan_portPvid_get(port->id, &pvid, &port_Pri);
if (err < 0)
return err;
err = rtk_vlan_portPvid_set(port->id, val->port_vlan, port_Pri); // <-- BUG
if (err < 0)
return err;
}
This sets the PVID of every non-EXT member port to the VLAN currently being applied — including tagged members. Since netifd applies switch_vlan sections in ascending VID order at boot and on every switch reapply, a trunk port that is a tagged member of VLANs 1/20/30/40 ends up with PVID=40 instead of its configured PVID. That's exactly what I captured in my report: dmesg shows the [rtl8366ub_sw_set_vlan_ports] vid=1 … vid=20 … vid=30 … vid=40 cascade (the pr_info a few lines below this loop), after which swconfig dev switch0 port 1 get pvid returns 40. Untagged clients then land on the wrong VLAN and get DHCPNAK'd.
The fix is one line, and upstream already has it. This loop — including the "To ensure that we have a valid MC entry" comment, verbatim — comes from OpenWrt's GPL rtl8366_smi.c. But upstream guards the write so it only initializes a port that doesn't have a PVID yet (openwrt-21.02, rtl8366_sw_set_vlan_ports [[LINK:index]](openwrt/target/linux/generic/files/drivers/net/phy/rtl8366_smi.c at openwrt-21.02 · openwrt/openwrt · GitHub)):
err = rtl8366_get_pvid(smi, port->id, &pvid);
if (err < 0)
return err;
if (pvid == 0) { // <-- the guard your fork dropped
err = rtl8366_set_pvid(smi, port->id, val->port_vlan);
if (err < 0)
return err;
}
Your version calls rtk_vlan_portPvid_get() but then only uses the priority and overwrites the PVID unconditionally. Restoring the equivalent guard (if (pvid == 0) — or simply not touching PVID for tagged members) fixes the trunk-port behavior; set_port_pvid already exists for explicit PVID configuration.
A second, smaller bug while you're in this file — the enable_vlan get/set handler bodies are swapped:
static int rtl8366ub_sw_set_vlan_enable(...) {
...
val->value.i = gsw->global_vlan_enable; // "set" handler only READS
return 0;
}
static int rtl8366ub_sw_get_vlan_enable(...) {
...
gsw->global_vlan_enable = val->value.i != 0; // "get" handler only WRITES
return 0;
}
This is why swconfig dev switch0 set enable_vlan 1 is a no-op and get enable_vlan always reads 0 on fw 4.8.6, despite uci network.@switch[0].enable_vlan='1' — something several of us independently hit while debugging this.