[Update:Solution]
It was my router which set STP on by default. Switching it off (in smaller networks) or using RSTP made the delays go away.
[/Update]
Hóla!
For a long time I’ve got this horribly annoying problem: Upon bootup, ANY domain-machine that is using LAN (no probs with wireless) has an idle-time with “there’s no network!” of about 1-2mins until they discovered the network. BUT only windows-machines. Linux boxes get net instantly. Also on LAN.
Setup: 2 Domaincontrollers, Server2019. Both are DNS, one is DHCP and NPS for WIFI. All machines have fixed IPs, the DHCP is just for wireless clients.
I have tried everything I could think of, like NIC-Drivers, OpenDHCP, temporarily changed the switch from a managed one to a dumb one, changed the NIC in the server, let only one DC be alive at a time, rejoined the domain, the usual sfc/dism-approach and whatnot.
I asked once on reddit, but everyone just told me “that’s DHCP!”, yet it’s (seemingly at least) not. All have fixed IPs, but using dhcp doesn’t change a thing.
So I’m clueless again, hoping for some nerd that’s nerdier than me to have an idea :)
Windows machines determine whether they have Internet by pinging a Microsoft server, if there’s an issue doing that it would explain why Linux boxes on the same network don’t have this problem. As for the root cause, there’s nothing in your post that gives me an idea.
Oh, it’s not INTERNET they don’t get, they get no net at all. It’s “unknown network” for a long time until they finally display “<domain>” and only then I can access the LAN. From there on, everything works fine.
I know this is stupid to ask but can you test setting up servers fresh from a .iso? No template, no domain join, no nothing that would create any predefined settings. If the issue doesnt persist, maybe there is a legacy gpo or something that forces it for domain recognition before allowing other network traffic. Or something completely different but we gotta corner the problem in with troubleshooting.
And also maybe create a script that’s being fired at bootup. The script could write the timecode and the “ipconfig /all” and “route print” into a textfile every few miliseconds.
This would create large logfiles but might help. Since if you are even uncapable of pinging local adresses with IPv4 adresses, maybe the network stack just simply doesn’t load fast enough.
Also some additional info might help with cornering it in such as:
- is it only occuring on Virtualized Machines?
- what Hypervisor is being used?
- are there more than one kind of hypervisor brand? (For e.g. Vmware and Hyper-V)
- is the problem also ocurring on Bare Metal Servers? (Windows Server OS being installed directly on the Server without usage of Virtualisation)
- is your Domain Forest an old one, that you didnt create initially - or another way of asking: could there be GPO’s or Templates that have settings in them, that you dont know about?
- did you already try to connect two servers together by directly connecting them to each other and sniffing the NIC output via Wireshark? Maybe you can use this to parallel Check the behaviour of the bootup script with the Routing Tables and IP-Settings. Maybe somthing sticks out weirdly enough to catch your attention?
NVM, I finally found the culprit by accident…my switch enabled STP (slow) by default. Switching it off or using RSTP fixed the delays. Thanks for helping anyway man!
Holy moly Networking Class… I’m getting flashbacks to my time when in the Simulated Cisco Environment we tried the SPT out and yes you are right. It takes a short but nonetheless weird amount of time for it to timeout.
Thanks for giving me the updates. If I or somebody else ever has similar symptoms maybe they will find this thread :D
I gotta say I think I would never had targeted SPT as the culprit. Though to be fair I only use dumb switches in my homelab and at the corp, the Networking department gatekeeps the nice stuff a bit :3
Anyway, I’m happy you found out and were able to fix it. <3
If I’d tell you that I was trying to fix that shit for over a year now and gave up 4 times already…
Yeah totally. Would’ve never thought the culprit there. But it started to make total sense. Only lan. Only physical. Even switching the nic off and on again. But not in a vm. There was only one denominator here. The effing switch.
Well, if you use pro-stuff at home, better be a pro lol. Thanks anyway man. It nudged me in right direction.
At this point I was willing to try sacrificing sheep or reading a manual.
You were ready for reading the manual. Darn good that you’ve made it without passing that line. Once you pass it you never come back to being sane again, you know?
:D
I knoooow. That’s what i feared most. Luckily i lacked the balls to cross the final frontier 😁
This comment needs to come up in a search.
Thought I edited my post for this reason. Gotta do it again 😁
-
no. Also physical machines.
-
hypervisor is proxmox. But there’s only linux-machines which all have no problems.
-
yes also bare metal servers. They both are.
-
the forest is old (2003 or so) and migrated a lot. I created it. I already tried disabling all gpos and returning to default.
Will try the wiresharking approach. Good hint. Didn’t even think of it. The bootup-log-script is also a good idea. Will do that. Thanks man!
-
deleted by creator
Check the following during this unknown network window:
- What does ipconfig /all show
- Can you ping the gateway?
- What does arp -a show?
- Is there anything in the NCSI log?
Also are your wireless clients on a different VLAN than your wired clients? Does the firewall treat this traffic differently in any way? Does DHCP give out different DNS settings than wired?
NVM, I finally found the culprit by accident…my switch enabled STP (slow) by default. Switching it off or using RSTP fixed the delays. Thanks for helping anyway man!
Will do the bootup-script! Good idea.
The wireless were on a different VLAN. Also changed that for troubleshooting. Now everything is the same and got the same firewall-rules. Which i also completely disabled. And no, DHCP is the same for all too.
deleted by creator
Still sounds like an NCSI issue. You might have active probing disabled or it’s not working.
Will investigate. Thanks for the hint.
Sounds like NLA being stupid. Are you seeing any actual problems, or is it just the flavor text being delayed?
deleted by creator
Silly question but have you verified they don’t have a connection? Maybe try pinging 1.1.1.1 to see if it is just a detection error.
Already solved, thanks anyway man!
Do you experience the issue when you boot into “Safe mode with networking” on one of the Windows machines?
See my update on the problem. Already solved. Thanks for trying anyway, mate!