Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

netkvm with lots of parallel network activity hanging Win10 VM #1058

Open
dasoussan opened this issue Mar 9, 2024 · 22 comments
Open

netkvm with lots of parallel network activity hanging Win10 VM #1058

dasoussan opened this issue Mar 9, 2024 · 22 comments
Assignees
Labels

Comments

@dasoussan
Copy link

dasoussan commented Mar 9, 2024

HangCallStacks-Full.txt
Scenario summary:

Ubuntu 22.04 running Oracle Virtual Box with a Win10 VM running underneath. 4 cores & 16 GB RAM allocated to the VM, 32 GB total in the machine. We believe via crash dump analysis that the NetKVM / VirtIO drivers are hanging up spinlocked & waiting for virtqueue_get_buf_split.

Details (sorry, this is long - I'm summarizing as best I can):

The VM is running our own code that previously ran in Win7 and runs in Win10 stand alone fine, now we are migrating portions (eventually all of it) over to Linux but for now are in a hybrid environment. The hardware emulated NICs weren't giving the performance we wanted as this main system has 2 x 10G fiber interfaces and when we tried NetKVM / VirtIO drivers things performed sooooo much better so that is where we went.

With ~30 devices on the network every now and then we would occasionally see some kind of VM hang up. VirtualBox would say:

VMMDev: vmmDevHeartbeatFlatlinedTimer: Guest seems to be unresponsive. Last heartbeat received 4 seconds ago. We thought it was a virtualbox or file system problem as we saw the hang when 30-50 devices were all trying to pull files from the main server.

On a system with ~200+ devices on the network it is hanging up more often and now not when they are pulling files but when the system is "idle". I quote "Idle" because even when idle the server is constantly querying every device every few seconds with a ping and or a 'How are you doing today?' message, devices are constantly saying what their status is via multicast messages that are picked up by any interested listening device and most are listening to the multicast messages.

Thus even an idle system while not a high GB/s amount of network traffic is a lot of messages per unit time, a quick sniff and I counted > 350 packets per second.

With the hang happening more often I've instrumented the Win10 VM to be able to generate a crash dump on NMI.

Hang: VM seems running but is completely unresponsive to any keyboard or mouse input, the clock has stopped updating on the bottom right of the Win10 screen, network stack does not respond to any pings, we see no data egressing from the VM when checked with wireshark. The VM does respond to the NMI, so I've got that going for me.

A lot of hoops and spelunking later I managed to find a symbol file for the version of netkvm we were running, get it recognized by WinDBG, and can finally show call stacks for the 4 VCPUs with actual function names and line numbers VS what I was seeing.

Here is the call stack that received the NMI - I filtered out all the not interesting hex stuff - so the first 8 stack entries are all related to the NMI and generating the crash dump. Right under that it is in netkvm:

0: kd> !thread
THREAD fffff80369727a00  Cid 0000.0000  Teb: 0000000000000000 Win32Thread: 0000000000000000 RUNNING on processor 0
Not impersonating
DeviceMap                 ffffb20b8ee36540
Owning Process            fffff80369724a00       Image:         Idle
Attached Process          ffffde034267d040       Image:         System
Wait Start TickCount      128224         Ticks: 334059 (0:01:26:59.671)
Context Switch Count      528260         IdealProcessor: 0             
UserTime                  00:00:00.000
KernelTime                00:29:47.765
Win32 Start Address nt!KiIdleLoop (0xfffff80368e01f10)
Stack Init fffff8036c874c90 Current fffff8036c874c20
Base fffff8036c875000 Limit fffff8036c86f000 Call 0000000000000000
Priority 0 BasePriority 0 PriorityDecrement 0 IoPriority 0 PagePriority 5
: Call Site
: nt!KeBugCheckEx
: nt!HalBugCheckSystem+0x7a
: PSHED!PshedBugCheckSystem+0x10
: nt!WheaReportHwError+0x46e
: nt!HalHandleNMI+0x142
: nt!KiProcessNMI+0x134
: nt!KxNmiInterrupt+0x82
: nt!KiNmiInterrupt+0x212 (TrapFrame @ fffff803`6c897e50)
: netkvm!virtqueue_get_buf_split+0x9d [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\VirtIO\VirtIORing.c @ 325] 
: netkvm!virtqueue_get_buf+0x5f (Inline Function @ fffff803`6ea01c26) [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\Virtio\virtio.h @ 87] 
: netkvm!CVirtQueue::GetBuf+0x63 (Inline Function @ fffff803`6ea01c26) [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\Common\ParaNdis-VirtQueue.h @ 208] 
: netkvm!CTXVirtQueue::ReleaseTransmitBuffers+0xae [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\Common\ParaNdis_VirtQueue.cpp @ 294] 
: netkvm!CTXVirtQueue::ProcessTXCompletions+0x38 [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\Common\ParaNdis_VirtQueue.cpp @ 327] 
: netkvm!CParaNdisTX::DoPendingTasks::__l2::<lambda_2bb8197fb2982050d71283ed06b8d199>::operator()+0x10 (Inline Function @ fffff803`6e9fee61) [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\Common\ParaNdis_TX.cpp @ 725] 
: netkvm!CParaNdisTX::DoWithTXLock+0x1a (Inline Function @ fffff803`6e9fee61) [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\Common\ParaNdis-TX.h @ 237] 
: netkvm!CParaNdisTX::DoPendingTasks+0x75 [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\Common\ParaNdis_TX.cpp @ 723] 
: netkvm!CParaNdisTX::NBLMappingDone+0x3f [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\Common\ParaNdis_TX.cpp @ 432] 
: netkvm!CNBL::RegisterMappedNB+0x27 [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\Common\ParaNdis_TX.cpp @ 93] 
: nt!HalpAllocateAdapterCallbackV3+0xa9
: nt!HalpBuildScatterGatherList+0x1af
: nt!HalBuildScatterGatherListV3+0x73
: ndis!NdisMAllocateNetBufferSGList+0x299
: netkvm!CNB::ScheduleBuildSGListForTx+0x2f [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\Common\ParaNdis_TX.cpp @ 784] 
: netkvm!CNBL::StartMapping::__l2::<lambda_3029174f739c49f49e8fda0852d91fd3>::operator()+0x8 (Inline Function @ fffff803`6ea004d8) [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\Common\ParaNdis_TX.cpp @ 285] 
: netkvm!CNdisList<CNB,CRawAccess,CNonCountingObject>::ForEachPrepareIf+0x1e (Inline Function @ fffff803`6ea004d8) [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\Common\ParaNdis-Util.h @ 465] 
: netkvm!CNdisList<CNB,CRawAccess,CNonCountingObject>::ForEach+0x1e (Inline Function @ fffff803`6ea004d8) [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\Common\ParaNdis-Util.h @ 446] 
: netkvm!CNBL::StartMapping+0x48 [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\Common\ParaNdis_TX.cpp @ 283] 
: netkvm!CParaNdisTX::Send+0x13c [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\Common\ParaNdis_TX.cpp @ 411] 
: netkvm!ParaNdis6_SendNetBufferLists+0x83 [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\wlh\ParaNdis6_Driver.cpp @ 460] 
: ndis!ndisMSendNBLToMiniportInternal+0xff
: ndis!ndisMSendNBLToMiniport+0xe
: ndis!ndisCallSendHandler+0x59
: ndis!ndisInvokeNextSendHandler+0x191
: ndis!NdisSendNetBufferLists+0x17d
: tcpip!FlpSendPacketsHelper+0x2bd
: tcpip!Fl48pSendArpPacket+0xcb
: tcpip!Fl48SendNeighborSolicitation+0x22
: tcpip!Ipv4pSendNeighborSolicitation+0x60
: tcpip!IppSendNeighborSolicitation+0xf3
: tcpip!IppNeighborSetTimeout+0x3c4

The other 3 call stacks all look similar:

1: kd> !thread
THREAD ffffc901089cb140  Cid 0000.0000  Teb: 0000000000000000 Win32Thread: 0000000000000000 RUNNING on processor 1
Not impersonating
DeviceMap                 ffffb20b8ee36540
Owning Process            fffff80369724a00       Image:         Idle
Attached Process          ffffde034267d040       Image:         System
Wait Start TickCount      125774         Ticks: 336509 (0:01:27:37.953)
Context Switch Count      497827         IdealProcessor: 1             
UserTime                  00:00:00.000
KernelTime                00:29:53.109
Win32 Start Address nt!KiIdleLoop (0xfffff80368e01f10)
Stack Init ffffb00fd4e29c90 Current ffffb00fd4e29c20
Base ffffb00fd4e2a000 Limit ffffb00fd4e24000 Call 0000000000000000
Priority 0 BasePriority 0 PriorityDecrement 0 IoPriority 0 PagePriority 0
Child-SP          RetAddr           : Args to Child                                                           : Call Site
: nt!KxWaitForSpinLockAndAcquire+0x2f
: nt!KeAcquireSpinLockAtDpcLevel+0x5c
: netkvm!CNdisSpinLock::LockDPR+0xa (Inline Function @ fffff803`6e9fee51) [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\Common\ParaNdis-Util.h @ 137] 
: netkvm!CDPCSpinLockedContext::{ctor}+0xa (Inline Function @ fffff803`6e9fee51) [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\Common\ParaNdis-Util.h @ 226] 
: netkvm!CParaNdisTX::DoWithTXLock+0xa (Inline Function @ fffff803`6e9fee51) [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\Common\ParaNdis-TX.h @ 236] 
: netkvm!CParaNdisTX::DoPendingTasks+0x65 [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\Common\ParaNdis_TX.cpp @ 723] 
: netkvm!ParaNdis_RXTXDPCWorkBody+0xd1 [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\Common\ParaNdis_Common.cpp @ 1811] 
: netkvm!MiniportInterruptDPC+0x5f [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\wlh\ParaNdis6_Impl.cpp @ 380] 
: ndis!ndisInterruptDpc+0x3c58a
: nt!KiExecuteAllDpcs+0x30e
: nt!KiRetireDpcList+0x1f4
: nt!KiIdleLoop+0x9e

I'm attaching the full kernel mode call stack, included the above as a teaser. I don't have a complete memory dump of a hang as that necessitates increasing the pagefile size by a non-trivial amount and that could make the bug disappear, so I'm being as minimally invasive as possible right now.

This all happened with the Red Hat NDIS Miniport Driver version 100.92.104.22900 (as reported by the netkvm.sys file properties in Windows 10) which I believe corresponds to version 229 of the driver and associated source / executable files I found here:

https://fedorapeople.org/groups/virt/virtio-win/direct-downloads/archive-virtio/

I saw that 229 dates back to when I was looking to get better performance and see there is a v248 driver out now. I took both code bases and did some side by side comparisons and while I didn't see any differences in VirtIORing.c I did see some in ParaNdis_TX.cpp nearby to where we hang but didn't drill into the code in detail. There were enough differences I'm forking a thread attempting to reproduce the problem with the v248 driver from the same site and while it hasn't hung up yet it hasn't been cooking for very long and this is quite an evasive intermittent issue.

I thought another thread might be to detail what I've seen here on the off chance some hang condition has already been detected and fixed and I just couldn't find such a scenario in the various bugs I've perused through.

I'll post back if I get another hang.

Thank you!
David Soussan

@dasoussan
Copy link
Author

Over the weekend, I created a stress test scenario and can now replicate the problem on multiple machines with both the older and newer drivers. I haven't done the crash dump analysis on the .dmp files yet, but wanted to get this info into your hands sooner rather than later.

The stress test consists of a batch file spawning 50-70 iperf3 instances on different ports in Windows 10 all in server mode.

From a different Linux box - nomal Ubuntu 22.04 - running 50-70 iperf3 clients half in normal half in -R (reverse) mode, each doing UDP 1 Mb/s. Hang happens anywhere from 2-8 hours into this test.

Batch file on the client: KillMe.bat Script on the server: KillYou.sh I copy out the appropriate column of the attached spreadsheet to a script file.

KillMeKillYou.xlsx

@YanVugenfirer
Copy link
Collaborator

@dasoussan I suggest to test the scenario on QEMU as well.

It might look like a "familiar" case of TX packet stack on the host side. We encountered such issues before in in vast majority of the cases they were due to the host network stack.
Also worth sending NMI couple of times more to check if the driver spends time in the same place.

@YanVugenfirer
Copy link
Collaborator

@dasoussan can you please post the scripts in the comment?

@dasoussan
Copy link
Author

Batch file KillMe (runs on the Windows 10 side):

start f:\das\iperf321\iperf3.exe -s -p 5001
start f:\das\iperf321\iperf3.exe -s -p 5002
start f:\das\iperf321\iperf3.exe -s -p 5003
start f:\das\iperf321\iperf3.exe -s -p 5004
start f:\das\iperf321\iperf3.exe -s -p 5005
start f:\das\iperf321\iperf3.exe -s -p 5006
start f:\das\iperf321\iperf3.exe -s -p 5007
start f:\das\iperf321\iperf3.exe -s -p 5008
start f:\das\iperf321\iperf3.exe -s -p 5009
start f:\das\iperf321\iperf3.exe -s -p 5010
start f:\das\iperf321\iperf3.exe -s -p 5011
start f:\das\iperf321\iperf3.exe -s -p 5012
start f:\das\iperf321\iperf3.exe -s -p 5013
start f:\das\iperf321\iperf3.exe -s -p 5014
start f:\das\iperf321\iperf3.exe -s -p 5015
start f:\das\iperf321\iperf3.exe -s -p 5016
start f:\das\iperf321\iperf3.exe -s -p 5017
start f:\das\iperf321\iperf3.exe -s -p 5018
start f:\das\iperf321\iperf3.exe -s -p 5019
start f:\das\iperf321\iperf3.exe -s -p 5020
start f:\das\iperf321\iperf3.exe -s -p 5021
start f:\das\iperf321\iperf3.exe -s -p 5022
start f:\das\iperf321\iperf3.exe -s -p 5023
start f:\das\iperf321\iperf3.exe -s -p 5024
start f:\das\iperf321\iperf3.exe -s -p 5025
start f:\das\iperf321\iperf3.exe -s -p 5026
start f:\das\iperf321\iperf3.exe -s -p 5027
start f:\das\iperf321\iperf3.exe -s -p 5028
start f:\das\iperf321\iperf3.exe -s -p 5029
start f:\das\iperf321\iperf3.exe -s -p 5030
start f:\das\iperf321\iperf3.exe -s -p 5031
start f:\das\iperf321\iperf3.exe -s -p 5032
start f:\das\iperf321\iperf3.exe -s -p 5033
start f:\das\iperf321\iperf3.exe -s -p 5034
start f:\das\iperf321\iperf3.exe -s -p 5035
start f:\das\iperf321\iperf3.exe -s -p 5036
start f:\das\iperf321\iperf3.exe -s -p 5037
start f:\das\iperf321\iperf3.exe -s -p 5038
start f:\das\iperf321\iperf3.exe -s -p 5039
start f:\das\iperf321\iperf3.exe -s -p 5040
start f:\das\iperf321\iperf3.exe -s -p 5041
start f:\das\iperf321\iperf3.exe -s -p 5042
start f:\das\iperf321\iperf3.exe -s -p 5043
start f:\das\iperf321\iperf3.exe -s -p 5044
start f:\das\iperf321\iperf3.exe -s -p 5045
start f:\das\iperf321\iperf3.exe -s -p 5046
start f:\das\iperf321\iperf3.exe -s -p 5047
start f:\das\iperf321\iperf3.exe -s -p 5048
start f:\das\iperf321\iperf3.exe -s -p 5049
start f:\das\iperf321\iperf3.exe -s -p 5050
start f:\das\iperf321\iperf3.exe -s -p 5051
start f:\das\iperf321\iperf3.exe -s -p 5052
start f:\das\iperf321\iperf3.exe -s -p 5053
start f:\das\iperf321\iperf3.exe -s -p 5054
start f:\das\iperf321\iperf3.exe -s -p 5055
start f:\das\iperf321\iperf3.exe -s -p 5056
start f:\das\iperf321\iperf3.exe -s -p 5057
start f:\das\iperf321\iperf3.exe -s -p 5058
start f:\das\iperf321\iperf3.exe -s -p 5059
start f:\das\iperf321\iperf3.exe -s -p 5060
start f:\das\iperf321\iperf3.exe -s -p 5061
start f:\das\iperf321\iperf3.exe -s -p 5062
start f:\das\iperf321\iperf3.exe -s -p 5063
start f:\das\iperf321\iperf3.exe -s -p 5064
start f:\das\iperf321\iperf3.exe -s -p 5065
start f:\das\iperf321\iperf3.exe -s -p 5066
start f:\das\iperf321\iperf3.exe -s -p 5067
start f:\das\iperf321\iperf3.exe -s -p 5068
start f:\das\iperf321\iperf3.exe -s -p 5069
start f:\das\iperf321\iperf3.exe -s -p 5070
start f:\das\iperf321\iperf3.exe -s -p 5071

Shell script KillYou.sh (runs on the Linux 22.04 lab machine):
#!/bin/sh
iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5001 &
iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5002 &
iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5003 &
iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5004 &
iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5005 &
iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5006 &
iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5007 &
iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5008 &
iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5009 &
iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5010 &
iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5011 &
iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5012 &
iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5013 &
iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5014 &
iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5015 &
iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5016 &
iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5017 &
iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5018 &
iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5019 &
iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5020 &
iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5021 &
iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5022 &
iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5023 &
iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5024 &
iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5025 &
iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5026 &
iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5027 &
iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5028 &
iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5029 &
iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5030 &
iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5031 &
iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5032 &
iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5033 &
iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5034 &
iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5035 &
iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5036 &
iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5037 &
iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5038 &
iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5039 &
iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5040 &
iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5041 &
iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5042 &
iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5043 &
iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5044 &
iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5045 &
iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5046 &
iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5047 &
iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5048 &
iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5049 &
iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5050 &
iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5051 &
iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5052 &
iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5053 &
iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5054 &
iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5055 &
iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5056 &
iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5057 &
iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5058 &
iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5059 &
iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5060 &
iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5061 &
iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5062 &
iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5063 &
iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5064 &
iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5065 &
iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5066 &
iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5067 &
iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5068 &
iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5069 &
iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5070 &
iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5071 &

@dasoussan
Copy link
Author

Here is the traffic graph - it isn't a traffic volume issue:
TrafficGraph

Blue brace area I'm developing and tweaking the script, running solid 3.0 MB/s up and down for a few hours, hangs at the red lines. This was running the newer drivers so I need to get the symbols file and convince windbg to digest it. More as I know it....

@dasoussan
Copy link
Author

Attached are 3 more snips of crashdumps - all were ~50-70 iperf3 UDP runs, 2 different physical machines.
First 2 are running the newer V248 version of netkvm, the latest I could find.

The file titled semi-hang was weird. The time still updated on the Windows VM tool bar bottom right, no windows were responding to any mouse or keyboard events, and network traffic was still flowing so stuff was still running.

The other two were the solid lock-ups, time not updating, etc.

All three had the VBox log message

VMMDev: vmmDevHeartbeatFlatlinedTimer: Guest seems to be unresponsive. Last heartbeat received 4 seconds ago.

Switching everything to QEMU isn't an option for many reasons.

I restarted the tests after switching both test machines to use the "Intel PRO/1000 MT Server (82545EM)", restarted both virtual machines, and am running the KillMe / KillYou scripts and have been for a ~7 hours each, so far no crashes. I'll keep them cooking at least overnight and report results in the morning.

Semi-Hang20240311-1.txt
Hang20240311-2.txt
Hang20240311-3.txt

@YanVugenfirer
Copy link
Collaborator

It looks like the driver spends time retrieving the descriptors that were transmitted by the host. And at the same time, it should be flooded with RX packets.

How many NICs do you have in the guest? How many VMs are running simultaneously on the host?

Try to do the following (each is a separate experiment):

  1. Disable RSS - use the device manager advanced tab
  2. In the registry (this parameter is not added by default to INF, so you will have to add registry value manually) under HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Class{4D36E972-E325-11CE-BFC1-08002BE10318}<number depending on your nics and network config, check that it has a reference to NetKVM> ; add "NumberOfHandledRXPacketsInDPC" DWORD value and set it to 100. Disable\enable the device in device manager.

@YanVugenfirer
Copy link
Collaborator

BTW: if already in the device manager, please check in the resource tab what kind of interrupts are used by the device (legacy or MSI). You can take a screenshot. But in general, MSI interrupts will look like "negative" numbers.

@dasoussan
Copy link
Author

dasoussan commented Mar 12, 2024

I've now got > 24 hours running the same test on the same two machines but instead of the netkvm driver using the "Intel PRO/1000 MT Server (82545EM)" NIC instead - no crashes, no hangs, just lots of iperf packets flying around.

All settings were the default.

Right now there is only one NIC in the guest Win10 VM. The host has 3 NICs; one 1G copper RJ45, two 10G SFP+ fiber or one SFP+ 1G copper - different on the two testing units as they are in different projects. I said "Right now" because originally we mapped all NICs on the host to the guest but months ago went to a single NIC in the guest approach. Eventually the two SFP+ will be bonded on the host side, active-standby failover, and presented as one NIC to the guest machine. Not yet, but soon.

Only one VM running on the host.

Since I've got both machines in the Intel NIC configuration, I looked and didn't see an RSS setting in that NIC.

I converted one machine back to the VirtIO and RSS was enabled by default on the VirtIO adapter. I disabled it on one test machine and am re-running the iperf3 stressing tests. Cooking for an hour so far, nothing to report.

Looks like Legacy IRQ - 10, screen shot is attached.
IRQ

Thanks for looking into this!
David

@dasoussan
Copy link
Author

A slight correction - I'm only seeing ~30 iperf3 sessions between the lab machine and guest VM. I dug into it this morning and found the error "fatal error - console device allocation failure - too many consoles in use, max consoles is 32" when I tried to spawn one manually. The error message wasn't visible as the script probably displayed something but it was so brief I never saw the message. So I'm really doing 15 x 1Mb/s iperf3 sessions to and 15 from the VM to/from the guest.

@dasoussan
Copy link
Author

Got a hang an hour ago, pulled a crash dump, analysis in the morning.

Did the registry key (I had 3 entries in the registry from when we had 3 VirtIO NICs, not knowing which was current I put the key/value in all of them.)

Starting up another stress overnight. The last one hit after ~15 hours. I only have one test machine for now, gave the others back to the crew to use in development after changing it back to Intel's virtual NIC.

@dasoussan
Copy link
Author

Both thread stacks are here:

memory-hang20240312NoRSS.txt
memory-Hang20240312-2-RegistryMods.txt

NoRSS test case and the registry mods as requested. I fresh booted the VM before each test.

VM hangs look the same to me.

Thanks!

@dasoussan
Copy link
Author

I've had 3 more hangs today. Not much point in posting the crash dumps - nothing new to see in them. Same fault in the same place.

Any ideas?

@YanVugenfirer
Copy link
Collaborator

YanVugenfirer commented Mar 19, 2024

Can you test your setup on QEMU with MSI interrupts (it will be a default)?

@dasoussan
Copy link
Author

@vanVugenfirer, moving everything over to QEMU would be doable if QEMU already supported all the same commands and stuff we are doing to manipulate VirtualBox - so not a practical path forward.

Curiously #1, I setup everything on completely different hardware - a Dell Optiplex micro form factor PC with the same Linux image but now different hardware and was able to replicate the hang condition with the NetKVM driver.

We have switched to the Intel driver and are running fine so our immediate need is solved by not using these drivers.

Curiously#2 we weren't running fine initially as it turns out the VBox Intel driver on the hardware we have does not support TCP large segment offloading. Lots of debugging later to discover this, things got better once we disabled that in Windows 10's VM.

Curiously #3: I've tried running the iperf3 stress test with the driver verifier watching the NetKVM driver and have yet to replicate a failure in that scenario. No doubt the quantum physics observer effect.

@YanVugenfirer
Copy link
Collaborator

Driver verifier adds delays to the system, so the timing bugs might disappear.

I am still puzzled as to why VirtualBox does not support MSI interrupts. It definitely can be a bug in the driver due to the scenario that we do not expect on QEMU for the last ±8 years.

@dasoussan
Copy link
Author

The VM was initially setup with the PIIX3 virtual system board instead of the ICH9 board. Some light reading indicates that might be why MSI interrupts weren't supported. Something I can try on one of my test boxes sometime soon.

@dasoussanATRO
Copy link

Switching the Win10 VM to use the ICH9 chipset instead of the PIIX3 and putting the VirtIO / netkvm NIC back in place, checking the properties of the NIC - I don't even see a resources tab anymore.
image

I'm guessing that means it is using MSI interrupts?

In this configuration I started up a stress test to see what happens. Stay tuned, more to come...

@YanVugenfirer
Copy link
Collaborator

Strange... You still should be able to see the resource tab also with MSI interrupts.

@dasoussan
Copy link
Author

Changing to the ICH9 chipset does improve CPU usage of the VM when it is sitting mostly idle but does not resolve the NetKVM hang spinlock.

@ybendito
Copy link
Collaborator

@dasoussan It would be good to look at the crash dump. We can get from it more information than from the stack dump.

@dasoussanATRO
Copy link

I've asked if I can publish that but I don't think I'll be allowed to. I can run whatever you want on the dump file and publish the results assuming nothing is inside that I can't share.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants