Mieze Posted August 7, 2015 Author Share Posted August 7, 2015 Just as a write this, I managed to reproduce the problem (still doing backups and testing simultaneously ... it must have found some unique data to send) This time it happened with my DSDT patched to rid of GBES declarations/tests as you described. So, as I predicted, the DSDT is not the problem. I'll keep looking, but given that it only happens in a relatively rare heavy traffic scenario, I'm not as worried about it. The patch I developed was originally made for 9 series mainboards which also had a code sequence writing to the PCI power management register (see below) so that I'm not surprised that it didn't help you as there is no write access to the register. Device (GLAN) { Name (_ADR, 0x00190000) OperationRegion (GLBA, PCI_Config, Zero, 0x0100) Field (GLBA, AnyAcc, NoLock, Preserve) { DVID, 16, Offset (0xCC), Offset (0xCD), PMEE, 1, , 6, PMES, 1 } Method (_PRW, 0, NotSerialized) { Return (GPRW (0x0D, 0x04)) } Method (_DSW, 3, NotSerialized) { Store (Arg0, PMEE) } Method (GPEH, 0, NotSerialized) { If (LEqual (DVID, 0xFFFF)) { Return (Zero) } If (LAnd (PMEE, PMES)) { Store (One, PWST) Store (One, PMES) Notify (GLAN, 0x02) } } } I've logged and analyzed dozens of these incidents and found nothing. I checked the NIC's config registers, the descriptors and the packets without finding any hint. There is no indication for a systematic error and no common ground for these transmitter deadlocks. Most of the logged packets where TSO operations, which is no wonder when you are moving large amounts of data with TCP, but I also found small ACKs and UDP datagrams. The packets as well as the descriptors were correct and matched each other. The fact that it only happens under load is just a consequence of the design: you can't have a transmitter deadlock without transmitter activity and you need transmitter activity in order to detect it. The reason why I'm quite sure that there is something interfering is the fact that users reported the same issue with the Realtek and the Atheros drivers too. For example see http://www.insanelymac.com/forum/topic/300056-solution-for-qualcomm-atheros-ar816x-ar817x-and-killer-e220x/?p=2128204 and in those cases where the issue was resolved it turned out that it was related to power management or a wrong BIOS setting, e.g. external interference. Mieze Link to comment Share on other sites More sharing options...
RehabMan Posted August 7, 2015 Share Posted August 7, 2015 The patch I developed was originally made for 9 series mainboards which also had a code sequence writing to the PCI power management register (see below) so that I'm not surprised that it didn't help you as there is no write access to the register. Device (GLAN) { Name (_ADR, 0x00190000) OperationRegion (GLBA, PCI_Config, Zero, 0x0100) Field (GLBA, AnyAcc, NoLock, Preserve) { DVID, 16, Offset (0xCC), Offset (0xCD), PMEE, 1, , 6, PMES, 1 } Method (_PRW, 0, NotSerialized) { Return (GPRW (0x0D, 0x04)) } Method (_DSW, 3, NotSerialized) { Store (Arg0, PMEE) } Method (GPEH, 0, NotSerialized) { If (LEqual (DVID, 0xFFFF)) { Return (Zero) } If (LAnd (PMEE, PMES)) { Store (One, PWST) Store (One, PMES) Notify (GLAN, 0x02) } } } I've logged and analyzed dozens of these incidents and found nothing. I checked the NIC's config registers, the descriptors and the packets without finding any hint. There is no indication for a systematic error and no common ground for these transmitter deadlocks. Most of the logged packets where TSO operations, which is no wonder when you are moving large amounts of data with TCP, but I also found small ACKs and UDP datagrams. The packets as well as the descriptors were correct and matched each other. The fact that it only happens under load is just a consequence of the design: you can't have a transmitter deadlock without transmitter activity and you need transmitter activity in order to detect it. The reason why I'm quite sure that there is something interfering is the fact that users reported the same issue with the Realtek and the Atheros drivers too. For example see http://www.insanelymac.com/forum/topic/300056-solution-for-qualcomm-atheros-ar816x-ar817x-and-killer-e220x/?p=2128204 and in those cases where the issue was resolved it turned out that it was related to power management or a wrong BIOS setting, e.g. external interference. Mieze It would be nice if there was a way to fix the stalled transmitter without bringing down the link. Possible? Link to comment Share on other sites More sharing options...
Mieze Posted August 7, 2015 Author Share Posted August 7, 2015 It would be nice if there was a way to fix the stalled transmitter without bringing down the link. Possible? In order to recover from this condition you need to reset the NIC and it will be difficult to achieve without loosing the link. Of course you don't need to tell the network stack about it but I doubt that this is a good idea because of the side effects. Trying to find the cause is a more promising approach from my point of view. Mieze Link to comment Share on other sites More sharing options...
RehabMan Posted August 8, 2015 Share Posted August 8, 2015 In order to recover from this condition you need to reset the NIC and it will be difficult to achieve without loosing the link. Of course you don't need to tell the network stack about it but I doubt that this is a good idea because of the side effects. I thought of the same idea (delay reporting the link down condition, until it is clear it is not coming back up). But didn't even bother because of the rarity of this problem and because doing so would require a relatively long delay (10 sec, maybe more). That said, it would be possible to restrict this delay only to the case of a forced reset due to deadlocked transmitter, which makes it a bit more acceptable. I think if the link didn't go down, the system would recover better from the problem. But I understand your reluctance to hack around this problem... Link to comment Share on other sites More sharing options...
Mieze Posted August 8, 2015 Author Share Posted August 8, 2015 I thought of the same idea (delay reporting the link down condition, until it is clear it is not coming back up). But didn't even bother because of the rarity of this problem and because doing so would require a relatively long delay (10 sec, maybe more). That said, it would be possible to restrict this delay only to the case of a forced reset due to deadlocked transmitter, which makes it a bit more acceptable. I think if the link didn't go down, the system would recover better from the problem. But I understand your reluctance to hack around this problem... As long as you don't know exactly what went wrong the only reliable method to restore full operation is a complete reset but once you located the cause of the problem it's usually easier to eliminate it instead of creating a workaround. Mieze Link to comment Share on other sites More sharing options...
The Edge3000 Posted August 9, 2015 Share Posted August 9, 2015 It appears I am having the same problem as RehabMan. Same error after extended high speed transfers to NAS storage device: kernel[0]: Ethernet [IntelMausi]: Tx stalled? Resetting chipset. txDirtyDescIndex=796, STATUS=0x40080083, TCTL=0x3103f0fa. This is the device I am using: Intel 82579V PCI Express Gigabit Ethernet: Name: Intel Ethernet Controller Type: Ethernet Controller Bus: PCI Slot: Built In Vendor ID: 0x8086 Device ID: 0x1503 Subsystem Vendor ID: 0x1458 Subsystem ID: 0xe000 Revision ID: 0x0004 BSD name: en0 Kext name: IntelMausiEthernet.kext Location: /System/Library/Extensions/FakeSMC.kext/Contents/PlugIns/IntelMausiEthernet.kext Version: 2.0.0 As seen in the iStat screenshot, the transfers can go for quite a while before failing, whereas other times it will fail in just a short period of time. I'm using rsync to copy over a lot of files, with an until loop to retry the command after 10 seconds until it exits cleanly. Directory-intensive transfers lower the overall throughput, to the effect that it rarely stalls, whereas large movie transfers see stalling more frequently. Another thing I have noticed is that if I create more "noise" in the traffic, for example browsing a lot of directories on the NAS share or opening up quick-look previews while the large transfer is going, it seems to crash more often.Attached is my DSDT (which appears to not contain any special power-management calls like your previous post), a system log of all IntelMausi debug errors over a long period of time, and one log of all the system events surrounding one crash. There are no specific power management features that I can toggle in the BIOS, and I get crashes whether WOL is enabled or not. Ethernet configuration is set to basic full-duplex as recommended, however, even with EEE and flow-control the crashes persist, although flow-control appears to not be enabled/supported on my hardware. en0: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500 index 4 eflags=8c0<ACCEPT_RTADV,TXSTART,ARPLL> options=6b<RXCSUM,TXCSUM,VLAN_HWTAGGING,TSO4,TSO6> ether 90:2b:34:XX:XX:XX inet6 fe80::922b:34ff:XXXX:XXXX%en0 prefixlen 64 scopeid 0x4 inet 192.168.1.2 netmask 0xffffff00 broadcast 192.168.1.255 inet6 2002:4b6f:e5a4::922b:34ff:XXXX:XXXX prefixlen 64 autoconf inet6 2002:4b6f:e5a4::a85c:e70a:223a:2590 prefixlen 64 autoconf temporary nd6 options=1<PERFORMNUD> media: autoselect (1000baseT <full-duplex>) status: active type: Ethernet link quality: 100 (good) scheduler: QFQ link rate: 1.00 Gbps I appreciate all the work you have done on this and thank you for looking in to this problem. Logs.zip DSDT.aml.zip Link to comment Share on other sites More sharing options...
Mieze Posted August 9, 2015 Author Share Posted August 9, 2015 (edited) @RehabMan: Intel's datasheets of the 82579 and the I217 contain the following advice with regard to the transmit descriptor handling policy. The strange thing is that neither the Windows nor the Linux driver follow this advice but on my I217, which was affected of random tx deadlocks in early development versions of the driver too, setting TXDCTL=0 eliminated the problem. That's why I added this workaround in version 2.0.0d2 but as I don't have an 82579 to test on, I haven't been able to verify it on this NIC. /** * intelConfigureTx - Configure Transmit Unit after Reset * @adapter: board private structure * * Configure the Tx unit of the MAC after a reset. **/ void IntelMausi::intelConfigureTx(struct e1000_adapter *adapter) { struct e1000_hw *hw = &adapter->hw; UInt32 tctl, tarc; UInt32 txdctl; /* Setup the HW Tx Head and Tail descriptor pointers */ intelInitTxRing(); /* Set the Tx Interrupt Delay register */ intelWriteMem32(E1000_TIDV, adapter->tx_int_delay); /* Tx irq moderation */ intelWriteMem32(E1000_TADV, adapter->tx_abs_int_delay); txdctl = intelReadMem32(E1000_TXDCTL(0)); if (chipType == board_pch_lpt) { txdctl = 0; intelWriteMem32(E1000_TXDCTL(0), txdctl); } /* erratum work around: set txdctl the same for both queues */ intelWriteMem32(E1000_TXDCTL(1), txdctl); /* Program the Transmit Control Register */ tctl = intelReadMem32(E1000_TCTL); tctl &= ~E1000_TCTL_CT; tctl |= E1000_TCTL_PSP | E1000_TCTL_RTLC | (E1000_COLLISION_THRESHOLD << E1000_CT_SHIFT); /* errata: program both queues to unweighted RR */ if (adapter->flags & FLAG_TARC_SET_BIT_ZERO) { tarc = intelReadMem32(E1000_TARC(0)); tarc |= 1; intelWriteMem32(E1000_TARC(0), tarc); tarc = intelReadMem32(E1000_TARC(1)); tarc |= 1; intelWriteMem32(E1000_TARC(1), tarc); } intelWriteMem32(E1000_TCTL, tctl); hw->mac.ops.config_collision_dist(hw); } EDIT: Checking the source code again I discovered that the workaround I described above is only applied to the I217 and I218 while the 82579 still uses the default transmit descriptor handling policy. Please change if (chipType == board_pch_lpt) { into if ((chipType == board_pch_lpt) || (chipType == board_pch2lan)) { in order to apply it to the 82579 too. Please report back. In case of a positive result I will include the workaround in the next update. Good luck! Mieze Edited August 9, 2015 by Mieze 3 Link to comment Share on other sites More sharing options...
RehabMan Posted August 10, 2015 Share Posted August 10, 2015 @RehabMan: Intel's datasheets of the 82579 and the I217 contain the following advice with regard to the transmit descriptor handling policy. Bildschirmfoto 2015-08-09 um 18.49.58.png The strange thing is that neither the Windows nor the Linux driver follow this advice but on my I217, which was affected of random tx deadlocks in early development versions of the driver too, setting TXDCTL=0 eliminated the problem. That's why I added this workaround in version 2.0.0d2 but as I don't have an 82579 to test on, I haven't been able to verify it on this NIC. /** * intelConfigureTx - Configure Transmit Unit after Reset * @adapter: board private structure * * Configure the Tx unit of the MAC after a reset. **/ void IntelMausi::intelConfigureTx(struct e1000_adapter *adapter) { struct e1000_hw *hw = &adapter->hw; UInt32 tctl, tarc; UInt32 txdctl; /* Setup the HW Tx Head and Tail descriptor pointers */ intelInitTxRing(); /* Set the Tx Interrupt Delay register */ intelWriteMem32(E1000_TIDV, adapter->tx_int_delay); /* Tx irq moderation */ intelWriteMem32(E1000_TADV, adapter->tx_abs_int_delay); txdctl = intelReadMem32(E1000_TXDCTL(0)); if (chipType == board_pch_lpt) { txdctl = 0; intelWriteMem32(E1000_TXDCTL(0), txdctl); } /* erratum work around: set txdctl the same for both queues */ intelWriteMem32(E1000_TXDCTL(1), txdctl); /* Program the Transmit Control Register */ tctl = intelReadMem32(E1000_TCTL); tctl &= ~E1000_TCTL_CT; tctl |= E1000_TCTL_PSP | E1000_TCTL_RTLC | (E1000_COLLISION_THRESHOLD << E1000_CT_SHIFT); /* errata: program both queues to unweighted RR */ if (adapter->flags & FLAG_TARC_SET_BIT_ZERO) { tarc = intelReadMem32(E1000_TARC(0)); tarc |= 1; intelWriteMem32(E1000_TARC(0), tarc); tarc = intelReadMem32(E1000_TARC(1)); tarc |= 1; intelWriteMem32(E1000_TARC(1), tarc); } intelWriteMem32(E1000_TCTL, tctl); hw->mac.ops.config_collision_dist(hw); } EDIT: Checking the source code again I discovered that the workaround I described above is only applied to the I217 and I218 while the 82579 still uses the default transmit descriptor handling policy. Please change if (chipType == board_pch_lpt) { into if ((chipType == board_pch_lpt) || (chipType == board_pch2lan)) { in order to apply it to the 82579 too. Please report back. In case of a positive result I will include the workaround in the next update. Good luck! Mieze Thanks... I'll give it a try. No time to test for a while (due to the intermittent nature of the problem, it is very time consuming), but I'll let you know when I do. Link to comment Share on other sites More sharing options...
Mieze Posted August 10, 2015 Author Share Posted August 10, 2015 What do you mean by: Call "Archive" from the menu "Product" and save the built driver. The first clause should be clear. In order to save the driver select it in the Organizer, which is opened automatically, and click "Export". Now select "Save Built Products", click "Next" and select a directory where the products should be saved, for example the desktop, and confirm with "Export". Finally open the saved folder in Finder and you'll find a subdirectory hierarchy called "System/Library/Extensions" in it. There you will find your driver. Mieze 1 Link to comment Share on other sites More sharing options...
Mieze Posted August 10, 2015 Author Share Posted August 10, 2015 In older versions of Xcode the button is called "Distribute…". Link to comment Share on other sites More sharing options...
RehabMan Posted August 10, 2015 Share Posted August 10, 2015 The first clause should be clear. In order to save the driver select it in the Organizer, which is opened automatically, and click "Export". Now select "Save Built Products", click "Next" and select a directory where the products should be saved, for example the desktop, and confirm with "Export". Finally open the saved folder in Finder and you'll find a subdirectory hierarchy called "System/Library/Extensions" in it. There you will find your driver. Mieze Xcode has seriously brain damaged defaults when it comes to build products... Not sure what they were thinking... I set Xcode->Preferences->Locations->Advanced->Custom "Relative to Workspace". This way you can find your build results in ./Build relative to the project instead of some far off place with a random garbage name. Link to comment Share on other sites More sharing options...
The Edge3000 Posted August 12, 2015 Share Posted August 12, 2015 Woohoo! I complied with the changes you suggested above and it doesn't seem to be crashing! I copied a large file from one network share to another (maximally stressing both upload and download), as well as random I/O on the network as well. Usually, this always guarantees a crash, but it's been holding up this time!However, the console appears to be flooded with Ethernet [IntelMausi]: replaceOrCopyPacket() failed. and Ethernet [IntelMausi]: Not enough descriptors. Stalling. Ethernet [IntelMausi]: Restart stalled queue! Other than that it is--at least initially--working. I'll report back if I get any disconnects, but thanks for the fix! I attached the binary for 10.10 for anyone who can't compile themselves. Console.log.zip IntelMausiEthernet.kext.zip 1 Link to comment Share on other sites More sharing options...
Mieze Posted August 12, 2015 Author Share Posted August 12, 2015 @The Edge3000: Please note that you've built version 1.0.x, not 2.0.x which is recommended for 10.9+. Mieze Link to comment Share on other sites More sharing options...
The Edge3000 Posted August 12, 2015 Share Posted August 12, 2015 Oops, you are right. I don't use Xcode much, but I poked around a bit and got it to recompile as 2.0.0. And now... nooooo! Still failed like before, although it looks like it put up with more of a beating than the original V2 version used to take. Also, I might have posted prematurely on the v1.0.0 post, but I tested it hard for a good 30 mins without failure, whereas this one failed after about 10 minutes. Console.log.zip IntelMausiEthernetV2.kext.zip Link to comment Share on other sites More sharing options...
Mieze Posted August 12, 2015 Author Share Posted August 12, 2015 (edited) @The Edge3000: This is a hardware issue which is independent of the driver version. It affects version 1.0.x as well as 2.0.x. Please send me a complete ACPI dump of your machine (DSDT and SSDTs). Mieze Edited August 12, 2015 by Mieze Link to comment Share on other sites More sharing options...
The Edge3000 Posted August 12, 2015 Share Posted August 12, 2015 Using patchmatic -extractall Attached is what I get. I have Clover drop a lot of unnecessary tables. GLAN is the method used to define the ethernet. Thank you! ACPI.zip Link to comment Share on other sites More sharing options...
Mieze Posted August 12, 2015 Author Share Posted August 12, 2015 (edited) @The Edge3000: Looks ok. I found nothing which might interfere. Let's see what results RehabMan will have. EDIT: Are you sure you have disabled all of these items in the UEFI setup? Network stackDisables or enables booting from the network to install a GPT format OS, such as installing the OS from the Windows Deployment Services server. (Default: Disable Link) & IPv6 PXE Boot SupportEnables or disables IPv6 PXE Support. This item is configurable only when Network stack is enabled. & IPv4 PXE Boot SupportEnables or disables IPv4 PXE Support. This item is configurable only when Network stack is enabled. LAN PXE Boot Option ROM Allows you to decide whether to activate the boot ROM integrated with the onboard LAN chip. (Default: Disabled) Besides that, are you overclocking? Mieze Edited August 12, 2015 by Mieze Link to comment Share on other sites More sharing options...
The Edge3000 Posted August 14, 2015 Share Posted August 14, 2015 Hey Mieze, sorry for the late response. Work got in the way, etc. To answer your question, both of those options are disabled. I am overclocking, but only by raising the CPU multiplier. I am not messing with any voltages or overclocking the BCLK. I don't know if this would have any effect or not, but my BIOS mod (GA-Z77X-UD5H BIOS F16 mod11) per TweakTown forums includes the change "Intel GigabitLanX64 6.0.24 to 6.3.27" Link to comment Share on other sites More sharing options...
Mieze Posted August 14, 2015 Author Share Posted August 14, 2015 I don't know if this would have any effect or not, but my BIOS mod (GA-Z77X-UD5H BIOS F16 mod11) per TweakTown forums includes the change "Intel GigabitLanX64 6.0.24 to 6.3.27" Frankly, I don't know. What does the change do? Mieze Link to comment Share on other sites More sharing options...
Mieze Posted August 14, 2015 Author Share Posted August 14, 2015 To answer your question, both of those options are disabled. Both? There are 4 options you have to disable! Please see http://www.insanelymac.com/forum/topic/300056-solution-for-qualcomm-atheros-ar816x-ar817x-and-killer-e220x/?p=2128204 Mieze Link to comment Share on other sites More sharing options...
Gigabyte GA-Z170X Posted August 14, 2015 Share Posted August 14, 2015 Wow, amazing work! Thank you so much. Do you have any news concerning the new I219-V from the Z170 chipset / Skylake platform? Link to comment Share on other sites More sharing options...
Mieze Posted August 14, 2015 Author Share Posted August 14, 2015 Support for I219 will be added when Apple comes out with Skylake based products. Mieze 1 Link to comment Share on other sites More sharing options...
The Edge3000 Posted August 14, 2015 Share Posted August 14, 2015 Frankly, I don't know. What does the change do? I don't know either. It's a ROM on the BIOS.. probably more has something to do with the boot over network logic than anything. And I really don't think it could affect a driver. Both? There are 4 options you have to disable! Since the other two are just subsets of network stack, there are two options set to disabled in my BIOS. They only even show up as options with network stack enabled. Link to comment Share on other sites More sharing options...
Mieze Posted August 14, 2015 Author Share Posted August 14, 2015 Since the other two are just subsets of network stack, there are two options set to disabled in my BIOS. They only even show up as options with network stack enabled. No, this isn't enough as the link I attached suggests that you also should disable the sub options too. Mieze Link to comment Share on other sites More sharing options...
The Edge3000 Posted August 14, 2015 Share Posted August 14, 2015 Okay, thank you that was something I missed. However, looks like no change and still a bit of intermittent crashing. Link to comment Share on other sites More sharing options...
Recommended Posts