Storage Filer Ethernet WTF



  • As a network engineer, I've been working with a server team to get new, space age, 10 gig-capable storage filers online, along with a new virtualized server enclosure configuration. It has been interesting, tough, but also somewhat fun.

    The new filers arrived today. Dual 10 gig ports with modern day Ethernet capabilities (VLANs, 802.1q trunking, LACP, etc.). They were installed and brought online for testing. In the process of testing, the filers had difficulty getting management IP addresses. In the course of troubleshooting, the storage team engineer and the filer engineers wanted to test having the switch port hard set to 100 Mb/full duplex instead of auto-negotiating. I changed the port settings, but then realized that the port had auto-negotiated properly prior to my changing the settings. Our standard is auto-negotiation because it makes administration easier (less to worry about when reusing the port or if the port configuration has to be moved).

    So I asked my server team engineer, who inquired of the filer engineers. Their response? While the filers do support auto-negotiation, they don't recommend it because the filers can intermittently lose an auto-negotiated management connection. Reportedly this is written in their manual. I didn't bother to let them show it to me because I was already shaking my head. (In answer to your questions, the real issue was that DHCP was not configured on the filer interfaces.)

    But they can engineer a filer that can serve up multiple 10 gig streams, but they can't get a modern day standard like Ethernet auto-negotiation stable??? WTF???



  • Ah, DHCP not enabled - that old chestnut. It took a disappointingly long time for many of us here to convince a stubborn higher power here that it was more likely to be problematic having 192.168.0.10 as a static address than it would be to have DHCP enabled.

    Engineer : "What happens if we choose our own static address and it clashes with something else on the customer's network?"

    PHB : "Errr,,.... But what happens if we ship with DHCP enabled and the network's not got a DHCP server on it? That would be worse!"

    Engineer : "No, it wouldn't. Consult the network admin, get a static address allocated and set it. No harm done"

    Many variations of that conversation took place over a period of months. DHCP is now enabled by default.

     



  • At least it was auto-negotiating to full-duplex!  A few years ago I was adminning the Sun servers at a rather large company.  To this day I don't know why, but about half of the time we brought one up to be jumpstarted, its NICs would decide that half-duplex was the way to go. It was awful. If we were lucky, we caught it and fixed it before the multi-gig jumpstart process took the whole day.  Even the maintenance console shells would grind to a halt as the NIC struggled trafficking a couple dozen kB across.

    I can barely understand why half-duplex should be supported at all...



  • @Xyro said:

    At least it was auto-negotiating to full-duplex!  A few years ago I was adminning the Sun servers at a rather large company.  To this day I don't know why, but about half of the time we brought one up to be jumpstarted, its NICs would decide that half-duplex was the way to go. It was awful. If we were lucky, we caught it and fixed it before the multi-gig jumpstart process took the whole day.  Even the maintenance console shells would grind to a halt as the NIC struggled trafficking a couple dozen kB across.

    I can barely understand why half-duplex should be supported at all...

     

    I'm presuming you're talking a gig connection.

    Was it strictly half-duplex or was the speed screwed up as well?  We had a situation a few years back with new PCs where, for reasons unknown, sometimes their gig NICs would miss the negotiation and drop down to 10/half instead of 100/full as they should have.  (We still use 10/100 switches in the access layer except for special cases.)  I don't hear about it any longer; dunno if the problem resolved itself or if our computer techs stopped calling us because they understood it to be a known issue.

    The only reason I still support keeping half-duplex around is because I've had to troubleshoot a few rare problems that were only resolved by doing a capture using a hub.  In particular, trying to troubleshoot a problem with a host that involves multicast traffic is a bitch without a hub.  The regular port mirroring functionality on the switch doesn't duplicate the multicast traffic to the mirror destination.  More like a steamed-over mirror.



  •  This normally happens is one side is set to auto and the other side is hard coded to full.  Since one side is hard coded, it will not negotiate and the negotiation process fails.  When the process fails, the autonegotating port goes to the most conservative setting - half duplex.  



  • @BTrey said:

     This normally happens is one side is set to auto and the other side is hard coded to full.  Since one side is hard coded, it will not negotiate and the negotiation process fails.  When the process fails, the autonegotating port goes to the most conservative setting - half duplex.  

     

    I knew that . . . I discounted that possibility since it was stated that it sometimes negotiated to full and sometimes not.  Also, for a gig connection full duplex is the default by 802.3 standard whereas half is the default for 10/100.  That's why I wondered if there was a speed negotiation issue too.



  • @nonpartisan said:

    sometimes their gig NICs would miss the negotiation and drop down to 10/half instead of 100/full as they should have.
    I skipped over a lot of deails in the summary of the issue, but that is exactly what was happening, dropping 10/half.  It was very odd that it didn't happen reliably... I know that some of the devices were indeed hard set to use 100/full instead of negotiating (due to this problem), but the management console switches shouldn't have been.   Although.......  The switch was fairly intelligent and very configurable. I wonder if some of the physical ports were set to negotiate and some were not...  hmm...   then depending on which one the server happened to be plugged into to run the jumpstart, it would either behave as expected or not.  I don't think we ever checked for that...



  • @nonpartisan said:

    @BTrey said:

     This normally happens is one side is set to auto and the other side is hard coded to full.  Since one side is hard coded, it will not negotiate and the negotiation process fails.  When the process fails, the autonegotating port goes to the most conservative setting - half duplex.  

     

    I knew that . . . I discounted that possibility since it was stated that it sometimes negotiated to full and sometimes not.  Also, for a gig connection full duplex is the default by 802.3 standard whereas half is the default for 10/100.  That's why I wondered if there was a speed negotiation issue too.

     I hadn't even seen your response when I posted mine.  First guess for "sometimes negotiated to full and sometimes not" is that some of the switchports were hard coded and some weren't.  (It isn't clear whether "sometimes" means the same machine if you reboot it or different machines.  I took it to mean that some of the machines came up correctly and some did not.)  Another possibility is buggy drivers.  We had an IOS upgrade on Cisco 3550s a couple of years back that caused all sorts of issues.  After some time troubleshooting, we identified that one particular brand/model of NIC wouldn't properly autonegotiate with the switch and caused a duplex mismatch.  Other brands and other modes of the same brand worked fine.  Don't know if the bug was caused by Cisco or the NIC driver, but we ended up having to hard code both sides of the connection for all PCs with that model of NIC.



  • @nonpartisan said:

    The regular port mirroring functionality on the switch doesn't duplicate the multicast traffic to the mirror destination.  More like a steamed-over mirror.
    There's no port mode where it makes the one port behave as if the entire switch was a hub? There's a way to make non-managed switches do it too.



  • @Lingerance said:

    @nonpartisan said:
    The regular port mirroring functionality on the switch doesn't duplicate the multicast traffic to the mirror destination.  More like a steamed-over mirror.
    There's no port mode where it makes the one port behave as if the entire switch was a hub? There's a way to make non-managed switches do it too.
     

    Not that I'm aware of.  Our switches are made by a major name brand (think about if a female sibling started her own business).  When I turn on the port mirroring, the destination port receives normal multicast traffic intended for it (routing protocol hellos, spanning tree BPDUs, etc.) but does not receive copies of the multicast traffic that was destined for the source port.  If anyone has any ideas, I'm all ears. Having to track down a hub can be a pain in the ass at times.



  • @nonpartisan said:

    (think about if a female sibling started her own business).

     

    Screw political correctness and not naming names.  This is on a Cisco 3750 using a regular SPAN session.  I want to say that the 3500XL series did used to redirect multicast traffic as well as unicast but I embarrassingly found out that the capture I took off the 3750 one day was incomplete at best.



  • @nonpartisan said:

    Screw political correctness and not naming names.  This is on a Cisco 3750 using a regular SPAN session.  I want to say that the 3500XL series did used to redirect multicast traffic as well as unicast but I embarrassingly found out that the capture I took off the 3750 one day was incomplete at best.

     

      Per Cisco, "Multicast traffic can be monitored. For egress and ingress port monitoring, only a single unedited packet is sent to the SPAN destination port. It does not reflect the number of times the multicast packet is sent."  Are you monitoring a source port or source VLAN?  Any filtering in place?



  • @BTrey said:

      Per Cisco, "Multicast traffic can be monitored. For egress and ingress port monitoring, only a single unedited packet is sent to the SPAN destination port. It does not reflect the number of times the multicast packet is sent."  Are you monitoring a source port or source VLAN?  Any filtering in place?

     

    In answer to your question, I was monitoring an individual port. Direct capture, no filtering.  I generally prefer to capture with whatever ACLs or filters are in place on the switch so I can see what's actually coming in and going out, but I turn off all filtering on the capture software (generally Wireshark) so I don't introduce any extra WTFs.  I figure I can always filter things down later.

    I hate to think I'm TRWTF in this situation, but it seems I am.  When I went back and looked at some of my first captures related to this situation, I find that the CDP information matches the source port of the capture and not the destination.  I don't know if my thoughts were colored by a colleague telling me I wouldn't get an accurate multicast capture just by using a SPAN session or if I misread the capture.  Somehow I was under the impression that the multicast data I did get was incomplete or inaccurate (I was just starting to troubleshoot the issue in question and was at that point trying to collect data -- didn't have a hypothesis as to the cause, which turned out to be an IGMP v3 issue with the 3750 --  so I may have jumped to an early, incorrect conclusion about the accuracy of my capture).  But when I review the CDP information now, it is from the port that I was mirroring.  So I guess, indeed, I am TRWTF.

    (The IGMP v3 issue with the 3750 is that it DOES support v3 out of the box, but you need to configure "no ip igmp filter" globally and "ip igmp version 3" on the SVI to actually get it to process v3.  These aren't needed on other models (at least the "no ip igmp filter" command -- need to check about specifying v3 on the SVI).  Without these commands, the 3750 ignored v3, which was causing an internal application timeout, causing that application to request a video stream be sent unicast.  This caused the multicast stream to cease for all the other monitoring stations.  Once Windows recognized the 3750 was using v2 after the 3750 sent an announcement in v2 format, then Windows started sending joins in v2, the application requested the camera feed in multicast again, and the multicast stream would come back.  But in the meantime, other monitoring stations would lose their feed because the cameras ceased multicast and had switched to unicast on order of one of the monitoring stations not receiving the video feed.  It was a huge tug-of-war that generally started any time a monitoring station was rebooted because Windows would default to IGMP v3 again.)

     



  • @nonpartisan said:

    The IGMP v3 issue with the 3750 is that it DOES support v3 out of the box, but you need to configure "no ip igmp filter" globally and "ip igmp version 3" on the SVI to actually get it to process v3
     

    It's not that I don't understand that at all. I mean, I know some of the words, like "with", and "box", and even "actually".



  • @b_redeker said:

    @nonpartisan said:

    The IGMP v3 issue with the 3750 is that it DOES support v3 out of the box, but you need to configure "no ip igmp filter" globally and "ip igmp version 3" on the SVI to actually get it to process v3
     

    It's not that I don't understand that at all. I mean, I know some of the words, like "with", and "box", and even "actually".

     

    IGMP is Internet Group Management Protocol.  It's the protocol a router participates in to determine whether it needs to subscribe to a multicast stream or not.  If there are five routers on your network but hosts downstream from two of them want a multicast stream, it doesn't make much sense for the other three routers to receive it.  So the end host says that it wants to participate in the multicast stream, the router subscribes to the stream, then sends the feed to the downstream hosts that want it.   Periodically, the router inquires to see if anyone still wants the stream.  If no one wants it, the router sends a message to cut it off.  There are three versions of this protocol.  v1 isn't used much any longer; it's mostly v2 and v3.

    I included the information because I've seen enough posts here ask follow-up questions like "So what was the issue with the 3750?"  I was just trying to pre-emptively answer such a question.



  • Thanks, that helped a lot (with a bit of additional googling/wikiing ;-).

    I'm trying to understand why you'd want multicast in a corporate network though. From what I read, this is mostly about video streams, games, etc, exactly the kind of thing you usually don't want (unless it's a webcast from the CEO). Or am I missing sth again?



  • @b_redeker said:

    I'm trying to understand why you'd want multicast in a corporate network though. From what I read, this is mostly about video streams, games, etc, exactly the kind of thing you usually don't want (unless it's a webcast from the CEO). Or am I missing sth again?

    He mentioned monitoring stations and video feeds, so I'm guessing these are either security camera feeds, or some internal corporate application for warehouse/process management or R&D or somesuch... Multiple cameras, multiple monitoring stations displaying the feeds, and you don't want to clog up the network with multiple streams for identical feeds, so you may as well use multicast...



  • @random_garbage said:

    @b_redeker said:
    I'm trying to understand why you'd want multicast in a corporate network though. From what I read, this is mostly about video streams, games, etc, exactly the kind of thing you usually don't want (unless it's a webcast from the CEO). Or am I missing sth again?

    He mentioned monitoring stations and video feeds, so I'm guessing these are either security camera feeds, or some internal corporate application for warehouse/process management or R&D or somesuch... Multiple cameras, multiple monitoring stations displaying the feeds, and you don't want to clog up the network with multiple streams for identical feeds, so you may as well use multicast...

     

    Indeed.  Multicast reduces the amount of traffic on your network.  If you have 10 hosts that need to receive data from one device, the device can either send 10 different UDP streams or it can send one multicast stream (1/10th of the data that would otherwise be needed for individual streams) and let the network figure it out.  Hosts that want to subscribe to the multicast stream can subscribe and then they can get the data on demand.  The larger the demand for the stream, the more efficiency you achieve.  Multicast is also popular for routing protocols to establish neighbor relationships and provide route updates.

    In my particular situation, this is a patient care environment in which the cameras are being monitored from two different nursing stations plus a physicians' office.  There are four cameras providing data streams that need to be available to the three different monitoring stations.  The cameras can either send a single unicast stream or a single multicast stream.  Since there are multiple stations that need the data feed, we have the cameras configured for multicast, which reduces the total data on the network.



  • connect a old HP switch to a equally old Cisco switch and they will auto-negotiate to 10Mbit/Half Duplex

    wonderful stuff happens when you run your infrastructure on that kind of gear, and when they occationally decide that today is a bad day and drop their configuration, on the other hand it provided job security and a reason to bitch at management at the weekly meetings. (i am quite happy i left that place to be honest)



  • @nonpartisan said:

    Since there are multiple stations that need the data feed, we have the cameras configured for multicast, which reduces the total data on the network.
     

    Not only does that make total sense, this info might actually help me in an upcoming project (security cams).Yay :)


Log in to reply