Network issues


  • Notification Spam Recipient

    I'm posting on this forum because, ironically, I think it is the best place on the web to get IT related answers. We are a very small company (this month we expanded to 9 employees), and we don't have a dedicated network guy. My boss (the owner, CEO, founder, etc.) usually handles it, as it is a hobby for him. I usually handle the stuff which isn't that big, or for which he doesn't get the time. He is quite good at it, but this particular problem has him baffled as well. So the problem is this:

    On our network, we have an issue that some pages don't get loaded, i.e. we can't connect to them. This is mainly Google related sites (google main site, google news, google maps, etc.). It is also known to affect sites like trello. If we restart the router, everything works fine again for a while, but then, one by one (not simultaneously), websites fail to connect.

    At first we thought it was a dns server issue on our network, but it is not the case. The most recent thing we did was running tracert on affected ips:

    • Pinging google on another network gave the ip as 216.58.223.3
    • Pinging that ip on our network gives a timeout
    • Running a tracert on that ip on our network gives the following:
    Tracing route to jnb01s07-in-f3.1e100.net [216.58.223.3]
    over a maximum of 30 hops:
    
      1     1 ms     1 ms    <1 ms  172.18.19.10
      2     *        *        *     Request timed out.
      3    97 ms     8 ms     8 ms  41.181.178.42
      4    23 ms    14 ms    11 ms  ipc-recieve-jh-4a.za.mtnbusiness.net [41.181.178.41]
      5    13 ms    13 ms    68 ms  qux-jh-dca-2.za-b.za.mtnbusiness.net [41.181.165.115]
      6     9 ms     8 ms     8 ms  jh-dca-2.za--qux-q.za.mtnbusiness.net [196.31.180.4]
      7    35 ms    33 ms    33 ms  jh-cr-2.za--jh-dca-2.za-a.mtnns.net [196.44.0.224]
      8    18 ms    13 ms    14 ms  rb-cr-1.za--jh-cr-2.za-b.mtnns.net [196.44.31.170]
      9    36 ms    37 ms    36 ms  41.181.139.109
     10     9 ms    12 ms    23 ms  72.14.194.74
     11    18 ms    27 ms    10 ms  72.14.237.239
     12     *        *        *     Request timed out.
     13     *        *        *     Request timed out.
     14     *        *        *     Request timed out.
     15     *        *        *     Request timed out.
     16     *        *        *     Request timed out.
     17     *        *        *     Request timed out.
     18     *        *        *     Request timed out.
     19     *        *        *     Request timed out.
     20     *        *        *     Request timed out.
     21     *        *        *     Request timed out.
     22     *        *        *     Request timed out.
     23     *        *        *     Request timed out.
     24     *        *        *     Request timed out.
     25     *        *        *     Request timed out.
     26     *        *        *     Request timed out.
     27     *        *        *     Request timed out.
     28     *        *        *     Request timed out.
     29     *        *        *     Request timed out.
     30     *        *        *     Request timed out.
    
    • Running it on another network gives the following (or you can test yourself):
    Tracing route to jnb01s07-in-f3.1e100.net [216.58.223.3]
    over a maximum of 30 hops:
    
      1     1 ms     1 ms     1 ms  192.168.43.1
      2     *        *        *     Request timed out.
      3    68 ms    61 ms    53 ms  41.48.22.34
      4    31 ms    40 ms   100 ms  10.228.233.193
      5    43 ms    78 ms    26 ms  41.48.16.1
      6    29 ms    25 ms    24 ms  41.48.0.3
      7    35 ms    30 ms   112 ms  41.48.1.5
      8    47 ms   111 ms    35 ms  41.48.253.37
      9    53 ms    26 ms    29 ms  72.14.197.146
     10    68 ms    45 ms    52 ms  72.14.237.239
     11    29 ms    38 ms    28 ms  jnb01s07-in-f3.1e100.net [216.58.223.3]
    
    Trace complete.
    
    • So it times out right before the Google server.
    • Notice that both traces go through the ip 72.14.237.239 (if it is in anyway relevant)
    • Tracing another google ip (216.58.223.3) on our network is succesful:
    Tracing route to jnb01s07-in-f4.1e100.net [216.58.223.4]
    over a maximum of 30 hops:
    
      1     1 ms     1 ms    <1 ms  172.18.19.10
      2     *        *        *     Request timed out.
      3    31 ms   145 ms    26 ms  41.181.221.245
      4    10 ms     8 ms     8 ms  41.181.221.246
      5    11 ms    12 ms    11 ms  qux-jh-dca-2.za-b.za.mtnbusiness.net [41.181.165.115]
      6    23 ms     8 ms    10 ms  jh-dca-2.za--qux-q.za.mtnbusiness.net [196.31.180.4]
      7    20 ms    30 ms    17 ms  41.181.180.10
      8    10 ms     9 ms    22 ms  rb-cr-2.za--jh-cr-1.za.mtnns.net [196.44.0.43]
      9    33 ms    34 ms    34 ms  41.181.139.99
     10     9 ms    15 ms    27 ms  72.14.194.74
     11    25 ms    25 ms    20 ms  72.14.237.239
     12    39 ms    34 ms    32 ms  jnb01s07-in-f4.1e100.net [216.58.223.4]
    
    Trace complete.
    
    • It also goes through the same ip (72.14.237.239), but this time it goes through.

    So it could be some kind of packet inspection, which makes one of the possible culprits the SSL connection on our router that gives us our static ip (according to my boss, my networking knowledge is very limited).

    What I have noticed is that it is (AFAIK) only https websites that are affected. Does anybody have any idea what the problem could be, or how I can diagnose the problem further?

    TL;DR
    Websites time out (one by one), but after router restart, it is fine again, but only for a while.



  • Sounds similar to a black hole router issue.

    May also be firewall is blocking ICMP traffic. Most SSL/TLS connections I've seen in the wild set the 'do not fragment' flag on the packets and if ICMP is getting stopped by the firewall clients don't get the message to send smaller packets and can timeout.



  • My first thought after I read your first couple paragraphs was that your router is either running out of NAT table space or free ports on the WAN side. Rebooting clears the table/frees the ports and things work until it runs out of space again. What brand/model is it?

    The traceroute log / ping reports don't bear out that theory, but the fact things are OK for a while after rebooting still suggests that it's an issue with your router.



  • @reverendryan said:

    My first thought after I read your first couple paragraphs was that your router is either running out of NAT table space or free ports on the WAN side. Rebooting clears the table/frees the ports and things work until it runs out of space again. What brand/model is it?

    I would second the above. If rebooting the router clears it (especially if the router is ye olde "residential special" sub-$100 Walmart router), then I would be inclined to start troubleshooting as a failing router. Most of my business customers use either MikroTik RouterBOARD models (if we're doing the configuration and they don't want to ever touch it), or SonicWALL TZ series if they insist on being able to configure it themselves. The RouterBOARD routers aren't really that complicated to set up, but they don't have much in the way of "hold your hand" wizards to walk through common stuff (NAT firewall rules, changing static IPs, etc), so they're usually difficult for a non-techie to configure. In general, I would recommend either the RB2011 or a SonicWALL TZ 105 series router for most of my small business clients.



  • @izzion said:

    If rebooting the router clears it (especially if the router is ye olde "residential special" sub-$100 Walmart router), then I would be inclined to start troubleshooting as a failing router.

    Might also be worth turning off the boss's bittorrent client. Or at the very least, altering the maximum peers setting so it doesn't blow out the totally inadequate NAT table in your shitty consumer-grade router.

    I got so pissed off by needing to do that at my place that now my router is a Beaglebone Black running a proper Debian installation.



  • @Vault_Dweller said:

    the SSL connection on our router that gives us our static ip

    This makes no sense to me whatsoever, and I netadmin for a living.


  • Notification Spam Recipient

    @reverendryan said:

    What brand/model is it?

    Billion BiPAC 7402NX. It was the only one our ISP had documentation for setting up a static IP, and the one recommended by them, so that's why we chose it :stuck_out_tongue:.

    @flabdablet said:

    Might also be worth turning off the boss's bittorrent client.

    He is not that kind of guy. He is actually very strict about that kind of stuff at work.

    @flabdablet said:

    This makes no sense to me whatsoever, and I netadmin for a living.

    Sorry, as I said, I am not an expert at networking, and only repeated what he said. I could paste the Skype message here, but it is in Afrikaans, so it probably wouldn't help a lot...

    When I get time, I will try the various things mentioned here, thanks a lot.



  • @Vault_Dweller said:

    Billion BiPAC 7402NX

    Definitely consumer-grade. Billion stuff is OK for what it is, but it will definitely do the wrong thing (up to and including becoming completely non-responsive, first via web, then via telnet) if asked to maintain more than a few hundred simultaneous NAT sessions.

    It would certainly be worth your while logging on to its web interface and monitoring the size of its NAT table (somewhere under Status, if I recall correctly) to see whether there's some correlation between lots of NAT and loss of connections. If you're seeing way more NAT sessions than you expect to, try bisecting your workstation fleet's access to the LAN switch until you find the one running the rogue torrent client or skype node or whatever it turns out to be.

    If you're using a managed switch, it will also have a web interface that you can use to check for switch ports carrying anomalously high traffic, and will probably also have a way to set one switch port as a monitoring interface so you can easily run a LAN-wide packet capture using Wireshark to find out what's really going on in your network.


Log in to reply
 

Looks like your connection to What the Daily WTF? was lost, please wait while we try to reconnect.