48X locking up--excessive traffic?

doom1701 · Post by **doom1701** » Tue Feb 02, 2016 3:21 pm

We're trying to track down the root cause of our core 48X (we have 4 throughout the country, all setup for Multisite, the "core" is just the biggest one--about 180 users, 270 handsets) locking up about one a week lately. Version 7.6.14.6. The first time we experienced it was when we had one of the other systems experience a hardware failure and we redirected inbound traffic from the SIP trunk of the failed system to our core system. I was able to telnet into the system (still hoping to redirect all of that Telnet output in some other way, I've got another thread about that) after restarting it and saw a ton of messages like the ones below:

tSip: +++ IEC [7.6.14.6:sipInvite.c,3832]
tSip: +++ IEC [7.6.14.6:sipInvite.c,3832]
tSip: +++ IEC [7.6.14.6:sipInvite.c,3832]
tSip: +++ IEC [7.6.14.6:ccbCall.c,1927]
tSip: +++ IEC [7.6.14.6:sipInvite.c,1630]
tSip: +++ IEC [7.6.14.6:sipCancel.c,311]
tSip: +++ IEC [7.6.14.6:sipCancel.c,311]

Pages and pages of them. At some point the web interface of the phone system stopped responding, phone call processing stopped, Telnet disconnected, but we could still ping the system. Only resolution was to bounce the system, and ultimately we redirected the failed system's trunk to another system and the problem went away.

Three times since it's shown the same symptoms (other than we haven't been telnetted in so I can't verify those messages). We had one other instance today where I was telnetted in, we saw very similar symptoms (web interface hung, outbound calls were failing) and it was the same messages as above, page after page. I would keep hitting enter on the telnet window and see that it was still responding, and then after about a minute everything started responding again (and the messages stopped).

We've had a lot of growth lately, and while we aren't maxed out to the licensed and stated abilities of the system, we're wondering if either we have too much usage on the 48X, if there's something that our SIP provider (Windstream) might be doing that is leading to this, or if it might be a DoS attack. We've had DoS attacks in the past (prior to the software upgrade that fixed the vulnerability), but those caused ping results to be spotty with high latency; ping has been consistent with no latency for these most recent issues.

doom1701 · Post by **doom1701** » Wed Feb 03, 2016 11:39 am

We were able to get a little more information today (it happened again). The last messages generated before all phone and web traffic to the 48X stopped are below.

eCos> tSip: +++ IEC [7.6.14.6:sipInvite.c,3832]
tSip: +++ IEC [7.6.14.6:sipInvite.c,3832]
tSip: TokenPipe 37 blocked (7388652 ticks)(msg proxy.c:223)
tAlarm: TokenPipe 37 blocked (7388693 ticks)(msg proxy.c:223)
tRca: TokenPipe 37 blocked (7388749 ticks)(msg proxy.c:223)
tCaa: +++ IEC [7.6.14.6:ccbCall.c,2930]

We also unplugged the WAN cable during the message storm, and they kept coming--so whatever it is that is causing it appears to be coming from inside of the network.

doom1701 · Post by **doom1701** » Mon Feb 22, 2016 2:38 pm

After being stable for a week and a half, we think we know what happened. Windstream DoS'ed us.

We had Trunk Group Overflow setup on three of our SIP Trunks with Windstream. They were setup in a round robin fashion; if Site 1 became unavailable, Windstream routes the call to Site 2. Site 2 is configured for all of the DIDs of Site 1 (all the sites, actually), and knows the internal extension at Site 1 to route the call back to. It's saved our butts when we've had a circuit at a core site go down. Being round robin, if Site 2 is unavailable, calls go to Site 3, and Site 3 calls go to Site 1 if unavailable.

We were able to document two situations two weeks ago where a call would ring an extension (8 phones in the ring group) at Site 1. While that was happening, Windstream for some reason failed the call to Site 2. That call would ring the same extension by pushing the call back to the Site 1 48X. Windstream then failed the call again to Site 3, which pushed the call back to Site 1 again, as designed. This looped rapidly until the Site 1 48X was attempting to process the same call so many times that the system failed.

We have turned off the automatic failover and we haven't had an issue since.

48X locking up--excessive traffic?

48X locking up--excessive traffic?

Re: 48X locking up--excessive traffic?

Re: 48X locking up--excessive traffic?