The UTD campus lost internet connectivity for 12 hours last month, prompting an investigation from the Office of Information Technology.
OIT Associate Vice President and Chief Technology Officer Brian Dourty said the first indication of a network issue occurred Jan. 20 at 1:58 p.m., when the office received an automatic notification about a problem with connecting to Galaxy.
He said within 30 minutes of the first notification, OIT engineers were on site to troubleshoot.
“The symptoms really manifested themselves as more of a firewall problem than a router problem, initially,” Dourty said.
Engineers traced the network issue to a border router, a piece of equipment serving as a central receiving point for internet access from LEARN and UT System, UTD’s two internet service providers.
As the situation became apparent, Cisco, UTD’s networking equipment manufacturer, sent engineers to help investigate.
Dourty said this outage was unusual compared to previous ones.
“The symptoms were very odd,” he said. “It wasn’t a complete failure. Some connectivity would go through and other connectivity wouldn’t.”
Following a recommendation from Cisco, OIT engineers rebooted the system at 7 p.m., but were unsuccessful in restoring connectivity.
“In any sort of troubleshooting effort, the goal is to isolate as many of the variables as possible, so you can really zero in on what the problem is,” Dourty said. “On a piece of equipment like this, it’s really complex. That’s not an easy thing to do.”
According to OIT’s Twitter account, network access was restored at 12:03 a.m. on Jan. 21, but there were still intermittent service interruptions as engineers worked to replace the faulty equipment. The outage was fully resolved at 6:17 a.m. later that day.
Part of the reason for the lengthy outage was the fact that it involved a single point of failure — specifically, the border router — within the UTD network, Dourty said. The router has some level of redundancy built in, such as a backup power supply, but does not have a backup controller.
“That’s something we’ve been working over the course of the last year to rectify,” Dourty said. “Unfortunately, this outage occurred before we could complete those efforts.”
Dourty said Cisco engineers were unable to conclusively determine whether the outage was caused by a hardware or software issue within the border router. To prevent a similar problem from reoccurring, the router was entirely replaced.
“We take every outage as an opportunity to learn and document how we can do it better next time, and what we can do to avoid having the same problem reoccur,” he said.