Blog

Situational Awareness for Network Migrations

Situational Awareness for Network Migrations

by | Aug 14, 2022

At IP Architechs we perform a lot of network migrations and it is no secret network migrations/ maintenance windows can be one of the most nerve-racking things for engineers, managers, and business leaders for a variety of reasons.

For the engineers the uncertainty might be caused by fear of failure, not being able to predict the outcome due to complexity, rushed on preparation to meet a deadline, or a litany of other reasons.

For managers and business leaders it might be more along the lines of; what happens if this goes wrong, how will this effect my bottom line, are there going to be 1000s of trouble tickets come 8/9am when everyone hits the office, and so on.

The Preparation

We’re going to look at this at the perspective of the engineer throughout. The prep work is probably one of the most important pieces of success. This is where you do many things including but not limited to:

  • building and testing the configuration to be implemented
  • making a rollback plan — this might be something as simple as move a cable and shut an interface or a multistep/multi-device plan
  • know the situation surrounding the window

Lets explore understanding the situation surrounding the window a some more. I’ll use some real examples here to help.

We were getting ready to change the internet edge deployment at an enterprise. We did all the prep and rollback planning. However, we were given a few constraints on downtime by the business. Additionally, all of the product teams had to join the call for verification due to the impact of the, relatively small, routing change. The next opportunity was going to be a few months out due to change freezes and the coordination of resources necessary.

So what did we learn by engaging outside of the technical realm?

  • We had tight timeframes which placed an increased emphasis on planning
  • We needed to have plans for things that could go wrong and resolution paths based on downtime constraints
  • although a low impact routing change it was a high impact business change
  • We needed to have clearly defined decision points on what would be cause for a rollback
Image

The Execution

All the prep is done and it’s time to execute the change. We put in the first couple lines of the script and everything is going well. We get to the point where we need to clean up the old configuration. Then every engineers nightmare happens – everything starts to go down.

Okay what do we do now, we know based on the situation we don’t have a lot of time to work through the problem. We need to stay calm and start working through our decision trees made during the planning process.

Some quick troubleshooting revealed when we removed the no longer used virtual routing and forwarding (VRF) instance it shutdown the ports that we now in the global table. We put the VRF back, still unused, everything began to work as expected again.

Next the debate began, should we get TAC on the line to assist. There were still a few items to knock out in the change window to avoid a complete rollback. A majority of people wanted to “chase the rabbit” of what caused the VRF deletion to bring down the interface. However, this would not be a good use of our time. If we got TAC on the line and began to go down that rabbit hole there is no telling where it would have gone or how long it would have taken. The facts were leaving the unused VRF, although annoying to have extra config, didn’t effect performance as far as we could tell and we needed to get through the rest of the migration.

After a short debate we all agreed based on the circumstances of the migration, coordination efforts, business drivers, and still needing to get some more work done we would continue down the migration path. We also took the necessary logs for an initial case with TAC and opened a ticket in the morning. Would we get the same level of info/t-shooting on that problem? No, but we were able to complete the migration and follow up on the weird behavior at a safer time.

Conclusion

Sometimes, based on different circumstance, the right decision would be to get TAC on the line and work through the issue. The owners might decide everything can be down until it’s working as planned or anywhere in between. Often, things like physical access or travel will allow for longer down time/troubleshooting.

It is important to know the situation around the migration, why it’s happening, who’s involved, and keep awareness of those during the migration to make informed decisions with the owner to make everyone successful.

If you need help planning your migrations reach out to us.

This image has an empty alt attribute; its file name is IPA-Blog-ad-template-network.jpg