Wednesday, September 02, 2009

Data Center Move Part 3: What went right and what went wrong

This is the third and final installment of my account of our data center move. This is also my favorite part to write, because I get to recount everything that went right and wrong during after the move. And I love learning from my mistakes.

I had six total people, including myself, coordinating this effort - five at the data centers, and one remote man who could test connectivity from the outside and perform tasks related to the move that didn't require being on-site. Our downtime was scheduled from 12:00 AM to 4:00 AM EST on a Saturday night / Sunday morning.

11:00 PM

The five met at the New Data Center at 11:00 PM. I wanted to make sure the New Data Center had the WAN cable drop ready, that our access card worked, that the combinations to the cabinet doors worked, and plan out our access into and out of the building and determine where to park our cars while unloading equipment. The New Data Center came equipped with remote-access Power Distribution Units (PDUs) in both cabinets, one cabinet with 30 Amps and one with 20 Amps.

11:30 PM

Our inspection of the New Data Center was complete, and we drove the eight blocks to the Old Data Center to prepare for the take-down of the core servers: the web server, the database server, and the API server.

11:40 PM

We arrived at the Old Data Center. The primary check we needed to do prior to shutting down the servers ensuring that the standby database server had up-to-date copies of files from the primary database server. The primary log-ships to the standby server, so we needed to ensure the last of the log files had been restored to the standby server.

12:00 AM

The equipment was transported in three waves:

1. At 12:00 AM, I took the firewall and the main switch to the New Data Center. I wanted to get the firewall appliance connected to the WAN connection from the New Data Center, plug my laptop in to verify that connectivity to the Internet was working. The team transporting the three core servers would wait for my call from the New Data Center signaling that connectivity was working.

2. After my call to verify connectivity, a team of two (the primary two) would begin dismantling the three core servers (web server, database server, API server), and transport them to the New Data Center.

3. Then the remaining two (the back two) would begin dismantling and transporting the last set of servers, including corporate mail, incoming mail, standby DB server, corporate web server, the rendering servers, and a few other standby servers and appliances.

12:30 AM

Testing connectivity was a critical first step, because the type of connectivity was different from that of the Old Data Center. The New Data Center required a layer 3 routing device, while the Old Data Center did not. The firewall had to be reconfigured with a static IP, a range of DMZ IPs, and a gateway. All DMZ devices would then use the firewall as the gateway. At the Old Data Center, all devices, including the firewall, were assigned IPs from the same class C, and the gateway IP for all devices, was the same IP, and an IP provided to us by the Old Data Center. While the new connectivity setup seemed complicated at first, it worked just as expected.

After the firewall and switch were plugged in and connected, I had the sixth man who was remote, connect to the firewall via web browser, and begin changing the Network Address Objects' IP addresses. Simultaneously, he logged into our third party DNS system and begin making DNS changes for the main domain names, like jangomail.com and www.jangomail.com and mail.jangomail.com and relay.jangosmtp.net.

12:45 AM

After having verified connectivity, I called the core two, and had them initiate take-down and transport of the three core servers.

1:15 AM

Within thirty minutes, the team arrived, and we began rack-mounting the three core servers. We plugged them in, connected them to the switch, but they did not come back online.

This is where we ran into problem #1: While the IPs for the New Data Center had been assigned to the second NIC on each of the core 3 servers, the second NIC on each of the servers was disabled. Each needed to be re-enabled for connectivity to the servers to be established. I needed to login to each server, and activate the NIC, but I forgot the Keyboard-Video-Mouse (KVM) unit at the Old Data Center, and had no way of configuring anything on these three servers.

1:45 AM

I took a vehicle back to the Old Data Center, retrieved the KVM switch, checked in on the back two that were moving the ancillary servers, and drove back to the New Data Center.

2:15 AM

I mounted the KVM switch, and one by one, plugged it into each of the 3 core now-mounted servers, and activated the NIC with the new IP. One problem - I could only activate the NIC on 2 of the 3 servers. On the third, the API server, the keyboard/mouse ports were USB, and our KVM unit doesn't have USB connectors for keyboard/mouse. I had no way of actually configuring the API server. The New Data Center had mentioned the availability of a KVM crash cart on the floor, so I searched for it.Ten minutes later, I found it, locked behind another customer's cage.

2:45 AM

The web server and database server were operational, 75 minutes prior to the end of our downtime window. However, the API is mission critical to 30% of our customers, so I had to find a way to get it up and running.

Unable to fabricate an immediate solution to this problem, I drove back to the Old Data Center to check on the two men dismantling the ancillary servers.

3:00 AM


I pulled up, and they were just loading up the car. All equipment had now been removed from Old Data Center's racks. I was going to lead the car back to the New Data Center.The car wouldn't start. The car's battery had died while the flashers were on in, parked in front of the building.I pulled up alongside the car, we moved the equipment from his vehicle to mine, had a quick chat with the building doorman about not towing the now defunct car, piled back into my vehicle, and the three of us drove to the New Data Center.

3:20 AM

We loaded up the ancillary equipment on carts, and rolled the carts into the New Data Center. Once the equipment was in the data center, I had my core two men rack-mount the equipment. I let the back two use my laptop, still connected directly to the firewall, to Google for a store that was still open at the hour that would carry jumper cables. They didn't find one, but left anyway in my vehicle in search of jumper cables.

The three of us continued to rack-mount the remaining ancillary equipment. One by one, we turned each device on, and re-configured the NIC for its new IP address

After all equipment was powered on and connected, I still needed to get the API up. I decided to use the standby database server as my new primary API server, until such time as I could configure the API server with a USB keyboard/mouse.

Post Move - 5:30 AM

The three of us drove back home to Dayton, Ohio.

Post Move - 6:30 AM

I arrived at my home in Dayton and fired up my laptop to do some final testing from the outside. I noticed a few problems:

My BlackBerry wasn't receiving any email.

My BlackBerry receives all my work email by having our corporate mail server, mail.silicomm.com, forward my email to my TMobile BlackBerry email account. The email wasn't being forwarded.

I first remoted into mail.silicomm.com, did an MX lookup on tmo.blackberry.net, which is the T-Mobile BlackBerry domain, and attempted to connect to port 25 on that IP. The connection was made and immediately dropped. I assumed it was due to the new IP for mail.silicomm.com never having sent email before. As a temporary work-around, I had mail.silicomm.com forward my email to my GMail account, and then I had my GMail account forward email to my TMobile BlackBerry account.

Email I was sending wasn't being received.

I sent a test email from mail.silicomm.com to my GMail account, and it wasn't received.

Our corporate email system, IceWarp, has its own DNS settings, separate from the DNS settings in Windows TCP/IP. I hadn't changed the DNS servers of IceWarp to the New Data Cetner's DNS servers, and therefore, outbound email wasn't being transmitted since the MX lookups couldn't be performed.

JangoMail notification emails weren't being received.

All JangoMail notifications, such as "Sending Complete" and "Import Complete" notifications are sent via the JangoMail SMTP relay service. You can authenticate into the SMTP relay either by IP address or by From Address. For our own system notifications, we authenticate by IP address. We had forgotten to change the authenticated IP address in our own JangoMail account - the one that handles all these customer notifications. Once that was changed, customers received the email notifications.

One-Three Days After Data Center Move

Clients were reporting that the API was still inaccessible.

The API was inaccessible to at least three clients, because they were connecting to the API by IP address, rather than the domain name api.jangomail.com. This was discovered in the 48 hours following the move.

Clients were reporting that the JangoMail web-database connectivity feature wasn't working.

Clients were unable to use the web-database connectivity feature of JangoMail because they had restricted access to their web servers/database servers by IP address, and they had our Old Data Center IP range in their firewalls. They did not have the New Data Center IPs in their firewalls.

Five Days After Data Center Move

While examining the headers of a forwarded email in my GMail account, I noticed that an email forwarded by mail.silicomm.com to my GMail account had the originating IP listed as
209.173.128.118, which is not the IP for mail.silicomm.com. The IP for mail.silicomm.com is 209.173.141.195, and the reverse lookup for this IP is appropriately, mail.silicomm.com.

Upon further examination, I discovered that all servers in the DMZ were connecting to the outside as 209.173.128.118, which is the IP of the firewall. If I pointed a web browser on any device to www.whatismyip.com, the result was the same - 209.173.128.118. I was baffled, since all devices had been assigned public, routable IP addresses. I called SonicWall support, but I was declined support because I didn't have a support contract, and they refused to do a per-incident support ticket. I posted on the SonicWall forums, and was told I had to put in a Network Address Translation (NAT) setting in order for the firewall to prevent the firewall IP from routing the outbound connections. Once, I did that, the original BlackBerry forwarding issue was resolved. The BlackBerry email server was dropping the connection immediately because the source IP of the connection, 209.173.128.118, did not have a reverse DNS entry for it. It was the firewall's IP, so it doesn't need a rDNS entry.

Lessons Learned

There were several areas where planning was weak, including planning what equipment would be moved in what phase, and under assessing the impact of switching to new IP addresses. Given the experience I've outlined, I would have done the following differently:
  1. Activated the NIC with the New Data Center IP while still at the Old Data Center.
  2. Made sure to have a USB keyboard/mouse with me.
  3. A week prior to the move, informed all customers that our IP addresses were changing, including the new IP range.
  4. Informed all API customers that anyone connecting directly to our IP address needed to connect to the domain api.jangomail.com instead.
  5. Been more conscious of the JangoMail accounts that we ourselves use and dependencies on IP addresses.
  6. Been more aware that some applications don't inherit the TCP/IP settings of Windows, and have them set manually.
  7. And lastly, I would have made sure at least one vehicle carried jumper cables.