Monday, August 24, 2009

Data Center Move Part 2: The Planning

There were several components to my planning of the data center move:
  1. Connectivity / IP Addresses - We'd be getting a whole new set of IP addresses, so I had to map the old IPs to the new IPs. And, in order for the domains to map to the new IPs as soon as we came live at the new data center, I set all TTLs for important domains, like jangomail.com, to 60 seconds.
  2. Reverse DNS - Because we operate several email servers, I needed to ensure that rDNS would work properly on all of them post-move. So prior to the move, I contacted the support team at the New Data Center to put in rDNS entries for 3 of our IPs, that would map to mail.silicomm.com, mx.jngo.net, and mail.jangomail.com.
  3. Human labor – I’d need to find enough people to move our equipment over in the desired amount of time.
  4. Rails – we were moving from non-Dell cabinets to Dell cabinets. We needed to make sure the rails purchased for our non-Dell cabinets would work in the New Data Center’s Dell cabinets.
  5. Firewall – All of our server access rules are controlled by our firewall, and we’d need to make sure the firewall would be ready, with the rules around the new IP addresses.
  6. Software/OS configuration changes – We would need to ensure that all web servers, email servers, database servers that had services bound to a designated IP would have those IPs changed.
IP Addresses

Prior to the day of the move, I took inventory of all our equipment and the IPs to which they were assigned. I then made an Excel spreadsheet detailing the device, the old IPs assigned to it, and the new IPs that would be assigned to it.

Reverse DNS

This was relatively simple. Since we don't own our IP addresses, and since we only control forward DNS in our own DNS system, we would rely upon the New Data Center's tech team to setup rDNS for us on 3 of our IPs, each of which would be assigned to a different email server.

Human Labor

I enlisted a team of 6, including myself: 5 doing the actual move, and one man on the outside to test things and remotely make configuration changes to our firewall and DNS system. The goal was to be offline for no more than 4 hours. The plan was simply this:
  1. Within the first hour, I’d take the firewall from the Old Data Center to the New Data Center to test connectivity to the Internet. A team of 2 would begin dismantling the most important servers, the primary web server, the primary database server, and the API server, but not load them into a vehicle until I had called from the New Data Center to verify connectivity was working. If for some reason, connectivity wasn’t working, then we could easily fall back to the Old Data Center.
  2. Within the next two hours, the primary 2, would meet me in the New Data Center with the 3 most critical servers.
  3. During the last hour, the secondary 2 would arrive at the New Data Center with all other remaining equipment.
Rails

Several weeks prior to the move, we brought rails from the Old Data Centerto the New Data Center to ensure they were compatible. They were not, but with a lug, washer, and screw, we made them work. We tested mounting a 1U server two weeks prior to the actual move.

Firewall

We only had a single firewall appliance running in the Old Data Center, and I contemplated purchasing a second appliance of the same model prior to the move. That would enable me to have the second firewall configured with the New Data Center’s IPs, and would save time in having to manually re-configure the Old Data Center’s firewall mid-move. But the price tag of the appliance convinced me otherwise, and I determined it would only take 10 minutes of time to swap out the old IPs on the firewall with the new IPs once I was able to connect my laptop to its LAN port in the New Data Center.

A few days prior to the move, we became aware of a major difference between our connectivity setup at the Old Data Center and the setup at the New Data Center. At the Old Data Center, we were simply given a range of IPs, the gateway to use, and DNS servers. All servers and devices, including the firewall, used this configuration.

The connectivity setup at the New Data Center was slightly more complicated, requiring a Layer 3 routing device. An IP would have to be assigned to the WAN port on the firewall, and that WAN port would use the facility’s IP as its gateway. Then, all the devices connected to our DMZ would use the IP that we assigned to our firewall as their own gateways. I was wary of this new setup, and therefore wanted to test connectivity during the move (see step 1 above), prior to reaching a point of no-return with the move. And because the firewall would now serve as the gateway for all of our servers on the DMZ, the firewall now became an essential piece of equipment to our uptime. At the Old Data Center, the firewall could've been removed from the network flow, and the servers would remain online. But that wasn't the case anymore.

Software Configuration Changes

The software services that runs JangoMail consist of web servers, FTP servers, SMTP servers, SQL Servers, and some monitoring systems. We have many instances of IIS running, and prior to the move, I documented which applications would need configuration changes because of the new IPs. Many of our IIS web/FTP sites are bound not to a specific IP but to “All Unassigned”, which benefitted us because it prevented us from having to manually change the IP that a particular site was bound to. But some services, like our corporate email server (IceWarp), has to have DNS servers specifically set by IP address. It doesn’t inherit the DNS settings from the NIC’s configuration. It was given that I’d be changing the IPs and Gateway and DNS servers associated with all NICs on all servers, but I wanted to minimize the amount of additional IP changes that would have to be done on top of NIC changes.

With my planning finished a mere 3 hours before the move, I was ready to go. Stay tuned for Part III, where I’ll detail what went right and what went wrong during the move, and whether we were able to complete the move within our announced downtime window!