iovation finished making all real-time customer-facing services active-active in February of 2013, reporting in May of 2013, and continue to improve the underlying availability model.
What is active-active?
Well that depends on your definition. For iovation in order to call a real-time service active-active (i.e. an api tier, or a web interface tier) it means there is a "full stack" of hardware in two datacenters and that under normal operating conditions, half the production traffic is sent to one of the datacenters, and half is sent to the other. Any data about one transaction that comes into Datacenter "A" is synchronized in near real time to Datacenter "B", so that a future transaction coming in can make use of the information about that previous transaction. We take this one step further and require that any given piece of hardware failing does not cause:
- A customer visible outage
- One of the datacenters to no longer be capable of handling our full peak load by itself, even in the degraded state
Why did we go active-active?
The discussion first started around availability. At the time we only had a single datacenter and we were concerned about what would happen to it in the event of a major earthquake or fire. While this was originally what drove us to build a second datacenter, looking back on it, our primary reason to do it today would be to increase the velocity with which we can improve existing services and deploy new services.
What are the benefits?
The obvious benefit of active-active is that you can lose a datacenter in a disaster and traffic will be picked up by the other datacenter seamlessly. Note: By running active-active at all times with transactions going to both datacenters you have high confidence that the switchover will be successful vs. an active-passive model (or even worse, a disaster recovery model) where you never really know what will happen.
Another significant benefit, that may not be as obvious, is the impact this change had on our culture over time.
Along with this change to active-active we were also evolving the underlying software and data storage technologies such that we could move to a zero-scheduled-downtime model. Previously, we had to take the system down once a month for database updates and code deployments. This had a huge impact on the team as we did them extremely early in the morning, we could only do them once a month, and it locked in the code deployment timelines. There were also long notice requirements for customers which further constrained deployments.
Eliminating scheduled downtimes from the system allowed us to stop burning an entire workday every month where teams were frazzled by their work hours being adjusted.
Doing deployments during the day allows us to do them while staff is fresh, support resources are available (i.e. Development, QA, Client Managers), and they can work on them in between other non-deployment related activities.
You might be thinking, “But won’t customers be unhappy that you are doing things during their business hours?” Yes, that is a concern, but iovation has a global customer base. There is no time of the day that our customers, someplace in the world, aren’t working. By eliminating scheduled downtimes, and adding so much resilience to the system, we have greatly improved system availability even with deployments happening during U.S. business hours.
We’ve also found that doing deployments during business hours has led to a faster feedback loop. If a release introduced a bug that was not caught prior to deployment, we have a better chance of getting feedback from our customers while key team members are still at work.
Deployments can now be done much more frequently and without such rigid scheduling. We can do smaller spins on changes and get them into production faster with more immediate feedback. This allows us to tweak things to make them better, or change course if we’re going down the wrong path. Fraudsters move quickly to adapt and change their strategies so the speed at which we can adapt is a competitive advantage.
Another benefit is that less planning is required to do an update. Previously, if we were going to do an update during a maintenance window, we had to rehearse it many times on a non-production system. This was necessary to ensure we knew exactly how long it would take and to polish procedure to the nth degree. When traffic can be shifted away from the system being upgraded, with no impact to customers, less time can be spent planning the deploy step. As long as the deployed software is fully tested prior to putting it back in service you are not at risk for impacting customers.
Updates are no longer white knuckle occurrences. Before active-active doing an update during a maintenance window was stressful with a huge amount of attention on the event. If one team member ran long on their piece, they were squeezing a coworker’s ability to complete their required tasks within the window.
We also are able to further improve the availability of the system by setting the standard that we shift traffic even when performing tasks categorized at a moderate level of risk (i.e. ones that previously would not have been done during a downtime, but that could come with some level of risk). When flipping traffic 100 percent to the other datacenter only takes a couple minutes of work there is very little reason not to do it. We do this several times a week on average.
So how does this impact culture?
At iovation we no longer ask employees to work after hours or on weekends for normal infrastructure and software deployments. We still occasionally have to push out an urgent hotfix, but standard deployments are always during business hours now. We also have not needed to dispatch employees to a datacenter after hours or on weekends for several years. Individual pieces of hardware are expected units of failure, our internal service-level-agreement is that we start troubleshooting on the next-business-day as the system is still highly-available after a single component failure.
This helps us with recruiting! iovation is known as place where employees aren't unnecessarily burdened by expected failure conditions and for regular maintenance tasks. Engineering the system to be maintained during business hours puts less stress on employees’ personal lives and allows us to attract and retain better talent. Did I mention that iovation is hiring?
Additionally, employee projects can make it into production faster than places with legacy designs. This means less time is spent scheduling and planning deployments leaving more time for work on meaningful features.
We are no longer a captive audience tied to one physical facility without the ability to easily leave. The active-active infrastructure allows us to move a datacenter (for improved quality or reduced costs) as needed. Physical equipment can be moved without customers being impacted and without massive cost/effort to rebuild another complete set of hardware elsewhere.
But isn't this expensive?
Yes, it can be, but there are also opportunities along the way to mitigate this. iovation actually did this while reducing the spend on our infrastructure, of course, your mileage may vary.
In our case, we made a number of cost-reducing changes all at once:
- We moved from Sun SPARC hardware to commodity x86 hardware from vendors such as SuperMicro and Dell.
- We moved storage off expensive Hitachi storage arrays and nearly exclusively onto local spinning disks and consumer grade SSD’s inside of the servers (when you have so much redundancy at the application layer, it gives you a lot more flexibility at the storage layer).
- We moved off expensive commercial software solutions (i.e. Oracle Database Enterprise Edition) and on to open-source distributed-scale NoSQL solutions such as Cassandra.
- iovation has been moving toward less expensive colocation facilities as the market has evolved. Specifically, as we have grown we are moving into “wholesale” datacenters since we do not need “high-touch” 24x7 remote-hands services and our architecture allows us the flexibility to leverage cost-optimized dual corded power system designs.
- We purchased our telecommunications circuits on the wholesale market with endpoints in carrier hotels where there is a truly competitive market.
- We moved our network design to a commodity design without leveraging features that induce vendor lock (i.e. see Cisco Nexus or Juniper QFabric).
While in many cases we overall doubled our server count and added the expense of network links between our facilities, we net-net reduced our spend by cutting our appetite for expensive proprietary solutions. Also, in some cases, instead of going from two copies of a piece of data in one datacenter to having two copies in each of two datacenters (a total of four), we instead went to one copy in each of three datacenters (hooked up in a triangle network so there were no single points of failure). This allowed us to increase server count by only 50 percent instead of doubling it for our largest and most expensive data application tiers.
We also do not have to maintain as much “burst” capacity in the system within each datacenter as we did before. We have always kept a healthy margin of capacity within the system for unexpected customer volume, but now we can rely on the second datacenter to provide that extra capacity since it is available nearly 100 percent of the time. This means the extra capacity is not just “wasted” as it is in an active-passive system. Note that when we say "unexpected" we really mean unexpected, not just seasonal changes in volume around the holidays. A single datacenter is still sized to handle that peak volume.
Even if you have already taken all of the cost savings strategies described above, there is still the likelihood that increased spending on infrastructure to go active-active will pay-off by enabling you to decrease your time-to-market on new features and improve productivity in your engineering teams.
This sounds too good to be true, what are the gotchas?
First off, coupling between datacenters is to be avoided to the greatest extent possible. The worst possible outcome when trying to go active-active would be to spend all the time/energy/money on building out a second datacenter only to create a situation in which *either* datacenter going down takes you offline.
Extreme attention to design and implementation detail is necessary to ensure the least amount of coupling exists between datacenters. Your goal is to have two independent stacks of hardware/software that know nothing about each other. If one goes down it has no impact on the other. The “gotcha” here is that this is impossible for any service that involves state/data that must be shared across the wire. Keeping these to a minimum is your goal. Failure modes must be considered and tested.
You also must be very careful to do regular capacity testing as during normal active-active operations, the system has a lot more capacity in it as a whole than in its worst-case degraded state (as defined by your availability model). In our model, some application tiers have four servers in them—two in one datacenter and two in the other. Under this model, we work to ensure any one of the four servers has sufficient capacity to take on our peak load as the worst case scenario is for one server to fail on a weekend (expected type failure), and then for the opposite datacenter to have a complete outage.
Another thing to be aware of is unanticipated coupling that may sneak into the system over time - in addition to switching traffic back and forth between datacenters regularly it is wise to actually test “pulling the plug” on a non-active datacenter (i.e. chopping off the network links to it) to ensure the remaining facility continues to run as expected. Where this comes into play is those inevitable systems that do have coupling between datacenters.
I am onboard, but how do I do it?
You will need to take things one-step-at-a-time. It took iovation several years, during which time we happened to be re-writing our application, which allowed us to design it in such a way to support active-active.
Infrastructure comes first. The design patterns for core infrastructure are well established and tested. It is just a matter of finding the right facilities, procuring the hardware, and deploying it appropriately. In our case, we have redundant Internet routers, firewalls, core switches, load balancers, monitoring servers, etc…in each of our datacenters. Then we have redundant 10 gigabit wave circuits between the datacenters.
Application design patterns on the other hand are still evolving to support this, but huge progress has been made as more tools become available to support this kind of design. We leverage tools such as Cassandra (NoSQL), Message Queueing, ElasticSearch, Postgres. Particular focus should be placed on Message Queue design as they are a great way to reduce coupling in an asynchronous fashion.
Eventual consistency is your friend! Really, it is! Nearly all applications can tolerate a little bit of eventual consistency which is how you eliminate coupling.
But what if my applications don’t support it?
Your options are a lot more limited, but first, push on your vendor to support active-active! It truly is the way of the future. However, there are some things you can do at lower levels leveraging technologies such as VMware, though these are really more targeted at active/passive type designs. If you do go with an active/passive design, the key is to fail back and forth regularly (i.e. WEEKLY) that way you know your procedure is well rehearsed and tested.
While iovation launched active-active in order to enhance system availability, it turned out the overwhelming benefit was the culture change that it brought to the organization. The biggest takeaway—active-active architectures are now within reach of companies smaller than the hyper-giants Google/Amazon/Apple. Once you go active-active you will never go back!