Does Your Datacenter Have An SLA?
from the prove-it dept
I have great concerns about whether or not mission-critical applications are having their SLA's met in datacenters, whether they are hosted in-house, third-party supported, or any other form of datacenter-based hosting. First, consider the alternative: the server sits in a room next to your expert developers. Sure, it's probably a SOX violation, but I can tell you this much: that server will not go down often, and if it does, you can be sure that it will be restored as fast as humanly possible. That's the advantage to having an expert babysit your system. If you have two experts in different geographic locations and each babysits a server in case one goes down, then you have about the best support possible. However, for large systems, this may not be convenient, etc.
But how do you know that a datacenter-hosted app has this type of support? First, you need to know for sure what the SLA spells out in terms of support and monitoring. Look for this in your SLA:
"If your app encounters event W, person X will do Y about that specific event within Z amount of time"
I guarantee that anything less specific than that, or anything as specific that's not in writing in the SLA to that effect, will not be honored. Vague responses equal no responses, because why would the datacenter host open themselves up to liabilities by initiating a response that wasn't specified in writing? Specific, measurable responses with noted responsible parties are required to be honored for the SLA or the datacenter host can be held accountable for any failure to respond as specified.
So assume you have an acceptable SLA in place, and you know what they're supposed to do. How can you be sure they'll actually do the things they say they'll do? Well, you obviously need to know before you can count on your apps for something mission-critical, so while the mission-critical app is still running somewhere else (i.e. being babysat by an expert), you set out to prove that the support can respond -- by staging various types of failures. You could tell the host about the staged failure attempts, but then they'll know and will definitely staff and respond appropriately. I would stage failures and not tell the host that the failures are a test. After all, from the host's perspective, any failure is a failure. Be sure to measure closely the response and check if the SLA was honored as expected. Any failure to honor it, for any reason, should be a strong indication that the host is not prepared to honor the SLA, thus potentially costing you your mission-critical app.
Do not allow a complicated roll-over or automated monitoring to imply that the datacenter can respond to any event with seamless mission-critical app coverage. An inexperienced datacenter admin simply hitting the wrong button can send any app to Davy Jones' locker in a big hurry. If you truly want mission-critical backup performance, ask yourself what would happen if the datacenter was completely unresponsive? For example, what if it were hit by a hurricane and completely wiped out? How soon could you be back up and running, and at what capacity? If you can't answer that, you better find an answer before some unpredictable event knocks out your one server running everything.