How we maintain 99.5% guaranteed uptime
At the heart of the Software-as-a-Service (SaaS) business model is the understanding that the service will always be available when needed. In our 24×7 flattened world, that pretty much means all the time.
While we strive to ensure that our client’s customers will enjoy 24x7x365 availability, in reality, there are too many moving parts involved in order to commit to that Service Level Agreement (SLA) affordably. Besides, we’re not operating a heart lung machine where lives are at stake 🙂 Nevertheless, we do aim to do our best, and at the very least, we do promise our clients a 99.5% monthly availability…which translates to no more than 3.60 unavailable hours per month.
Almost a year and a half ago, I instituted daily operational stand-up meetings that (amongst other things) are dedicated to monitoring our data center and application performance daily. I did this because while much of our operations are automated, human beings are still needed to provide oversight and to design and institute process improvements as warranted. That seems to have worked out well for us. Although, because we weren’t keeping track of our cumulative available hours, we really didn’t know how well…until now!
Starting this month, at the beginning of every month, we’re going to publish our Availability statistics. Along with the data, we’ll also offer any color commentary that appears appropriate for that month’s data. This month I’ll start with posting our first set of numbers (retroactive to the beginning of this year) and explain some of the terms used. There are four things of note in this chart.
The first one is a horizontal blue line that denotes our SLA target. That’s the 99.5% line. The red and blue bar charts should never dip below this line…or I’m sure I’ll be getting a few phone calls.
As for the red and blue bar charts, they separately measure the availability of the infrastructure of our data center (System) and the appropriate behavior of our service itself (Application).
The System Availability (red bar) measures things like: our application servers, web servers, databases, firewalls, routers, virtual private networks, etc. The Application Availability (blue bar) measures the successful behavior of the application including: catastrophic errors caused by our application, SAP unavailability (not usually in b2b2dot0’s control) etc.
While users of our website couldn’t care less about why the website was unavailable to them, we measure both types of availability because it guides our investments to ensure our SLA. Regardless, and in all cases, when the service is down…unexpectedly or for maintenance, we post appropriate messages to the website’s visitors.
Incidently, we interrogate our infrastructure every 10 seconds of every minute of every hour of every day!
The last data series, RFC Transactions (green line chart), is meant to be a measure of the overall “meaningful usage” of our service, and whether or not that load is varying from month to month. The green line will trend upward when either new users are registered to the service, those users are using the website more frequently, or both.
An RFC (remote function call) is the primary programmatic mechanism by which our service communicates with SAP. Maintaining a steady Availability SLA, in the face of a growing RFC load, is a testament to the scalability of our infrastructure.
Overall Analysis of our Performance for the First 4 Months of 2010
- System usage climbed steadily for the first 3 months due to an increase in registered usage.
- Lowest Availability (Application) was 99.62% in March that was caused by a client VPN failure rendering SAP unavailable.
- On average, Application Availability was 99.83%, which represents about 1 hour of down time across all of our 9000 registered users for the month.
- System Availability was consistently above 99.9% which represents less than 43 minutes of unavailability per month for end users. In actuality, System Availability is close to 99.99% which represents less than 4.3 minutes unavailability per month.
All in all, I’d say that our data center is performing wonderfully. While we are constantly delivering greater than expected availability to the end-user, we’ve learned that our Achilles Heel is ensuring connectivity to a high performing SAP system. The good news is that we’ve instituted an early warning system that we use to tip off our clients when/if their SAP system is experiencing any connectivity or performance challenges. I guess you could say we’re the canary in their mine 🙂
I’m very proud of the investments that we’ve made in our data center’s infrastructure, tools and procedures. Thanks to them, our clients can all sleep well at night. Starting today, we can visualize just how well they can sleep!