Availability
AVAILABILITY
TTM | ROI | Sellability | Agility | Reputation |
A measure of a system's ability to be accessed by its consumers.
Availability is a measure of a system's (or data) accessibility (in terms of its ability to be accessed, rather than the diversity of the consumption patterns), in comparison to the total time it could have been accessible [1]. Or to reiterate, it's the time that a system is able to service requests (which is always lower) in relation to its potential to.
Availability is a reflection on Stakeholder Confidence and Reliability. A highly-available system helps to make a system more reliable (and thus increase confidence), whilst a system with poor availability is often riddled with reliability concerns.
TERMINOLOGY
You may hear Availability sometimes qualified in terms of high, medium, or low availability. In fact High Availability is so often quoted, it has its own acronym (“HA”).
As a generalisation, systems that demonstrate high availability are better received. However, this should also be qualified by the solution and its users' needs (since every availability enhancement has a cost). For instance, the systems in a nuclear submarine require very high levels of availability (failing very rarely) - a failure in this scenario is a definite faux pas. Whereas a blogging site may require far less rigour.
Of course, unless we can agree upon a common language and understanding, our interpretations of “high availability” could be very different. What does “high” actually mean? The good news is that we can also express availability quantitatively - as a percentage - providing a means both to measure, and agree it, with others.
SLAS
A shared, contractual, availability agreement is often found in a Service Level Agreement (SLA).
(Good) SLAs use clear (unambiguous) language to describe the expected levels of service between two parties, juxtaposed by clauses to ensure there are appropriate ramifications should one party fail in its duty.
The following algorithm determines availability:
Availability = ( MTBF / (MTBF + MTTR) )
- MTBF (Mean Time Before Failure) - Also known as uptime. It's the duration that a system is available (or usable for) before it fails.
- MTTR (Mean Time To Recovery) - It's the duration it takes to restore a system to a working, and available state, including the time required to fix the system.
We can further enforce this with a few examples.
EXAMPLE A
Scott is building a software solution for client A. They have agreed to 95% availability. To calculate the permitted maximum amount of downtime, Scott does the following calculation:
365.2425 * 24 = 8765.82 hours [2]
8765.82 (hours) * 60 (mins) = 525949.2 mins
# 5% of total minutes in a year is…
525949.2 * 0.05 = 26297.46 mins
# revert back to hours for simplicity
26297.46 / 60 = 438.291 hours
438.291 / 24 = 18.262 days
Equating to: 18d 6h 17m 27s
Therefore to meet a 95% availability commitment, the system can be down for a maximum of 18.262 days (18d 6h 17m 27s) without incurring penalties. Of course, in reality I'd not recommend using all of your leeway in one go!
EXAMPLE B
Ok, now let's say that instead of 95%, the client wants 99% availability. This time Scott uses an online SLA calculator [3], and gets: 3d 15h 39m 29s
Therefore to meet a 99% availability commitment, the system can be down for a maximum of 3d 15h 39m 29s without incurring penalties.
FIVE NINES (99.999%) AVAILABILITY
You may have heard the term “five nines availability”? It's the pinnacle of the availability quality, but is both extremely difficult, and costly, to achieve.
Fundamentally, as a system approaches 100% availability, it becomes increasingly expensive to build, maintain, and operate it. You've got to do a lot more analysis (“what makes up our entire system?”), thinking (“which areas are a threat, and how would I reduce their impact?”), and then doing it. Essentially, we're widening our scope to include many more aspects (components, power, geographical areas), all of which must guarantee their availability, typically by removing any potential Single Points of Failure. The success of the Cloud is in part down to its ability to encapsulate (and take responsibility for) some of this for you.
EXAMPLE AVAILABILITY
Here are some common SLA availability examples [3].
SLA | 95% | 99% | 99.9% | 99.99% | 99.999% |
Permitted Downtime |
|
|
|
|
|
Notice the increasing rigid availability requirements as we get nearer to “five-nines”.
AVAILABILITY & SINGLE POINTS OF FAILURE
A chapter on Availability wouldn't be complete without a section on Single Points of Failure.
One of the most common causes of system failure is failing components with no redundancy (i.e. a single point of failure). The failure creates a lack of availability whilst the problem is identified and resolved (the Mean Time Before Failure aspect described above). Conversely, redundant components give us a backup plan, enabling us to redirect traffic away from failing components and onto functioning components, and retaining availability.
AVAILABILITY & “ALWAYS ON”
The “always on” expectation some businesses take towards software construction is another aspect that may affect availability. With this approach two components make certain (temporal) assumptions (Assumptions) about one another - a Temporal Coupling - indicating a synchronicity between them.
This is fine when things are in a working state, but a single failure in a dependent is all that's needed to break an entire user journey (which may have other ramifications, such as replay needs, or concerns over data integrity). A common solution here is to decouple ourselves from tightly-coupled components and employ asynchronous bulkheads (Bulkheads) between components.
ISOLATION & EXPERIENCE
Isolating different components behind bulkheads can also help with the user experience, allowing much of the remainder of a system to function, even whilst other parts don't.
AVAILABILITY & OTHER QUALITIES
Several other architectural qualities also link to Availability. For instance, a solution that can't scale to user demand can cause an availability outage, rendering the service useless. Alternatively, a solution with poor Performance, due to a sluggish response to user requests may create its own availability concerns (as users turn their back on it).
UNAVAILABLE & UNUSABLE
Poor performance doesn't necessarily make a system technically unavailable; however it may be from a practical perspective (making it unusable), and thus it is deemed unavailable.
Possibly the most well-known link with Availability is the Security quality, in the form of the CIA Triad of Confidentiality, Integrity, and Availability [4]. Confidentiality being used to retain secrecy/privacy; Integrity, to ensure transactions remain accurate and there has been no tampering, and Availability, our focus here. The most obvious example of an availability risk in the security context is a Denial-of-Service (DoS) attack, where legitimate users are denied access to a service by attackers, thus creating availability concerns.
PILLARS AFFECTED
SELLABILITY
Availability is a key quality for many customers, is regularly stated in contractual obligations (e.g. SLAs), and is therefore an important part of the overall sales process.
Let's say Mass Synergy (Case Study A) wants to enter the video streaming industry, with the intent of charging customers access to both an existing catalogue of content, and to live sports events content. They successfully market the service, drawing in a large number of customers. However, during the live stream, customers find the system unable to service the demand (causing a Self-Inflicted Denial-of-Service), eventually forcing them all out mid-event and shutting down the service for its remainder.
This is poor availability, caused by an inability to scale. It's a terrifying prospect to businesses in this industry - affecting their reputation, profit margin, and even their potential chances of getting more content - thus, the focus on Availability.
REPUTATION
If Reliability (the parent to Resilience and Availability) correlates with Stakeholder Confidence, then it stands to reason that availability also has an effect on Reputation.
SUMMARY
Availability and Resilience make up the Reliability quality. Availability relates to the system's ability to remain available to service requests, whilst Resilience relates to how well a system copes with failure. Both affect Stakeholder Confidence and therefore Reputation.
System availability may be qualified (e.g. “High Availability”) or quantified (e.g. 99.9%). Five-Nines Availability (99.999%) is the apogee - indicating a system has extremely high availability (~5 minutes of downtime per year) - but is extremely difficult to achieve. In fact, it gets increasingly difficult and costly to achieve such a position as we near it.
Availability is also linked to (and affected by) other qualities, such as Scalability, Performance, and Security. A system suffering from poor scalability, performance, or security (controls) can affect its own availability, by allowing those (poor) characteristics to control it.
Availability can be enhanced through the use of the following approaches:
- Redundancy. We duplicate parts of the system to ensure there's no Single-Point-of-Failure across multiple availability zones and/or regions.
- Support Horizontal Scalability and the Load Balancing of your services.
- Reduce the Assumptions made (and therefore the coupling) within and between system components. We might do this by distributing unrelated functional responsibilities (cohesive components make fewer assumptions than their counterparts), or reducing the assumptions we make about other components' availability (i.e. temporal assumptions). See the section on coupling (The Many Forms of Coupling) for more information.
- Segregate parts of the system and employ Bulkheads between key system components, enabling parts of it to be offline (unavailable) without significantly impacting other areas.
FURTHER CONSIDERATIONS
- [1] - Technically it's not necessarily about how accessible it is, it's how usable it also is. A system may be available, but if it takes several minutes to respond to a simple query, then it isn't usable - and thus the equivalent of unavailable.
- [2] - there's 365.2425 days on average in a year based on Gregorian mean.
- [3] - https://uptime.is
- [4] - the CIA Triad of Confidentiality, Integrity, and Availability is an integral part of any security policy and control implementation.
- https://www.google.com/search?q=airline%2Bsystem%2Boutages
- https://www.atlassian.com/incident-management/kpis/common-metrics
- Assumptions
- Bulkheads
- Case Study A
- The Cloud
- Denial-of-Service (DoS)
- The Many Forms of Coupling
- Single-Points-of-Failure
- Service Level Agreements (SLA)
- Stakeholder Confidence