Observability
OBSERVABILITY
TTM | ROI | Sellability | Agility | Reputation |
The ability to observe the present (or past) system state to contextualise, diagnose or decision-make.
Our ability to gather data, contextualise it (i.e. information), and then leverage it allows us to enhance our customers' experience and engagement, and improve our system resilience. Without it we are “running blind”, lacking the value of rigorous (empirical) measurement, and therefore impairing our decision-making. Observability helps.
WHY MEASURE?
A software product is a technical realisation of an idea (What is Software?). As we know, ideas can be good or bad, early, late, or on time. Measuring their success allows us to determine their worth, identify potential issues, compare them, and subsequently improve them.
We can split Observability into three branches. The first is to observe our system, to ensure it remains healthy and accessible. The second is to observe how (well) we deliver value to our customers - our Value Stream. The third is to observe the performance (success) of our business to ensure we are making positive contributions towards our goals. The table below indicates the qualities they promote.
Area | Affects |
System Metrics | Availability, Resilience, Scalability, Security, Usability. |
Value Stream Metrics | Productivity, Releasability. |
Customer Metrics | Business Goals, Customer Uptake, Effectiveness, Agility, Usability. |
SYSTEM METRICS
A common complaint of software realisations is that we sometimes neglect to gather sufficient information to contextualise its technical capabilities (strengths and weaknesses).
As engineers we want to capture: compute usage (e.g. CPU), memory usage, instance health, build and deployment times, and log and access files. We may use quantitative empirical information to more accurately measure (or predict) Availability, Resilience, Useability, Scalability, and Security.
System observability is typically achieved using the following mechanisms:
- Monitoring & Alerting.
- Logging.
- Tracing.
ON NARRATIVE
Which I'll be discussing next. To best articulate them though, I've chosen to view each through the lens of how we construct modern software, namely:
- The Distributed Architecture.
- Ephemeral Infrastructure, and the Cloud.
- The Declarative Model.
MONITORING & ALERTING
Our software applications, and the infrastructure they run on, have a tendency to fail, or malfunction, over time. Hoisted upon these are the additional challenges of Scalability (does our solution have sufficient resources to handle additional load?), and Security (is a miscreant trying to tamper with our solution for nefarious purposes?).
The act of verifying our system is in the correct (healthy) state is called monitoring. The act of notifying someone (or something) about a potential inconsistency in that state - so it can be analysed and (possibly) resolved - is known as alerting. We can monitor without alerting (e.g. Real-Time Dashboards of system health), but not alert without monitoring (what condition would you alert on?). From a practical perspective though, we really need both.
Monitoring tools provide a point-in-time view of a service's health and performance metrics, promoting better transparency and earlier stakeholder engagement, which can be helpful in resolving potential issues quickly, prior to a failure.
REAL-TIME DASHBOARDS
A Real-Time Dashboard shows monitoring information in near real-time, allowing us to view important “events” and incoming transactions.
To do so, we must first overcome a few key challenges. Firstly, there's the Distributed Architecture. Since a distributed architecture distributes our services (and load) across multiple instances, it typically equates to us running many, smaller, instances (rather than one large one with the monolith). Secondly, we also know that our infrastructure is becoming increasingly more dynamic than in yesteryears (the Cloud). Ephemeral instances are increasingly being normalised. We don't nurse things back to health, we simply replace them (Cattle not Pets). Consequently, our software applications (and infrastructure) may die or become unresponsive (potentially for reasons outside of our control), and thus, be spawned elsewhere. There's no guarantee of a long lifetime.
LOCALISED STATE
We should not rely upon localised state persistence, and assume the worst case - that the service may terminate with little-to-no forewarning. To rephrase, we can't retain state in a single localised place as there is no way to recover it if things go awry. It's a similar argument for why we don't like “sticky sessions” from a Scalability perspective.
Anything important enough to retain (e.g. metrics, alerts, access requests, log files) should be promulgated, and made easily discoverable and searchable.
And finally, we have the Declarative Model. The declarative model adds a further layer of abstraction to our software, for instance by using Orchestrator technologies (like Kubernetes) and the Platform-as-a-Service (PaaS). These have greatly simplified things, but it also abstracts the Observability aspect.
The combination of automation, distribution of software services, and ephemeral instances makes manual Observability entirely impractical. We must employ services instead. The two approaches are: (1) push and (2) pull.
With the pull model, a centralised service accesses the running instances (if APIs) at regular intervals (typically by polling a health check endpoint) and persists them (often in a Time-Series Database). The push model typically expects an “agent”, or a plugin to be deployed alongside the running software, to regularly push system metrics out to a centralised service (again, this may be a time-series database).
PAAS ALERTING
One advantage of the PaaS is that - assuming you follow its “contract” - it will hook all of this together for you, thereby reducing complexity and creating a degree of Uniformity.
In terms of alerting - the notifications we receive to indicate a potential problem - we may attach thresholds (e.g. 20 failed transactions in a row) to services, thereby alerting operational staff (or other systems) using various communication channels, including: email, MS Teams, and Slack where they can be actioned.
ALERTING IN DEPLOYMENT PIPELINES
To ensure our code remains healthy (and continuously deliverable), also requires us to alert in our Deployment Pipelines. A problem introduced into our release branch (or trunk) should be immediately identified, alerted, and worked on by the team. It is our Andon Cord to achieve better quality and expedient delivery.
LONG-RUNNING OBSERVABILITY
I advise adding observability to any long-running processes, such as migrations or bulk updates, particularly if it requires us to kick users out for the duration. It's uncomfortable to be in a position where you've prevented user access but can't predict when the service will be available again due to insufficient observability.
(DISTRIBUTED) LOGGING
System logs hold lots of important information that we can use to diagnose problems, including:
- Who did what, and when did they do it? For instance, not only are logs used for debugging, but also for security.
- The paths taken by transactions through our codebase.
- To identify usage patterns, such as to a specific transaction.
- Metrics. We might also choose to gather internal metrics.
- The errors that we may wish to remain hidden because they give away important (sensitive) system information.
The challenges with distributed logging are almost identical to those with distributed monitoring and alerting; i.e. automation, the distribution of software services (independent logging across distributed systems has little value unless we can centralize, and aggregate it), and ephemeral instances. Unsurprisingly, the solution is very similar. We configure our distributed software services to push logs out to a centralised service (e.g. see the ELK or EFK stacks), creating a single view of them all, and thus enabling us to search, aggregate, and contextualise the information they contain.
TRANSACTION CORRELATION
Unlike the Monolith (which typically has Immediate Consistency), a Distributed Architecture typically has Eventual Consistency (Immediate v Eventual Consistency). This is because the constituent (child) transactions of an encompassing business transaction are also distributed. They are committed independently, and may even end up in an entirely different data storage technology. As described above, logs are also distributed.
BUSINESS & TECHNOLOGY TRANSACTIONS
A business transaction works at a more coarse-grained level than the child (technology) transactions it's made up of. For instance, let's say we have a business transaction for Cart Checkout. The constituent (technology) transactions might be:
- Create Order.
- Apply Discount.
- …
- Reduce Stock.
- Fulfil Order.
To generate a single, joined-up view of a business transaction we must find a way to bind all of its child transactions back together. We can do so by sharing a unique traceId per transaction across our system domain boundaries, and attaching them to any logs, which are then centralised to produce a single, comprehensive view. See Distributed Transaction Correlation.
VALUE STREAM OBSERVABILITY
The Observability of our Value Stream is something else we shouldn't neglect. Everything we do as a business can be traced back to our Value Stream - from inception through to delivery and its use. It shows us (amongst other things) complexity, inefficiencies, bottlenecks, Inventory (queues), and Waste (e.g. “why are we doing task X?”). Obscuring it limits our ability to contextualise it (the what, why, and when), or to measure it, to enable improvements. We know from previous chapters that this impacts Productivity and Releasability.
The mechanism for work entering a value stream using the Agile methodology is through User Stories (and epics), realised through work-item tickets (such as JIRA). A ticket therefore, is a way to observe the work, its flow, effort, wait time (queuing), and progress through our system. It allows us to track and measure our activities.
NO TICKET, NO WORK
“But it can't be work, there's no ticket...!” is a now infamous phrase uttered by Agile practitioners around the world. It's a counter to unseen (and sometimes unplanned) work, indicating its peril.
Unseen work comes with some baggage:
These are all valid concerns, but let's also consider it from the perspective of our Value Stream. Without a ticket, how can we understand what is happening, let alone measure it? We can't.
- We may be doing ourselves (and our team) a disservice, as our work isn't included (nor applauded) in the tally.
- It can mess up a team's Velocity.
- We may be working on the wrong thing (i.e. it's non-strategic, or ineffective).
- The change may not be governed (creating risk).
- There may not be sufficient control of, or a clear entry point to initiate new work.
- It may be the consequence of Expediting.
(CI/CD) Deployment Pipelines - typically used to build and deploy services in a Distributed Architecture - offer a treasure trove of information. Amongst other things, they tell us: how long builds take, the quality of the software being produced, test coverage, vulnerabilities, team velocity, build and deployment frequency, lead time, a “time machine” of change, and aspects of DORA Metrics. Thus, they need to be observable.
ENGINEERING EXPERIENCE OBSERVABILITY
The observability of the overall Engineering Experience is vital to the success of our Value Stream. Please don't forget it.
The purpose of observing all of these various metrics and parts of our Value Stream is firstly, to measure them, and secondly, to leverage them to make improvements.
CUSTOMER & BUSINESS METRICS
Engineering and Value Stream metrics are all well-and-good, but they only articulate the internal state (or strength) of our system. They do little to describe the success (or otherwise) of our ideas.
At a grander scale business (customer) metrics measure our business' performance (its Key Performance Indicators - KPIs). How we're doing. We care little about system memory, or Value Stream efficiency here; we care about qualities such as customer uptake and engagement, since these help to determine our profit. The sort of information we find in MI (Management Information) Dashboards.
TYPICAL KPIS
KPIs differ per business; although they typically involve increasing and retaining custom, sales (and projected sales) revenues, orders, and popular items. That requires (amongst other things) good marketing and observable customer engagement.
The information gathered here helps us to improve Usability and to make better judgements on what constitutes value, enabling us to drive our business' actions and direction, define the product road-map, and stop undertaking ineffective work.
Controlling the user interface allows us to manage how customers interact with our services, thereby making us responsible for the efficiency and effectiveness of those customer interactions. Of course, it's unlikely that we'll get it all right the first time (customer habits and expectations change over time), so we must be capable of evolution.
PREDICTION
The nirvana is being able to predict what each customer will do next to retain them, or upsell more services to them.
Aside from the KPIs, some other useful metrics include: page load times (how much waiting must a customer do?), API response times (again, how much waiting must the customer do?), and drop-out rates (where do customers drop-out of the business transaction, and why?). Social media is another useful tool. It won't always be impartial, but it does provide a sense of how we're doing.
PERFORMANCE AS A PREDICTOR
Performance (and wait time) is a decent predictor of how our users (or prospective customers) perceive our system, and therefore our business (reputation).
A number of user research studies have been conducted on the relationship between user wait time and user engagement [1]. For instance, this report suggests that: “The probability of bounce increases 32% as page load time goes from 1 second to 3 seconds. (Google, 2017)” [2]. A “bounce” may mean a lost sale (bad), thereby impacting our KPIs.
PILLARS AFFECTED
TTM
Value Stream Observability makes our Value Stream more transparent, enabling us to measure and improve our overall TTM.
ROI
Gathering (and using) customer and business metrics helps to promote effectiveness. Capturing (and using) Value Stream metrics ensures that we're able to improve our efficiency. Both can lead to lowered costs (we don't invest in pointless activities or unpopular features) and faster turnaround (faster return).
AGILITY
Observability allows us to adapt, for example to adjust our user interface flow to be more intuitive, or to change tack if we find we're not meeting our KPIs.
REPUTATION
System Observability helps to protect Resilience and Availability. It can also partly evidence a solution's Usability, enabling us to improve the user experience. Value Stream Observability makes our Value Stream more transparent, thus enabling us to measure and improve our Productivity (efficiency and effectiveness) and Releasability. Business & Customer Observability is used to gather metrics on (for instance) our KPIs, thereby supporting our business goals. All of these metrics help to protect our Reputation.
SUMMARY
Observability is another underrated quality. Admittedly, it's not as cool as building and delivering software, but our work isn't done (Definition of Done) unless we can measure our success. It's also an important insurance policy. If we can measure it, then we can build up defences (like an Andon Cord) to prevent further exacerbating a problem.
Observability is typically found in the following areas:
- Systems - such as the health of our software applications and infrastructure.
- Value Stream - the health of our release processes.
- Business - our ability to measure progress towards our business goals and aspirations.
Finally, we may also need to tailor our observability needs to Releasability mechanisms - like Canary Releases - to measure (and compare) both the new path or feature, and the current ones.
FURTHER CONSIDERATIONS
- [1] - https://blog.hubspot.com/marketing/page-load-time-conversion-rates
- [2] - https://www.thinkwithgoogle.com/marketing-strategies/app-and-mobile/page-load-time-statistics/
- Agile
- Andon Cord
- Canary Releases
- The Cloud
- Definition of Done
- Deployment Pipelines
- DORA Metrics
- Engineering Experience
- KPIs
- Value Streams
- What is Software?