PRINCIPLES & PRACTICES

<< Previous | Table of Contents | Next >>


Work-In-Progress...

SECTION CONTENTS

WHAT IS SOFTWARE?

Let’s begin with a more fundamental, philosophical question. What is software?

My interpretation? Software is but a specific realisation of an idea.

In his Theory of Forms, Plato concluded: ‘... "Ideas" or "Forms", are the non-physical essences of all things, of which objects and matter in the physical world are merely imitations.’ [1]

So, an idea may have many different forms (or interpretations), of which a specific software implementation is just one. There can be many other interpretations, such as different software interpretations, a more physical manifestation (e.g. hardware), or a business interpretation (one that doesn’t necessarily involve technology). Consider digital transformations for example. They typically replace one interpretation (existing business workflows) with another (a digital, software, interpretation), but it’s often the same idea realised differently.

The fixation on interpretation over idea is problematic as it creates an interpretation attachment, causing us to:

So what’s my point? We should show caution in how we view value. Value isn’t solely about a specific software interpretation, it’s about the idea and its practical interpretation. Of course, I’m not proposing that we neglect the interpretation, only that we adjust our views to better incorporate the idea, promoting it to first-class citizen status.

THE RISE OF THE FINTECH

The last decade has given rise to many new Fintech (financial technology) businesses. The news is full of their success stories, as they start to take market share from (and outmanoeuvre) the established players. Why is this?

They’re able to interpret the same idea differently, typically in a way which is more appealing to customers than what’s already on the market. Because there’s nothing already in place at point of inception (both in terms of a solution, or in terms of the pollution of a specific realisation), they’re able to work top-down (from idea to interpretation to implementation), building out their realisation based upon modern thinking (social, technological, delivery practices, devices, data-driven decision-making, machine learning, consumer-driven, integrability, accessibility), whilst many established businesses continue to use a bottom-up (realisation-focused) approach, munging together new concepts into their existing estate.

All of this leads to a serious problem for established businesses if innovation and progress are constrained by its existing realisation, and Loss Aversion is at hand. Whilst they could certainly attempt to retrofit modern practices into aging solutions, it doesn’t resolve the key issue - the realisation is (now) substandard, and requires a sea change to re-realise the idea (from the top down) using modern thinking.

INFLUENCES

Market pressures, innovations and inventions, modern practices and thinking, trends, better collaboration, knowledge, and experience all affect how an idea is interpreted, and can cause us to reinterpret it in a new, distinct, or disruptive way.

Customers pay for the idea and the interpretation, but in the end it's the idea they want. They’re tied to the idea, not necessarily to the interpretation. If they find an alternative (better) interpretation elsewhere, then they’ll take their custom there. Alternatively, you may alter (evolve) your own interpretation, or change it entirely and so long as it still aligns with the customer’s interpretation, still retain their custom.

FURTHER CONSIDERATIONS

THE ATOMIC DEPLOYMENT STRATEGY

I use the term “atomic deployment strategy” to describe a common deployment approach, commonly aligned with the Monolith - where all components must be deployed, regardless of their need. It is an all-or-nothing deployment strategy.

The standard Point-to-Point (P2P) architecture typical in a Monolith suggests a tight-coupling (Coupling) between all software components/systems in its ecosystem. This has several knock-on effects, including (a) making the extrication of specific services/domains from their dependencies difficult, and as a side-effect (b) forcing deployments to include all components, even when only one component is required (or has changed), and regardless of the necessity.

This approach suffers from the following challenges:

FURTHER CONSIDERATIONS

LENGTHY RELEASE CYCLES

Lengthy (e.g. quarterly or monthly) release cycles can be tremendously damaging to a business, its employees, and its customers.

Lengthy Release Cycles often promote excruciating, siloed steps, involving many days of waiting, regular rework, unnecessarily long and complex deployments, and the forced acceptance (testing) of a large number of changes in a relatively short time frame. They also suggest that anything at the end of the release cycle is particularly at risk of expediting, for the sake of a promised release date.

CONWAY’S LAW AND RELEASES

Conway's Law states that: "organizations which design systems ... are constrained to produce designs which are copies of the communication structures of these organizations." [wikipedia]

So, if a system architecture mimics how the organisation communicates, then why wouldn’t its delivery flow also mimic it? i.e. in a siloed organisation, where inter-department communication is limited, and any decision requires lots of lead time, then we’ll also find similar issues with the release cycle (siloed, with lots of waits).

Lengthy Release Cycles can cause the following issues:

Let’s view them.

SILOING OF TEAMS

The longer the flow, the easier it is to fit unnecessary steps into a flow, creating little pockets of resistance to fast change.

Silo’ing can cause the following issues:

Most of the succeeding points are caused by Silo’ing.

COMMUNICATION ISSUES

Silos are the natural enemy of good communication. Work that is undertaken in a silo has (to my mind) a higher propensity to be flawed, simply because of numbers. (As a generalisation) the more brains on a problem, sooner, the higher the quality of that solution.

Consider the natural tendency for siloed teams to work on software in isolation, typically neglecting to involve any Operations or Security stakeholder input (that would slow things down, wouldn’t it?). I’ve often seen cases where the Operations team either won’t deploy the feature, or are duressed into deploying it (against their will and better sense of judgement), causing the accrual of Technical Debt.

Often, it’s because the developer has assumed something (Assumptions) that competes with an operational expectation (e.g. until recently, that’s been stability, although now it might be nearer Resilience), and hasn't sufficiently communicated with them due to the organisation’s siloed nature.

This lack of good communication, alongside Lengthy Release Cycles causes misunderstandings, such as knowing which features are deployed in which environments (see Twelve Factor Apps - Minimise Divergence). E.g. the developer thinks that a bug fix is in Release X on environment Z, whilst in fact, it never got that far.

FEATURE HUNTS

I’ve listened to lengthy conversations around “Feature Hunts”, until someone finally provides evidence (typically by testing it) of what feature is where. It’s mainly due to Lengthy Release Cycles. Feature Hunts are a big time waster for everyone involved.

Poor communications, caused by the siloed mindset, are one reason for the increasing popularity of Cross-Functional Teams.

LACK OF OWNERSHIP

“I can’t do that, only [Jeff/Lisa/insert name here...] knows how to do that. We’ll have to wait for them.” Ring any bells?

That response is pretty common in siloed organisations suffering from Lengthy Release Cycles. It stems from a lack of group ownership and cross-pollination of skills and knowledge that you’d typically get in a Cross-Functional Team.

Why? I think it’s a mix of skills availability and context. For instance, just because someone in Operations has the skills to solve a particular problem, doesn’t mean they have the context to make the right change.

As a technologist, I admit to finding context much harder to understand than technology. Whilst I can pick up most technologies in a relatively short time, applying those skills and techniques in the right domain context is far more challenging (and one reason why I’d argue that consultants should rarely have the final say in a solution, unless there is parity between parties). If you don’t have context, you can’t own it, and if you can’t own it, you can’t help. Alas, we must await Jeff.

Pair/Mob Programming, or Cross-Functional Teams are some counters to this problem.

PING-PONG RELEASES

The scale of change embedded within each release often results in one (or more) major/critical bugs being identified, often late in the cycle (e.g. in system, acceptance/exploratory testing). These bugs must be fixed, causing lengthy “ping-pong” releases between development, build and release, deployment, and several forms of (formal) testing, causing lots of unnecessary waiting and rework.

TRUST

Technology books often focus on mechanisms over people (but sometimes the solution needs something more people-focused). Yet one of the cultural concerns over Lengthy Release Cycles is trust.

Lengthy Release Cycles promote silos, and individuals who are siloed may feel isolated, and neglect opportunities to interact with people within other silos. This is more than a shame, it’s a travesty. People who don’t know one another may have no means to build relationships, and thus trust. And without trust, we face an uphill battle to deliver meaningful change in a short time.

TRUST & CAMARADERIE

The camaraderie that stems from a well-knit, integral team of like-minded individuals shouldn’t be overlooked. It’s an extremely powerful tool to build highly performant, collaborative, antifragile teams.

Much of a team’s strength comes through the accrual of trust. E.g. Bill trusts Dave and Lisa as they’ve worked well together for years. They understand one another’s strengths (and weaknesses), and often complement them to build a stronger foundation. Cross-Functional Teams also build a shared learning platform that reduces single-points-of-failure, thus building trust. Consider the mutual trust and respect that the astronauts built up on the lunar landings, including how each one must learn how to perform the other’s job (a Redundancy Check).

Yet siloed teams don’t get the same opportunities. Whilst people may build trust within the silo, they can’t (necessarily) trust what comes before or after their own little island, creating a Us versus Them mentality.

Colleagues who don’t/can’t truly trust one another may live in a world of stop-start communications (e.g. “I don’t trust them, so I’m going to ensure I know exactly what they’ve done before I let it pass”; a form of micromanagement), and where duplication of work is predominant (e.g. “so what if they’ve tested it, I don’t trust they did it right, so I’ll do it again myself”). Mistrust hampers our willingness/ability to Shift-Left, and thus slows Productivity.

PRODUCTIVITY

Long Release Cycles exist (in a large part) due to difficult environment provisioning. Even in the cases where provisioning is quick, we often still find they’re not a carbon copy of the production environment. This has efficiency ramifications.

For instance, without an efficient provisioning mechanism, we invite systemic problems to creep into the work of development and testing staff, simply because they lack any means of fast feedback; thus, we limit people’s confidence in their own work because they can’t guarantee it will function as intended (i.e. no Safety Net).

Creating the right mechanisms to successfully and efficiently provision environments quickly takes time and money (which is why some organisations never get around to doing it; they’re always “too busy building functionality”). So, some organisations attempt to remedy this by providing a shared environment. But this approach is also flawed. For example, who owns the environment, keeps it functioning, and ensures its accuracy? If no one owns it, no one will support it. The result is stale software/data that doesn’t reflect the production setup. Again, this is hampered by Lengthy Release Cycles and poor provisioning.

RELEASE MANAGEMENT TROUBLES

I’ve seen some interesting approaches to release management through the years (including verifying readiness).

At one organisation, I witnessed a great deal of ping-pong releases, where the entire software stack was deployed, configured and tested, only to find that some feature or configuration had been missed, thus forcing the cleansing and reprovision of the environment, rebuilding the code, redeployment and reconfiguration of the application (which took a couple of hours each time), recording all changes (it was hoped) in an “instruction” manual, and then retesting it all again. This vicious cycle might occur multiple times.

Once the testers gave the thumbs up, the environment was dropped, all the steps in the newly formed “instruction” manual re-executed, then all the testing repeated to prove the instructions were indeed correct! What a massive waste of time and money.

LENGTHY RELEASE CYCLES & CHERRY PICKING

Lengthy Release Cycles may also promote poor practices such as Cherry Picking (a form of Expediting). Cherry Picking is (typically) where someone with little appreciation for the technical difficulties, identifies features/bug fixes they think the business needs and discards the remainder for later release (we hope). Cherry Picking raises the following concerns:
  • The picker(s) may regularly favour functionality over non-functional needs (i.e. Functional Myopicism).
  • It promotes messy branching and merging activities, where the engineer must identify what’s been picked, discard the remainder, then merge what’s left back for release. This is painful, error-prone, and pretty thankless.

REPEATABILITY

Lengthy Release Cycles make most forms of change difficult (Change Friction). Thus, when change is required (which happens often in dynamic organisations), it incentivises the circumvention of established processes (i.e. Expediting) in an effort to right a wrong; e.g. the effort to manually hack a change into a production environment seems quicker than following the established process (Short-Term Glory v Long-Term Success).

Whilst this approach seems quicker, it actually increases risk in the following areas:

IMPACT ON QUALITY

Lengthy Release Cycles can cause quality issues, simply because of timing. Lengthy Release Cycles tend to increase wait time, hamper collaboration, and don’t lend themselves well to fast feedback (or fail fast). Consequently, there is less time to rectify issues; i.e. less time to increase quality, without incurring cost and delivery time penalties. The question then becomes more contentious; do we release now, because we’ve promised the market something but knowing there are quality issues (which might affect customers), or hold off for another month/quarter to resolve quality issues but potentially suffer reputational damage, and let down our customers in other ways?

Performance (e.g. load) testing is a good example where Lengthy Release Cycles can impact quality. Performance tests (e.g. load, soak) are undertaken towards the end of a release cycle, once everything is built, combined, coordinated, and deployed to a production-like environment. This takes time (and is a lot more involved than you might imagine), and may (see previous points) be difficult to provision (i.e. time and money).

Tasks (such as performance tests) which are not undertaken till towards the cycle end (neglected entirely, or only performed half-heartedly), makes reacting to any identified problems difficult. In the case of performance testing, you may be forced to release software with known scale/performance concerns, leading to poor quality, and incurring Technical Debt. And Lengthy Release Cycles likely means you’ll often find yourself down this same path, where performance testing is never (or rarely) done due to time pressures. Unfortunately, we’re only pushing the problem further down the line, where it will be worse.

Quality is again at risk. And bear in mind that quality can be subjective (Quality is Subjective).

FURTHER CONSIDERATIONS

TOOLS & TECHNOLOGIES DO NOT REPLACE THINKING

We live (and work) in a highly complex, interconnected world. Complex interrelations abound, and every decision we make has both positive results, and negative consequences.

Yet, humans love simplicity, having a natural tendency to focus on positive outcomes, whilst neglecting negative consequences. This oversimplification promotes the chimeric notion that every problem has a silver-bullet solution.

My point? Tools, frameworks, design patterns, technologies, and even books (like this one), are all tools to support thinking; they are not replacements for it.

REAL-WORLD EXAMPLE - THE IMPORTANCE OF THINKING

To support my point around thinking, consider the doctor/patient relationship, and specifically the procedures involved in the diagnosis and treatment of a patient.

Below is a quotation from a medical article. I’ve bolded the words of particular relevance.

“The diagnostic process is a complex transition process that begins with the patient's individual illness history and culminates in a result that can be categorized. A patient consulting the doctor about his symptoms starts an intricate process that may label him, classify his illness, indicate certain specific treatments in preference to others and put him in a prognostic category. The outcome of the process is regarded as important for effective treatment, by both patient and doctor.” [1]

Let’s pause a moment to consider it.

As a precondition to the medical assessment, the doctor familiarises themselves with the patient’s medical history. The assessment takes the form of a familiar protocol, where the doctor asks the patient about their ailment, listens to the patients description/assessment, then begins a dialogue, probing the patient (at appropriate times) to gather more detailed, or accurate, information about the symptoms (i.e. to identify a root cause). Understanding the history (i.e. contextual information) is important here (there’s often an association between past problems and the current ailment), as it may lead to a revelation that will enable the doctor to classify the illness.

Note that at this stage, whilst the doctor has (hopefully) classified the illness, there’s not a single mention of a treatment plan. Only once the doctor has undertaken sufficient due-diligence, and has a keen understanding of the cause (including potentially testing that hypothesis), will they then diagnose and offer a treatment plan. The doctor now begins the next stage; indicate certain specific treatments in preference to others.

At the treatment stage the doctor uses their knowledge, expertise, and judgement, to formulate the best treatment plan based upon their given constraints. Let’s pause a moment to reflect on this.

Now consider the following. Why do doctors consider your age, fitness level, cholesterol, known allergies, genetic familial issues, in addition to accounting for your medical history etc, prior to identifying a treatment plan? Because they must work within the constraints of your Complex System; i.e. how your system reacts to a treatment plan may differ to how my system will react.

However, doctors must also work within the realms of another Complex System; they may be constrained (or influenced) by external influences like time, treatment cost (like budgetary costs; e.g. expensive treatment plans may be disfavoured, even when they’re known to offer more promising results), or (in some cases) unorthodox new treatments (such as in the treatment of a potentially terminal disease; albeit this is a bit of a caveat emptor). We’re talking about the intersection of two complex systems: one for the patient, and one for external parties.

Let me reiterate my last point. No treatment plan is recommended until after the doctor has considered a number of key factors and constraints, and made a balanced decision based on all available evidence.

Returning to the technology domain, I’m sorry to say, I often see a very different approach to how technology treatment plans are undertaken. Whilst I’m not saying it’s true of everyone, my experience suggests that many technologists spend far too little time diagnosing the problem, and often have already formulated a treatment plan before understanding the context to apply it to, or whether it will really work (and this may be subjective too!). This is a form of Bias. To rephrase, I rarely see technologists consider all of the constraints, often only considering the first-level ramifications, but ignorant to second and third-level consequences, or formulating a treatment plan based on an accurate diagnosis of the problem.

The reason for this behaviour is harder to qualify. Is the cost of medical failure much greater (maybe there’s little opportunity to rectify a medical failing - the patient is already affected), or more obvious, than the failure of a technological decision, thus less time is spent diagnosing technology? Possibly, but I suppose that depends on the context. Maybe a medical faux pa is recognised sooner (e.g. the impact on the patient is more immediate)? Are there more rigorous validations in the practicing of medicine than in technology? I’m less sure about that one. Do we know more about the human body than technology? No, I’d label both as Complex Systems...

I suspect there’s a more obvious answer, known as Affect Heuristic.

“The affect heuristic is a heuristic, a mental shortcut that allows people to make decisions and solve problems quickly and efficiently, in which current emotion—fear, pleasure, surprise, etc.—influences decisions. In other words, it is a type of heuristic in which emotional response, or "affect" in psychological terms, plays a lead role.” [2]

“Finucane, Alhakami, Slovic and Johnson theorized in 2000 that a good feeling towards a situation (i.e., positive affect) would lead to a lower risk perception and a higher benefit perception, even when this is logically not warranted for that situation. This implies that a strong emotional response to a word or other stimulus might alter a person's judgment. He or she might make different decisions based on the same set of facts and might thus make an illogical decision. Overall, the affect heuristic is of influence in nearly every decision-making arena.” [3]

We’re often driven by our emotions towards a specific technology or methodology, which may play a lead role in our decision making. I've seen this time-and-again; from an unhealthily averse response to certain vendors, regardless of the quality of their offering, to an inappropriately positive outlook towards specific cloud technologies incompatible with the business’ aspirations, or timelines. It sometimes leads to technologists attempting to fit a business around a technology constraint, rather than the converse (Tail Wagging the Dog).

BALANCED DECISION-MAKING OVER TOOLS & TECHNOLOGY

A good deal of this book attempts to demonstrate the complex interrelations within our industry. My advice is simple. Rather than attempting to solve a problem through the introduction of a new tool or technology, I (at least initially) emphasise the importance of better understanding these complex interrelations, helping you to make more informed and balanced decisions on technology/tooling choices.

BRING OUT YOUR DEAD

The technology landscape is littered with the broken remnants of once lauded tools and technologies that nowadays cause raised eyebrows and sheepish grins.

Some enterprises now face major challenges, through the heavy investment in a (now irrelevant) technology. The problem is twofold:

  1. Transitioning to a new technology is technically challenging.
  2. The existing technologies and practices are so ingrained in the organisation’s culture (through Inculcation Bias) that there’s little opportunity to reverse the decision.

Whilst hindsight is a wonderful thing, I believe that with the right mindset and knowledge, it’s possible to foresee many (but not all) obstacles on the road in enough time to avoid them, or pivot, simply by applying more balance to the decision making process.

There’s no such thing as a free lunch. “Good” outcomes and “Bad” consequences are so entwined that it’s impossible to separate one from the other, and extricate only the good. Consider the following cases:

MARKETING MAGPIES

Beware of Marketing Magpies; individuals who wax lyrical on new tools and technologies, based mainly around the (potentially biased) opinions of others. They may be missing the balanced judgement necessary for strategic decisions.

KNOWN QUANTITY V NEW TECH

Never hold onto something simply because it’s a Known Quantity, and never modernise just because it’s new and shiny. Change takes time and patience, and should always be based upon a business need or motivation.

Will we be demonising Microservices and Serverless in decades to come? Possibly. So let’s finish on a more uplifting note.

The best tool at your disposal is not some new tool, technology, platform, or methodology; it’s a diverse team with complementary skills and experience, with a precise understanding of the problem to solve, a good knowledge of foundations and principles, coupled with a deep understanding of the complex interrelations that exist between business and technology, and sound, balanced decision-making which remains unbiased (and sometimes undeterred) by the constant noise and buzz that encompasses our industry.

FURTHER CONSIDERATIONS

FURTHER READING

LOSS AVERSION

“In preparing for battle I have always found that plans are useless, but planning is indispensable.” - Dwight D. Eisenhower

Coupling should be, but often isn’t, considered alongside Loss Aversion (i.e. how averse your business is to the loss of a service, or feature). Owners of systems with a tight-coupling to integral services or features, may suffer great financial hardship if those services become unavailable (e.g. whether in the temporal capacity, or something more final, such as partner bankruptcy). Astute businesses identify key services or features they are averse to losing, and either plan for that failure, or deploy countermeasures (by building slack) into the system.

Netflix provide the archetype from a systems perspective, practicing several key aspects of fault tolerance (in their Microservices architecture) to counter Loss Aversion:

Consider the following scenario around Loss Aversion.

Let’s say I’m starting a new business venture to provide baking advice and recipes. To market my business, I need the following things:

The coupling might take this form.

Domain Name Coupling

Now whilst this represents a very simple case of Coupling, there’s already a few potential failures here. In this case, I’ll focus on the domain name.

Let’s say I’m remiss, and fail to renew my domain name on its renewal date. Several scenarios can play out:

Let’s say option 3 occurs. I’m remiss, and lose my domain name to a competitor (it’s a popular domain name!). That competitor links their own website to the blithebaking.com domain, which either confuses all my existing custom, or routes them all to my competitor. I’ve drawn up a table of potential outcomes below.

Scenario Domain Name Costs (monthly) * Business Cards Costs (one off) Website Costs Number of Customers (sales avg. $50) Overall Potential Cost (My) Level of Aversion
1 $5 $50 (100 cards) $0. Built it myself. ~10 $610 ($50 + $500) Low
2 $5 $50 (100 cards) $0. Built it myself. ~500 ~$25K ($50 + $25,000) Medium to High
3 $5 $10K (100,000 cards) $0. Built it myself. ~10 ~$10K ($10,000 + $500) Medium to High
4 $5 $10K (100,000 cards) $0. Built it myself. ~1000 ~60K ($10,000 + 50,000) High
5 $5 $10K (100,000 cards) 2K per change. Third party managed. ~1000 ~62K ($10,000 + $2,000 + $50,000) High
* The domain name costs aren’t included in the overall potential costs; they highlight the disparity in how a tiny outgoing may relate to the Aversion costs it affects.

You can see how quickly the combination of aversions can wreak havoc. The key concepts are:

Scenario 1 is the best in terms of low coupling/dependence. I’ve spent very little on business cards, and I have limited custom at this stage. I might grumble, but I can live with this outcome. Scenario 2 has cost me dearly due to the significant number of customers I had. Scenario 3 is an interesting one. Whilst I have limited custom, my somewhat unorthodox approach of purchasing 100,000 business cards as an up-front investment (a form of stock), prior to validating my business model, has done me a disservice, inflicting a form of self-inflicted Entropy upon myself.

Scenarios 4 & 5 show the worst cases, where my Loss Aversion is at its highest. In Scenario 5 I demonstrate that for the sake of only $5 a month, I’ve inflicted immediate costs in the region of $62,000. But there’s also the unseen, insidious costs to consider here; e.g. note that I haven’t considered the longer-term branding implications. Has it actually cost me hundreds of thousands of dollars?

How the number of customers affects our Loss Aversion falls into the scale category. Large-scale failings concern me more, because of the large disparity between (say) 10 customers, and 1000 customers.

LOSS AVERSION MAY BE SUBJECTIVE

One individual’s perception of Loss Aversion may differ to another, introducing an additional degree of complexity. For instance, whilst I might consider a $15K loss unacceptable, another - with stronger recovery capabilities - may not.

TIME CRITICALITY & LOSS AVERSION

Time criticality (the permanence of the failure) is another consideration.

Temporal failure may be sufficiently disruptive to put your business at risk, and you might want to consider:
  • What is an unacceptable duration of disruption to your business? Depending upon the situation, it might be seconds, or even days.
  • Timing. When did the failure occur? E.g. if my sales website fails at 3am one cold Sunday winter morning, I’m probably less concerned than it occurring at 6pm on Black Friday. Alternatively, if a large proportion of my revenue comes through events (e.g. a sports event), then the timing of a failure is crucial.

What’s interesting about my earlier example was that the outcome was - given enough time and foresight - utterly controllable. Most of the problems I found myself in were due to my inability to react, which I’d inflicted upon myself.

Whilst there’s no golden rules around Loss Aversion, that doesn’t suggest we shouldn’t plan; especially when each scenario is unique, complex, and responses may be subjective. The outcome of any decisions made here should feed into discussions around Coupling.

FURTHER CONSIDERATIONS

LOWEST COMMON DENOMINATOR

Lowest Common Denominator - if correctly used - can be a powerful tool. In a sense, it promotes Uniformity, which is a powerful Productivity enabler.

Consider the following example. For many years (and even today), one of the biggest problems facing integration protocols was the lack of widespread support for a single approach across the major vendors. Achieving a significant quorum was difficult as each vendor was either already heavily invested in an existing protocol, or was promoting their own. Standards existed but they were many pockets of resistance.

For instance, for many years Microsoft supported COM and DCOM, whilst Sun/Oracle promoted RMI (based upon Corba) for much of the “enterprise edition” integrations. Whilst both protocols are highly regarded, they often influenced the direction of the implementation technology; e.g. there was a vicious cycle, where opting to implement one solution in Java promoted the RMI protocol, which in turn influenced all further solution implementation choices to be Java (regardless of whether it was the right tool for the job).

As new technologies, such as XML (and later JSON) emerged, we began to see a nascent form of (implicit) standardisation (through uptake, rather than necessarily vendor-driven). Web Services settled upon string-based data transfer structures that were highly flexible, hierarchically structured, human readable, and could easily represent most business concepts, all over HTTP. It allowed the implementation technology to be decoupled from the contract (or API interface), enabling us to separate how we communicate with software services, and how behaviours/rules are implemented within it (Separation of Concerns). As long as you could communicate in string-form over HTTP, you could integrate; i.e. it became the Lowest Common Denominator.

We’re now seeing this Lowest Common Denominator used across highly-distributed polyglot systems (e.g. Microservices) as the default communication mechanism.

MICROSERVICES & LOWEST COMMON DENOMINATOR

One of Microservices key benefits is its ability to support highly distributed, heterogeneous systems. It achieves this, in large part, through the use of the Lowest Common Denominator principle to communicate between software services.

LOWERING THE GAP

Interoperability is a key design feature of the .NET platform. It lowered the gap, by providing a common engine to leverage any managed (running under the CLR) and unmanaged code (written as C++ components, or COM, ActiveX) to communicate. The benefits include:
  • Greater choice of implementation technology, inculcating a “Best Tool for the Job” mindset.
  • The ability to source a greater pool of talent - if you can’t get C~ talent, you might source some VB.net.
  • Extended sharing and reuse capabilities; e.g. reuse an existing investment, such as using a VB.net library in a C# application.

Note that whilst Lower Representational Gap draws upon the many benefits of Uniformity, one drawback may be innovation; i.e. new ideas often come from unique sources; less readily from sources that share many common attributes within an established order.

FURTHER CONSIDERATIONS

TABLE STAKES

In gambling parlance, to play a hand at the table, you must first meet the minimal requirement; i.e. you must match/exceed the table stake.

In business, Table Stakes are the features, pricing, or capabilities, a customer expects of every product in that class; i.e. it is the Lowest Common Denominator. In many cases, Table Stakes’ features are the core, generally uninteresting aspects of a product, so integral that they’re rarely discussed in detail during a sales negotiation (you shouldn’t be at the negotiation table without it).

Whilst Table Stakes normally relate directly to the product, they need not. For instance, a customer may demand regular distribution through a technique such as Continuous Delivery, or a cooperative and inclusive culture more akin to a partnership.

INTERNAL & EXTERNAL FLOW

Good Flow is an important characteristic of any successful production line, and thus your ability to deliver regularly, accurately, and efficiently. Yet it seems that many businesses fall foul of what I term Internal Flow Myopicism; i.e. they only consider their own internal flow when considering their delivery pipeline - and this may not represent the whole picture.

The figure below shows an example of flow within a (software) delivery pipeline. Let’s assume in this case that it’s a software supplier providing a platform to customers to build products upon. In this case, the assembly line has only five sequential stages (S1 to S5).

Flow

The “constraint” (i.e. the slowest process in the flow, or bottleneck) is, in this case, stage 3 (tagged with an egg-timer symbol; it’s also the smallest) in our five-stage process. No matter how fast the rest of the system is, throughput is dictated by this constraint. Inventory sits on the Buffer (see Drum-Buffer-Rope), waiting to feed the constraint.

What isn’t always immediately obvious from the above diagram (and something that is easy to overlook) is that the entire flow (from inception until real use by an end user) is typically far more expansive than just the internal flow. For instance, let’s say a supplier provides you a service (such as a software platform), which their customers (i.e. you) build upon to create their own product, which they - in turn - sell to their own customers. The figure below shows who’s involved in that chain.

Customer - User

Technically, from a suppliers perspective, the supplier’s customers are also intimately linked to the flow and should not (at least from the customer’s perspective) be considered in isolation; yet they were never represented on the original (supplier) diagram. IF Value is indeed associated with both what YOU provide AND what your SUPPLIERS provide, then this is an important point.

If we were to consider Drum-Buffer-Rope to represent both the supplier and the external customer, we would likely find that the drum beats to the much slower rhythm of a specific customer (the slowest part in the chain); not the velocity of the supplier, nor the fastest customer, not even of the second slowest consumer. Let’s see that now.

Entire Flow

The Supplier pushes it’s wares out to three customers; A, B, and C. Customer A moves quickly and can easily integrate those supplier changes whenever they arrive. However, Customers B and C move slower (B in this case being the slowest) and can't integrate those supplier changes so quickly. This, therefore, is the theoretical constraint (I say theoretical as it doesn’t happen like this in practice).

Of course, this picture is somewhat skewed by reality. None of the parties necessarily knows one other, or their velocity. And neither - in most cases - is the supplier aware of them. Customers are only cognisant of the supplier velocity, and their own velocity, nothing more.

But humour me for a bit longer. It’s all academic after all. If - as I have inferred - we find that the constraint sits with a specific customer (B in our case), yet the drum actually beats at the supplier’s speed, then we find that all of the inventory (the Buffer in Drum-Buffer-Rope) builds up ahead of Customer B (much like in the tale of The Magic Porridge Pot [1], where the porridge keeps flowing until the whole town is filled with it), and to a lesser extent, in front of the other slow customers (C).

You might question the fairness of this situation; why can’t the supplier move at my speed? So allow me to present you with an analogy. Imagine yourself in ancient Greece; specifically Athens, during the time of Socrates (circa 470-399 bc, if you’re really interested). Before you stands the great man, surrounded by his avid students and followers, all deeply engaged in one of his famous discourses. Let’s also assume you understand ancient Greek. The dialogue moves at a furious pace, back-and-forth between teacher and students, and you quickly find yourself unable to follow the main thrust of the argument.

Rising from your seat, you interrupt Socrates mid-flow, explain your lack of clarity, and suggest the group adjust their verbal discourse to a tempo more suited to your mental faculties. Would the great man, or indeed his followers, appreciate your (regular) interruption, and be willing to sacrifice everyone’s learning and enjoyment (it would probably frustrate a few), to slowly recount every minutiae, solely for your comprehension? Or might you be shown the proverbial door? My money is on the second option. So why should a software vendor (Socrates in this analogy) behave differently?

Of course, there is an alternative tactic, which fits well into our analogy. Rather than interrupting the flow, and facing indignation and alienation, you try to hide your ignorance; returning each day to hear the great man speak, reclining in your seat, nodding in appreciation at appropriate intervals, but not once comprehending the argument. What’s occurring here is your own form of personal (learning) debt accrual; at some critical juncture you’ll suffer a personal catastrophe (one day Socrates turns to you and asks you to argue your point of view - of which you have none readily available), and may even be laughed out of Athens (maybe the Spartans will be more accommodating?).

Let’s now return to the software world, and see if we can fit it with our analogy. If we find that the software our business depends upon moves at an uncomfortably fast pace, we can:

FURTHER CONSIDERATIONS

DRUM, BUFFER, ROPE

Good Flow is an important characteristic of any successful production line. Drum-Buffer-Rope (DBR) - popularised in the Theory of Constraints (ToC) - is a heuristic to visualise flow and constraint management.

The figure below shows a basic example of flow within a (software) delivery pipeline.

Software Delivery Pipeline

In this case, the assembly line has only five sequential stages (S1 to S5). We find that stage 3 (tagged with an egg-timer symbol) is our system “constraint” (the slowest step in the process).

Now that we’ve identified this constraint we can represent it using Drum-Buffer-Rope. See the figure below.

Software Delivery Pipeline with Drum-Buffer-Rope

In this model the Drum represents the capacity of the “constraint” (i.e. the slowest process in the flow, or bottleneck); in this case, stage 3 in our five-stage process. No matter how fast the rest of the system is, throughput is dictated by this constraint.

WAR DRUMS

For centuries, drums were used by the military for battlefield communication, to signal an increase or decrease in tempo (such as during a march), or to signal coordinated manoeuvres.

Inventory sits on the Buffer, waiting to feed the constraint. The Rope can be pulled to increase flow to the constraint; i.e. ensuring the constraint is never starved (which would effectively cut the entire system’s throughput; a unit of time lost to the constraint is a unit lost for the overall system).

FURTHER CONSIDERATIONS

VALUE IDENTIFICATION

Value should be a measurement of the whole, not the part; not solely what you can offer, but an amalgam of what you and your supply-chain can offer your customers.

Perceived Customer Value

Whilst customers may not always explicitly state it (Functional Myopicism), they expect certain qualities in the software and systems that they use (and purchase); such as stability, Security, accessibility, Performance, and Scalability. Customers who directly foot the bill for your service probably also appreciate efficient (and cost-effective) software construction and delivery practices.

TABLE STAKES

When viewed, these system qualities are often seen as Table Stakes, and may be glossed over during sales discussions. However, that doesn’t make them irrelevant.

Most businesses rely heavily upon the software platforms and services of others; and as a business, we inherit traits and qualities from those suppliers (e.g. platform stability, or instability); yet we can’t necessarily Control any of these aspects themselves. And if customers value the whole, not the part, then logical deduction suggests that these inherited traits also hold customer value. The figure below shows some examples of inherited traits (value) that suppliers may offer you and your customers.

Value Examples

Some might question the merit of these qualities, so let me present you with some examples based upon my experiences.

EXPERIENCES

EMBEDDED WEB SERVERS/CONTAINERS

Web servers/containers are used to host software and serve out web requests. Historically, they have been treated as entirely independent entities, embedded within the deployment and runtime software delivery phase, rather than the construction phase; however, those lines are now being blurred.

Embedding web containers into my day-to-day engineering practices had profound benefits on my software development habits and productivity over my original working practices. Bringing development, testing, and deployment activities closer together enabled me to do more of what I had typically invested less effort into (not through choice as much as through necessity), and to do so sooner.

For instance, prior to the switch, these were the steps I would typically follow:

Whilst I performed some form of incremental development involving deployment and runtime test phases, it was numbing and laborious, and rife with start/stop/navigate/wait activities. Embedded web containers changed all that, and also allowed me to embrace TDD practices.

I estimate these practices improved my delivery performance by around 25%, enabling me to deliver functionality (of greater robustness), sooner, through more rigorous testing.

MAVEN

Whilst Apache Ant was a significant step forward over its predecessors (e.g. “make files”), it was - for me - Apache Maven that was the real trailblazer. Maven is a build automation and dependency management tool that uses an elegant, easy-to-follow syntax, sensible conventions (e.g. a standardised location for source code and unit tests), has fantastic dependency management (a key problem to minimise duplication and “versioning hell”), and strong plugin support (see my point about embedded web containers). The end result? Increased Productivity, Uniformity, and (release) Reliability.

MOB PROGRAMMING

Whilst initially sceptical of this approach (a group - or mob - work on the same work item together for an extended period, until complete), I soon found it to be a great way to align teams around a domain and/or a problem, gain new skills, collaborate, build trust and acceptance, grow in confidence, and increase business scalability and resilience (having a pool of people with sufficient expertise to solve similar problems increases flexibility and enables the more reliable sequencing of project management activities).

DISTRIBUTED ARCHITECTURE

The introduction of a distributed (Microservices) application architecture enabled me to innovate (use a range of different technologies to solve a problem), isolate change (increase Productivity) and therefore reduce risk, support evolution, and embed TDD practices into my day-to-day work.

THE CLOUD

The Cloud has had a significant impact on many technology-oriented businesses. Need I say more?

LINKS AND ENABLERS

Most of these technologies/techniques have close associations or interrelations; and one often becomes a direct enabler to the next. For instance, in a previous role, I couldn’t gain the benefits from embedded web containers (or Maven), until we broke the monolithic architecture into smaller "Microservices". Once that approach became available, I could more readily apply a TDD mindset to many problems, resulting in better quality code and swifter future change. That TDD-driven mindset supported a marked increase in automated test coverage, which subsequently promoted continuous practices, like CI/CD. Once that was in place, I could look at Canary Releases etc etc.

My point is that there are almost always second and third-order effects to any decision, and you can’t necessarily know what the downstream impact of introducing one idea/technology will be. As I described, the introduction of one innovation may lead to many others, leading to a flood of innovation, and cultural improvements across the business.

PERCEIVED VALUE

Surely some, if not every one of these innovations has value? So, why are they given a second-class status within so many organisations? I can think of several reasons:

WHAT’S THE MINIMUM?

Good ROI is mainly about doing the minimum to satisfy Table Stakes, whilst investing the remainder on creating diverse functionality that excites customers. You want prospective customers to leave with the perception of a high quality product (which it hopefully is), and balance effort (and therefore ROI) by doing just enough Table Stakes to be successful. But how do you measure what's the minimum? It's rather subjective.

We can perceive value from two alternate angles:

  1. What external parties (customers) perceive.
  2. What the internal business - offering the service - perceives.

See the figure below.

Perceived Value

The external and internal parties perception of value are rarely identical, and can often be radically different. There’s no hard-and-fast rules in how different stakeholders perceive value. For instance, whilst some customers may perceive value lies with Functionality, Reliability, Usability, (and possibly) Security, internal stakeholders may perceive value lies in Functionality, Reliability, Scalability, Security, and Productivity. Much of these views comes down to our ability to Contextualise, yet perception may also shift over time, as people gain new learnings, or by the stimulus of some tumultuous event, causing us to reassess our previous beliefs.

We might visualise the problem as two distinct sets of perceived value, intersecting where the two parties are in agreement. See the figure below.

Two Value Sets

For instance, if both parties viewed Security as being of prime importance, then that quality would lie within the intersection, and should therefore be accorded an appropriate amount of energy from both parties. Ideally, there would be a large intersection (a commonality) between the two, representing a close alignment in goals and virtues between the two parties (you and your customers); such as in the figure below.

Large intersection means greater alignment

To my mind, this scenario better represents a partnership between aligned parties, rather than the typical hierarchical customer-supplier model that’s been a mainstay of trade for centuries. In this partnership both parties are deeply invested in building the best product or service; not because it benefits the one party only, but because it benefits everyone: 1. your business, to build a world-class product to sell widely, and 2. the customer, to allow them to reap the biggest benefits from that product.

As Adam Smith put it:

"It is not from the benevolence of the butcher, the brewer, or the baker that we expect our dinner, but from their regard to their own interest." [1]

"INTERNAL VALUE? - WHY SHOULD I CARE? I DON'T PAY FOR THAT"

External customers may be of the opinion that they don’t pay for internal value. They’re paying for functionality, not some seemingly vague notion called Maintainability, Scalability, or some other “ility”.

Whilst I understand that viewpoint, it seems rather myopic, and - to my mind - not entirely valid. In one way or another (whether in its entirety, or through some SAAS-based subscription model), external customers pay a share for the product or service that is delivered. And ALL software has production and delivery costs. And what about innovation?

If the supplier is slow (because they have inefficient construction or delivery practices), the external customer “pays” in the following ways:

  1. They’re investing in the time for that business to do things other than produce functionality, or (for instance) the further stabilisation of the platform.
  2. They’re not getting innovation quickly enough. Consider this a bit longer. I’ll wait... Innovation is key to the existence of many businesses; without it, many would have shrunk into insignificance. And if your competitors (using another supplier) can out-innovate you, then surely that represents a problem?

There’s also something to be said around Brand Reputation. As a customer, you should be able to ask the tough questions about scalability, resilience, security, productivity etc. Misinterpret these and you’ll pay for them too; whether in fines, lost revenue, share price, or simply embarrassment. Don’t believe me? Do a quick search on some of the big organisations who’ve suffered a major security breach, or the airlines that have suffered system availability/resilience issues, and analyse the outcome.

SUMMARY

My point? Different parties perceive value differently. The greater the discrepancy, the greater the chance of that partnership (eventually) failing. Some modern businesses have dismissed the rather one-dimensional, and deeply hierarchical, customer-provider business model, to favour one of a collaborative partnership, by aligning on what’s truly valuable (the qualities intersection) and learning from one another, to build long-term relationships of mutual benefit.

We dismiss the value that (upstream or downstream) suppliers can provide from our regular productization practices at our own peril; they offer benefit internally and externally; platform upgrades should be given a first-class status alongside internal product improvements.

However, to appreciate value, we must also be able to contextualise it; the subject of the next section (Value Contextualisation).

FURTHER CONSIDERATIONS

VALUE CONTEXTUALISATION

The ability to contextualise value comes from several sources, including:

We can categorise this contextualisation as either proactive (enabling forethought - i.e. Prometheus), or reactive (hindsight, or afterthought - i.e. Epimetheus). See below.

Value Contextualisation

Value - therefore - is an amalgam of what customers can proactively contextualise, and what they must retrospectively contextualise (typically, right after a significant system failure).

The ability to proactively contextualise value can be a very important ability. See the graphic below.

Failure to sufficiently contextualise leading to disaster

Failing to spot, or - in this case - to contextualise (the crew spotted a problem, they just never afforded it sufficient credence) the unseen, and change course, may result in disaster; i.e. if we build a product or business where little is visible (and known), and much remains invisible (the insidious unknown), we should proceed with caution, and be mindful of icebergs.

Most business customers I meet with see what’s above the water (e.g. functionality), and thus, can contextualise it. Yet, they don’t necessarily see, ask, or are given access to, what lies below the surface. Thus, they can’t contextualise its importance or indeed, its purpose. Business news is rife with examples of systems that weren’t sufficiently contextualised (or given credence) by their owners, forcing them to react (and rather swiftly) after a tumultuous event [1]. But the horse has already bolted.

A CHRISTMAS CAROL

In Charles Dickens’ famous classic novel A Christmas Carol [2], the main character - Eberneezer Scrooge - is portrayed as a spiteful, grasping, misanthropist, with no love for anything other (it seems) than money. You probably you know the story.

Scrooge had lost all sense of his humanity, and became blind to problems of his own making. Prior to the main event, we are treated to various scenes of loathsome rapacity as he turns away men of charity with hurtful words; scathes and mocks his good-natured nephew; loads misery and poverty upon Bob Cratchett and his family, and even dismisses the chained spectre of his once lauded, and now dead business partner, Jacob Marley, warning Scrooge to repent before it is too late (Marley is dismissed as a piece of undigested food). None of these actions are sufficient for Scrooge to contextualise what he truly values, so he is given a hard lesson.

Scrooge is visited by three ghosts on the eve of Christmas; the ghosts of Christmas past, present, and future. Through the course of the night, Scrooge is shown the error of his ways, and his own mortality is laid bare. It slowly dawns on him that he cares for more than money (e.g. his own mortality, how others view him, and his regained love of fellow man). He repents in time, and is able to change his destiny.

Where am I going with this? Well, it took the visitation of the ghosts for Scrooge to contextualise what he truly held valuable; i.e. it took a tumultuous event for him to reassess his values/beliefs, in order to make changes. Fortunately, time was on his side.

Whilst this novel has a fantastical theme, the underlying issue of reactive contextualisation still applies to how some businesses are run. These businesses are ill-prepared for the visitation of some “quality spectre” (whether it be Security, Resilience, Scalability, or regulatory non-compliance), and are forced to reactively contextualise. It takes some tumultuous event to wake them, during which time they’ll probably suffer harm (e.g. reputationally, financially, innovation dampeners).

We can’t change an outcome, yet by proactively contextualising, we can influence both our current position, and our future.

There’s another aspect to consider here too; whilst what’s below the surface may not sink you, it also may not be to your advantage. The metaphor of the graceful swan above water (external customers contextualise this), with duck legs paddling furiously below (the internal business contextualise this) fits well into this model.

You might be paying for a swan, but getting a duck! Aesthetic Condescension is a popular trap to the unwary; slap a new front end on a legacy product, and sell it as something new. The unwary see a flashy new UI and link the entire product (and practices) to modernity, even though it’s just a veneer. Again, there is a contextualisation problem; we’re blinded by beauty and can’t see the ugliness below (or the other idiom; "you can put lipstick on a pig, but it’s still a pig" [3]).

WILLINGNESS TO LEARN

Of course, much of this proactive contextualisation assumes a willingness to learn.

I knew a senior executive who (at least outwardly) seemed entirely unwilling to learn about the key technologies or practices used to build the business’ product suite. Now, I’m not suggesting that that executive should be coding software, but their lack of appreciation for it, and how teams worked, made it hard for them to contextualise, so they couldn’t proactively support the business needs - e.g. to identify, correctly prioritise, and solve key problems on the horizon, prior to them becoming serious impediments. My view is that if you’re in the technology business, you should make an effort to understand technology, at least at a high level.

“Only if we understand, can we care. Only if we care, will we help.” - Jane Goodall

SUMMARY

Contextualising value is not necessarily about resolution; foremost, it's about awareness, and then deciding what - if anything - to do about it. Once we can contextualise problems, we may then progress into risk management.

Value Contextualisation comes in two flavours:

One is about understanding your path and (potentially) changing your future; the other is about dealing with the after effects of an unknown and unexpected future (more of a fatalist mindset). The converse of proaction is reaction. Favour Proaction over Reaction.

Many have failed to sufficiently contextualise, or give credence to a problem, and suffered. Business news is rife with stories of failing systems leaving customers stranded, significant data losses causing eye-watering financial penalties (sometimes into the hundreds of millions of dollars), and key systems failing to scale at inauspicious times, angering customers and inducing financial recompense. Business Reputation is at stake.

FURTHER CONSIDERATIONS

THE PRINCIPLE OF STATUS QUO

Retaining the status quo - meaning the “the existing state or condition” - is important to many businesses.

Whilst many modern books, practices and methodologies place a heavy emphasis on change and innovation at both the technology and cultural levels (i.e. break Cultural Stasis), they tend to neglect to mention the fact that most businesses also depend upon a certain degree of status quo to survive. Innovation tends to be about future success, but stability is about the present situation.

BALANCING FORCE

Sometimes, we are driven so much by what we can achieve, that we forget to ask if we should do it. The Principle of Status Quo suggests that we maintain a modicum of balance between change and stability (Stability v Change).

Most businesses, and customers, can’t manage extreme change; it requires deep and sustained cognitive load, and each change carries an inherent risk. Whilst innovation is desirable, it needs to be carefully judged and managed so it doesn’t impact the current perception of stability. No one I know has (successfully) attempted to transition from Waterfall to Agile in one fell swoop, or migrated from on-prem to the Cloud, or shifted from a Monolithic architecture to Microservices. Change occurs incrementally, not as one big bang, and we maintain most of the status quo whilst undertaking that transition.

Consider - for instance - Agile, Blue/Green Deployments, Canary Releases & A/B Testing. Whilst these practices are certainly a vigorous nod towards progression and change, their approach is methodical and also protective of the status quo, with features like small, incremental change (Agile), fast rollback (Blue/Green), and smart routing to minimise impact on more conservative customers (Canary).

FURTHER CONSIDERATIONS

THE CIRCLE OF INFLUENCE

The Circle of Influence is a way to visualise who (and by what degree) change influences. It can be a useful tool for influencing and negotiation. The figure below shows an example.

The Circle of Influence

It takes significant effort to convince others of the need to change (whether that change is how we work, functional, or cultural). People have many different reasons to reject change; from a simple bias, to a lack of understanding, or that they (rightly or wrongly) think the change holds no value. Attempting to convince everyone, in a single big bang change, is doomed to failure. See the figure below. [1]

Adoption of Innovation

The graph shows you when change is actioned by different groups. Note it occurs at different times, and has various influencers.

EXAMPLE

At a previous employment I saw an opportunity to make a big difference in the way we built software; yet I didn’t shout loudly for all to hear. It would have been pointless, and may even have hampered the change’s introduction.

I began by influencing my immediate circle (colleagues I worked with on a day-to-day basis), explaining the problem (don’t underestimate the time this takes) I aimed to solve, discussing my proposal with them, and listening to their concerns and improvements, before progressing onto the next stage; a proof of concept (PoC).

This PoC was a success, and gave me and my immediate circle greater confidence that we could expand into the next circle of influence - the wider technology department. Again, there were more discussions, we took improvements on board, extended the PoC, and then took it to the next set of stakeholders (another circle of influence). By the third or fourth concentric circle, we had sufficiently influenced all of the C-level execs to give us the nod to use it for all future work.

If I had approached this big bang, I wouldn’t have found sufficiently strong support to influence everyone. Additionally, the overall solution wouldn’t have gained from the improvements offered by my colleagues.

The solution is to build up concentric rings of influence until you’ve enough motion that there’s no stopping it. But you need to get that stone rolling in the first place; and they can be big.

BIDIRECTIONAL INFLUENCE

Circle of Influence has bidirectional influence. You promote ideas for others to trial, and they agree, disagree, or offer improvements (influence in the opposite direction).

Circle of Influence might be used from internal to external (e.g. customers) influencing, or it might remain internal to the business (e.g. hierarchical influencing). It need not be from least influential to most influential (HiPPO - highest paid person’s opinion); e.g. a cultural change may begin with HiPPOs, but be pushed down to all employees in concentric circles of influence.

ALIGN AROUND A PROBLEM, NOT A SOLUTION

Unless you can explain a problem in a way that the circle truly understands (and I mean truly), they’ll not be able (or willing) to support you to influence the next concentric circle. Forget about explaining the solution until you’re sure they understand the problem.

And even when you think those stakeholders do understand, don’t be surprised when they ask you questions that disprove that theory. You might repeat this four or five times before some stakeholders truly understand the problem, but once they get it, you’ll find the solution just clicks.

Don’t be disheartened. Once people are truly subscribed, they’ll fight your corner. Find enough of the right type of stakeholders, and you’ll have enough sway to influence everyone.

“Only if we understand, can we care. Only if we care, will we help.” - Jane Goodall

Circle of Influence is used a lot (implicitly) in Canary Releases & A/B Testing. They are the means of practicing Circle of Influence.

FURTHER CONSIDERATIONS

UNIT-LEVEL PRODUCTIVITY V BUSINESS-LEVEL SCALE

There’s a lot of focus on unit-level (individual or team) productivity. It’s easy to see, relatively easy to measure (e.g. velocity), and it’s highly contextualised by those individuals in the team affected by it (and therefore championed). Yet, greater unit-level productivity does not necessarily equate to greater business-level productivity (or scale). Beware of focusing too heavily upon unit-level productivity if it hampers business-level scaling.

I’m skeptical of what some promote as wholly good practices (to me, context is key, and no single approach is wholly good, or wholly bad). Take the practice of Technology Choice per Microservice for instance. Whilst selecting the best tool for the job is a sensible practice, few seem to discuss the second or third order consequences, caused by an avalanche of such decisions across the entire business. Could we be harming the overarching business with these unit-level decisions? And if so, where is the tipping point?

Let’s consider Technology Choice per Microservice - a great example of unit-level decision-making - in more depth. The premise is simple; each unit may decide which implementation technology(s) to use, per microservice.

At first glance this seems entirely harmless. It promotes a sense of ownership and accountability within that unit, giving it the stimulus to find the right tool for the problem. However, being ultimately flexible in technological choice also comes with the risk that the overall solution is so technologically diverse (i.e. a complex ecosystem) that (a) comprehension can be hard, (b) security concerns are spread over a wider range of technologies, and (c) moving technical staff across domains is difficult (e.g. Simon may be an extremely competent Java developer, but he has no skills in node.js).

Most systems consist of tens, hundreds, or even thousands of these Microservices. If every unit can select their preferred technologies, we’re promoting a policy where it’s acceptable to increase the complexity of an already Complex System, resulting in even fewer people who can contextualise it in its entirety. The consequence of that, is further system-level (not unit) fragility; you’re increasing the number of moving parts (in the form of software platforms and libraries), and actually reducing your ability to Control change.

Where does one stop? Can that unit also select divergent technologies for logging, alerting, monitoring, or any other metric-gathering tools that could be used to understand aggregated system health? Personally, I’d prefer the ability to measure, analyse, and view in a consistent manner.

There’s also something to be said for Uniformity from a security perspective. Systems with greatly diverse technologies (technology sprawl) suggest an increase in Manageability and Security challenges. Patching many divergent technology stacks for vulnerabilities may be tough, as it infers an increased likelihood that we must wait upon more vendors to release a patch as each learns what to change and how to distribute it.

DEPLOYMENT PIPELINE UNIFORMITY

Modern continuous practices often promote Uniformity. For instance, Deployment Pipelines all look and behave very similarly, regardless of the underlying implementation technology of the software unit, simply because many variations may lead to a decrease in that business’ ability to scale; even when unit-level productivity is greatly improved.

A predominant focus on unit-level productivity can create:

HOW DO YOU CHOOSE?

The benefits of unit-level productivity upon a business can be highly attractive. And whilst there is often a case for a unit to have complete autonomy, and not be too considerate of the overarching business, it really depends upon the context. And if that statement holds true, then conversely, there must also be cases where an improvement within that unit (individual/team) doesn’t equate to a more successful business.

Before we discuss how we might choose when to increase the efficiency of that unit, I suggest a quick refresher on the differences between efficiency and throughput (Efficiency & Throughput).

As discussed in that section, increasing the efficiency of the unit - by introducing a different technology or technique - fits more closely with a vertical system scaling model (than a horizontal one); i.e. you’ll increase Productivity, and also gain some throughput increase. If your goal is to reduce waste, then this may be a sound investment. However, let’s say your goal is to significantly increase throughput (as it’s a major constraint in Flow). In this case, by focusing on efficiency, not only may you still reach a hard limit, but you may also have exacerbated the unit’s ability to scale out.

So, how do you choose when to increase the efficiency of that unit? Let's return to our discussion around Technology Choice per Microservice.

Firstly, let’s assume that the unit has chosen an appropriate technology (it isn’t always the case), which - if implemented - will increase team productivity. Below is a list of questions to help assess the situation?:

UNIT-LEVEL CULTURAL POLLUTION

There's something else to consider at the unit-level, related to interpersonal skills. In recent times there seems - particularly within some influential tech-oriented quarters - to be a backlash against the incredibly talented, but culturally suicidal Culture Polluters.

You probably know the type. These so called “Brilliant Jerks” are talented and highly-productive at the unit level, and can find innovative solutions to difficult problems. But there’s always a downside... they are a complete nightmare to work with. In extreme cases, they cause such irreparable problems that other teams/individuals must be insulated from them, and those culture polluters are either forced out, or must be ring-fenced to work by themselves.

Were beginning to realise the detrimental effect these people can have on a culture, and thus on a business’ success. “Brilliant Jerks” can affect a business’ ability to grow and scale. It’s another example where the promise of unit-level productivity (said jerk) is disfavoured to make way for a much broader cultural improvement [1].

If we only look at unit depth, we’ll see highly-productive people; but if we look more broadly and wholly, we’ll witness a dysfunctional business (e.g. people not able or willing to communicate, ideas and innovations from other quarters quashed before they have the opportunity to grow, and a lack of camaraderie, trust, and collaboration).

It’s easy (and seemingly smart) to install these individuals into key business domains and positions - they are after all very good at their job - to the point where you’ve committed the cardinal sin of embedding an irreplaceable Single Point of Human Failure into a key area of your business. At this juncture, you’ve little choice but to retain them, no matter how difficult or horrid they are to others, as you simply can’t replace them. The outcome of this is likely to be lost talent (as colleagues leave), and the hampering of new talent acquisition (as word gets around about your culture).

SUMMARY

To clarify, I don’t advocate a blind investment in unit-level productivity without regard to context, and thus, understanding the wider business ramifications. Whilst the intent of most unit-level decision-making (e.g. Technology Choice per Microservice) is good, beware of “too much of a good thing” (in this instance creating an unmanageable technology sprawl).

Conversely, whilst Uniformity is extremely potent when dealing with similar, or conventional tasks, it may lack what’s needed for radical change and innovation (which is where unit-level productivity can shine). You - with a mind both on technology and the business (i.e. sufficient context) - are best placed to decide when to favour radical change over convention.

I reiterate the following philosophy throughout this book. To make better decisions requires two things:

Balance the need for units to grow and learn, against the cost (monetary, time, or cultural) for the business to support many diverging units. I’d recommend avoiding the extremities (you always do it one way, or you have too many methods to count) and use the Goldilocks Gauge (not too little, not too much, just the right amount).

Finally, beware of Marketing Magpies - individuals attracted to modernity and marketing propaganda, for the sake of modernity over necessity. These individuals are often influential, have strong backing in the form of some evidence (but it’s rarely contextualised since it originates from external sources), and may promote unit-level gains whilst forsaking overarching business needs.

FURTHER CONSIDERATIONS

EFFICIENCY & THROUGHPUT

For the sake of completeness, I’d like to discuss the differences between efficiency and throughput (or capacity).

Dictionary.com defines “efficiency” as (I’ve bolded the key words):
“the state or quality of being efficient, or able to accomplish something with the least waste of time and effort; competency in performance,”
or alternatively,
“accomplishment of or ability to accomplish a job with a minimum expenditure of time and effort...”

Efficiency, therefore, is about expending as little time or money on a task, by performing it so well that there are minimal wasteful activities. You can improve efficiency by reducing waste, thus reducing expenditure (i.e. better ROI).

Throughput, though, is different. Dictionary.com defines it as (I’ve bolded the key words):
“the quantity or amount of raw material processed within a given time, especially the work done by an electronic computer in a given period of time.”

The focus here is not on reducing costs per se, but about increasing the system’s overall Flow, by increasing the capacity of a key area of that system.

The concept of increasing business unit (an individual, or team) efficiency or throughput (increasing its ability to scale) isn’t vastly different to how you might improve a software system’s efficiency or throughput (Performance v Scalability).

You can increase system capacity in two ways:

You can scale vertically on a single node by either (a) increasing hardware resources or (b) by improving the runtime efficiency of how the software on that node functions. The trouble with vertical scaling is that you may still reach the node’s maximum threshold and can go no further.

You can scale horizontally, not by improving efficiencies, but by adding further nodes (hardware instances), and delegating work to those nodes in a distributed manner. This form of scaling is more potent (but more complex) than vertical scaling because there’s no theoretical limit.

Increasing business unit-level efficiency - by introducing a different technology or technique to an established team or individual - fits closer to the improvement to runtime efficiencies, described in the vertical scaling model; i.e. you’ll likely increase that business unit’s productivity, and - to some extent - increase throughput (whilst you can increase throughput solely by increasing the efficiencies of an existing business unit, it is limited to whatever that business unit can manage), but you might still reach the maximum capacity of that unit. This may be fine if your goal is to reduce waste but may not be if your goal is to significantly increase Flow.

FURTHER CONSIDERATIONS

CHANGE V STABILITY

“And then He created rapid change, and all cheered, excepting the ops.” :)

The technology industry (in particular) has a problem - the continual friction between the two competing forces of change versus stability.


The industry moves incredibly fast. The future success of a business lies with the need for “change” in order to innovate and stay ahead of the competition. Innovation holds risks, but also has the potential for massive reward. However, the sustainability of most established businesses depends upon their ability to maintain a set of stable products and services to existing customers. Most established businesses need the best of both worlds; the “do cool new stuff, but don’t disturb the status quo” philosophy.

These conflicting forces can also create challenges internal to a business (Cultural Friction). The traditional centralised team and departmental structures of many established businesses may create groups that foster conflicting opinions, or even goals; i.e. whilst the overarching business is fostering/promoting change (acted out by adding new features); the traditional (centralised) Operations teams - responsible for maintaining stability in production systems - really desire stability (they’re the ones who get woken at the witching hour by alerts due to a failing system).

The centralising of teams tends to centralise mindsets too. Change - no matter how small - risks instability, and thus causes Change Friction. A group focused on delivering rapid change will attempt to drive it forward with increasing rapidity (sometimes too quickly), whilst a group focused on stability can create a stalwart defence against (rapid) change, and may be disincentived to do so. A business needs a balance of both.

This friction will remain so, ad infinitum; so what can be done about it? To my mind, there are several methodologies, cultural fits, and practices that can bring balance, including:

FURTHER CONSIDERATIONS

FEASIBILITY V PRACTICALITY

“Many things are feasible, not everything is practical.”

The technical feasibility of something is neither a good indicator that it should be done, or that there is the capability to do it.

Let me offer an example. A (service provider) client once gave me a job to investigate options to replace parts of their existing internal product. The options on the table included integrating with several external sources, or building out the functionality internally (i.e. the Build v Buy dilemma). After initial discussions, it became clear that key business stakeholders were already preparing for an external (Buy) solution, reasoning that it would be a straightforward integration, and the business would realise the benefits of fast TTM. My role was to offer some technical due diligence of the options, and “determine their feasibility” (that was the requirement).

The thing is, as I delved deeper, I found that all options were technically feasible (I apologise for my pedantry, but that was the requirement). The question was far too vague to sensibly answer. It was quite possible to successfully integrate the product with any of the Buy products - given sufficient time and money - but not one was practical.

There were several key concerns:

My point? Being asked what’s feasible does not necessarily correlate with what’s sensible. You can do almost anything; the first question is should you?

FURTHER CONSIDERATIONS

CIRCUMVENTION

Circumvention - meaning “to go around or bypass” [dictionary.com] - can create reliability and consistency problems across a businesses. It is often the result of Reaction, leading people to action change quickly (to remediate some impending doom), but never following it up with a more permanent, sustainable solution. Circumvention changes may be easily forgotten, insufficiently promulgated, or are unable to be tackled (particularly if reaction is second-nature to that organisation), leading to Technical Debt.

WHEN CIRCUMVENTION BECOMES THE NORM

Watch out for circumvention practices that become the established norm. Each circumvention embedded is like a tick, slowly poisoning your bloodstream; debts accrue in the form of additional complexity, reduced TTM, and increased Single Point of Human Failure etc.

Common examples of Circumvention include:

SUPPORTING CIRCUMVENTION

Whilst conformity typically leads to convergence, circumvention can lead to such diverseness in practices that the practicalities of supporting, or offering support to a highly circumvented process, are severely hampered.

Circumvention might best be explained with a story... Andy has recently joined the company in an IT Operations role. Whilst an experienced IT systems operator, Andy has no business domain context, but has been asked to deploy a new software release into production.

For the sake of argument, let’s say the release is a monolithic software application, with very little in the way of deployment automation. The deployment consists of a number of “build instructions”, crafted over many years (of blood, sweat, and turmoil), and now runs to over three pages in length. The table below represents part of those instructions.

1.Copy .zip file to xyz directory.
2.Change Directory to the xyz/web directory.
3.Unzip the archive (e.g. unzip ...).
4.Rename the directory to web-deployment (ensure it’s case sensitive).
......
97.Delete the temporary files in xyz directory.
98.Configure the system variable “bob” to be the value 10.
Explanation: it needs to be greater than 9 to allow the application to start up correctly.
99.Tail the log file and check the web server started up correctly.
100.Configure the system variable “bob” to be the value 5.
This is the correct value, but can’t be set until AFTER the application is started up.
......
200.Done!

The process is somewhat protracted, taking several hours to complete. And whilst most instructions are exoteric, Andy finds some highly contextual and esoteric. For instance, whilst steps 98 and 100 are imagined, I’ve seen many examples of these seemingly nonsensical instructions required for some key system to function as expected. These steps are examples where Circumvention has become an established practice.

These instructions are abstruse, and nothing has been done to remedy them. Andy should - quite rightly - question the veracity of these instructions.

In this case Andy misses step 98, and completes the remaining steps unaware of any problem. It’s not till step 200 (e.g. 3 hours later) - when the entire system is deployed and configured - that he finds it doesn’t work. It’s a further two hours (plus an additional Ops resource) to track the problem down to a single missing step. It’s certainly not Fail Fast. None of this is Andy’s fault; it’s a problem with the process (it shouldn’t be so manual, onerous, and filled with circumvention), rather than the person.

Finally, we may find that once embedded, Circumvention is rarely questioned. How do we know these steps are still relevant? They were required when first recorded, but they’ve been taken for gospel ever since, yet it’s quite possible they no longer affect the outcome (just muddy the waters). As no-one has time to check their relevance, they continue to add unnecessary complexity, worry, and slowness.

SUMMARY

As I mentioned at the start, Circumvention is often the result of Reaction, leading people to action change quickly (to remediate some impending doom), but never following it up with a more permanent, sustainable solution. This approach is then repeated as each reactive measure is required.

Circumvention could occur anywhere in the product engineering lifecycle. For instance, circumventing performance tests will eventually lead to the introduction of poorly performing software (which can hamper several business qualities) - yet problems can remain dormant for a long time. This mindset may even pollute a development culture; i.e. if writing performant software is perceived by key stakeholders to have little merit, would some developers rest on their laurels (and why would junior developers know any different)?

At some stage, we reach a tipping point, where more is invested (both in time and money) to work around this proliferation of circumventions, accumulated over time, than it would take to permanently solve the problem. This, inevitably, leads to poor TTM, waste (poor ROI), embarrassment, and even Branding issues.

FURTHER CONSIDERATIONS

LEARN FAST

“If fail we must, let us do so with haste.”

Purview the business section of any decent bookstore and you’ll find many independent publications all singing the same tune - failure is not only an inevitability, but an important quality to (eventual) success.

Late learning - in any form - is a problem in a world dominated by Ravenous Consumption, whether related to late success (i.e. you could have achieved greater success, sooner), or of late failure (i.e. you’ve burnt lots of cash on worthless features).

EMBRACE FAILURE, BUT BE INTOLERANT TO ITS LATE DISCOVERY

We must learn to embrace failure as part of the path to success, whilst being intolerant to its late discovery - i.e. if we must fail, let’s do so quickly.

We can ill afford late learning with Ravenous Consumption nipping at our heels. Its counterpart - Learn Fast (which Fail Fast is a subset of) - has several redeeming qualities:

With Learn Fast, I’m looking to test my assumptions, identify surprises (which might take me down an entirely new and untraveled path, or prove an approach unviable), and compare them against my preconceived notion (and possible Bias) - which may sit at odds with reality. I don’t wish to over-invest in any single untested idea until I have greater knowledge and (therefore) confidence in my approach. In software products this typically involves doing the minimum possible to (safely) distribute an idea (e.g. deploy it into a production environment), where it can be measured and studied.

Learn Fast mitigates some of the risk associated with building out an entire solution, only to find it doesn’t function in practice. Both the foundations of Agile, and of continuous practices, are based upon this principle. For example, we learned to associate Waterfall with risk, due to its Big Bang Delivery approach and protracted learning, so favoured small increments of value.

Embedding this principle within a culture can also have beneficial results. For example, Cross-Functional Teams (diverse units exhibiting a variety of skills, experiences, and thought processes) provide a form of Learn Fast, simply by aligning these diverse groups, sooner. This style of team is regularly cooperating and communicating, fleshing out prototypes, negotiating MVPs, building out coordinated value in small increments, and discussing areas of contention much sooner. Diversity speeds learning.

FURTHER CONSIDERATIONS

GOLD-PLATING

Good artists have two important characteristics:

The first point needs no further discussion, so let’s consider the second.

Good artists have a keen sense of when a work is complete. They're also acutely aware that - after some point - any further change may actually decrease their work’s value. This cessation of activity at a key juncture also has another important quality - it prevents them making any further (unnecessary) investments with no sensible return. In the technology world we term this unnecessary over-investment as “gold plating”.

Like an artist (which - by the way - a software developer is a form of), technology staff can get carried away. It’s easy to lose sight of the wood (business value), whilst navigating around the trees (technical detail). To my mind there is a large proportion of technologists who either have an unhealthy Refinement Bias, or who struggle to identify the next right thing to do. However, it’s also a tricky subject, particularly if we consider that the idea of “doing it right” can vary greatly, and that Quality is Subjective.

Good technologists however, are also acutely aware of spend. One eye rests on the technology, whilst the other is continually reassessing the change in terms of business benefit and spend. Good technologists know when to stop refining and tackle other concerns.

SMALL STORIES COUNTER GOLD-PLATING

If you’ve ever worked in an Agile manner, you’ll be aware of the drive towards delivering small, cohesive stories.

There’s many good reasons to keep stories small and focused. One benefit being their ability to hinder the hidden gold-plating activities that can pop up in larger projects. Note that I’m not suggesting you shouldn’t Refactor, only that you should not continually refine something well past the point of no return.

To conclude this section, I’d like to recount a story. Some time ago, the company I worked for hired a software engineer to bolster our small team on a software project. Soon after his arrival we started finding unexpected changes to our source code that occurred overnight; we tracked them down to our new joiner.

Now, the original code wasn’t that bad, so whilst we were a bit skeptical, we agreed to incorporate the minor improvements he made. He obviously enjoyed making these refactorings and had a lot of time on his hands (hence the out-of-hours commits).

Soon after though, we’d arrive in the morning to find significant code restructurings. What was worse, it was debatable whether his changes offered any improvement. What began as a minor infringement soon became a major source of contention within the team.

Remember, every change should go through the rigours of the engineering lifecycle before it is accepted, and these were the days before automated deployments and testing went truly mainstream. We found ourselves undertaking a raft of additional activities (e.g. deployments, regression testing) solely to verify the previous night’s “refinements”. This didn’t add business value, slowed our velocity, polluted our culture (the team found the approach uncollaborative), and detracted us from our goal. We concluded that his time was better spent on building new functionality than refining thoroughly adequate software.

This was the ultimate form of gold-plating; where a set of refinements actually slowed progression towards an important business goal, wasted cycles, and created unnecessary cultural friction in a close-knit team.

FURTHER CONSIDERATIONS

GOLDILOCKS GRANULARITY

“Not too coarse, not too fine, just right.”

Granularity is an important, and sometimes overlooked aspect of software design and engineering. It relates to how we define responsibilities, and how we propose that consumers interact with our software. Selecting the wrong granularity can place unnecessary burdens on both the solution and its consumers.

FACADE

The Facade Design Pattern hides complexity from consumers by acting as a higher-level intermediary. This intermediary exposes a coarser-grained public interface to consumers, whilst hiding the finer-grained interactions (the ones the consumers would have made themselves) internally. Not only does Facade hide complexity, it may also improve Performance and Evolvability too.

API design offers us a good example of the importance of granularity. Expose too fine a grain of API, and risk a “chattiness” that causes performance (latency) issues (a process making many remote interactions can cause substantial performance degradation), the embedding of too many Assumptions in the consumers that it affects Agility (such as tight coupling on the sequencing of a workflow), Resilience concerns (more network interactions increase failure risk), data consistency concerns (e.g. a business transaction is left in a partially complete state), and scaling concerns (e.g. the network/server processes n individual requests when one would suffice).

On the flip side, make them too coarse and we risk placing too many Assumptions within the APIs themselves. This makes them incohesive, they may suffer from poor Flexibility/Reuse in different contexts (i.e. lower our ROI) and Maintainability problems, and may even suffer from Integrability challenges (around comprehension and common-sense).

FAULT-TOLERANCE OF COARSE-GRAINED REMOTE INTERACTIONS

Coarse-grained communications promote greater reliability as they are (generally) over sooner (less network interrogations) and have fewer opportunities to fail due to network instabilities.

Let’s look at a dialogue now (between Lisa, the integrator; and Jeff, the API Service provider). Whilst fabricated, this conversation is based on several I’ve experienced.

Lisa (Integrator): “Hi Jeff, I have a question about API x that I’m trying to integrate with. The API contract says that field z is mandatory. Why must I pass this data in? It makes no sense and I don’t even use it anyway...”
Jeff (API Service Provider): “Sorry Lisa, it’s part of the API contract. I know it doesn’t make much sense, and you don’t need it, but we need it to make the flow work, so please can you add it.”
Lisa: “Could you not just change the API to make it optional?”
Jeff: “Sorry, but other consumers rely upon it being there and we can’t change it. Sorry.”
Lisa: “Ok, could you create me a new API just for what I need? Also, I’ve noticed a few other fields that don’t make sense either.”
...
Dialogue proceeds.
...
Jeff (soliloquy): Lisa is absolutely right, yet I can’t say that. I don’t even understand why we need to support these fields myself. I wasn’t around when this decision was made, and no one seems to remember the reasoning. All I know is that passing it in makes our APIs work. I hope I don’t get any more of these difficult questions...

This is not a good conversation to have with a customer (which the integrator may represent); nor is it customer-focused. It resembles a tennis rally where there’s no winner, just two deeply frustrated people (where one side asks reasonable questions but gets nowhere, and the other is focused on fending them off rather than offering helpful solutions). These sorts of conversations reflect poorly on the API service provider. Both the messenger and the overarching business look foolish, and customers lose respect and confidence in both you and your offering (Reputation). Whilst not the only cause, in this case, (overly) coarse-grained responsibilities was a major factor.

TABLE-LEVEL VIEWS

I've worked with several solutions that coupled the UI forms (and navigation) directly to the underlying database tables. Whilst extremely fast to build (RAD), in the main they were:

  • Unintuitive. It was so easy to expose the underlying data structure that everything was presented, regardless of its relevance.
  • Unnecessarily complex, and therefore, unproductive for the user.

Coupled the user experience directly to how those underlying table structures were navigated. This was both inflexible and created Evolvability challenges.

Unlike my previous example, this one was caused by the granularity being too fine. It’s worth noting that further iterations were much better received; in the main because they hid this complexity behind facades, and managed the flow for the user.

One final reflection. Please note that I’m not advocating one option over the other. It depends upon the context, and why I advise the practical application of Goldilocks Granularity.

FURTHER CONSIDERATIONS

“SHIFT LEFT”

To Shift-Left is a popular phrase in modern software businesses. It simply means the practice of (sensibly) moving work items to earlier in the current flow. This practice allows problems to be identified, thus gaining valuable feedback that can increase business Agility (allowing you to change tack, pivot etc).

Examples of Shift-Left include:

FURTHER CONSIDERATIONS

ECONOMIES OF SCALE

Economies of Scale  -  the act of gaining competitive advantage (by increasing the scale of operation, whilst decreasing cost) -  is often used to gain competitive advantage by:

When one company acquires another, that business typically swallows up the other (including its employees, products, and customers). The acquiring business benefits in three ways:

One way to achieve Economies of Scale is to follow the Growth through Acquisition model.

FURTHER CONSIDERATIONS

GROWTH THROUGH ACQUISITION

While some businesses have succeeded in growing through this model (Economies of Scale), it’s a difficult one to sustain. I believe a key cause for this long-term failure relates to Technical Debt, and an unwillingness (or inability) to consolidate technologies.

Let me elaborate. With each new acquisition comes technical baggage (or debts if you prefer). They may take the form of incorporating in a multitude of systems (built to mimic that business’ bespoke practices), or non-standardised data-sets meant to capture similar but slightly different information, to how the acquiring business captures it. It could possibly be the inclusion of divergent technology stacks, Circumvention, or the thought processes and cultural ethos distinct to that acquired organisation. It’s rare that any sizable acquisition involving technology doesn’t come with these idiosyncrasies.

As the rate of business acquisitions displaces the business' ability to pay down the technical debt, we find some businesses forced to manage hundreds of discrete applications that often do the same thing, or are only slight variants on what others already do (i.e. they function in the same role, but capture different information, or use a slightly different implementation or integration model). They all make Assumptions about their operating environment. How many different ways of capturing customer information do you need?

THE RISE IN INTEGRATION PROJECTS

Many projects have been initiated solely to manage an integration of systems between two historically competing, but now merged, businesses.

A forceful growth through acquisition strategy can have consequences upon technology, leading to deceleration, cultural fatigue, and innovation stasis. One answer is the Acquisition / Consolidation Cycle.

FURTHER CONSIDERATIONS

GROWTH WITHOUT CONSOLIDATION

“It’s always easier to start something than to finish it.“

Over the years I’ve witnessed several failures in business strategy, caused by an unwillingness (or neglect) to consolidate technology, leading to poor business Agility, and subsequently, poor growth. They are:

LACK OF CONSOLIDATION

One of my biggest concerns within some businesses I’ve worked with is the lack of Technology Consolidation. Rather than solving this problem, there’s a keenness to make another new thing, supposedly to replace the legacy. Yet the retirement work seems to almost always be overlooked (Value Contextualisation), or shortened, to make way for the next big thing, and consolidation never occurs. The result is a vast (and often unmanageable) technology sprawl.

Let’s discuss them.

1. CONSOLIDATION AFTER PRODUCT REPLACEMENT

Consider the following scenario. You work for an established business, providing services to a wide range of customers. The business’ current product has sold well, but it’s now aging and something modern is required to appeal to both existing and new customers. So... you start building a new product.

The business intends to migrate all customers from the old system across to the new one. However, there are two problems:

  1. Building the new solution will take years.
  2. Existing customers must still be supported; this includes functional change and extension on the old product.

In the past, I’ve seen monolith after monolith created to satisfy this business desire. Each monolith uses (at the time) modern technologies and techniques, yet the net result is the same. The products diverge to a point where the migration effort is unacceptable, and multiple products must be maintained forevermore. In the worst case, we may see these monoliths combined to form an aggregated monolith, and may even become a Frankenstein's Monster System.

There’s a fair chance that the business never performs the consolidation phase, so their customer base becomes strewn across multiple applications, causing (operational) management and coordination issues, and with each successive feature added, exacerbating any prospective migration plan.

2. GROWTH THROUGH ACQUISITION

Consider the following example. CoreTech is a business wishing to expand their presence into other regions, and thus become more profitable. They currently offer software products A and B, mainly to a core US market. In the following months they successfully broker a deal to acquire a smaller, European-centric business, called WolfTech, to bring their product (product C) and specialties under CoreTech’s umbrella.

CoreTech now has gained three key benefits:

  1. They have a more diverse product portfolio to offer customers.
  2. They have more customers, thus (hopefully) more profit.
  3. They now have a presence in another region (Europe) to sell their products/services.

Sounds appealing doesn’t it? But there’s some disadvantages.

The products on sale are software applications. We now have three products (good); however, some aspects of those products perform the same function, such as capturing customer details (bad).

As new customers are on-boarded, we find most only need parts of each product. Let’s say a prospective customer (propellor.inc) needs these features to form Product D: Users, Customers, Products, Portfolios, Ledgers, Carts, Discounts. See the figure below.


Note that these features touch all of CoreTech’s products. From a business perspective, this seems OK, so we offer them this solution (product D); why wouldn’t we?

The problem, however, is insidious, and relates to (system) dependency management. To use specific features of each product, we must also manage every one of that feature’s dependencies. For instance, all three products depend upon the existence of user and customer representations, regardless of whether it’s functionally useful, so we must provide them. Some might also comment that we're into the realms of managing a Frankenstein’s Monster System.

We must:

Without consolidation, we find that this approach has forced a complex and risky coordination and synchronization problem onto the entire business (cue Atlas holding up the globe), and what’s worse, it’s probably one that most key stakeholders are unaware of, or cannot contextualise (Value Contextualisation).

THE TAIL WAGGING THE DOG

To clarify, this approach creates a role-reversal, and is a classic case of the Tail Wagging the Dog. There’s no useful business functionality being created. We’re satisfying the system, not the business here, and creating system-level dependencies that the business wouldn’t expect to have to manage.

With the increased complexity, comes:

SUMMARY

Whilst it seems like I’m bashing the Growth through Acquisition model, I’m not. I’m attacking the repeated application of this model without a sensible consolidation strategy. Some of the problems caused by neglecting to consolidate are presented here.

When all is said and done, there’s also something to say on lost opportunities. If the business doesn’t acquire now, will it get another opportunity? Immediate acquisition may be pragmatic, and worry about consolidation a bit later. Like much in life, it’s about finding the right balance.

The Acquisition / Consolidation Cycle offers one possible solution to sustained growth.

FURTHER CONSIDERATIONS

THE ACQUISITION / CONSOLIDATION CYCLE

The steps (as I see them) to long-term Growth through Acquisition are shown in the figure below.

The typical "growth through acquisition" cycle is:

  1. Acquire new business.
  2. Sell off (unnecessary) assets.
  3. Reduce (unnecessary) staff.
  4. Embed acquired technology and products into the parent product suite.

At this point, it’s very tempting to repeat this cycle, but I’d suggest - for long-term sustainability - you don’t (see my reasoning in the Growth through Consolidation section). You should now consider the consolidation phase; described next:

  1. Consolidate technology.
  2. Migrate customers to consolidated technology.
  3. Reduce unnecessary staff.
  4. Repeat at step 1.

Let's look at the steps.

1. ACQUIRE NEW BUSINESS

In this phase, agreements are made, due diligence is undertaken, contracts are signed, and the business is acquired/merged. We’ve now got a suite of new technology products to support.

2. SELL OFF UNNECESSARY ASSETS

Any products or services that aren’t deemed of value to the parent business are discontinued, or sold on to others. The remaining products are retained, and ingested into the parent company’s product suite.

The parent company can now offer a more diverse range of products (see later stages).

3. REDUCE STAFF OVERHEAD

An unpalatable one, but it’s pretty typical in acquisitions (and one I’ve been at the receiving end of).

The parent company identifies the necessary staff to support the retained products/services (they must retain some experts to manage that product), and discards the remainder. This is mainly about managing profits.

However, it’s worth mentioning that the parent company’s overall staff levels increase to maintain these products. And - if you’re like many technology companies - most of your outgoings are held with staffing costs.

4. EMBED ACQUIRED TECHNOLOGY INTO PRODUCT PORTFOLIO

In this stage the new product(s) are ingested into the parent company’s product portfolio and can be sold to prospective customers. Additionally, the parent company has expanded their customer-base/revenue from the customers already using the ingested product(s).

The manner in which the parent company now manages its solutions is critical to their long-term success. If solutions are composed from many other large, tightly-coupled (monolithic) applications, then we may have a technology sprawl (a Frankenstein’s Monster System) that can hamper sustained business growth (see earlier).

5. CONSOLIDATE TECHNOLOGY

The consolidation phase is one of the most important. Without it, we may suffer mass complexity, technology sprawl, duplication of effort, and a general delivery and cultural stasis.

THE PARADOX OF GROWTH (WITHOUT CONSOLIDATION)

There’s a potential paradox here. By attempting growth through acquisition, a business may have hamstrung itself with poor TTM, ROI, and Agility, and no longer support good business practices, like reacting quickly to the changing needs/tastes of customers, or building disruptive products. If a business isn’t lean, in terms of technology sprawl and staffing, how can it react to both growth demands, and the need to continually delight existing customers?

WAYS TO CONSOLIDATE

Consolidation isn’t easy, quick, or its value easily contextualised, which is why it is often overlooked. The key thing we’re aiming for is a smaller, cohesive footprint (i.e. less to manage), looking to identify features that perform similar functions, and then finding opportunities to only use a single one.

6. MIGRATE CUSTOMERS TO NEW PRODUCTS

We've now consolidated the technology/product(s) to create functionally equivalent solutions, so we can now migrate the customers on one system to the desired one. Once that’s achieved, we can now retire parts of the old system.

7. REDUCE STAFFING OVERHEAD

Sorry, but this is business, and businesses care about profit. By migrating those customers from the external system onto a consolidated product, we need not support the other products, and thus, we don’t need the staff maintaining them. We can either utilize those staff in other areas of the business, or (legalities aside) let them go; this can represent a significant cash-flow saving.

8. REPEAT

We repeat this cycle. Acquire another business. Sell some products, embed the others. Tidy up. Remove waste. If we’ve been cautious, we’ve minimised debts, and supported sustained growth.

SUMMARY

The cycle infers some pauses after each merger/acquisition in order to pave the way for more, or simply to keep the business in good shape. Without consolidation, you may find your systems getting increasingly complex (Complex System), suggesting a larger (than necessary) workforce, an increase in long-winded, committee-based decisions, and a general cultural stasis.

FURTHER CONSIDERATIONS

NEGLECTING TO CONSOLIDATE

Neglecting to consolidate may cause the following issues:

FRANKENSTEIN’S MONSTER SYSTEMS

"I had gazed on him while unfinished; he was ugly then, but when those muscles and joints were rendered capable of motion, it became a thing such as even Dante could not have conceived." - Frankenstein, Mary Shelley.

Mary Shelley's classic horror novel, Frankenstein, tells the tale of Victor Frankenstein, a scientist who tried to play God, and who consequently suffered great anguish and torment from his own designs. His creation is freakish and unnatural, stitched together haphazardly from spare body parts, and animated into being:

"I collected the instruments of life around me, that I might infuse a spark of being into the lifeless thing that lay at my feet."

Up until the creature's terrifying arrival, Victor's belief in his purpose is unshaken - he will peel back the very fabric of creation:

"...soon my mind was filled with one thought, one conception, one purpose. So much has been done, exclaimed the soul of Frankenstein—more, far more, will I achieve; treading in the steps already marked, I will pioneer a new way, explore unknown powers, and unfold to the world the deepest mysteries of creation."

Yet, as the creature awakens, Victor finally grows alarmed, as the magnitude of his error begins to dawn on him:

"...but now that I had finished, the beauty of the dream vanished, and breathless horror and disgust filled my heart."

However, it’s not until later in the narrative that he truly appreciates the horror, as it systematically takes everything he’s ever cared for from him:

"Yet you, my creator, detest and spurn me, thy creature, to whom thou art bound by ties only dissoluble by the annihilation of one of us."

Ok, so whilst dramatic, I feel there’s some similarities to what I say next.

Businesses that have pursued the goal of either Growth through Acquisition, or simply of modernisation, yet neglected to Consolidate, may run into Frankenstein’s Monster Systems.

Like the monster, Frankenstein’s Monster Systems may be freakish and unnatural, stitched together haphazardly into some form that whilst alive, may cause considerable distress, and (if not initially, will eventually) be detested by the overarching business.

WHAT FORM DO MONSTERS TAKE?

In the acquisition sense, Frankenstein’s Monster Systems may have grown up to meet the needs of the (acquired) organisation, may well be monolithic, highly bespoke, and never meant to function as a system within the acquiring business. They’re also likely to be written in a language, or platform, different to the acquirers stack.

The Aggregated Monolith (or Monolith of Monoliths) is a good representative candidate. It’s an approach I’ve seen on at least three occasions, and also a situation some larger organisations have found themselves with.

Let’s consider a case where the business has two existing software applications, and has acquired a further one. Let’s also suggest that all three of these products are established legacy (serving the businesses for at least a decade). As such, they’re Antiquated Monoliths (they need not be, but this is the simplest scenario). Because each application models the overarching business we find many Assumptions made that tightly-coupled components together. This leads to functional duplication across those three applications (such as capturing customer details). See below.

Let’s say we’re merging the businesses into one (in a modernisation project we might also be merging customer bases rather than businesses). We have customer bases in all three applications that we want to offer a new service to (let’s say it’s a stocks-based solution), expecting a large number of these customers will want. We can’t simply migrate all customers onto a single platform due to a lack of Functional Parity (each application offers distinct functionality that the customers are using).

So, we form a new product - Product D - requiring the following features: Users, Customers, Products, Portfolios, Ledgers, Carts, Discounts, Transactions, and Stocks. See the figure below.

Note that these features touch all three existing products, so we must include them all.

Some large corporations have tens, or possibly even hundreds of these subsystems linked together to form one massive one. Which leads me into a discussion around Control.

CONTROLLING THE MONSTER

Frankenstein’s monster could not be trusted, or controlled. Something that can also be true of Frankenstein’s Monster Systems (Complex Systems). The wide variety of technologies and solutions in play can’t be entirely understood, so much so that any unorthodox activity may cause an accumulation of events (a snowballing), that cannot be easily stopped or controlled, and lead to catastrophe.

Businesses that attain new systems (through either model), yet fail to consolidate, may find themselves glueing existing systems together to form an increasingly complex form of Frankenstein’s Monster, to the point that it’s not understood, appreciated, or controlled, with inevitable long-term results. There’s so much choice that it's extremely enticing to reuse any and all systems, rather than applying the Sensible Reuse Principle.

Let’s look at some of the issues.

SATISFYING THE SYSTEM

When you build a giant aggregated solution from lots of other bits, and stitch them together, you’ve still got to satisfy the needs of each sub-system. That involves populating a tree of dependencies (which likely need satisfied in a specific order), first for each subsystem, then for the overall solution.

LEGACY MEANS ENTROPY

Many of these subsystems may be legacy, and Entropy has already set in long ago. Legacy systems come with their own set of baggage (the constraints) that may bleed into the overarching system, polluting delivery capabilities and polluting the business’ culture (see next).

CULTURAL STASIS

If the subsystems pollute the overall solution, then we may find everything we do is brittle and takes forever. In these circumstances, we find people becoming disenfranchised (which may lead to high rates of attrition) and Stakeholder Confidence disintegrates.

Due to the brittle architecture, we may find a small change in a subsystem has a rippling effect on the entire solution (Complex System), and Brand Reputation may be tarnished.

So we stop making changes (it’s just too painful), and slip into a downward spiral, finding it increasingly difficult to compete with highly nimble “startups”.

INTEGRATION MASOCHISM

Integrating is in - Frankenstein parlance - a stitching together of systems. Whilst integration certainly isn’t a bad activity, you should consider what you’re stitching together, and why (Sensible Reuse Principle).

Stitching together dead or defunct legacy systems into the fabric of modern systems may create more problems than they solve, yet it might not seem initially obvious. It’s a question of short-term (tactical) over long-term (strategic) thinking.

SECURING THE SYSTEM

Two commonly used security mantras are particularly relevant here:

SUMMARY

Frankenstein's Monster Systems can be symptomatic of Neglecting to Consolidate. These systems can be:

FURTHER CONSIDERATIONS

ANALYSIS PARALYSIS

Analysis Paralysis relates to an individual, team, or even a business, being paralysed into Continuous Deliberation, and unable to progress, with no sensible path to liberation.

It may be caused by:

FURTHER CONSIDERATIONS

CAPEX & OPEX

CapEx and OpEx are two different expenditure models. Capital Expenditure (CapEx) relates to the purchase of (significant) assets that are anticipated to be used in the future, and is seen as an investment by the business. In the UK, CapEx must be recognised on the business balance sheet. Operational Expenditure (OpEx) relates to ongoing operational costs for running a business.

Historically, software projects have utilised hardware, systems, and databases using (predominantly) the CapEx model. In this model, project inception requires the business to either find the necessary kit from an existing source, or to purchase it. Purchasing carries some risk though, because:

The CapEx model can be both restrictive and wasteful. The lengthy cycle times and setup costs can hamper Innovation. Many start-ups simply don't have the capital to make this type of up-front investment, particularly if that large investment is taken on a bet (any idea that hasn’t yet been proven is a form of betting). It’s a risky venture when you consider ~90% of start-ups don’t survive the first year.

One of the key tenets of the Cloud is to turn all this up-front CapEx investment on its head. Cloud vendors recognise the inhibitive nature of the CapEx model on (overall) Innovation, simply because many businesses can’t necessarily afford a significant one-off investment. However, most can certainly afford to rent hardware and services on a month-by-month basis. This is the OpEx model.

The businesses never own the asset, but receive other benefits, including:

FURTHER CONSIDERATIONS

INNOVATION V STANDARDISATION

Innovation is about rapid change and working through unknowns. Standardisation is more about alignment and creating something shareable from a Known Quantity. They are two competing factors and may need to be treated differently.

NON-REPUDIATION

Non-Repudiation - a party's inability to repudiate an action or agreement - is an important aspect of many transactional systems, and prevalent within legal or financial settings.

Many actions we undertake in life have consequences. Taking out a mortgage, moving funds between bank accounts, signing a minimum-term contract for a gym, or TV subscription package. All of them form some kind of legal contract (or agreement) between one (receiving) party, and another (providing) party.

LOSS AVERSION

Loss Aversion plays a key part in these contracts in that whilst a service is prepared to be offered, at least one party (typically the provider) is averse to the potential loss the other party could incur upon them.

Difficulties arise if one party breaks the contract, causing the other (financial) burdens. The victim may be unjustly affected, so legal proceedings begin. Yet, the contract cannot be binding if one party can successfully refute the agreement (the “I never signed that, where’s your proof?” argument) - i.e. repudiation. Good Non-Repudiation mechanisms make it impossible for one party to refute the authenticity of their action.

ANALOGY

A vehicle Dash Cam (Dashboard Camera) is one real-world example of non-repudiation. If you were involved in an accident with another vehicle, but neither party admits fault, then some other form of evidence can be useful. A Dash Cam makes it very difficult for a party to repudiate the evidence.

From a systems perspective, an API Gateway, Edge Service, or an Intercepting Filter are all good (system) mechanisms to capture non-repudiation proofs; being executed at the gateway to the underlying system or resource. Digital Signatures, Certificate Authorities (CAs), and Blockchain also offer approaches to verify authenticity for non-repudiation.

FURTHER CONSIDERATIONS

SPIKES & DOUBLE INVESTMENT

“Let’s do two spikes in parallel to decide which one is better.”

The premise here is that two spikes are better than one, and that by undertaking both in parallel, we’ll gain twice the understanding. But that’s not always the case, neither is it always appropriate.

Doubling down on spikes to support a decision around Innovation (“is this approach better than that?”) is sensible, but duplicating effort to find the most innovative solution on a conformity project, or an ephemeral solution (we often have to build tactical solutions with limited lifespans when dealing with legacy systems) is needlessly wasteful. In these circumstances, select the best one (based on your current understanding) and only run the other spike if the first produces a substandard result.

FURTHER CONSIDERATIONS

NEW (!=) IS NOT ALWAYS BETTER

A trap I see many technologists fall into is always equating new with better. Whilst it’s often true, it isn’t always true, and swallowing the propaganda of the “new” (e.g. Marketing Magpies) can store problems up for the future.

TERMINOLOGY

I’m describing a multitude of “things” here, including technologies, techniques, and methodologies. However, for brevity’s sake, I’m going to generalise all these concepts under an umbrella term of “tool”.

Firstly, let’s talk about “better”. What is it? How is it measured? And what are we comparing it against? Of course, we could associate "better" with many qualities (e.g. faster development, faster integration, reduced complexity, or standardisation), but the one I’m most comfortable with is how it performs against a set of common “fragility factors”.

If we compare the fragility of two tools, then the one exhibiting less fragility is probably better. To my mind, less fragile tools display these three characteristics:

  1. The entrenchment factor. It has seen high exposure and extensive use across a wide range of industries, domains, and diverse groups of people.
  2. It has stood the test of time. This also infers that the tool has vanquished many other competing tools that were less good (more fragile). It's resilient to competition.
  3. It continues to be championed.

Note also that any tool exhibiting only two of these qualities is not (yet) proof of its superiority. There’s plenty of tools that gained early traction, but were extinct within a few years. There’s also long-standing tools that never received high exposure (becoming the mainstay of only a single domain or industry), and can’t necessarily be considered superior outside of that domain (although that may be of little concern to those domains making good use of them). And something that is deeply entrenched and seen lengthy use may no longer be championed.

A tool that satisfies all three criteria is one that still stands tall, having faced the worst that could have been thrown at it. That’s both sustained “normal” use, and the occasional outlier event that shakes the foundations of an industry (like the Internet, or the Cloud). A skyscraper that withstands inclement weather, sees sustained use by tenants, and is still highly-esteemed, is good, but certainly not exceptional. However, one exhibiting all these qualities and having survived (intact) a significant earthquake is a blueprint for success. A retail business that can scale up and sustain their distribution channels during an epidemic (such as COVID-19) is rather resilient, not so much to failure, but to extreme success.

Some examples:

Some of these technologies have seen great success, some may even have been viewed as “technically superior”, but that isn’t necessarily “better”.

Another danger of the new is when it's not new at all. It’s relatively easy to make something old seem new, when in fact it’s the same thing with a new veneer (e.g. Aesthetic Condescension is one way to falsify its subject matter), but repackaged into something else and resold.

NEOPHILIA

Some of my favorite examples of this neophilia come in the form of frameworks and products that promise “high productivity at zero cost”. Sorry, but everything has positive and negative aspects (No such thing as No Consequences). Some promote (mainly poor) practices like direct interactions between UI and a backend database, and throw up all table fields to the form for display. It’s certainly rapid development, but is it right? How do you support a new mobile strategy with this solution? Duplicate the approach and create two divergent codebases?

APIS - NEW ISN'T ALWAYS BETTER

Many years ago I remember using Java Servlets to build web service APIs, long before SOAP, and then REST, became fashionable. They were (relatively) easy to build and to integrate with, and overall, were a good solution.

When SOAP started gaining industry traction, we reassessed our API strategy. Although the servlets approach was highly convenient and good (albeit it lacked some consistency), we were pressurised (mainly by external market forces) to shift our API standard to SOAP. There were many reasons for this, and most seemed reasonable at the time, so one should resist Hindsight Bias.

“Simple” it was not (the ‘S’ in SOAP was meant to stand for simple), and soon enough we saw a new contender (REST) rise, with the promise to right all the wrongs of SOAP. We almost found ourselves coming full circle, implementing something very similar to what we had a decade ago, but after considerable time and investment. New isn’t always better.

Ok, but if my idea of “better” is established and far-reaching tools, how can we innovate? That’s the tricky part, and partly why businesses often couple themselves to tools that don’t survive. Whilst being at the forefront of technology allows you to reap all of its benefits, that assumes the tool is successful; you suffer the consequences if it is not. It’s a balancing act - wait too long and lose some of the benefits, go too soon and you risk introducing failure into a business that is hard to extricate. It is also driven by your appetite for risk. Anything that is mimicked by others (particularly the backing of large technology corporations) is probably not a flash in the pan (e.g. the serverless vision has successfully embedded itself into all major Cloud providers), but it's by no means a guarantee. It’s the agony and ecstasy of working with technology.

SUMMARY

The quest for ever-faster TTM can drive poor (and unchallenged) decisions. Some people get so hung-up upon the speed of change that longer-term consequences are forgotten or ill-considered. It's hard to describe (or contextualise) what the potential consequences could be of a decision in five years' time, when it's very obvious what the short-term consequences are. We’ll naturally favour (bias) satisfying known short-term consequences over some longer-term, nebulous consequence. This bias can pave the way for the adoption of new tools, regardless of the implications. We attempt to treat the patient without diagnosing their malady, or the consequences of their treatment.

Our industry is littered with examples of tools that were once lauded but are now defunct or demonised. New tools can introduce just as many anti-patterns as the old tools they’re meant to replace, yet you’ll rarely hear that from their marketing teams (and the magpies who follow them), who play up their benefits whilst hiding, downplaying, or even being ignorant of their disadvantages. There's always consequences (No such thing as No Consequences).

Whilst new is often better, that isn’t the same as always.

FURTHER CONSIDERATIONS

“SOME” TECH DEBT IS HEALTHY

Now for something a bit contentious... I regularly hear technologists greet Technical Debt with such contempt that they expect all forms of it to be extirpated. Some technologists will take pains to remove even the most innocuous and unimportant technical debt in an effort to cleanse the system of all evil (a technical debt exorcism). I follow the motive, but not the rationale.

To put it bluntly, I’d be worried if a business didn’t have some technical debt. Surely it would indicate that its technology staff had stopped adding value for the customer, and shifted focus solely on to system health and hygiene (what those engineers felt was important)?

Caveat Emptor. I know of no valid excuse not to clean. You should still address Technical Debt to protect systems and the overarching business from Entropy. A healthy business though always has friction between building value for the customer, and system hygiene (to keep the business nimble). If you’re spending all your time building nothing but customer value, then you’ll eventually fail as the technical debts accrue; yet if you invest all your time addressing technical debt, then do you actually have a viable business?

“SOME”

Of course the definition of what “some” technical debt is highly subjective. I’ll leave that one with you.

FURTHER CONSIDERATIONS

CONSUMER-DRIVEN APIS

APIs (Application Programmable Interfaces) are one of the most fundamentally important of business technologies, being a main interface for the flow of information and services between a business and its consumers (including other businesses). APIs therefore deserve respect, and should be designed and built with that respect in mind.

Poorly designed (or constructed) APIs can:

There's a common API-design anti-pattern, known as Bottom-Up API Design. It typically exhibits some of the following characteristics:

DON'T EXPOSE YOUR PRIVATES!

In my article on API design [1], I describe the importance of the “Don't Expose your Privates” practice. In this case, an API exposes unnecessary details to external consumers, who become tightly coupled to those details.

Some Rapid-Application-Development (RAD) tools (indirectly) advocate breaking fundamental encapsulation techniques for the sake of TTM and efficiency. Be wary of any tools or practices (I also see this approach used regularly where no tools are involved) that advocate directly exposing internal (data or navigation) models. It will, of course, be quicker (that's what makes it so appealing) in the short-term, but balance that against the longer-term costs of complexity and the inability to evolve.

To further elaborate, I regularly see this practice used within Microservices which manage data using an Object Relational Model (ORM) framework. It's convenient and quick to reuse the same persistable entities for multiple purposes - to persist data and to represent an API request/responses (which is the realms of DTOs) - but it's a dangerous practice because:
  • You may tie the consumer to an internal model, reducing Evolvability.
  • You may expose more information than anticipated (or necessary), simply by forgetting to annotate a sensitive field.
  • Mindset. Possibly the most concerning aspect is that this practice suggests an inclination towards bottom-up API design, rather than top-down, consumer-driven design.

As I've inferred, API malpractice can have very serious business evolutionary challenges, particularly when consumers are other businesses (i.e. the Business-to-Business, B2B, model). If you think it's difficult to make internal evolutionary changes, wait till you attempt it with external integrators!

Every business has a discrete, individualized evolutionary life-cycle, which may be at odds with the life-cycle of your own business (indicating we have different levels of Control over the evolution of other businesses). There may be hundreds of these integrators; all evolving at different rates and with different expectations. Some may own a greater market share, and have more financial clout than your own business, creating a situation where some consumers will dictate your evolutionary abilities. The moral of the story? If you get your APIs wrong, you may be stuck with that decision for a long time.

CONSUMER-DRIVEN MEANS BIDIRECTIONAL INFLUENCE

It's worth noting that building quality Consumer-Driven APIs isn't solely about what API owners think the consumer wants. It's also about empowering consumers to suggest how those APIs should work. I've witnessed situations where both the API owner can't (or won't) support consumer requests and where the consumer is unwilling to broach the subject of change with the API owner (I've even seen this with internal teams). Whilst it's not always possible for the right party to evolve an API, that doesn't mean we shouldn't try.

If owners won't countenance sensible changes within sensible timeframes, consumers may get frustrated (and shop around for alternatives), or be forced to implement unnecessary workarounds, adding complexity to the overall solution. And increased complexity typically indicates reduced reliability, affecting Reputation.

Whilst API owners may disclaim any responsibility for problems that surface due to this complexity (the complexity was all on the consumer side after all), its rarely that simple. If your consumer suffers a branding faux pas, financial loss, or uncertainty due to an unreliable system, stemming from your APIs, or an exclusionary culture, why would there not be some repercussions within your own business (No Man is an Island)?

Sometimes the reluctance to change isn't cultural but a problem with the system as a whole. In these cases, the system may not support a mechanism for API owners to establish who an API's consumers are, and how they use it. This is a form of Change Friction - we simply cannot risk a change without knowing its impact.

SUMMARY

Some final thoughts for this section. Don't drive API design from the bottom-up, but from the top-down (i.e. consumer-driven). Where possible, let consumers drive flows, and external structures. Consider the need, name, and purpose of every data field before exposing it.

FURTHER CONSIDERATIONS

MVP IS MORE THAN FUNCTIONAL

One mistake I’ve seen some businesses make is assuming that a MVP (Minimal Viable Product) is solely about delivering functionality. It’s not.

The word “viable” offers a clue. The purpose of a MVP is to prove the viability of a solution (and to reduce business risk) - both its functional viability (does the customer want these features?), and its non-functional viability (is the solution technically viable?). You can’t do both if your only focus is on functionality (Functional Myopicism).

FURTHER CONSIDERATIONS

THERE'S NO SUCH THING AS NO CONSEQUENCES

“You can't have your cake and eat it.” - English Proverb.

Everything we do in life has consequences. Let no one deceive us of this. Every decision (or lack of decision) has some effect - either positive or negative - on a system.

The challenge of technology (or any Complex System) under these circumstances, is that the outcome (consequences) of a decision is not always noticeable, easily understood, or quickly seen.

CONSEQUENCES

The implications of a poor decision on a Complex System may not be seen for many years, potentially long after the people who made it have left (some studies suggest the average retention time for technologists is around two years). [1] Yet, your business may own and maintain these systems for decades.

I hope that what I've presented so far helps to illustrate my point. Each technical quality is tightly linked to many others (and to other business qualities). Applying a positive force on a quality in one area will have a detrimental effect on others. It's not possible to get the best of everything - something must be sacrificed to the gods.

Let me offer some examples:

FURTHER CONSIDERATIONS

WHY TEST?

Why do we test? Yes, I saw that quizzical look you gave me! Surely it’s obvious?

The stock answer is that testing verifies quality (e.g. product quality), and that protects other important characteristics, like customer satisfaction (and thus, Brand Reputation). Testing teams are - after all - often regarded as Quality Assurance (QA). But that’s not the whole picture, and one that’s easy to lose sight of.

Testing can also:

SUMMARY

The later that we check for quality, the costlier a breakdown in quality is. Both monetary, and time. Additionally, a lack of (sufficient) tests/testing may also risk reducing business Agility, and impede our ability to Innovate.

Aside from the financial aspect - which may be significant - both of these failures can lead us away from producing other types of value. This wasted time could be the difference between a business built on mediocrity, and one that finds its fortune with a disruptive technology.

FURTHER CONSIDERATIONS

BULKHEADS

The idea of (system) bulkheads is not a new one. The Chinese have been building ships using bulkheads (compartmentalising) for many centuries (Chinese junks contained up to twelve bulkheads), for several reasons:

Consider the figure below, showing a schematic of the RMS Titanic.

Bulkheads in RMS Titanic [source: Wikipedia]

You might wonder why I chose to present the RMS Titanic; it is, after all, notorious for its tragic failure. However, it failed not because of the compartmentalisation technique, but (mainly) due to its poor implementation (and some pure misfortune). It's also a good example of how poor (or no) compartmentalisation can have disastrous outcomes. Once the water flowed over one bulkhead it would then flood the next, in a cascading fashion, until sinking was the only outcome.

Contamination is another interesting one, and plays into Domain Pollution. The Chinese didn't want a catch of fish contaminating other perishable goods, such as grain, so they kept them isolated and prevented their intercourse.

These shipping analogies are also good parallels for how to build software. We too want to prevent the flooding of our systems (Flooding the System), and no part of a system should bleed responsibilities, data, or performance expectations into another domain, else we contaminate that domain (with productivity, evolutionary, and data integrity challenges). System Bulkheads - such as queues (Queues & Streams) - help to isolate distinct parts of a system(s), enabling each to move at a pace appropriate to it (and not be dictated to by others), or even to fail, recover later, and still achieve the system’s overall goal (Availability). Bulkheads have another redeeming feature - they create autonomic (self-healing) systems that need little to no manual intervention to recover.

BREATHING SPACE

By placing intentional barriers between distinct domains, we ensure one can’t flood another, giving us time to pump out the water (or messages) when convenient.

FURTHER CONSIDERATIONS

THE PARADOX OF CHOICE

Choice is a wonderful thing. Isn't it?

We hold a certain bias towards an abundance of choice, often painting it in a positive light; e.g. “Thank goodness we have many ways to solve this problem”, “Which pricing model should we offer our SAAS customers?”, “Which vendor should we deploy our new solution to?”, “Which of these ten languages should we use for this solution?”, “The CTO has promised that I can select my own team from our hundred employees, which ones do I choose?”, “In which sequence should I re-architect domains in our existing monolithic application to modernise it?” You've probably faced some of these choices yourself.

Choice overload though, can actually stifle decision-making, causing procrastination, confusion, and paranoia (we become overly concerned about making a poor selection). In the technology world, this may lead to Analysis Paralysis, and protracted solution delivery.

Whilst it’s easier said than done, one simple solution to this problem is to reduce the amount of choice. Alternatively, if finding the optimal path on the first pass is impractical (and not always necessary), just choose one to bet on, being cognizant that it may not be optimal, but it will break the paralysis and enable the team to either reduce their choices, or now identify the optimal path. This approach can act as a stepping stone to the next best thing.

MAKING SENSE OF IT

You may also use sense-making frameworks (like Cynefin) to help guide your decisions. Once you get a sense of where you are, you can - for instance - clear the way by investing in further analysis, or alternatively spend that time trying something and then deciding what to do next.

FURTHER CONSIDERATIONS

TWELVE FACTOR APPLICATIONS

Twelve Factor Applications is a methodology for building software applications that helps to instill certain qualities and practices into software. Software that exhibits these qualities is often more evolvable (Evolvability), resilient (Resilience), testable (Testability), maintainable (Maintainability), releasable (Releasability), and manageable (Manageability).

I won’t name all twelve factors here (see this page [1]), but I will describe some of the factors that I find particularly enlightening:

I discuss them next.

SEPARATE BUILD & RUN STAGES

Separate build and run stages

A software change typically has three stages:

  1. Build. Check the stability, quality (e.g. gather source code quality metrics), and the functional accuracy of the source code being transformed into a software artefact. This should be done once per release. The output is typically a packaged software artefact (if successful), which is uniquely identifiable, and stored for future retrieval.
  2. Release. Combine a specific configuration with the software artefact (built in the previous stage) to enable it to run in the target environment (i.e. contextualise it for that environment). The software is now ready to be run.
  3. Run. Run the software in the target environment.

BUILD V DEPLOY & RUN STAGES

The build stage is responsible for packaging up a software application and storing it for access in subsequent stages. We may initiate the release and run stages many times, but we only require a single build stage.

By separating each stage and treating it as an independent stage we:

IMMUTABLE, IDENTIFIABLE & AVAILABLE

Note that the created software artefact should be immutable, identifiable, and available. Any new change to the source code requires the creation of a new artefact version (this also protects against Circumvention).

ONE-TIME BUILDS ARE A MEASURE OF CONFIDENCE

It's quite possible to rebuild the same source code, yet get different results, and thus inconsistent artefacts. This is dangerous and why we should avoid repeated builds of the same code for the same purpose.

By executing the build stage once (and only once) per release, we gain confidence in the artefact's consistency across all deployments, which is then stored - as an immutable artefact - for future deployment.

DEPENDENCIES

Explicitly declare and isolate dependencies

This factor has two key points:

  1. Each and every dependency - including their version - is explicitly declared.
  2. An application’s dependencies are isolated from another’s.

Software is not an island. It requires the support of many other software artefacts (libraries, frameworks) to form something useful. Yet software and its dependencies change. Classes are deprecated, packages move, implementations change, flaws are patched etc etc. Any change in a dependency creates a potential friction in your software. Some may be readily apparent (e.g. compile-time failures), some may not (e.g. runtime failures). Feedback, surety, and confidence are lost with implicit dependencies. It's quite possible to successfully develop and test a feature, only for it to fail in LIVE due to a different dependency configuration.

Secondly. Not only must we be explicit in our dependency management, but we must also isolate those dependencies from the pollution of others. One application, one set of dependencies. There should be no sharing or centralising of dependencies as:

The downside to this approach is one of duplication, resulting in increased storage needs. Whilst this certainly used to be a problem, it isn’t really nowadays with storage costs so low.

REAL-WORLD EXAMPLE

I still remember one of my earliest forays into Java web development. At that business, several web apps were deployed into a single web container, using a shared library model. Storage was more expensive in those days, and it seemed sensible to share dependencies, but Control was a real problem.

Each application used a mixture of: distinct libraries, the same (shared) libraries, and different versions of the same (shared) libraries - all held within one directory and class-loader. It became impossible to associate the correct version of a shared library with the application that used it, it being left to the whims of the class-loader’s ordering strategy. This was unsustainable.

The use of packaged .war files - and later Maven - was a revelation. By explicitly declaring the library/version, and isolating them, there was no pollution and the correct dependency was always linked.

DISPOSABILITY

Maximize robustness with fast startup and graceful shutdown

There’s a well-known metaphor in modern software delivery - software applications should be treated as “Cattle, not Pets” [1].

Treating software as a pet places the wrong onus on it - one of attachment. In this model, considerations such as the time/effort invested, manual configuration, and even Circumvention create an unhealthy attachment (a Disposability Friction). What’s occurring is an Attachment Bias - our decisions are shaped by our attachment to the software, rather than (necessarily) to the overarching business’ needs (e.g. resilience, recovery time, scalability). When our software sickens, we rush to the rescue, further exacerbating our attachment as we invest more into the workarounds that keep them healthy.

DISPOSABILITY & MONOLITHS

I’ve seen this attachment model with Antiquated Monoliths and Lengthy Release Cycles where:

  • The configuration effort was highly specialised and significant.
  • Containers were slow to initialise (e.g. 10-15 mins) and destroy, making the thought of restarting them (let alone redeploying them) unappealing.
  • Deployments were atomic and contained many sequential activities.

Monoliths also contain (many) more Assumptions than a typical domain-driven component (e.g. microservice), again challenging disposability and increasing our Attachment Bias.

To my mind, this attachment bias actually creates a reliability paradox. By attempting to protect software against all comers we may build less resilient software (sacrificing Resilience in a vain attempt to protect Availability), such as:

Ok, so how does all this relate to disposability? There are two qualities I consider when viewing disposability:

  1. Resilience - the ability to recover from a failure, ideally autonomically. Software shouldn’t rely on someone being available to configure, deploy, and verify it. Declarative platforms with observable and autonomic qualities should be used where practicable.
  2. (Rapid) Scalability - the ability to rapidly scale out services to meet demand. The time element is important here.

Disposable software should:

REAL-WORLD EXAMPLE

Off the top of my head I can think of several instances of software failing the disposability factor. In one case I witnessed the spawned application being initialised by reading many GB’s of data from a datastore before storing it in a local cache. Initialisation took around 15 mins, during which time the service was unavailable.

It failed the disposability factor on at least two accounts:
  1. It took too long to initialise (seconds is acceptable, but not minutes). Anything that takes so long to initialise creates some attachment bias; it also limits (rapid) scalability.
  2. It depended upon embedded state (not a backing service).

DEV/PROD PARITY

Keep development, staging, and production aligned

Picture the scene. Your business is building an important software feature to be released to the market. The marketing team has already sent out press releases and there has been great interest. Everything seems to be going well - the feature has been successfully developed and tested and it's now ready for production release. The fateful day arrives - the software is released to production with some fanfare. Shortly after though, the customer care department begins to receive a flood of complaints from existing customers. Something is terribly wrong. The production environment is rolled back to the previous (stable) release, resolving the customer issues; however, the press release is a disaster and reputational harm is done.

What went wrong? After a deeper investigation it turns out that the production environment was configured differently to the other (development and test) environments, creating a disparity. It seems that Circumvention was at play. The previous product release was manually configured to get it working, but that change was never transmitted back into previous environments - had it been, then this problem would have been quickly caught and resolved, and no harm would be done. We wish to avoid this disparity.

IS FEATURE X LIVE YET?

It's sometimes possible to infer a disparity between production and other environments, for example, with questions like: “is feature x in live?”

What this suggests is that there’s a substantial gap (temporal probably) between a feature being built, tested, and finally put live. The person asking this question can’t determine where that feature is, due to the unpredictable time gap between when a feature is completed development and when it is released to production.

This disparity may originate from:

EXPEDITING & CIRCUMVENTION

Expediting is a common pitfall in (particularly reactive) businesses; it occurs when a feature is deemed more important than the others already in the system, so is given priority. Expediting typically involves Context Switching; work stops on the current work items, the system is flushed of any residual pollutants from those original work items, and then work begins on the expedited work item.

Expediting can also lead to Circumvention. Established practices are circumvented to increase the velocity of a change (expediency), or so we hope.

We expedite and circumvent mainly from time pressures. We can reduce those pressures through efficient and reliable delivery practices.

FURTHER CONSIDERATIONS

AUTOMATION ALSO CREATES CHALLENGES

Imagine that you work in a factory creating caramel shortbread (one of my favourites). The factory sells to a number of retail customers, who then resell them on to their customers for another profit.

The factory can produce 10,000 bars a day, costing them around 15c per bar (in ingredients/electricity/employment). They sell each bar to their retail customers at 80c, who then sell it to their customers at $2. The factory runs ten hours a day, producing an average of 1000 bars an hour.

There’s many stages from pot to shop, and many things can go wrong, such as:

Now, I’m not an expert on the productionisation of caramel shortbread, but I suspect that many other things could go wrong on a production line. The point is, not all problems are immediately visible, but they can all cause problems in either the quality of the end product, or create unnecessary waste.

Let’s say a batch gets polluted soon after the initial bake. To counter this, the factory puts a “taste test” control in place, verified every five minutes. However, the factory can process a decent number of bars in five minutes (of the order of 83 bars - 1000/60 * 5). That's $12.45 + $54 in lost profit for that five minutes. [1]

Not too bad considering though? Ok, so let’s up the stakes. Let’s reduce the “taste testing” control to once every hour and also increase factory output from one to ten thousand bars per hour. (Conservative) Estimates now put the loss at $8000 ($1500 in waste and $6500 in lost profit), and that’s assuming no additional surrounding waste. We see a marked increase in the cost of a failure as (a) the number of units increases and (b) the time between verification expands.

If we were to revisit this concept on a smaller, less autonomous, scale (e.g. a local bakery producing their shortbread by hand, and selling it on directly to their customers) we’d find the same contaminant failure risk had much lesser consequences (it might only cost a few hundred dollars).

What I’m trying to demonstrate with this analogy is that whilst automation has great merit (automation is often key to sustainable business Agility, scalability, growth, and Economies of Scale), failure can be costly. [2] In the software world, data is (often) “king”. Data about a person, transaction, event, or something that is used to make an important decision (e.g. who to offer insurance to, stock trading, or diagnosing a disease) must be accurate. However, the consequences of an automation failure may be much greater (affected records), and may even lead to severe financial penalty. Of course the same problems can occur due to manual intervention, but they tend to have a lesser impact as fewer records can be visited in the same period.

The impact of automation failures can be alleviated through several mechanisms. The first thing to consider is how to stop exacerbating an identified problem. This is achieved using a context-specific Andon Cord - i.e. a switch-off mechanism. Secondly, consider keeping a data history to provide a means of both fast comparison, and rollback. Whilst a regular data backup (archiving) may provide a reliable history, it's not always immediately available, easily accessible, or fast. You may also wish to consider how to rollback to an earlier point in time (e.g. versioning) if an additional (correct) change is stacked upon an incorrect one. You probably want to keep the correct change but revert the one it's built upon.

Other useful approaches include:

FURTHER CONSIDERATIONS

DECLARATIVE V IMPERATIVE

What’s the main differences between the following approaches:

The first approach is all driven by one person telling others exactly what to do and the order to do it in. That person is deeply (inextricably) involved in both the what and the how of reaching the outcome. The second approach focuses on defining the end state, but letting others figure out how to reach it. That person is only involved in the what, and has little involvement in the how.

This model may be applied to container management platforms, automation frameworks, programming, and even leadership. Kubernetes - an open-source container orchestration platform - supports the declaration of objects in both imperative and declarative styles. Aspects of functional programming (e.g. map(), filter(), reduce()) are more closely associated with declarative than the historical imperative mode of many programming languages.

The imperative mode - being instructive - is quite detailed and offers fine-grained control. This is useful if a declarative model doesn’t (or can’t) provide sufficiently detailed control. Whilst steps in an imperative model may be easier to read, its benefits are countered by the need to write (often far) more written instructions than its declarative counterpart, making comprehension over the whole harder. This also makes the imperative model harder to scale. Finally, being more detailed, and contextual, we may find there are more assumptions (Assumptions) embedded in imperative scripts, thus reducing reuse (Reuse).

With the declarative mode, we simply tell a framework/platform what our end state is and let it figure out how to get there (declarative uses, but doesn’t define the steps). It uses a layer of abstraction which often reduces complexity because (a) implementation details remain hidden, (b) there’s less code to understand, and (c) it can help to reduce duplication. Assuming the abstraction already exists, then the declarative mode can also reduce the time and effort required to solve a problem.

FURTHER CONSIDERATIONS

WORK ITEM DELIVERY FLOW

The “ticket” is a fundamental delivery management mechanism for software delivery. It typically represents a user story, which (usually) represents the partial delivery of a software feature. Accompanying the ticket on its travels is a work item (the software of value) - together they transition through a number of different stages until completion. Ticket flow may differ slightly per organisation, but in the main it follows this flow.


The stages are:

Make sense? Ok, so I fibbed a bit, sorry... Whilst the above flow represents the main stages, it doesn’t necessarily present a true practical picture. The true picture changes based upon the chosen form of engagement (Indy, Pairing & Mobbing); however a standard (Indy) flow looks like this.


A bit more rambling - and repetitive - than the first flow isn’t it? The key difference is all of the waiting stages - which we often find sandwiched between the doing stages - a problem common to any form of manufacturing, not just software, and one that LEAN principles take into account (e.g. The Seven Wastes).

FURTHER CONSIDERATIONS

SHARED CONTEXT


Software is but one representation (or implementation) of the ideas that flow through a system (a system in this sense being an “idea system”). To me, there are two qualities I think important in the (software) implementation of an idea:

Whilst I shall assume the first, the second requires something more fundamental than software; it requires a Shared Context.

New software is (still) often written by an individual. And whilst that’s not necessarily an issue, it does raise questions about how we (a) maintain it if that individual is unavailable (they’re too busy, on holiday, or have left the company), or (b) increase support capacity to it when the business must scale up in that area. The sharing (and retention) of contextual information about the ideas we build (a shared context) is vital.

Consider how work items are managed (Work Item Delivery Flow). It's extremely rare for software to be released (or built for that matter) without also releasing contextual information (e.g. the why, who, what, how) alongside it. At each stage, there’s context applied or shared with others. A Shared Context may be used to:

THE WHATS, HOWS & THE WHYS

Whilst software can provide the whats and hows, it can’t provide the whys. Nor can it always provide the reasoning why one approach was selected whilst another was discarded.

WORKING SOFTWARE OVER COMPREHENSIVE DOCUMENTATION

Note that I’m not suggesting we ignore the Agile Manifesto’s principle of “Working software over comprehensive documentation”. Shared Context is less about documentation and more about communication and the sharing of ideas.

A loss in shared context may create:

So, let’s just remember to share more.” The problem with this statement is that contextual information is delivered through many different communication channels (written, using numerous tools, verbal), and communication can be lossy. See below.


Ideas (in the idea cloud) are realised through software, and the work undertaken produces a Shared Context. Whilst all aspects of the software is retained, the shared context information splits into two paths:

  1. Information (context) that’s widely shared and retained and then fed back into our aggregation of ideas (the main ideas cloud). This is good.
  2. Information (context) that’s lost. This information was once known but is now nebulous. Assumptions are created around this lost information. This is bad.

WHAT ASSUMPTIONS DO

When we’re unsure of how something works we can:

  1. Ask someone to explain it. This isn’t always possible.
  2. Investigate it ourselves. This may be possible, but takes time.
  3. Assume how it works.

Decision-making founded on assumptions is a game of risk. Incorrect assumptions may lead to unnecessary work, late learning, and failed outcomes. Even an assumption that is correct is still an assumption, creating uncertainty (e.g. Analysis Paralysis).

Shared Context is shared through communication, but communication depends upon many aspects:

No wonder it’s so difficult to communicate so effectively, yet the benefits of getting it right are vast, including increased (business) flexibility, (team/business) agility, (business) scalability, morale, and fewer Single Points of Failure.

REACTIVE OR PROACTIVE

Arguably software delivered without a Shared Context follows a more reactive model, whereas consideration of a shared context is a more proactive (and sustainable) model.

To my knowledge, there’s no recipe, nor any standard measure of either shared understanding or idea retention. In an ideal world every stakeholder gets the context they need, but reality is often different. Several approaches can help:

FURTHER CONSIDERATIONS

DIRECT TO DATABASE

Direct-to-Database is an appealing pattern that I see again and again, mainly but not exclusively, on legacy systems. Except for the odd case, I’d generally classify it as an anti-pattern.

The premise is simple. By interacting directly with a data store - and avoiding building any intermediaries - we can share information, and build things, much faster. The consumer may be the user of a user interface (UI), or indeed, another system needing access to the data to perform some useful function. We’ll briefly visit both scenarios, but let’s start by looking at the user (interface) scenario. See below.


In this case a server-side user interface (UI) fronts a backend database, offering users the ability to perform a function and then persist the result back to the database. Initially, it's really just a thin veneer onto the database, with nothing between.

Things go well. New forms can be swiftly created, exposing data with minimal fuss, users are happy, and the business sees fast turnaround (TTM) with minimal spend (ROI). What’s not to like? Unfortunately Aesthetic Condescension has us in its grasp and we have lost our impartiality - “I can see it right there,” says one executive, jabbing a finger at the screen and declaring, “there’s nothing more to do...”. It's a very powerful argument, and one that is hard to counter. So, between the disquieting rumours that something isn’t quite right, we find this approach gaining more and more traction.

Of course, before long there’s a need to write business logic - and being nowhere else to put it (ignoring stored procedures), it must go into the user interface code. We also find a need to create a second (and third) UI to support other clients as they learn about the new tool. See below.


BESPOKE UI’S

Another problem I regularly see relates to the client services business model, and bespoke user interfaces. It often seems easier to duplicate a UI and then modify it for bespoke needs, than to create a generic solution and add branding. However, this is a mistake - business logic must now be repeated in two places, as must any bug fix or vulnerability patching.

The UI becomes bloated and we begin to see a pollution of responsibilities (due to a poor Separation of Responsibilities), thus increasing complexity. There’s also a duplication of effort, as the team makes the same change across the (three) UIs. This creates Change Friction, leading to an increase in delivery times. The team begins to complain about their workload and lack of staffing, and the business (incorrectly) rushes to recruit more staff (they should assess the underlying cause, which is less about needing more staff in the extremities, and more about applying them in the right areas).

FUNCTION SHAPES & SIZES

To my mind, this rhombus shape (below) nicely models the relative sizes of the three main functions comprising the three common application tiers:

  1. UI - a relatively small function that integrates with back-end applications.
  2. Business Tier (Applications & Data Engineering). Where most of the action is, and therefore the staffing needs. The business tier is (almost always) the most complex, and therefore requires the highest number of staffing, and thus, why it typically dwarfs the other two functions.
  3. Data storage (I mean administration, not data engineering). A relatively small function responsible for ensuring the databases are healthy.

It's now quite common to find all three representatives embedded in a single team.


My point? If there’s too much focus on the extremities (UI, or DB), then it might be worth considering why that is? A sudden spike of effort in one extremity may suggest a problem with focus (or strategy), rather than staffing needs.

Unfortunately, our problems don’t end there. By modelling the UI almost exclusively on the database model, we're also finding that our user experience suffers. Forms become glorified representations of the underlying database table (Don’t Expose Your Privates), and the user flow (journey) is heavily influenced (modelled) by how the internal data model relationships are navigated. We’ve forgotten to drive a top-down user experience.

And finally for the coup de grâce... One of the most changeable aspects of any software product is the user interface. What is modern one day, is tired only a short time later. There are many good reasons to modernise a UI, including Aesthetic Condescension on the part of your customers (new means good to many), internal stakeholder Bike-Shedding (they can't necessarily offer their opinion on system internals but they sure can on its UI aesthetics!), shifts in UI technologies, and more focus on mobile devices. A business regularly needs to modernise its product UI to present modernity to the market, increase sales revenue etc. In our case though, that opportunity is long gone - there’s too great an investment in the existing user interface to simply recreate it. This Change Friction has created a very serious Agility issue.

So, how does this happen? Well, the middle tier (e.g. service) is traditionally the hardest (longest) to create. Given the choices described earlier, some business leaders might ask why bother? e.g. “I can get the same functionality in half the time, simply by linking the UI up to the database.” This is certainly true, but caveat emptor! It’s a form of corporate Circumvention which may cause us to fail to adequately represent some very important qualities:

SUMMARY

Direct-to-Database may well solve immediate TTM and ROI needs, but - to my mind - it’s often just a form of (corporate or technical) Circumvention, offering little in the way of Agility (e.g. difficulty in rebranding your product, or scaling up your business), nor a viable route to Sustainability.

There’s a reason why Three-Tiered Architectures and (more recently) Microservices have been so successful. They’re a conduit between the user (UI) and the data (database), adding an important (some might argue vital) ingredient. Circumventing an entire layer - for the sake of immediacy - is just a problem waiting to happen.

Whilst I see this approach used less with modern solutions, it still happens, generally as a response to TTM pressures, or due to some habitual use of the practice (the “they did it before me, so it must be ok for me to do it” argument - 21+ Days to Change a Habit). For instance, I commonly see legacy systems built to shift data to many other consumers using ETLs (ETL Frenzy), directly coupling the (E)xtract piece to another party’s dataset, data model, and responsibility. This is a form of Domain Pollution, often hampering Evolvability, and in more severe cases creating a large Blast Radius for even the simplest form of change (Change Friction and Stakeholder Confidence). Tread carefully...

FURTHER CONSIDERATIONS

THE SENSIBLE REUSE PRINCIPLE (OR GLASS SLIPPER REUSE)

The Sensible Reuse Principle is my response to something I see far too often in software development - the inappropriate shoehorning of a new function into an existing feature that isn’t designed to fit it.

GLASS SLIPPER REUSE

What better way to start a section than with a fairytale? In Cinderella [1], we find Prince Charming, enamoured by his chance meeting with a mystery girl at a ball, and utterly determined to find her again. His only clue to her identity is her one glass slipper, mislaid as she rushed away. Charming’s ingenious solution is to have every woman in the kingdom try on the glass slipper - the one it fits is his true love.

The slipper travels across the kingdom, and is tried on by every woman, even Cinderella's villainous stepsisters. Determined to be queen, they try every trick in the book, from trying to squeeze into it, to reshaping their feet to make it fit, to forcing their foot in, all in vain. The glass slipper was made for one, and only fits one foot - that of Cinderella.

The idea of reuse based on fit isn’t always what happens with software. When we see a potential opportunity for reuse, some are very quick to attempt to force or squeeze that solution into an existing model. Merton labelled this the “imperious immediacy of interest” [2]. This works well when they are closely aligned (e.g. similar benefits, behaviour, data models - think of a close intersection), but may be more trouble than it's worth at the other end of the spectrum, polluting the original, creating maintenance and comprehension challenges, or simply creating a Frankenstein's Monster System. Reuse should not be employed solely for the sake of it, but to offer some (preferably sustainable) benefit.

Poor reuse reasoning tends to occur for one of the following reasons:

  1. There’s a driving force making the reuse seem extremely attractive. This may be a delivery schedule (TTM), a spending cap (ROI), capacity limitations, or even political gain (see later).
  2. There’s no deep appreciation for the existing state, whether it truly is a close match for the new function, or for the second and third-order consequences of the decision (Merton’s “imperious immediacy of interest” [2]). In the Cinderella analogy, it's turning a blind eye to the obvious misfit of the stepsisters and allowing one of them to marry the prince.
  3. Bias, face-saving, or political gain. We often see execs/departments vying with one another for the business’ affections in order to get the best (most interesting) projects, or to advance careers. An opportunity to reuse (seemingly) benefits both the provider (offering the service) and the purchaser (business). This is psychology of the kind “We’ve already got this solution (since I was the one who requisitioned it), so we’re going to use it, regardless of its appropriateness.”

SENSIBLE REUSE

Sensible reuse is the ultimate form of TTM, ROI, and Agility, whilst insensible reuse infers the opposite.

Reuse comes in many forms (some far better than others). For instance:

Let’s look at a few examples.

MONOLITH OF MONOLITHS

A Monolith of Monoliths offers a great example of inappropriate reuse. A monolith is atomic. You get all of it, regardless of whether you want it, need it, or you’ve already got it elsewhere. If we find that our monolith doesn’t contain the feature we want, we look elsewhere. We may find that feature within another monolith, so we must find a way to sew it in (integrate).

The result is a binding of several (possibly even hundreds in extreme cases) monoliths to create one giant monolith (a Frankenstein’s Monster System) with many disadvantages, including the Cycle of Discontent.

REUSE THROUGH ITERATION

It's quite normal (and appropriate) to reuse code through iteration, simply by wrapping a single execution of a section of code (algorithm) within a loop, and thus enabling it to execute many times. This approach is powerful and quick to develop - thus rather appealing - however, it still requires some caution.

Consider a user interface (UI) making use of a back-end API to display content to the user. Let’s say it’s used to visualise a software system (e.g. blocks interconnected). See image.


This works very well for a single remote interaction, but can create problems when (a) we approach this style iteratively and (b) that function is always assumed to be independent. The most notable concerns are:

Let’s briefly examine the second point. Let's say we now want to scale up our visualisation solution (mentioned earlier) to display a much larger system, or a system of systems. In this case, we’ll achieve this by wrapping the original remote call in a loop. See below.


What’s going on here? Well, we’ve wrapped a remote API call, so it's now executed many times (not necessarily an issue); however, some cases have met with failure (e.g. HTTP 404s or 500s). The common logic we execute (reuse) iteratively assumes that the user should be informed of a failure (something that makes sense for a single interaction) - thus it presents an error dialog to inform the user to intercede. This is fine if it's a single dialog, but not if it's hundreds - definitely not an intended consequence of this reuse.

DATA MODEL REUSE

A data model may also be reused. Let’s say the data model is:

X —> Y —> Z

In this case X may have one or more Y’s, which may have one or more Z’s. This relationship is strongly represented (i.e. tightly coupled) in a relational database, using referential integrity. We can’t have a Y without an X, nor a Z without a Y. The user experience to populate this relationship is shown below.


We model our APIs on this same business flow/relationship - the user completes steps X, followed by Y, followed by Z. This approach works well when all of those entities exist, but what happens if we now stumble upon a secondary business flow that only reflects Y and Z? See below.


Note that X isn’t modelled here, because the user doesn’t do it. This leaves us with a predicament - how can we model this (divergent) relationship?:

Yep, there’s no easy answer...

PLAYING WELL WITH OTHERS

Some time ago, I proposed what I thought to be a decent alternative to an established product pricing feature. It had many positive traits, and I thought it would be eagerly adopted. The problem though, was that it came after an established solution. My proposal gained little traction, partly due to Change Friction, a large Blast Radius, and Domain Pollution from the existing solution.

The selected solution was a hybrid of both existing and new (my proposed) solutions, which was unfortunate, because my proposed feature lost some of its vigour as it became entangled in the current position, but entirely appropriate from the perspective of the product’s current state.

My point is that it doesn’t matter how rosy the future is if you must still consider the existing position (i.e. how the existing consumers will integrate with it). Never hold on to something that is excellent in isolation, but doesn’t play well with others.

FIELD REUSE

Data fields are used throughout software (from user interfaces, to APIs, to database tables). They are containers of data, typically of a certain type (e.g. string, integer, date time), and are often validated to meet certain business expectations. Combining them in different ways allows us to create more coarse-grained entities of almost infinite possibility.

So here’s the question. Would reusing a field, for an entirely different purpose (to have two purposes) be more, or less, prohibitive than introducing a new field instead? Still unclear? Let’s look at an example.

Assume that our example retailer - Mass Synergy - wants to offer its retail platform to clients. Those clients will create their own branded websites, but use Mass Synergy’s platform and services as a SAAS. Magic Cinnamon - an online cake-baking subscription service - wants a slice of the action (pun intended). During the sign-up process, a Magic Cinnamon customer can explicitly agree to receive a free magazine of recipes posted directly to their door. Of course this function is not catered for in the product (it has never even been considered).

This is a very specific (highly bespoke) requirement - and a typical problem for businesses who want both to build a generic product and still support bespoke client requirements. Mass Synergy doesn't want bespoke tailorings polluting their product, and thus limiting its reach and ROI, but they still want the custom. The business and technical leads from both sides get together to discuss their options, and eventually come up with the following solution.


Magic Cinnamon will continue to use APIs as-is, whilst the Mass Synergy team will introduce a new (bespoke) nightly extract process that identifies new registrations and sends them on to fulfilment. But how will they indicate a fulfilment request? During project inception, the teams found that Magic Cinnamon didn’t require all of the existing registration API fields. For instance, the giftedBy field typically holds the name of the person who gifted this item, but the concept of a gift doesn’t make sense here. The initial plan was to ignore (not send) this field, but they now agree to reuse it for recipe fulfilment (by supplying a value of “MAGAZINE” to indicate if a recipe book should be posted). See below.

{
  "id": ...,
  "surname": "Jackson",
  "email": "supersuejackson889@zizzy.com",
  "activationDate": "16/09/2021 15:09:34",
  "giftedBy": "MAGAZINE",
  ...
}

The nightly extract job identifies any Magic Cinnamon customers that were registered on the previous day with a giftedBy value of “MAGAZINE”, and forwards them on to fulfilment.

Ok, so let’s discuss this approach. Firstly, would anyone really do this, and if so, why? Yes, it does happen. I’ve seen it used a handful of times [3]. The main cause is a lack of Control (typically over something others consume, like APIs), and a wish to avoid Blast Radius. You don’t want to (or can’t) change an API, model, or UI, but you still need to support a requirement never previously considered. Effort isn’t the only concern either - consideration should also be given to delivery timelines. Not everyone has Continuous Delivery Pipelines, automated Regression Testing, or regular releases. There’s many causes of Change Friction.

API CHANGE FRICTION

Without careful consideration, sensible SLAs, and Evolvability in mind, APIs can be horrendously difficult to change - mainly due to consumer life cycles. Any new version can be fought against (Change Friction) and requires persistent consumer coordination. Ideally this type of bespoke requirement would have already been considered (such as by attaching metadata), but it often isn’t (and not in our case), and leads us back to the initial reasons for avoiding change in the first instance (Blast Radius).

Let’s now consider our wins and losses. We’ve certainly saved ourselves the trouble of changing something that can be notoriously difficult to modify. But what’s been lost? Firstly, there’s increased confusion and lowered comprehension - for both parties. We’ve diverged from the intended use of the API, and created a couple of Knowledge Silos in the process. Secondly, where does it stop? Will this same approach be reused on other clients, for other bespoke needs? Thirdly, now that we’ve stuffed arbitrary data into a generic product location, we must keep the reporting team abreast of these specific requirements to ensure they don’t include this in generic reports. We’re building a product, but we’ve not (really) delineated between what’s a product, and what’s bespoke.

REUSE ABUSE

I’ve also seen a product identifier used for both a product, and to identify a discount code, hampering the client’s ability to modernise its database technology.

SUMMARY

This chapter was longer than I anticipated. I hope though that it demonstrates that not all forms of reuse are welcome, and some forms should definitely come with a caveat emptor. Sensible reuse is the ultimate form of TTM, ROI, and Agility, whilst insensible reuse infers the opposite.

FURTHER CONSIDERATIONS

“DEBT” ACCRUES BY NOT SHARING CONTEXT

Debt in software manufacturing doesn’t just accrue from poor or incomplete software solutions, but also from an inability to share context (Shared Context). Context that is shared reduces Single Points of Failure, and increases Flow and Agility.

DIVERGENCE V CONFORMANCE

I’ll keep this short. A variation - or a divergence - from the norm creates specialisms, misalignment, and silos of expertise (Knowledge Silos). Uniformity - or a conformance - brings alignment and unity, as all representatives work from the same manual.

Conformance suggests “knowns”, produces similar results, scales well (more staff), and is easier to estimate (you’ve already done it multiple times). Divergence suggests “unknowns”, may produce unexpected results, scales less well, and is harder to estimate.

Divergence promotes Flexibility (choose whichever technology/approach you like) and supports Innovation, whilst conformance promotes Sustainability (choose from a wide range of staff) and makes that innovation a standard.

You need both in a technology business.

FURTHER CONSIDERATIONS

WORK-IN-PROGRESS (WIP)

Imagine that you work as waiting staff in a restaurant. Few customers visit before the peak times, so there's few “covers” (low WIP), and it's easy for staff to be attentive and offer a great service.

However, as the day wears on we see a big increase in the number of covers - often surpassing staff capacity - and the system begins to break down. Service slows, mistakes are made, food is produced too soon, or left too long and gets cold, the drinks order is wrong, staff are forced to Context Switch, and both customers and staff get irritated. Often, the deterioration is noticeable, so it's likely you’ve experienced it yourself.

Too much WIP creates concerns around Circumvention, Expediting, wait times, and isolated thinking:

Whilst my analogies have mainly revolved around the hospitality sector, we see the same issues in software. Lots of WIP creates specialisms (e.g. Indy Dev), increases the likelihood of Expediting, increases wait times, thus reducing Flow, and facilitates Circumvention.

REVERT TO BATCHING

A system always has a bottleneck. In the hospitality analogy it might be the bar staff, the maître d', waiting staff, chefs, or the pot-wash. To be effective and offer a great service requires everyone across the whole system to work together - not just individually. This becomes increasingly difficult as WIP increases.

Batching is a common coping mechanism to pressure and isolation. For instance, if the kitchen receives three covers (totalling six Lobster Thermidor, two Chicken Chasseur, and one Fillet Steak), then there’s a danger of batching all lobster, meeting an individual commitment, but missing the goal (mistiming the remaining dishes in each cover).

Batching may increase WIP woes, not lower them, as products are discarded (due to timing), are returned for rework, or simply take forever to release value to the customer.

SUMMARY

Too much WIP pleases no one. Not your customers - who get a substandard service or product, and not your staff - who take pride in their work and don’t want to be under constant pressure.

WIP in any system (including software) should be carefully managed. It should be enough to satisfy demand, but not so much as to flood the system to the detriment of all.

FURTHER CONSIDERATIONS

"CUSTOMER-CENTRIC"

It seems to be all the rage for modern businesses (or ones aspiring to modernity) to suddenly declare themselves “customer-centric”. It makes you wonder what they were doing before this... Visit many corporate websites (and their corporate strategy) and you’ll see a liberal sprinkling of such terminology.

Whilst placing the customer first seems rather obvious and unremarkable, these statements may obscure something less appetising. However, to understand this, we must first discuss and compare the customer consumption models of yesteryear (which many established businesses have great familiarity with) and today.

If we were to travel back in time twenty years ago, we’d find a very different social and technological picture to the one presented today. One not long after the arrival of the internet, where Web Services were nascent, mobile technology was still in its infancy (I could phone or text people but not use the internet), there were the first stirrings of a social media revolution, and information infrastructure was patchy.

In the UK (at least), there was still a strong high-street presence. If you wanted to open a bank account, you visited your local high-street branch and arranged its opening. It might take weeks, but you expected it, because you were acclimated to it. If you wanted a new washing machine, you visited your local electronics store. They’d offer you a limited set of delivery slots and expect you to be in all day to receive the item. If you wanted to watch a film, you’d rent or purchase the physical DVD from your local store. Customers continued to conduct business with their current energy utility providers and banks because there was little alternative and change was too hard.

There were no (global) online retail businesses. No streaming services, market comparison sites, next day delivery as standard, or prompt and accurate delivery status notifications (this is a great feature). You see where I’m going with this? Expectation, and thus consumption, was dictated by local market factors - driven less by what the customer really wants, and more by the constraints imposed upon them.

The business practices, systems, communication mediums, and even their staff's working locations and hours - all formed twenty (or more) years ago - aren’t a good fit for modern expectations. Yet they’ve been inculcated and embedded into the culture of established businesses over many years, to the extent that the scale of change required in order to modernise is very very challenging (21+ Days to Change a Habit). Things have changed since then, placing a greater focus on customer-centrism, and challenging the way businesses function, including:

I’ll describe each next.

RAVENOUS CONSUMPTION

I’ve already talked (Ravenous Consumption) about how consumerism has both increased our rate of consumption and increased our expectations. It's also made us less tolerant of slow, or late, deliveries.

Modern consumption habits were somewhat stoked by the technical innovations that came out of the Internet. No longer were we inhibited by local limitations, but we could now access a far wider, global, set of services brought to our door. A customer-centric position is one that delivers at a pace appropriate to customer expectations.

GREATER COMPETITION

Competition for custom, which was once more localised, is now global. If customers don’t find a product or service locally, they’ll look further afield. This makes every customer on-boarding onto your platform a small victory. But you must also keep them. That may involve regular improvements, building things they want, testing hypotheses, employing Fast Feedback, and Continuous practices.

DATA-DRIVEN DECISION MAKING

Remember the days before social media, data pipelines and warehouses, and Google Analytics? In those days, as a developer, it felt pretty rare to get the opportunity to talk directly to the customer. For me, requirements were fed through a series of intermediates (who sometimes added their own expectations), before putting it in front of us for implementation. It also felt quite common for executives to formulate their own interpretation of what the market needed, without necessarily consulting that market.

I would argue that there was (and sometimes still is) too much subjective decision making. Assumptions became commonplace due to the number of intermediates between the customer and engineer, the lengthy release cycles typical of that time (e.g. quarterly), and fewer options to capture metrics. Simply put, there were fewer opportunities to get objective, swift, and regular feedback.

But times are changing. There are many ways to capture data about your customers, and modern practices provide the means to deliver value swiftly. Feedback loops and empirical evidence are driving out decisions that may once have been tinged with bias and subjectivity. The challenge is no longer how to capture this information but how to sift through the vast quantities of captured data.

CONTINUOUS PRACTICES

Practices like Continuous Integration (CI) and Continuous Delivery (CD) help to put the customer back where they should be - at the forefront of your decision-making. Whilst not necessarily true of all, a significant proportion of internet-facing businesses are embracing these principles.

SOCIAL MEDIA

Social media didn’t (really) exist twenty years ago, so it wasn’t a pressing concern to businesses. No longer. Social media provides a quick, cheap, and expansive medium for customers to air their views on your product, service, delivery efficiency (e.g. late or lengthy deliveries), or customer experience. Social media affects branding (Reputation) in both positive and negative ways.

THE CLOUD

One might say that the Cloud was also spawned from such an environment. The customer’s desire for greater pace, flexibility, and better pricing efficiency were key to its creation. It’s very customer-centric, offering a wide range of services to choose from that are quick to provision, (relatively) easy to integrate, and aren’t charged for if they remain unused (e.g. Serverless).

AGILE PRACTICES

Nowadays, most of the customers I speak with expect regular releases that show progress, and enable them to steer things in the right direction. This is key to Agile delivery, and something Waterfall doesn’t offer. Agile is all about customer-centrism. You can find more information here (Agile / Waterfall).

ILL-CONSIDERED GROWTH

A strategy of growth (e.g. Economies of Scale) - typically executed through a series of mergers and acquisitions - that fails, may impede customer-centrism, as the business loses sight of their existing customers in its quest for growth, becoming overly outward-facing.

FAILURE TO GROW

Whilst this practice doesn’t fit directly into customer-centrism, the outcome of a poorly executed growth strategy most certainly does.

This problem will be familiar to those who’ve witnessed the epic rise and fall of businesses who either (a) grow at a rate faster than they can sensibly consolidate, or (b) fail to undertake sufficient due diligence on the technologies or practices of another merged business (a failing from that supplier often falls on the parent business’ shoulders). The harsh reality is that they’ve brought about a Self-Inflicted Denial of Customer Service (DoCS) - where the quality, delivery speed, or agility of the product/service being offered to the customer, in turn for their loyalty, is impeded - potentially causing reputational damage (Reputation). A lack of Consolidation then leads to massive complexity, brittleness, a large Blast Radius, and containment challenges. It creates Change Friction at both the technical and cultural levels (stability usurps change), and eventually reduces customer satisfaction (something you can ill afford in a globally competitive market).

If this strategy fails, then that business may return to an internal-facing outlook that has worked for them in the past, by refocusing on existing customers. Yet this is difficult if it accrues a lot of baggage during the growth stage.

SUMMARY

Being customer-centric in today’s climate is quite different to one twenty years ago, and requires businesses to be extremely agile, fast, and metrics driven. The benefits can be great (e.g. global reach), but so too are the challenges, especially for established businesses.

FURTHER CONSIDERATIONS

BRANCHING STRATEGIES

Suggested Reads: "Customer-Centric"

Aligning your code branching strategy around your business goals and aspirations is another important consideration.

“Hold on,” I hear you cry, “how can a code branching strategy relate to a discussion around strategic goals?” I’ll elaborate shortly, but first let’s discuss the qualities many customers now expect of modern businesses.

If we revisit modern Customer-Centrism, we find a synergy with many of these forces:

Whilst there are several branching strategies, in reality there are only two models:

We’ll visit both.

(FEATURE) BRANCHES

With feature branches, each feature is worked on in isolation (on a branch), and only merged back once work is complete. See below.


In this case Asha creates a branch for Feature A and begins working on it. Whilst Feature A is being worked on, Sally finds she also needs to make a change, so she creates another branch (Feature B). Timing is important here. Since Feature B was created before Feature A is complete, it doesn’t contain any of Feature A’s changes. Finally, Feature A is completed, (technical) code reviewed, and then merged back into the master branch, where it’s now accessible to everyone. Sally may now pull in those changes when convenient, and eventually merge Feature B back into the master branch.

Feature Branches have the following traits:

TRUNK DEVELOPMENT

The alternative to branch development is trunk development, where no branching is involved. The key distinction being that committed (pushed) code is immediately integrated into the current working baseline - called the “trunk”. See below.


Note that changes are occurring on an (almost) ad-hoc basis, and that there’s no intermediate management layer (a branch) where code may sit for a while with impunity (Manufacturing Purgatory). This might seem strange (and disturbing to some) at first glance, being quite different to branching, however, bear with me.

At the start of this section I described how modern businesses look to TTM, fast feedback, early alignment, automation, Continuous practices, and agility for customer-centrism. In contrast to a branch - which may be radically different from the current baseline, and therefore harder to contextualise - work on the trunk is regularly reintegrated back into the current baseline, requiring far lower effort to contextualise (due to its low divergence). You’ve also checked whether it plays well with others, and not waited for a big-bang integration. This has the added benefit of Fast Feedback, both within the team - who all get early sight of the change and can assess/refine it, and also from the customer - value is delivered (almost) immediately to the customer, who can then determine if it’s what they really want.

Trunk development has its own challenges, otherwise everyone would be doing it. To some teams I suspect it’ll never be more than an aspirational state. For instance, the Safety Net of a (technical) code review, typical of branch development, can’t be done in the same fashion, due to the ad-hoc nature of commits. A different form(s) of safety net must be employed, typically through (a) a high degree of test automation (over 80% is a popular - if arbitrary - measurement), (b) TDD - which plays on the first point, (c) Pairing/Mobbing practices (which - to my mind - is a better substitute than a formal technical review), and (d) Quality Gates.

Finally, consideration should also be made for promoting unfinished software - code that’s syntactically correct and executable, but not quite ready for users to access. Feature Flags and/or Canary Releases are commonly used here.

SUMMARY

The decision as to whether to branch has implications that you might not expect, including how and when you deliver software (value) to your customer, and how successful context is shared across a team (Shared Context).

To my mind, branching is an isolationist strategy, and as such it may create:

But it can also be a good way to experiment, and is a well-established practice.

Whilst trunk development is in many respects a better approach to achieve modern customer-centrism, it’s also much harder to integrate. Even with supporting practices - like TDD and Pairing/Mobbing - it's fair to say that it requires a greater amount of cultural reform.

FURTHER CONSIDERATIONS

LIFT-AND-SHIFT

Lift-and-Shift is the practice of lifting an existing solution, and shifting it to another environment in order to leverage some of the benefits of that new environment. It’s a term most commonly associated with migrations from on-premise to the Cloud.

Lift-and-Shift is a tantalising prospect - particularly to business and product owners, but possibly less so to the technology teams maintaining it. It suggests that the lifetime of an existing product can be extended, and the overall offering improved (at least in some way), simply by shifting it onto another platform. However, this is only a partial truth.

Many of the solutions earmarked for Lift-and-Shift are legacy (e.g. Antiquated Monoliths). As such, they come with legacy problems (e.g. inflexible technologies, poor practices, manual Circumvention, security concerns, slow delivery cycles, and costly scaling) which when unresolved are again reflected in the lifted solution. There’s no such thing as a free lunch.

FURTHER CONSIDERATIONS

TEST-DRIVEN DEVELOPMENT (TDD)

In this section (Why Test?) I described why we test software. Testing isn’t solely a means to ensure quality, it’s also an important practice to enable fast change and feedback, and to promote Innovation. Testing also helps to identify and resolve defects early on in the release cycle: “The later that we check for quality, the costlier a breakdown in quality is. Both monetary, and time. Additionally, a lack of (sufficient) tests/testing may also risk reducing business Agility, and impede our ability to Innovate.”

One of the most cited studies on the financial costs of defects (originally by IBM) indicated that the cost of fixing the same defect in production could be four or five times more expensive than doing so in development. [1] Therefore, we can make the following deduction. If software has fewer defects (i.e. better quality), or if they’re caught earlier on in the development lifecycle, then we can reduce costs. This is something TDD can support.

So what is Test-Driven Development (TDD)? Well, to understand TDD’s usefulness, we must first step back in time.

WHY TDD?

When I first started out in the industry, software was built using a very formal waterfall process (Waterfall). Business requirements were captured in an enormous requirements document that was subsequently translated into a more detailed design and specification. Those designs would then be implemented, before being packaged and deployed for formal system testing. In those days there were very formalised delineations of responsibility.

As developers, we would build each function to specification, undertake some manual tests (based upon another formal test specification document) to get a general feel for its quality, and then throw it over the fence to system test (often weeks or months later) to prove its correctness. I wrote very few tests, although I did manually undertake many. You won’t be surprised to know that this was a very long and convoluted process, often creating Ping-Pong Releases. Alongside these development practices were large monolithic deliveries (Lengthy Release Cycles). Feedback was hampered by (amongst other things) manual testing, long wait times, large batches per release, and an over-the-fence hand-off mentality. The take-away for this section though, is that significant testing was undertaken after the software was built.

Equally concerning was the (released) code quality across the business. The code tended to be overly procedural (even though we were employing object-oriented practices), resulting in Fat Controllers and God Classes that were very difficult to maintain. What I didn’t know then, but I do now, was that it was heavily influenced by the testing approach. In some respects, code that is only tested at a coarse-grained feature level is more likely to be messy (even with code reviews), because only the externally perceived quality is being verified.

Fast forward maybe five years and I began to learn about unit testing. These unit tests were great - they helped to ensure the software I wrote functioned, but we were still treating them as a second-class citizen, and only written after the software under test was implemented. System testing was still heavily involved in verifying the quality of the implemented features.

However, whilst I was hidden away in the depths of conditional statements and for loops, something else was happening in our industry. The Internet was driving new consumer habits and expectations (Rapid Consumption). It caused (as it continues to) us all to reassess how (and how quickly) we deliver customer value (Customer-Centric), led us to reconsider our existing, established practices, and was highly influential in the industry’s Shift Left policy. See (a simplified view) below.


This policy caused a gradual blurring of what were once distinctive roles and delivery stages. Developers and testers got invited to requirements gathering sessions - once only the mainstay of customers and Business Analysts - to hear first-hand about customer needs and begin a dialogue. A major shift left. Some aspects of testing - once thought to be a practice solely for testers and of finished software - began to be embedded within the development lifecycle and undertaken by developers. Another shift left. Deployments and operations - historically the responsibility of a centralised operations/sysadmin team - began to be undertaken (with the help of automation) by developers and testers, following the “You build it, you run it” mantra. Another shift left. Even the premise of Cross-Functional Teams is to short-circuit established silos, improve communications, and (you guessed it) support Shift-Left. Each step was an evolution, meant to enable faster delivery and feedback, yet still retain (or improve) quality, and mainly to support modern customer expectations. TDD fits well with this idea of shift left (as does BDD, which I’ll discuss in a future chapter).

Test-Driven Development then is a practice for building better quality software, through the use of a lifecycle that promotes incremental change and a test-oriented delivery model. It’s another element of what I term FOD (Feedback Oriented Delivery). TDDs benefits include:

THE TDD LIFECYCLE

The practice of TDD is founded on a simple (yet powerful) incremental change lifecycle. See below.


The stages are a repeatable pattern of:

  1. Write a failing test. This stage gets you thinking about the goal and a testable outcome. It’s more concerned with the contract over the implementation. The test represents the specification of what should occur (and can sometimes replace requirements documents). It tests the outcome, not how you get there. The test should fail, enabling it to be resolved in the next stage.
  2. Make the test pass. The test should be made to pass by implementing the code to produce the desired outcome. This should be done as quickly as possible (no procrastination); i.e. we find the simplest thing we can do to make the test pass, thus creating a Safety Net.
  3. Refactor the implementation (and test) code as required. We can do this because we created a Safety Net (something of great value) in the last stage, allowing us to refactor (if necessary) our original implementation into something more robust, performant, cohesive, maintainable etc. This stage is important and should not be circumvented, lest we end up with a major (big-bang) refactoring at the end - certainly not our intent. If pairing is being employed then this is also an opportunity for an informal (ad-hoc) code review.

It’s worth reiterating that TDD is less about the tests that are written and more about writing software that is highly testable. It's through the application of the TDD lifecycle that provides the improvement framework. TDD is not the same as a test-first strategy, where a host of tests are written up front, prior to any implementation code. All of the artefacts (code and tests) built with TDD are written in lock-step to ensure (a) Shared Context is maintained, (b) feedback is swift, and (c) that quality is built in incrementally.

SUMMARY

That all sounds great, and it mainly is. But we should also discuss why and when TDD becomes less suitable. For instance, I wouldn’t use TDD when embarking on highly innovative work containing many unknowns. If a man is lost in the woods, unsure of the direction to his salvation, then lugging around a heavy sack of gold (the tests) quickly becomes a hindrance. In these circumstances, TDD may be unwelcome.

There’s also a point around short vs long-term delivery, and sustainability. Let’s say you’re creating a prototype for an imminent trade show. It’s not meant to be a finished product, just one that can demonstrate capabilities. There’s little time to waste, so (a) why waste time and money on testing a throwaway item, and (b) should you really invest an additional (15%-35% development) effort when there are more pressing concerns? Additionally, are you really going to benefit from increased internal quality in this scenario (Value Identification)? Of course not.

Of course this is a very short-term view, one (rightly in this instance) inconsiderate of medium-to-long term productivity and sustainability. TDD comes into its own when you consider the repeated (and sustained) delivery of features by promoting every future change, providing a Safety Net, and supporting Innovation. TDD makes change sustainable.

FURTHER CONSIDERATIONS

DECLARATIVE V IMPERATIVE LEADERSHIP

In this section (Declarative v Imperative) I described how the declarative v imperative model fits in relation to systems and development. But it need not stop there - this thinking also fits well with a leadership style.

Let's start by describing the two leadership approaches:

Fundamentally they’re different ways to achieve the same output (the completion of a project or task), yet they typically have different outcomes (such as on sustainability). Let’s look at them.

IMPERATIVE LEADERSHIP

This approach is mainly driven by one person telling others exactly what to do, how to do it, and the order in which to do it. The leader is (probably) deeply involved in the implementation work, and therefore involved in (and responsible for) how the outcome is reached. See below.


The red circle represents the amount of effort the leader is expected to expend supporting that team. Note how it almost entirely fills the “Leader” circle on the right. There’s little time to undertake other work as a significant chunk is taken up with team interactions. Let’s now look at what happens when that leader has other responsibilities (such as supporting multiple teams, or other strategic work). See below.


The red circle continues to represent the interaction between the red team and the leader. However, we also have two further interactions. The green circle represents another team (the green team), also requiring support from that leader, yet the leader isn't in a position to help. The purple circle is another stream of strategic work the leader is also expected to undertake, but again, is unable to tackle. What’s happened here is that the leader is so flooded with imperative requests from one (red) team that they are unable to fulfil their other obligations.

LEADERSHIP CONSTRAINTS

This harks back to the Theory of Constraints. In this case the constraint is the leader, so everything depends upon both their availability, and their capability (abilities and knowledge).

If I were to use an analogy, imperative direction would be the guardrails employed in tenpin bowling to prevent the ball from going down the gutter. These guardrails are failure prevention mechanisms that may be used to build up confidence, particularly for new players. What they’re not though is a sustainable way to learn and master the game. At some stage they must be lifted. In software terms, the team employing a guardrail (a leader in imperative mode) should be using it to familiarise and improve themselves, in preparation for that guardrail being lifted.

The imperative mode then suits activities where the team may be inexperienced in that area and need support, but it can also be extremely useful when timelines are very tight and the outcome is business critical, typically because it gets results sooner. So why would anyone consider the alternative?

Well, it falls back on the old argument of short-term benefit v long-term sustainability (such as Unit-Level Productivity v Business-Level Scale). Whilst the upside of imperative is indeed swift delivery, the downside is that - dependent upon the leader’s outlook - the team isn’t truly learning the skills needed to succeed, and they remain strongly reliant upon the leader’s guidance. A leader more invested in the short-term - or in creating a name for themselves - may create a situation where the guardrails can never be lifted. Additionally, the team may also be learning bad habits - an expectation is raised that undertaking any critical activity requires the leader to be deeply involved in the minutia. To me, this isn’t leadership. Move at pace, but don’t allow that pace to create a Single Point of Failure that hampers sustainability.

SUPPORTING & DOING

There’s a fine line between leaders offering support, and performing the entire job for another. A leader put in this position must know both their own role and the minutiae of everything necessary to support the team at that granularity. It's not impossible, it's just unsustainable.

There’s also another - albeit more cynical - view. Whilst I don’t consider this widespread, some individuals may relish the possibility of becoming a passenger, of having someone else do the heavy (cognitive) lifting, whilst they only do enough to be seen to participate, aren’t truly engaged, and are learning very little. These individuals may even come to expect it for every other important project.

IMPERATIVE THROUGH NECESSITY

Of course there will be situations that force a leader to behave imperatively, even when it’s undesirable to all. This might, for instance, be a looming deadline, or a newly formed team not quite ready to work imperatively, or even an established team with little experience using modern technologies and practices. Even so, a good outcome is one that both delivers the feature and enhances overall sustainability.

IMPERIOUS IMMEDIACY OF INTEREST

I’ve worked with a range of different leaders, including ones who’ve seemed unable to progress past an imperative style. These leaders spend most of their days down in the weeds, directing and micromanaging the minutia, even when their role was strategic. Worse, rather than being reminded of their other duties, often they were congratulated for their heroic endeavours to (once again) save the day and deliver a feature in record time.

What’s wrong with that? Well, we humans seem to have an inherent weakness to overlook the consequences of “work heroism” - as if there were none - and proffer only plaudits. But… any action has consequences.

I’ve witnessed some of the consequences of overusing this tactic, including:

  1. The strategic work wasn’t being done (in a timely manner), leaving the wider team rudderless, and progress reports rushed or incomplete.
  2. The other teams in the wider group were left short of work, as it couldn’t be proactively prepared for them. This (of course) led to more “heroism” from that individual, except this time it rested with the other team (you see the pattern?). As the fires sprang up, that individual was always on hand to put it out, but no one thought to ask where (and why) those fires had originated from.
  3. Team learning was affected as it became a regular occurrence. The leader began to “steal the team’s” learning - always forcing them down their (imperative) route, allowing them no opportunity to identify their own best route.
  4. And most seriously, the team stopped thinking, or asking why, and just started doing everything they were told. Thinking people have passion and motivation. They will go above and beyond because they are engaged, challenged, and they care. People who aren’t challenged, are spoon-fed, and are unable to innovate, typically suffer from low morale. I remember vividly one developer saying that he’d stopped trying to think, and was simply following directions. He left soon after.
  5. The leader didn’t have time (or an inclination) to engage their listening mode, only their broadcast mode. They weren’t deeply engaged in another team’s problems so had no Shared Context. Consequently, they would dive right into the action based upon a murky account of the problem, making false assumptions, and reiterating solutions to problems already solved. This is like visiting a doctor with an ailment and that doctor immediately prescribing a treatment before you’ve given an account of your ailment.

Put simply, this model doesn’t scale well. Leaders caught up in a repetitive imperative cycle are probably spending so much time supporting, nurturing, and nursing a team, that they lose sight of the big picture. Atrophy occurs, and other parts of the body deteriorate. Those leaders begin to work in a short-term mindset. If I were to borrow from a biblical expression, working continuously in this style is like robbing Peter to pay Paul. You gain short-term benefits for longer-term sustainability costs.

It may also result in a Single-Point-of-Failure, Context Switching, and Expediting. The leader becomes the constraint as a new priority is raised, causing them to switch cognitive effort onto another task and then expedite it, thus creating a vicious circle.

One person's leadership is another’s micromanagement. Micromanaging a capable team will likely create animosity (aimed at you), and you’ll find yourself in their way.

DECLARATIVE LEADERSHIP

The declarative approach is quite different to the imperative model. In this model the leader provides the team with a vision, a general direction, and a desirable outcome; and then lets them figure out the rest. They frame the problem and indicate what’s needed, not how it's achieved.

The challenge here is being explicit enough to ensure that the team builds in the expected qualities (so they are exhibited in the final product), but not so rigid as to hamper them from finding the right solution to the problem. If the leader is too standoffish, insufficiently descriptive, or the problem isn’t fully understood or appreciated, then we may find a poor solution being built.

COMMUNICATION STYLE

Using a declarative mode doesn’t mean no communication, nor a one-off, fire-and-forget communication. It should be an open dialogue, allowing the team to (re)confirm acceptance criteria, broach problems uncovered during the implementation, or discuss quicker ways to gain feedback.

Here’s an example of a leader working in declarative mode.

In this case the leader is much freer (than in the imperative mode) and therefore able to support a much wider range of initiatives. The teams are working more autonomously, and are able to make judgement calls and adjustments about how best to solve the problem.

Declarative then may be about performance, but it’s definitely about sustainability. It works best either when the team is capable (declarative in a highly-effective, self-organised team is nirvana), or when the business is comfortable building up that capability, and willing (initially) to accept longer cycle times. In this second case, progress is initially (note the distinction) slower because the team is learning, through trial and error. However, each subsequent delivery gets faster and faster, as the team benefits from experience, to the point where there is little need for the imperative mode.

SCALING THE MODES

Like with other sustainability-oriented practices (e.g. TDD), you won’t necessarily get immediate benefit from declarative, but you do get scale and sustainability.

SUMMARY

Imperative and Declarative are different ways of leading that solve (but may also create) different problems. There’s a time and a place for both. To my mind, good leaders can work in both styles, and more importantly, know when to use one over the other.

In a sense, these models are similar to conducting (in an orchestra) and choreographing (as in dancing). Imperative expects the conductor to help others produce music. The conductor can make something harmonious, enabling a group to deliver sooner, but the group is only learning what the conductor is able (or willing) to teach. Consequently, they are not learning through experimentation (which also involves thinking and making mistakes) - one of the most effective ways to master a skill. Additionally, conducting doesn't scale. A conductor can only conduct one group at a time, so any other groups are demoted.

I’ve seen this model time and again. “Myopic leadership” spend too long conducting small groups that there’s little room left for strategy or sustainability, only the same tactics applied cyclically from one group to the next, to the next. It's good to move at pace, but don’t let it create a Single Point of Failure, and always leave a team in a better place than when you started.

FURTHER CONSIDERATIONS

THEORY OF CONSTRAINTS - CONSTRAINT EXPLOITATION

The Theory of Constraints has one key takeaway. An hour lost to the constraint is an hour lost to the entire flow. Or to put it differently, no matter how many productivity increases or waste reduction exercises are made in other areas, the constraint dictates all - an hour lost there cannot be regained through improvements elsewhere. So, identifying, exploiting, and elevating the constraint (and nothing else) is a vital part (in fact the only way) of increasing overall capacity and throughput.

I read this over and over and over; and whilst I understood what was being stated, I found it hard to reconcile with. For me, it felt like some sort of Cognitive Bias was working against me - I understood it, but I struggled to adopt it. Maybe I think differently to others, but I prefer visual descriptions (which is what you’ll find here), that I can return to regularly if I need reassurance, or if I’m demonstrating the principle to others. I hope you find it useful.

In this section I visually describe three examples:

  1. Exploit and elevate a workstation prior to the constraint; i.e. what happens if we improve a workstation prior to the constraint?
  2. Exploit and elevate a workstation after the constraint; i.e. what happens if we improve a workstation after the constraint?
  3. Exploit and elevate the workstation with the constraint; i.e. what happens if we improve the workstation with the constraint?

ELEVATE STATION PRIOR TO CONSTRAINT

If we elevate a workstation prior to the constraint (which is B), we get the following outcome:

Time Unit A’s Capacity B’s Inventory B’s Capacity C’s Inventory C’s Capacity Throughput Overall Inventory Done
0 5 0 1 0 5 N/A 0 0
1 5 4 1 0 5 1 4 1
2 5 8 1 0 5 1 8 2
3 10 17 1 0 5 1 17 3
4 10 26 1 0 5 1 26 4
Time Unit A’s Capacity B’s Inventory B’s Capacity C’s Inventory C’s Capacity Throughput Overall Inventory Done

Note that this may be any workstation before the constraint, not necessarily the one immediately prior.

In this case our constraint is workstation B (marked in red). At Time Unit 3, something is done to increase workstation A’s capacity (essentially doubling its output from 5 to 10 units). What’s interesting is the overall flow outcome (the Throughput and Done columns). We find:

  1. It made no difference to our throughput.
  2. We have increased the amount of inventory (waste) in front of our constraint. It’s now increasing by a rate of 9 units rather than 4.

If we were to pursue this approach further - and assuming no other variances (unlikely in reality) - we would find no increase in throughput, but an increasing amount of inventory in front of B.

ELEVATE STATION AFTER THE CONSTRAINT

If we elevate a workstation after the constraint (workstation B), we get the following outcome:

Time Unit A’s Capacity B’s Inventory B’s Capacity C’s Inventory C’s Capacity Throughput Overall Inventory Done
0 5 0 1 0 5 N/A 0 0
1 5 4 1 0 5 1 4 1
2 5 8 1 0 5 1 8 2
3 5 12 1 0 10 1 12 3
4 5 16 1 0 10 1 16 4
Time Unit A’s Capacity B’s Inventory B’s Capacity C’s Inventory C’s Capacity Throughput Overall Inventory Done

Note that this may be any workstation after the constraint, not necessarily the one immediately afterward.

In this case our constraint is workstation B (marked in red). At Time Unit 3, something is done to increase workstation C’s capacity (essentially doubling its output from 5 to 10 units). What’s interesting is the overall flow outcome (the Throughput and Done columns). We find:

  1. It made no difference to our throughput.
  2. Whilst we haven’t increased the amount of inventory (waste) in front of our constraint (it remains increasing at the same rate of 4 units), we've now (further) starved workstation C of work. That workstation/team is significantly under-utilised.

Were we to pursue this approach further - assuming no other change - we would find no increase in throughput, and very bored, under-utilised staff.

ELEVATE STATION WITH CONSTRAINT

Finally, let’s look at what happens if we elevate the workstation with the constraint (initially workstation B). Assume we can always feed A with sufficient work to fill it.

Time Unit A’s Capacity B’s Inventory B’s Capacity C’s Inventory C’s Capacity Throughput Overall Inventory Done
0 6 0 1 0 4 N/A 0 0
1 6 5 1 0 4 1 5 1
2 6 10 1 0 4 1 10 2
3 6 11 5 1 4 4 12 6
4 6 12 5 2 4 4 14 10
5 6 13 5 0 9 7 13 17
6 6 14 5 0 9 5 14 22
7 6 12 8 0 9 8 12 30
8 6 10 8 0 9 8 10 38
9 6 8 8 0 9 8 8 46
10 6 6 8 0 9 8 6 54
11 6 4 8 0 9 8 4 62
12 6 2 8 0 9 8 2 70
13 6 0 8 0 9 8 0 78
14 6 0 8 0 9 6 0 84
Time Unit A’s Capacity B’s Inventory B’s Capacity C’s Inventory C’s Capacity Throughput Overall Inventory Done

This is far more interesting, and illuminating. I’ve also elaborated (to include 14 time units) to provide a more comprehensive view.

Firstly, you might see that we’re playing our very own version of Whac-a-Mole [1]. As the constraint pops up, we need to identify it, and give it a whack (exploit it). Note also that there’s nothing to stop an exploited workstation from once again becoming the constraint, and thus needing further exploitation (see workstation B in time units 0 and 5). You may also have spotted something about how the constraint was identified. There’s two characteristics useful for identifying the constraint:

  1. Workstation capacity. The workstation with the least capacity should be the constraint. Theoretically, this is easy to spot, however it’s less so in practice. For instance, how do you determine the constraint if multiple workstations have similar capacity? Small variances (which may be ephemeral) within a workstation can also hinder us (such as flex in team structure, shorter working days, or a productivity increase not promulgated outside of the team).
  2. The build up of inventory preceding each workstation. The workstation with a low capacity, which is also seeing increased inventory directly preceding it, is likely to be the culprit.

TEAM VELOCITY & QUEUES

When we talk of capacity and inventory in software, we are typically talking of team capacity (e.g. velocity in Scrum) and queues (backlogs).

Secondly, for each row, compare the throughput value to the constraint’s capacity. Except in special circumstances (which I’ll describe shortly), the overall throughput is always the same as the constraints capacity. It’s clear in time units 1 to 4 that there is a relationship. But then something strange happens. We see some variation between the constraint and throughput, for a while (compare time units 5 to 6, and time units 7 through to 14), before it settles back down to the constraint capacity.

For a short while, we find the overall throughput is greater than the constraint capacity. So what’s happening here? Well, we haven’t considered the inventory that is currently in the system (our Manufacturing Purgatory). Notice at time unit 7 that B’s inventory was 14 units (taken from time unit 6, prior to processing)? Each workstation will process its inventory, in addition to any new items it is fed, until there is no more inventory around the non-constraint workstations - at which point the constraint dictates again. This occurs at time unit 14.

Also note that significantly increasing the capacity of the current constraint doesn’t necessarily mean that the overall flow will exhibit that throughput. To see this, look at time unit 5. We significantly increased workstation C’s capacity (from 4 to 9 units), but the overall (flow) throughput never reached it. Why? Because throughput is still determined by the constraint, and C is no longer the constraint - it moved to workstation B. The constraint started at B, moved to C, then back to B, and then on to workstation A.

SUMMARY

I’ve already mentioned that some of the terminology used here originated from the manufacturing industry (workstations and inventory), and is not the typical parlance of the software industry - who tend to use terms like teams, queues, and tickets. This is nomenclature, the concepts are the same whether you use workstation or team, inventory or queues.

To recap, what I’ve shown in this section is that any improvement on a non-constraint has no (sustained) effect on the overall throughput. None. Nothing. Nada. The only way to increase the overall throughput is to exploit the constraint. There’s two challenges to this thinking:

  1. Convincing others. I’ve already mentioned how hard I have to fight my own cognitive bias to reconcile with it.
  2. Convincing inward-facing teams to consider the system-wide impact of their own internal improvements, and more fundamentally that their own waste rationalisation schemes may transform, rather than extirpate waste. This is the return of our old friend (Unit-Level Efficiency Bias) - teams that focus on their own improvements may neither help the goal, nor reduce waste; they simply shift it.

THE TRANSFER OF WASTE

I see individuals, teams, or departments regularly promote efficiency improvements internally but remain unaware (willingly or unwillingly) of its effect on the overall “system”. For example, I commonly see teams resolve their own Waiting waste (in TIMWOOD) through the introduction of automation, enabling them to do more, and thus reduce their own wait times. Yet, they’re not looking at the whole, and may simply be shifting waste elsewhere (becoming someone else’s problem, typically the constraints), from waiting waste into inventory waste.

See Waste Reduction for more information.

Ok, so now we have a pattern to identify a constraint, the next logical question might be to look strategically, and ask whether we could preemptively identify the sequence of constraints as a form of constraint forecasting? This idea is very appealing. If we can forecast what our constraints will be, then we can formulate a sequence of improvement transitions, and bring a level of Control to a fluid situation.

The problem with this approach is threefold:

  1. It undermines one of the Theory of Constraint's most important qualities - its ability to focus everyone on a single goal. If we could forecast constraints, then we’d likely find the group splits to tackle them independently (divide and conquer), whereas we want to focus the whole group on exploiting and elevating the immediate constraint.
  2. It considers the current position to be immutable, which is rarely the case; i.e. no (or negligible) variation in each workstation. The “subordinate everything else to the constraint” step also suggests that changes may be needed to the non-constraints, for example to ensure the constraint is never starved and always working at its optimum capacity.
  3. Wouldn’t it be good to base decisions on next steps upon real and current metrics, rather than something that was measured and then forecast months ago?

FURTHER CONSIDERATIONS

AGILE & WATERFALL METHODOLOGIES

In the heydays of early game development, it was common for a single engineer to do everything. They would come up with the game design, develop the code, test and bug fix, artwork, sound, print off the user manual, and then package it up for release (e.g. copy disks). This was possible because the games were (by modern standards) very small. However, as demand and expectation grew, we also found the software products growing (games today are many orders of magnitude larger than they were in the eighties). It became impractical for one person to do everything, and specialisms became the norm (Adam Smith).

Delivering a (software) product in today’s world is a highly complex, intricate, and involved process, involving many tools, specialisms, and roles and responsibilities, and therefore requires more structured, standardised and formalised project and delivery management techniques. For instance, we should consider the following:

To do all this, we need a delivery methodology. A (delivery) methodology encompasses many aspects and is a way of working, managing change, and measuring progress that aligns to all parties' needs. Of those, there are two common approaches:

Discussed next.

WATERFALL

Until fairly recently the waterfall model has been the de facto delivery model. In fact, it’s so deeply embedded within some businesses that any attempt to dethrone it is met with difficulties. At the most fundamental level, it’s a series of delineated activities that flow downward, like a series of waterfalls. We do a big chunk of this type of activity, followed by a big chunk of that activity, followed by a big chunk of another activity. See below.


Notice how the trend is almost exclusively downwards. There’s no attempt to revisit earlier stages, thereby creating an expectation that each previous stage is correct and of a good (quality) standard. Any slip up can have major consequences on downstream teams (therefore to the project) dependent upon the output being delivered in a timely fashion and being of sound quality.

Rightly or wrongly, our contemporary view of Waterfall is often one of derision (I’ll explain why shortly), yet it can be a good fit in some circumstances. For example, if the project is relatively small, it’s something you’re deeply familiar with (possibly because you’ve done it before), it has little in the way of unknowns (e.g. you’re using familiar technologies, tools, and patterns), neither you nor your customer’s goal or product vision changes over that timeframe (i.e. you’re building to a immutable specification), you can allocate staff at exactly the point in time they’re required, and the output of each stage is of a good standard, then it could be a good fit. Then again… can you think of many scenarios that meet this criteria?

SO, WHAT'S WRONG WITH WATERFALL?

To me, the main issues with Waterfall are:

Let’s go through them.

IMMEDIACY BIAS / DATE BLINDNESS

“But surely a focus on dates is good? Without this, how can we satisfy customer demand in a timely manner?” Absolutely. Having a clear view of important dates is certainly a good thing, and helps to focus attention, but an overemphasis on dates - above all else - can also lead to an Immediacy Bias.

MILESTONES

To be clear, a waterfall project is defined by milestones - e.g. you must move from phase A to phase B by date C for the project to succeed. It's seen as definitive, easy to comprehend, and creates a sense (note the emphasis) of Control (particularly in upper management). It's rarely this simple, but if each phase meets its date, the project is on track, and everyone can remain calm.

This bias comes in two forms (date and project blindness). Fundamentally, the needs of delivering by a date, for a specific project, may be placed ahead of longer-term business agility or sustainability needs - quality is forsaken for the delivery date, which creates an increasing accrual of debts, thus making every subsequent delivery harder.

MANAGING WATERFALL PROJECTS

I see this problem again and again. Project managers and executives push so hard to meet the project delivery dates that they lose sight of the problems that can stem from such an approach. The problem is that they are accountable for (and judged by) the success of a project, not necessarily for the sustainable success of the business.

And who can blame them? Waterfall projects aren’t terribly nimble affairs. The impact of returning to an earlier phase can reflect badly on department heads (both the party responsible for the poor quality and the party requesting the rework), or staff are already reallocated onto other projects… so it becomes a game of politics and face-saving, rather than doing right by the business.

The second argument for immediacy bias is an easier sell, but not necessarily sensible. We’re trying to compare something definitive and relatively near-term (a project delivery date), against something that’s less tangible and longer-term (debts, and the sustainability of the business). Of course it’s going to lead to an Immediacy Bias.

HAND-OFF MENTALITY

Another concern with waterfall (or any coarse-grained model) is the hand-off mentality. Because work is segregated into large, distinct phases (or specialisms), we find that those specialists don’t gain access to the project until a large proportion of the work is already complete.

Often, there’s no significant (or regular) collaboration between the specialisms, except some minor interactions at inception or during a handover. This is where vital information is (meant to be) shared - thus creating a Shared Context.

Remember that a software product isn’t just the software (see What is Software?) - it’s the realisation of an idea, which has both a software aspect and the informational (or contextual) aspect. However, context has already been lost, between the time the feature was built to the time it is handed-over (Debt Accrues By Not Sharing Context).

So, instead of regular, timely collaboration and alignment sessions, context is shared through a raft of impersonal documentation, and potentially some high-level discussions (which typically involve pointing the recipients to the documentation), and it’s then left for that team to figure out the rest.

LOW REORGANISATION CAPABILITY LEADS TO BATCHING

The way a business organises itself, its culture, and even its politics, all play a part in how they work and function. In businesses that follow a more centralised (and specialised) form, we sometimes find (knowledge) silos - they are experts in their own domain but lose sight of other areas in terms of knowledge, experience, and (more importantly) empathy. Inevitably, this leads to difficult communication, the result of this being that we tend to do it less (humans tend to avoid difficult things, which exacerbates the problem at the point that they must be done). Those teams aren’t necessarily invested in what they’re asked to do because they’ve never really been involved in it.

Waterfall suits the siloed nature of teams in (some) organisations, as each lifecycle phase fits snugly into a centralised team’s responsibility (e.g. Architecture-and-Design hands off to Engineering, who hands off to Quality Acceptance). But it’s tough to organise and coordinate large groups of people, particularly when those people are of the same specialism and other projects are also vying for their attention. There may also be political considerations internal to that group to consider. Collectively, these difficulties lead to an inability (or unwillingness) to work at a finer grain, and thus the tendency to work with batches of change (in large releases), thus generating greater business risk.

SUCCESS IS MEASURED BY THE WHOLE

You only need one siloed team - awkwardly placed in the value stream - to force the batching of change.

I’ve seen teams working with modern technology and release practices unable to reap their rewards, simply because a downstream team was siloed, constrained by manual practices, and excluded from regular and meaningful interactions. Any strategy to increase the speed of technical delivery (for example to gain Fast Feedback) was hamstrung, both by the lower capacity of the downstream team, and the manner in which they completed their work, and drove the technology teams back towards a batch delivery model, thus creating business risk and competitive concerns.

LARGER RELEASES CAN BETTER HIDE QUALITY ISSUES

We already know that large releases are unwieldy and that waterfall leans towards the batching of changes, and thus larger-grained (big bang) releases (Lengthy Releases). When change is batched, problems are not encountered quickly enough to always do something about them, thus reducing business Agility.

INNOVATION + WAITING CARRIES RISK

Technology innovation typically carries risk due to the unknowns involved. This is exacerbated by lengthy waterfall practices. The longer the wait, the greater those unknowns become, affecting feedback and ultimately ROI.

But how does this relate to hiding quality issues? Well, it’s more of a “you can’t find a specific tree, from all of the other surrounding trees” scenario. A large batch requires us to process far more (physically and cognitively), to an extent that we may be unable to comprehensively process it.

Another concern we see with waterfall is the cultural, and emotional investment in each phase’s output. The bigger the investment (which may be effort rather than money), the more people are invested, and (due to vested interests) the less likely they are to stop investing, even if they know the project will fail (Loss Aversion). To rephrase - we may know of quality issues, but choose to ignore them.

LARGE-SCALE & SEQUENTIAL

As previously described, Waterfall is very sequential and forward-oriented. This in itself isn’t necessarily a problem, but when combined with coarse-grained (batched) deliveries, it can create significant disruption, should there be a need to return to a previous stage. To analogise, it's a bit like a snowball rolling down a hill. It gets larger and faster the further it travels, making it increasingly difficult to stop. The effort to stop it and the ramifications of doing so (e.g. the knock-on effect) become so serious, that bias (including political) and Loss Aversion can usurp other important considerations (like the impact on customers).

SEQUENTIAL NATURE + GRANULARITY

It’s not really the sequential nature of Waterfall that makes it difficult to return to an earlier phase - it’s the sequential nature combined with its coarse granularity.

The Agile methodology is also sequential (we undertake UX, technical design, implementation, testing, and deployments in a specific order). However, we’re working at a much finer granularity of change that enhances flexibility (such as returning to an earlier phase) and reduces risk.

GREATER RISK

My previous arguments point mainly to an increased (and in many cases, unnecessary) risk. Whilst the risk isn’t all-encompassing - waterfall can be very successful, assuming that all of the stars align (there are few surprises) - I see its use as a broader risk to how businesses successfully deliver value and sustainability.

From my experience, waterfall tends to focus groups on projects - finishing the project becomes the goal, rather than finishing it in a way that is sustainable to the business.

AGILE

The Agile methodology spawned from a general discord with the results of Waterfall, and has already been discussed. Like Waterfall, Agile is a project management and delivery methodology.

Primarily, Agile attempts to address the risk associated with software delivery (and with using waterfall). As we know, Waterfall projects tend to be more risky affairs that don't always align well (or quickly enough) with customer needs.

The Agile Manifesto states: “We are uncovering better ways of developing software by doing it and helping others do it. Through this work we have come to value:

That is, while there is value in the items on the right, we value the items on the left more.” [1]

Agile differs from Waterfall in the following areas:

Let’s talk through them.

FOCUS ON INTERACTIONS & COLLABORATION

One of Waterfall’s biggest failings is that it tends to produce (or support the extension of existing) silos. Agile confronts this by emphasising relationships, through more interactions and greater collaboration.

This may sound irrelevant, but the way in which we interact with others (and its regularity) can strongly influence the outcome. I’ve already talked about how techniques like Pairing and Mobbing (Indy, Pairing & Mobbing) can create better quality software and build up an important Shared Context. But building strong relationships also allows us to ask deeper questions and obtain better answers from a more diverse group.

Greater diversity has other benefits too - it counters bias, and less bias equates to fewer assumptions, which leads to reduced risk. It also reduces risk in other ways. A wider range of diverse stakeholders can typically identify problems much earlier, react sooner, make better decisions, and thus enable alternative approaches to be tried. You can’t do this when time is working against you - the decision is made for you.

A greater focus on interactions has also led to an increase in Shift Left activities - activities that not only increase collaboration, but often support improvements to TTM, ROI, and Agility. We may never have had activities like Story Mapping, BDD, TDD, or DevOps without the mantra behind Agile.

THE SCALE OF CHANGE

I’ve already described how the waterfall model, and the team structures it promotes, tends to advance the idea of irregular batching of change; thus creating unnecessary risk. Conversely, Agile prefers the regular release of small amounts of change that:

RERUNNING PHASES

The two key reasons why waterfall struggles to rerun phases are (from a pragmatic perspective):

  1. The scale of each change (bulk), and therefore its impact on teams/departments (Lengthy Releases).
  2. The implicit promotion of siloing within organisations.

Agile is in stark contrast on these two points. I’ve already mentioned that Agile prefers small deliveries of incremental value. Also, from an organisational perspective, it tends to favour dedicated Cross-Functional Teams - responsible for building, delivering, and operating features. This creates a nimbleness (or agility, if you prefer) at both the team level and in terms of the scale of change, which consequently makes it much easier to return to an earlier phase. The second-order consequence is the ability to build in quality at a much earlier stage (Shift Left).

BUT IT’S NOT PERFECT

Agile isn’t perfect, but what is? Voltaire would say that “the perfect is the enemy of the good”.

Earlier, I lamented that waterfall sometimes focuses groups on the success of projects, rather than the sustainable success of a business. Agile, though, can focus groups at too low a level - they solve the immediate concern, but lose sight of the big picture, resulting in a lot of team alignment sessions.

ALIGNMENT TECHNIQUES

The good news is that there are plenty of techniques (such as Story Mapping, and BDD) to help teams to align to the big picture.

The other aspect I hear some agile purists vocalise relates to any sequential (and potentially lengthy) activity that doesn’t necessarily fit well into an iterative model. I’m thinking of activities like solutions architecture or alignment sessions - they may not “fit” directly into such an incremental (agile) delivery, but that doesn’t discount their value.

AGILE & LESS ITERATIVE ACTIVITIES

Beware of others who would prefer to discount important project (or product) activities, simply because they don’t fit well into an incremental delivery (e.g. Agile & Less Iterative Activities).

To me, agility is also about having the flexibility to choose the right approach for your project/organisation. Therefore, we should employ any aspect of project (or product) delivery that is sensible, and not neglect them simply because they don’t fit well with an incremental bias.

WAGILE

Before finishing this section, I’d like to summarise a third way - which I (somewhat mischievously) call Wagile (Waterfall-Agile).

Let me first defend myself by stating that Wagile isn’t a formally defined methodology; neither is it something that most businesses would actively set out to achieve. I record it here for the sake of completeness.

Wagile is a term I use to describe businesses who are stuck in perdition, somewhere between the Waterfall and Agile methodologies. It's typically found in established organisations that are already familiar with waterfall methods, are now wishing to adopt Agile, but are unable to extirpate all of waterfall’s lingering hallmarks - like a firm and unwavering “date orientation”, incremental changes but batched releases (e.g. teams provide change fortnightly, but these are simply grouped into a large quarterly release the business and customers are already attuned to), or skill-delineated (silod) phases (e.g. “we do a full regression test of our release in this phase”).

EXAMPLE

“We don't do waterfall, we do Agile,” claimed the CTO.

“Really?” I thought, “so what are all these quality issues - causing you to spend an additional month (of our money) fixing at the end of the project? And it’s still not been released to customers…”

The morale of the story? If you’re not embedding quality into each and every iteration, you’re still doing a form of waterfall.

Wagile can be very difficult to extract yourself from. In many ways it's an improvement on Waterfall, but it’s still lacking some of Agile’s key qualities to mitigate risk and improve decision-making.

SUMMARY

Waterfall and Agile (and even Wagile) are delivery methodologies - i.e. they are different ways of delivering projects.

Waterfall is a series of delineated activities that flow downward, like a series of waterfalls. It can be effective if: the project is relatively small, is a known quantity (possibly because you’ve done it before), and neither you nor your customer’s goal or product vision changes over that timeframe (no scope creep). However, it tends to favour bulk change and organisational silos, reducing collaboration and feedback, and therefore increasing risk.

Conversely, Agile is a lighter-weight iterative methodology, well suited to managing (constant) change and unknowns. It focuses more on interactions, feedback, and iterative change, and is a useful tool to manage risk.

To analogise, consider how TV and film studios (typically) manage risk. Although it's not unheard of, you rarely hear of studios filming/producing a film trilogy contiguously (equivalent of Waterfall), simply because of the high risk associated with it. Studio executives won’t waste money creating content that the public won’t consume. They need to find alternative ways to gauge the likely success of a film or show with its audience, prior to making a significant investment, and typically counter financial risk by releasing trailers, pilot episodes, ad-hoc episodes, or only initially committing to a single season. This most closely resonates with Agile - we test our assumptions, by quickly releasing something to our customers, and then assess what to do from there.

FURTHER CONSIDERATIONS

AGILE & LESS ITERATIVE ACTIVITIES

The main thrust of Agile - as an iterative delivery methodology - is that by reducing the scope of each cycle of work down to a small chunk (typically ranging from a few days to a few weeks), you benefit from lowered risk, increased feedback, and increased agility (see Agile & Waterfall).

This is excellent in theory, but in practice a typical software project consists of many different activities - including UI/UX, product management, vision and inception (alignment) sessions, and architecture - not all of which can be (sensibly) achieved using iterative practices.

I’ve heard some suggest that these (non-iterative) practices are un-agile, and therefore take the astonishing leap that they therefore have no value, nor place, in the lifecycle of a modern software product.

The most common example I’ve heard is that: “Now that we’re doing Agile, we don’t need architecture, or an architect.” i.e. “We can do quite well without you, thank you very much.” And you know, quite often, they do. At least in the beginning…

SOLUTIONS ARCHITECTURE IN WATERFALL

This view isn’t necessarily helped by past experiences. In the wrong hands - and embedded within a unidirectional, waterfall-oriented delivery - the outcome of a solution’s architecture can be sketchy at best. I’m thinking of those theoretical architectures that are passed down to others to implement something that can never be realised.

It happens, but it’s not exclusively true, and certainly doesn’t make the activity valueless, whether in Waterfall or Agile.

Just because value can’t be seen, or contextualised, nor the effects easily measurable should such an activity be precluded, doesn’t make said activity devoid of value. This is true of many things, including architecture. Of course I would say that, but I’ve seen too many projects, products, and applications go astray - not in the immediate present, but half a decade later - when all of the decisions that were made independently of any joined-up thinking or vision, return to haunt the business.

It might help if we consider the purpose of solutions architecture. Would we create a new city, a film, or a product, without first considering its composition, how we link its constituent parts together, or the ramifications of one set of techniques or materials over another? Of course not. A city builder, film director, music composer, product owner, and architect provide a vision that binds distinct components into something cohesive and (hopefully) desirable.

Solutions architecture for instance allows us to comprehend as much of the whole (as is sensible), ensuring that we’re not just building in isolated silos; e.g. isolated solutions that don’t “fit”, integrate, add unnecessary complexity, or create longer-term sustainability challenges (e.g. no uniformity). Yet there are still aspects that can be embedded into an iterative model. We may define a holistic view, verify it with spikes, but be prepared to change that view and realign as more information arrives.

PREFER, BUT DON'T NEGLECT

The Agile Manifesto [1] states that we should prefer certain activities over others (e.g. individuals and interactions over processes and tools). It doesn’t suggest that we neglect, or ignore, these activities if they don’t fit easily into an iterative model.

This isn’t the only example. Agile teams already use up-front inception techniques with Story Mapping, where we try to tease out a narrative, features, interactions, and ideas to get a more holistic view, create a shared context, and build alignment. Likewise, you can’t define a UI/UX without understanding something about the whole. And what about product management? Wouldn’t you be sceptical of any product manager who only considered the next step, and didn’t also take the big picture into account? I certainly would.

(Solutions) Architecture, Story Mapping, UI/UX, and Product Management all influence the vision - they don’t fit directly into an iterative model (i.e. why some may view them as un-agile), but they’re all important.

FURTHER CONSIDERATIONS

DUPLICATION

Software Engineering is a complex business. Many aspects - if not carefully considered or controlled - can cause significant business-scale problems over time. One of which is duplication.

Duplication can take many forms; the most common being:

But this doesn’t explain the potential dangers of duplication. Let's start with functional duplication, shall we?

It’s common for established businesses to manage multiple software products offering the same (or very similar) functionality. In some cases, the problem is so acute that there may be tens of applications offering the same thing. This can occur for many reasons, including: lack of awareness (“we didn’t know it already existed so we built our own one”), political (“we don’t want their department dictating what we do”), necessity (“the existing solution doesn’t meet our modern needs”), or mergers and acquisitions. In this context it matters not why it occurs, simply that it does.

GROWTH THROUGH ACQUISITION

Businesses that follow the Economies of Scale and Growth Through Acquisition model (mergers and acquisitions) can find themselves with this problem unless they actively pursue a consolidation model.

Doesn’t sound like a problem (Why Consolidate)? The business is now managing the same function, in different technologies, on different platforms, requiring different skill sets, for essentially the same behaviour. This is massive (unnecessary) complexity.

POOR ROI CAUSED BY DUPLICATION

Now consider that we must implement a governmental directive (e.g. GDPR) - affecting each duplicated function - in order to prevent significant financial penalties.

In the zero-duplication model our task would be relatively small, with minimal waste. However, in a business with functional duplication, we suffer massive waste (it's a multiple, based upon the number of duplicates you have, across the entire portfolio). Talk about poor ROI.

BRANCHING TO CREATE BESPOKE PRODUCTS

Some businesses use a branching strategy to create bespoke client solutions (each branch began as a generic solution but was branched to deliver bespoke client behaviours). This is essentially functional duplication. It may be the same product, but the branch creates a distinction, and you’re therefore at the whims of functional duplication.

Data duplication can also create problems - as anyone familiar with the Resilience through Data Duplication model can likely attest to. Data may be duplicated into other systems, teams, departments, and businesses for various reasons, typically using an ETL (Extract Transform Load) mechanism. Whilst certainly a well-trodden path, the quality of the output (data integrity) depends upon many factors, including platform reliability, temporal factors, and the accuracy of the implementation (which may vary after each release). Consequently, it’s common for the duplicated datasets to be poorer equivalents of the original, with missing, stale (it was once accurate, but no longer so), or inconsistent data.

The impact of such an inaccuracy depends upon: the scale of the inaccuracy, reliance upon those datasets (is it heavily used or only by a small group of users?), how swiftly it can be identified and then remedied, and the importance of the data (e.g. is the data used for critical healthcare decisions?). Such an inaccuracy is likely to affect at least one of:

RELEASE (VERSIONING) INFLUENCE

As I previously stated, the quality of the output (data) is affected by the quality of the solution responsible for moving and transforming that data. See below.

Note how versions V1 and V2 both output data of the same shape. This suggests that the new release (V2) leaves the output in a consistent, expected state (assuming that the V1 state was already accurate). Version V3, though, creates a differently shaped output, which (in our scenario) is both unexpected, and unwanted.

The point of this exercise is to show that not only can a system change the output, but also each version of that system, and why regression testing (Regression Testing) is critical. Any solution that either does the wrong thing (such as by misinterpreting a business requirement, or creating a bug), or behaves in an unexpected manner (such as an availability problem), can affect the quality of the data.

AVAILABILITY INFLUENCE

In the previous section I described how a functional behaviour (whether intentional or not) can affect the outcome. In this section I briefly describe how a non-functional characteristic (availability) can also affect it.

Consider a system consisting of three steps, that takes data (records A, B, and C) from one system, transforms and filters it (the three steps), and then stores the output in another location for consumers to access. See below.

Let’s assume that datasets A and B are successfully processed and the output is stored. Dataset C now enters the system. Whilst step 1 is successful, an unexpected system failure occurs at step 2. Dataset C is discarded, never reaching the end system (note it is only aware of records A and B). We now have a discrepancy between the number of records in the master dataset and that in the downstream secondary.

SYSTEM RELIABILITY

Unless the system moving the data is as accurate and reliable as the system that initially captured that data, then there will always be some loss - either in quantity, or in quality. This affects the consumers of those datasets, and thus overall Stakeholder Confidence.

SUMMARY

I first entitled this section: “The Curse of Duplication”, but the more I thought about it, the more I realised that duplication in software engineering isn’t always a curse. In fact, it’s sometimes advantageous.

Microservices are a case in point. True microservices are fiercely independent. They make few assumptions (Assumptions). To achieve this, they must limit any external influence that can affect its independence. As such, a common microservice pattern is to duplicate behaviour in multiple microservices, rather than make use of a shared external library that may impact its lifecycle. Admittedly, whilst I understand the reasoning, I’m not quite sold on the practice.

Additionally, were we to consider Streaming technologies (such as Apache Kafka [1]), we’d find the use (nay, the advocation) of data duplication (Event Sourcing). It’s advantageous as it allows teams to work more independently and even add new consumers to existing data points. And whilst it can't fully resolve every difficulty around the data duplication model, it’s exciting because these new platforms offer a highly reliable and scalable platform, with near real-time synchronicity.

FURTHER CONSIDERATIONS

RELIABILITY THROUGH DATA DUPLICATION

I've encountered this approach a number of times. In an attempt to gain system reliability (resilience and availability), data is duplicated into another system, or area, for consumption; either to protect, or to circumvent problems, in the original system. See below.


WHY DUPLICATE DATA?

You may use this approach to create read-only data sets, data subsets for specific user demographics, or to filter it in some other way appropriate to the consumer.

This approach suggests that by creating an independent system to cater to a specific user group, and copying/transforming datasets into it from the original, it should promote greater reliability (mainly by limiting access to a subset of users, or by having more rigid SLAs in the new system).

Whilst this argument holds water, it has a very narrow focus, and other ramifications. Firstly, it is heavily influenced by the quality of the replication mechanism. One that contains bugs, partially meets the business needs, or isn’t available when needed, is unreliable, and therefore may create more problems than it solves.

DATA QUALITY FACTORS

The quality (integrity) of the data is dependent upon many factors, including platform reliability, temporal factors, and the accuracy of the implementation.

Secondly, and paradoxically, by neglecting to consider the whole (system thinking), we’re divorced from a potential joined-up response to the root cause problem (“I solve my problem, you solve yours - we don’t help one another”), meaning that the original system (and its dataset) remains less reliable than it probably should be.

CONTROL

There’s also a question of Control. Some teams take issue if they’re not in complete control of their own solution, so will find ways to gain that control for their own sake, and not necessarily for the benefit of the entire enterprise.

THE ARGUMENT

Ava is the original system owner, Sam the secondary system owner.

“Hi Sam, it’s Ava. Hope you had a nice break. I was speaking with some of your team members whilst you were off. They showed me some of the technical diagrams on your new system and I’m trying to understand why you chose to copy data out from our system, rather than using our master record?”

“Hi Ava. Well, when we looked at it, we were concerned that it wasn’t sufficiently reliable for our needs. We decided it was safer to duplicate the data to meet our customer’s needs, and thus create a better user experience.”

“Oh, ok,” says Ava, rather put out, “so you think our system isn’t reliable?”

“Err, sorry, but yes. I’ve heard of a few incidents…”

“You mean the thing that happened last year? Yes, that was rather unfortunate.” A brief pause. “But you solved it by doing a nightly replica. And, from what you’re saying, it leaves our solution in no better position. It's still - as you point out - not as reliable as we, nor our business, needs it to be. It’s circumventing the real problem. I feel if we’d talked this through sooner, we could have pooled our resources, and then achieved both. We could have offered you what you need and improved our own system. That would have been a win for all parties.”

DATA CHAINING

I’ve seen the following type of approach used as a data chaining model to create many distributed systems for different needs.


Data is copied from A and B into C, a subset of which is then copied from C and moved into D, before another subset of data is copied from D into E.

Remember my point about the quality of the systems responsible for transforming and moving the data? If it’s inferior to the original capturing system, then there’s a data degradation occurring after each hop, creating data entropy. That’s an issue with a single hop, so consider this approach with a chain of such systems, as shown above. It can’t get any better (the input isn’t really changing), and it’s unrealistic to think the data transfer mechanism is in every way and at every point in time an equal to the original capture means, so it can only really go in one direction.

SUMMARY

This approach can (and does) work. However, it can also create unnecessary complexity, and may not yield a good, holistic outcome - the resolution of an underlying problem for the entire business. Before using it, I’d first recommend that you understand why such an option is being considered? Is it an attempt to bring reliability, or scalability, to a solution lacking those qualities? And if so, why doesn’t the existing system already exhibit those desired traits, and is there a sensible way to support them, without duplicating datasets?

FURTHER CONSIDERATIONS

WASTE MANAGEMENT & TRANSFERRAL

To my mind, the best way to reduce waste is simple - don’t undertake tasks of little, or no value. Of course the problem with this is that reality sometimes gets in the way.

I've repeatedly stated throughout this book that success in feature development is often a form of betting. In any form of product development (including software), it’s difficult to determine the features which will be valuable and succeed [1], and those which won’t, without building something and showing it to customers.

So, if we can't (necessarily) prevent unsuccessful features from being released (an idealistic state), surely the next best thing is to swiftly release change to customers, doing no more than necessary (no Gold Plating), gain instantaneous feedback, and then pivot work based on this? If a feature is a flop, we rip it out (an important step in managing long-term complexity), and then reassign the team onto the next idea. Waste reduction in this case is achieved by increasing flow to our customers, and making better choices based upon Fast Feedback.

THE TRANSFERAL OF WASTE

Something I see regularly within engineering teams is a haste to remove waste. This is admirable, but it’s also important to understand that waste - like energy - may be transferred from one form into another. Consequently, and paradoxically, teams undertaking waste reduction activities may not be supporting the overarching business goal.

Most commonly I see individuals and teams promoting internal efficiency improvements, whilst remaining ignorant to its effect on the overall “system”. The displacement of the Waiting waste for the Inventory waste - where we speed up our own area, only to flood the constraint (Theory of Constraints) in another area - being a particular concern.

I’ve also seen engineers attempt to counter the Waiting waste with the Overprocessing waste - i.e. whilst they wait for others to complete a dependent task, they keep themselves busy by enhancing (Gold Plating) the current solution with unnecessary refactoring, testing, performance improvements, and documentation activities. The solution may be significantly enhanced, but it was unnecessary, the customer isn’t paying for it, and we’ve now stacked an additional waste (Overprocessing) upon the existing (Waiting) waste.

FURTHER CONSIDERATIONS

BEHAVIOUR-DRIVEN DEVELOPMENT (BDD)

The primary intent of this book is to bring business and technology closer together. Behaviour-Driven Development (BDD) offers one way to do this.

Throughout this book, I’ve expounded the challenges of software engineering. Those challenges - however - don’t always lie with the implementation, but in bringing together all of the various stakeholders (e.g. customers, developers, product owners, testers) to understand, comprehend, align, and then build the right solution.

In the past, we’ve attempted to deliver software using a mix of highly specialised and delineated responsibilities, phased deliveries, and stages interspersed with work queues (i.e. Waterfall). Customers engaged with Business Analysts (BAs), who wrote lengthy tomes of business requirements, that were then articulated into solutions by architects and designers, before being broken down into implementation phases that finally saw developers and testers involved. This approach led to waiting, the loss of Shared Context, the risk of building the wrong thing, and rework (The Seven Wastes).

Since then, there's been a steadfast countering of such disjointed practices through Shift Left activities, such as in automated testing, TDD, Pairing & Mobbing, and DevSecOps. The one common theme they share is that they all promote greater and earlier collaboration.

BDD is another means of supporting the shift left. Rather than the customer (or their proxy) engaging solely with a business analyst, we actively encourage technical (designers, developers, and testers) and business stakeholders to engage and collaborate with customers, both to create a shared context, and to use diversity to identify better solutions, risks, or unknowns. Additionally, rather than constructing lengthy business requirements documents (that take weeks or months to write and are quickly outdated), we take a more Agile (iterative) view, having lots of short, focused, just-in-time (JIT Communication) discussions instead.

So what is BDD? BDD is often misperceived as a form of testing, or as an alternative to TDD. Neither is accurate. It's a way to support collaboration and alignment (e.g. Shared Context) in the building of software, using scenarios to understand context, the output of which has secondary benefits, both in the form of a common language and in the support of automated testing (Acceptance Testing).

EXAMPLE MAPPING

BDD is commonly practiced using a technique called Example Mapping. In this approach a (cross-functional) group collaboratively discusses the next user story (ideally just-in-time), using coloured cards to capture the story, its rules, outstanding questions, and (importantly) examples. This discussion helps us to discover different scenarios and considerations, and is critical to understanding - and thus solving - the problem. See below.


PURPOSE

The idea is to collaboratively map out the feature in sufficient detail that all of the stakeholders understand it, and have sufficient detail to complete it (implement, test, and deliver it). It need not be exhaustive, but it should be a good indicator of how ready you are to begin work.

Example Mapping shouldn’t be viewed as a lazy way to determine requirements; or a vision for that matter. There should still be some groundwork done up-front - typically by the product owner - to understand the vision, scope, and give a general sense of what’s expected. This ensures the team isn’t asking fundamental questions during the shaping session, which is about gaining sufficient clarity to build a solution.

FURTHER CONSIDERATIONS

BDD BENEFITS - COMMON LANGUAGE

As previously mentioned, BDD is not a form of testing. It's a way of aligning different stakeholders around a common problem in order to better understand, and thus solve, it. However, BDD also produces useful outputs in the form of:

The common language can take any form, but a common one is Gherkin's [1] Given, When, Then syntax:

Let's look at an example shall we? Assume that Mass Synergy is building a new discounting feature for their customers. Our first feature might be to offer domestic customers (let's say US for the sake of argument) a percentage discount if their purchase exceeds $50 and 5 items (yes, it's slightly arbitrary). Using the Gherkin syntax, we identify the following scenarios.

Feature: Derive Discount
  Derives a discount applicable to the supplied cart and customer combination.
  
  Scenario: David lives in country and meets purchase criteria to receive percent discount
    Given David is a Mass Synergy member
    And resides in USA
    When he places the following order:
    | Item                  | Quantity  |
    | lordOfSea             | 1         |
    | danceMania            | 1         |
    | lostInTheMountains    | 1         |
    | balancingAct          | 1         |
    | mrMrMr                | 1         |
    Then he receives a 10% discount

  Scenario: David lives in country but fails to meet purchase criteria, he receives no discount
    Given David is a Mass Synergy member
    And resides in USA
    When he places the following order:
    | Item                  | Quantity  |
    | lordOfSea             | 1         |
    | danceMania            | 1         |
    Then he receives no discount

This feature file encapsulates our requirements. It needs a bit of work, but so far we've defined two scenarios - one where the customer receives the discount, and one where they don't. It's readable by all (Shared Context), unambiguous, uses personas (e.g. “David”) to help contextualize (and empathize), and (critically) can be used as a foundation for automated testing.

OTHER SCENARIOS

I've intentionally shown you a very simplistic, and incomplete set of discounting scenarios. We've shown one happy-path scenario, and one where the customer fails to meet the criteria, but there's lots more scenarios to consider. For example, can our customer (David) get a discount if he purchases five of the same item? What about if David is traveling to the UK, should he receive that discount whilst in London? What about customer churn? Should we encourage customers who've recently left with an incentive to rejoin? What about offering an alternative form of discount? Should a customer who spends $100 get a larger discount?

In a BDD session I'd expect the team to identify many other scenarios, and capture them in cards.

The above feature file example is not an automated test suite. We've still got work to do. To do this, we must map the common language to a technical implementation (the tests), using a “gluing framework” to link them together. Here's one such example (implemented in Java and using Cucumber).

public class StepDefinition {    
    private DiscountEntitlement discountEntitlement;
    private Customer david;
    
    // assume this has already been set up with items
    private Map catalogueItems = new HashMap<>();
    
    @Given("David is a Mass Synergy member")
    public void david_is_a_mass_synergy_member() {
        david = new Customer("34567", "EH9 5QT");
    }
    
    @Given("resides in USA")
    public void resides_in_usa() {
        david.setPostCode("90210");
    }
    
    @When("he places the following order:")
    public void he_places_the_following_order(DataTable dataTable) throws IOException {
        Cart cart = new Cart(UUID.randomUUID().toString());
    	
        // add all of the items to the cart, using the id to lookup map key
        dataTable.cells().stream().skip(1)
        	.map(fields -> {
        	    Item item = catalogueItems.get(fields.get(0));
        	    return item.setQuantity(Integer.valueOf(fields.get(1)));
          })
          .forEach(cart::addItem);
            	   	
          Entitlement entitlement = new Entitlement(david, cart);
          HttpRequest request = …;  // set up the request
    
          ... now call the discounts service to get the entitlement
          HttpResponse response = httpClient.execute(request);      	
    	
          discountEntitlement = mapper.readValue(convertResponseToString(response), ...);
    }
    
    @Then("he receives a {int}% discount")
    public void he_receives_a_discount_and_cart_total_is_reduced_by(Integer discountAmount) {        
        assertTrue(discountAmount == discountEntitlement.getAmount());
        assertEquals("PERCENTAGE", discountEntitlement.getType());
    }

    @Then("he receives no discount")
    public void he_receives_no_discount() {
        assertTrue(0 == discountEntitlement.getAmount());
        assertEquals("NONE", discountEntitlement.getType());
    }
}

A word of warning. This code won't compile - I've simplified it for brevity's sake. Note the method annotations starting with @ (e.g. @Given, @When, and @Then)? This mechanism allows the framework (Cucumber [2], in this case) to map the feature definition to specific implementations (methods), thus allowing us to link test automation to the common language Gherkin DSL.

This concept of taking the requirements (in the feature file) and running them through a test automation suite - without any extraneous (unnecessary) documentation - is powerful indeed.

FURTHER CONSIDERATIONS

SINGLE POINT OF FAILURE

As the name suggests, a Single Point of Failure is a point within a system (in a purely abstract sense, not a technological one) which is singular, and therefore vulnerable to failure (or throttling/slowdown). Understanding Single Points of Failure is important because they represent weaknesses in a system, or practices, and can quickly lead to an avalanche of failures elsewhere.

I commonly encounter two forms of single-point-of-failure:

Let's look at them now.

SYSTEM SINGLE-POINT-OF-FAILURES

In a software system, a single-point-of-failure can directly, and significantly, affect Availability, and therefore your Reputation. It typically occurs when false assumptions (Assumptions) are made about a system, and remain unresolved, eventually causing the system to fail.

A FAILURE? WHERE?

System failures occur for many reasons - including hardware, networking, operating system, application, power, or geographic failures.

Therefore, highly available and resilient systems typically require duplicates of every system component to negate the possibility of a single-point-of-failure. This sounds straightforward in theory, but less so in practice. Any component - given enough time or neglect - will fail, and the complexity of modern systems hampers the discovery of single-points-of-failure.

The Cloud vendors have done an excellent job here. Sure, there's the occasional wobble, but overall they've built the physical sites (e.g. regions and availability zones), infrastructure (e.g. direct network connections, load balancers, and IaC), services (e.g. distributed NoSQL databases), and platforms (e.g. multiple instances deployed across CaaS and Serverless platforms) to significantly reduce the likelihood of a single-point-of-failure. Why would we think we could do a better job?

Examples of single-points-of-failure include:

You can see that a lot of things can go wrong, making the Cloud a very appealing prospect.

ORGANISATIONAL SINGLE-POINTS-OF-FAILURE

MY TAKE

Organisational single-points-of-failure represent the habitual use of experts (Knowledge Silos) working within a silo to complete work and promote efficiency.

People, or teams, can also be single-points-of-failure - sometimes known as Knowledge Silos. Someone with vast system or domain experience and knowledge - commonly referred to as a Domain Expert - is an extremely valuable (company) asset, and could be a potentially disastrous loss if they leave, or are unable to work. Rightly or wrongly, intentionally or otherwise, these entities (individual or team) have a level of Control over your business, creating a dangerous situation where delivery is impeded, the tempo is set by, or decisions are dictated by that single-point-of-failure.

EMPIRE-BUILDING & CULTURAL IMPACT

The cultural impact of a single-point-of-failure is not something to be quickly dismissed. I've seen it enough times to be wary of it, and may include a loss of professionalism on an individual's part, and cultural challenges.

It can also be treated as a form of empire building. Unfortunately, as some people become aware of their superior knowledge over others (on a particular subject), they realise they have a certain degree of power over them, and - if so inclined - can bend decisions to their desire. A minority use this power to do what's best for them, or their team, rather than what's in the best interests of the business. Given sufficient time and exposure, this forms an almost impenetrable culture - something very difficult to shake - in any established business.

Quite simply, no one entity (be it an individual or a team) should know everything about a system; knowledge should be shared.

To me, the “domain experts” concept can quickly transition from asset to anti-pattern. Of course some level of expertise is unavoidable, and certainly not a bad thing. However, if you find yourself in a situation where only a single person knows how a system or process functions, I'd recommend asking why this is? Is it because the system is too complex for outsiders to understand, and therefore only the person who created it can understand it? This could be an architectural “smell”. Alternatively, could it be an over-emphasis by the business on efficiency over all else, suggesting a cultural and sustainability concern.

The other key problem with an organisational single-point-of-failure is that it creates a bottleneck, and thus a lack of flexibility, or Agility, in the business. It's also another case of the tail wagging the dog (Tail Wagging the Dog), with the bottleneck forcing the decision-making and pace of the business. This is disadvantageous to the business, yet it's a practice businesses (and managers) repeat again and again with an overemphasis on productivity, putting the efficiency needs of an individual unit (be it person or team) over the agility needs of the business. I describe this thinking here (Unit-Level Efficiency Bias, Unit-Level Productivity v Business-Level Scale).

And finally, the very nature of a single-point-of-failure in this context indicates a lack of collaboration (willing or not) and isolation, creating two issues:

  1. The quality of the work within that domain may deteriorate and no one knows. Diversity isn't just some buzzword, it's a critical aspect to promote quality in the overall solution, and its Evolvability. Without it, we are increasing our Technical Debt (in a sense doubling down on the problem), without knowing it.
  2. The individual feels increasingly isolated and overworked, and eventually quits - something particularly disruptive when we're talking about a single-point-of-failure. Some (bad) managers might place the blame on the individual, but more often than not, it's a problem of our own making. By isolating someone, to the point where no-one could support them, we have - in fact - driven them away, making it a fault on the part of management, not the employee.

FURTHER CONSIDERATIONS

ENDLESS EXPEDITING

Endless Expediting typically happens when a business is so large or unwieldy, their communication paths so convoluted, and the work they undertake takes so long, that nothing is ever finished. They are embraced by massive Manufacturing Purgatory.

These businesses are constantly in the throes of Expediting as new ideas are generated, the business changes priorities, or modernity (e.g. technology) makes an earlier decision substandard. Work-in-Progress (WIP) is paused or cancelled in favour of the new approach, which - due to the aforementioned challenges - then follows exactly the same trajectory as the usurped idea. Ad infinitum.

Endless Expediting also creates a lack of Agility (one of the worst possible outcomes), caused by the glacial pace that everything moves at, and thus causing the business to move even more slowly.

FURTHER CONSIDERATIONS

PROJECT BIAS

The "cult of the project" often promotes short-term, project-oriented thinking and delivery, over longer-term Sustainability needs; i.e. there's a bias heavily in favour of finishing the project, but forsaking some Sustainability.

CAVEAT

Of course this isn't true for every case, but it's certainly something I see too much of.

In this model, project leads are incentivised (and applauded) to deliver a project within the original constraints (regardless of the problems encountered during it that weren't considered at inception), and (implicitly) disregard the longer-term consequences of their decisions and actions. Once the project is deemed complete, project staff are quickly shifted onto the next, to repeat it all over again, never to resolve the issues created previously.

AGONY & ECSTACY

Broadly speaking there's a penalty to staff who fail to deliver a project (e.g. the notorious annual review), but rarely one for missing the sustainability needs of a business. In part, this is due to a lack of a common Sustainability definition, making it impossible to accurately measure. But there's also an immediacy bias (Immediacy Bias). The consequences of missing a project delivery is felt almost immediately, whereas the consequences of missing Sustainability needs may not raise itself for years.

FURTHER CONSIDERATIONS

EFFECTIVE OVER EFFICIENT

The statement to "do the right thing, over the thing right" means that we should prioritise working on the correct things ahead of prioritising the quality aspect of a feature with little customer value.

The words “efficient” and “effective” are not synonymous. Efficiency relates to Productivity - e.g. how efficiently you can make a change. Effectiveness relates to how effective we are at influencing and engaging our customers and business.

For instance, let's say that I can build and release a new API feature all in a day. That's pretty efficient. However, let's now say that the functionality I built within that API is irrelevant to my customers. That makes me (and my API) ineffective. I've been working on the wrong thing, and therefore haven't engaged my customers' interests, generating Waste (in terms of ROI and TTM, but also by further complexifying the estate). So what if I'm highly efficient if I can't make something to benefit my customers, and solve real business problems. In that case I should immediately stop and identify effective ways to make a difference.

GOLD PLATING EFFICIENCY BIAS

Gold Plating - the unnecessary refinement of software or processes for no effective benefit - is an easy trap to fall into, and often the sign of an Efficiency Bias.

FURTHER CONSIDERATIONS

MULTI-FACTOR AUTHENTICATION (MFA) & TWO-FACTOR AUTHENTICATION (2FA)

Authentication (and authorization) is a broad subject, and one I will only touch upon. Authentication relates to how someone (or something) proves they are who they claim to be. Authorization relates to what they are permitted to do.

The most common form of authentication is the single-factor (“password”) model, in which the user supplies a username and password for the system to verify. Whilst it's been used for many years, it has certain flaws.

Because it only uses a single factor, it only requires that one factor to be captured (i.e. stolen) for the account to be compromised. To counter this, the industry introduced a series of hardening approaches - from increasing password lengths, mixed-case, and mandating certain special characters, to encryption, “salts”, and the notorious (monthly) password reset, which added a temporal aspect. They increased Security, but sacrificed Usability, in the form of increased complexity and cognitive load, leading to more forgotten passwords, written-down passwords, and password reuse. Fundamentally, the single-factor model is limited.

BALANCING SECURITY & USABILITY

There's always a balance to be struck between Security and Usability. Lean too far toward security and we affect how usable the solution is, lean the other way and we've probably got a highly usable solution but little in the way of security.

The single factor (password) model is a prime example. New controls were introduced when the model was deemed to be insufficiently secure (e.g. regular password resets), each one reducing the usability of the solution, and in some cases, having the opposite effect to what was planned.

So what is MFA (and 2FA)? As the names suggest, multi and two factor use more than one authentication factor. It's rare for us to discard our existing (password) factor, rather we enhance its security potential with another factor.

So how does it work? Well, a factor is typically one of the following:

Note that the "something you have" factor often has a temporal aspect to it too. For instance, a SMS link may be valid for 5 minutes, whilst a soft token value may change every minute. The point of this isn't to frustrate the user, but to enhance security. Should an attacker capture the SMS message, they must do so within the allocation window for it to be useful.

LOCALISED ATTACKS

It's significantly harder and more taxing for an attacker to capture both the first factor (assume username/password) and the second (e.g. device). Many attacks originate from afar (even internationally), and happen digitally. Having to undertake a localised and physical hack isn't - I suspect - particularly appealing, and consequently we can view multiple factors as a decent form of protection.

Returning to our options (something you know, something you have, something you are), the recommended practice is to employ two (or more) distinct factors, such as: a password and a SMS text message, a password and soft token, or a password and fingerprint.

VARYING THE FACTORS

A two-factor authentication that incorporates a password and a pin is preferable to a single-factor password model, but it still represents two discoverable items an attacker can uncover or brute-force. This is tougher when those factors are distinct.

And finally, there seems to be some haziness around the distinctions between 2FA and MFA. Put simply 2FA requires two forms of identification, whilst MFA requires at least two forms of identification. That's it.

DORA METRICS

In 2016 the DevOps Research Assessment (DORA) team [1] published a set of measurable qualities considered to be a good foundation for success. They are:

DEPLOYMENT FREQUENCY

The frequency with which we deploy changes (to production) tells us something about Flow, risk appetite, and working practices. I've described elsewhere how a batch mindset leads to low (and slow) rates of change, high risk, and (most importantly) a culture of Release Fear. This isn't healthy.

There's something to be said for the regularity and normalisation of software releases. Some businesses still view them as a dark art that few are willing to brave, rather than a straightforward and commonplace activity. This generates fear, and consequently, creates an unseen (and undesirable) barrier.

Normalising releases though, by enabling anyone (or anything) to perform them, making them a (relatively) simple and low risk activity (through automation), and through cultural promotion (e.g. Small Batches, Definition of Done), promotes excellent TTM and Agility. Customers receive something sooner, and (business) risk is reduced.

NEW & EXISTING CODE CHANGES

Deployment frequency should relate to all changes (new software, and the maintenance of existing software) to ensure Sustainable practices are employed. It's easy to deploy new software once, quickly, but miss the (vital) repeatability aspect associated with regular change.

PREDICTABILITY

Predictability is another important consideration. It's difficult to accurately predict a release date with large batches, due to its many influences, complexity, and inconsistent sizing. That's less true of regular, small releases, who attract opposing qualities - limited (comparable) complexity, fewer influences, and fewer Surprises.

LEAD TIME

Lead Time has interesting connotations on TTM, ROI, Entropy, Flow, and Manufacturing Purgatory. It may also imply Waste (e.g. Waiting), issues within a Value Stream, and cultural concerns (e.g. too much Work-in-Progress (WIP)).

Customers want quick delivery, as do businesses. Long lead times reduce returns (ROI) - with work stuck in Manufacturing Purgatory - but it also slows feedback (Fast Feedback), making it difficult for businesses to react and adapt to changing circumstances.

Again, Batch Size plays a significant role here. The larger the batch, the more change (and consequently risk) it contains, the greater the effort (time) to verify and release it, and the greater the coordination effort.

CHANGE FAILURE RATE

The rate of change failure - the percentage of changes causing a failure or other unintended consequence - can be revealing, and may include: our true beliefs on quality (i.e. not the ones we might publicise), how well we've prepared and planned for work activities, the strength (or otherwise) of our testing and automation approach, and our operational readiness.

For instance, a failure due to poor preparation (e.g. a missed requirement) indicates something about our Definition of Ready (have we sufficiently captured, understood, contextualized, and collaborated on the requirements and acceptance criteria?), or our Definition of Done (have we done everything required to make this successful?). A defect intimates an inadequacy in our testing and automation strategy (and engineering practices), whilst a runtime failure identified by our customers (rather than us) implies something untoward about our operational readiness.

PRODUCTION ONLY

We're only discussing failures in the production environment, not those which are caught earlier.

There's no point in being fast if you can't also be accurate, and consistent. It's the equivalent of being efficient but ineffective (Efficiency v Effectiveness), which is ultimately pointless. Production failures are one of the worst forms of waste (The Seven Wastes), of course due to their impact, but also to their rework cost [2].

TIME TO RESTORATION

Everyone makes mistakes. The key is to not repeat them and quick resolution. The longer a problem exists, the greater the potential Blast Radius (as it increasingly impacts a greater number of users and systems). The greater the Blast Radius, in this case time, the more potential damage is done to a business (Reputation).

MTTR

Mean-Time To Recovery (MTTR) was discussed in the Availability section. It's the duration it takes to restore a system to a working, and available state, including the time required to fix it.

Lead Time also plays a factor in recovery time. The longer the lead time, the longer the likely recovery time (without resorting to Circumvention and Hotfixes). WIP is another concern. The more work a business juggles, the harder it's coordination, and the more need to expedite (Expedite) - a desirable quality in such a scenario.

SUMMARY

DORA metrics are a great way to measure progress, and I believe, success. They link back to every Business Pillar, certain underlying qualities, and therefore influence KPIs.

RELEASABILITY

Releasability is: “the ability to efficiently, repeatedly, and reliably deliver value to the customer”. This fits nicely with some of the DORA metrics.

One benefit of these metrics is that they can be used comparatively, either to view trends, or to address more immediate concerns. For instance we can identify deterioration, indicating downturns in production (output), lengthening lead times, or increases in faults. A steady decline suggests a lack of sustainable practices (Sustainability) being employed, and that Technical Debt is rising and requires addressing.

FURTHER CONSIDERATIONS

KPIS

A quick philosophical question for you. How do you measure success?

There are, of course, many ways to answer this. You might answer it from a purely materialistic perspective (money, fame, or reputation), or as something more personal. You might start by establishing a (joint) definition of success, since it's not necessarily objective, is it?

Initially, we may disagree - my idea of success might be quite different from yours. But one thing we should agree upon is that success is rarely quick. It's typically the result of many months and years of toil and hard labour (and even then it's not guaranteed).

Success then - in whichever form - is influenced by two things. Your ability to define your goal, and your ability to reach it. But - as the weeks and months slip by - how do you know whether you're getting nearer, or further away from it? That's what Key Performance Indicators (KPIs) offer - they record and measure a business' performance, improvements, and consequently success. It gives us a means to compare (against historical datasets), and adjust accordingly.

KPIS AND REACTIVE BUSINESSES

The concept of defining the goal and then measuring progress towards it seems a rather obvious premise, and surely a widespread practice, but it's sometimes lost in the busy day-to-day activities, particularly of highly reactive businesses. Staff in those environments become conditioned to work and think reactively, something that's surely not sustainable. KPIs can help here.

KPIs could be anything. They might be the number of orders, client sales, annual subscriptions, on-boarded customers, "active" customers (whatever that means in your context), success rollout of staff training, staff retention, publications, page reads etc. It's probably a combination of things. It's your business, you decide upon the indicators.

SENTIMENT

Not everything is readily measurable. One alternative is sentiment. For instance, we might gauge cultural improvements through a mix of metrics, like staff retention, and sentiment (“are people happier or sadder than three months ago?”).

Many of the challenges with KPIs lie not with the measurements, but with the gathering of the information. It's not that you don't have it, but it's either difficult to retrieve or aggregate it, or it's strewn over too wide (and diverse) a range of sources to be extracted from them all.

DATA ACCESS

This is often the case in large organisations with a vast and diverse estate (essentially multiple systems doing the same job), or widely distributed data sources (e.g. Datasource Daisy Chaining). The data exists, but gathering it is - to all intents and purposes - unrealistic and (overly) burdensome.

A data centralisation strategy (e.g. Data Warehouse) is commonly employed here in an attempt to simplify data consumption. However, it rarely takes into account making the data available to centralise it in the first instance.

IMMEDIATE V EVENTUAL CONSISTENCY

Consistency relates to a system's or dataset's ability to remain in a consistent (complete) state. Fundamentally, it's about its Reliability and Confidence. It's consistent with your expectations, such as how it's represented in relation to the real world. All parts are all complete and accurate (integrity).

Consider a legal document, such as a will. It's composed of multiple parts, including testator, executors, guardianships, assets, and signatures. Should we capture everything except signatures, it wouldn't be credible. Being incomplete, it's inconsistent with our expectations of what a will should be.

DEFINITION

Consistent is described as - "always happening or behaving in a similar way" [1]

Unless carefully managed, inconsistencies create unexpected behaviour and a loss of confidence. It's easy to see why. It's frustrating for customers to see incomplete information on an important transaction. But it's also frustrating for the engineers managing the systems, who must spend hours investigating why there's an incomplete picture.

Before moving on, it's important to discuss atomicity, time (i.e. the temporal factor), and transaction distribution.

ATOMICITY

Atomicity - the act of being atomic - is a very useful system quality, particularly around transactions. Fundamentally, it's an all-or-nothing model - all operations occur, or none do. When atomicity is guaranteed, a system can't get into an inconsistent state.

Wikipedia describes it as: “An atomic transaction is an indivisible and irreducible series of database operations such that either all occurs, or nothing occurs. [2] A guarantee of atomicity prevents updates to the database occurring only partially, which can cause greater problems than rejecting the whole series outright. As a consequence, the transaction cannot be observed to be in progress by another database client. At one moment in time, it has not yet happened, and at the next it has already occurred in whole (or nothing happened if the transaction was cancelled in progress).” [2]

If we were to apply atomicity to our will transaction, and view it, we'd find it could only be in one of two states: nothing (time unit 0), or everything (time unit 1). See below.

Time Unit11111
OperationTestatorExecutorsGuardiansAssetsSignature

The centralised (Monolithic) model, with more localised interactions, can better leverage database transaction scope, and thus Atomicity. One application, one database, one transaction. It doesn't exist, and then it does, there's no in-between. Or to reiterate, it's immediately consistent. The transaction is complete and accurate and can be immediately worked on. This is in marked contrast to the temporal and distributed factors I'll describe next.

THE TEMPORAL FACTOR

Time is a useful tactic to introduce some slack between business processes, people, and systems. For example, when I initiate a business transaction with some parties, I'm commonly told by them that: “the systems aren't updated till close of play tonight. It should be there tomorrow.” I'm so used to this behaviour that I don't even bat an eyelid. Under the hood, it's clear there's some form of queueing (not necessarily in the technology sense), or distribution of work, going on.

This approach allows us to decouple the requestor (the entity requesting an action) from the processor (the entity, or entities, responsible for actioning the request). Whether it's a system, or people, is beside the point. We've broken a complex, synchronous, and time-dependent task down into multiple discrete stages, thereby promoting . Availability (the processor doesn't need to be available at exactly the same time as the requestor), Scalability (we may introduce more entities to support increased loads, or reduce a queue size),

Performance

(the requestor need not await for you to finish everything, they can do other things), and Agility (we can make adjustments to better meet our objectives).

QUEUES & TIME

We employ Queues to decouple, and therefore protect, systems of differing performance, scalability, and availability characteristics from one another.

Regrettably, the act of decoupling ourselves from time can create consistency issues, particularly if ordering isn't adhered to (consider the effect of creating an order prior to the customer record), with some parts of a transaction complete, and others still unprocessed. In this case, the transaction is inconsistent with our expectation of what it should be. The most obvious consequence of using time as a bulkhead is Eventual Consistency. We can't guarantee exactly when each part of a transaction will complete, and must accept some likelihood of inconsistency.

TIME != ASYNCHRONICITY

Asynchronicity isn't directly about time, it relates to any dependencies upon previous steps. It's quite possible (and sensible) to employ a temporal bulkhead in a sequential flow. We do this all the time in the real world, for example, when we queue for coffee in our local coffee shop.

Failure in this mode can leave us in a bit of a mess, with a partially complete (inconsistent) transaction to be identified, diagnosed and remedied.

Finally, it's worth mentioning that temporal implies distribution, something we'll discuss next.

DISTRIBUTION

The third aspect to consider here is distribution; e.g. a Distributed Architecture. In a similar vein to the temporal factor (time), we employ distribution to promote (for instance) Agility, Scalability, Releasability, Flexibility, and rapid change. Unlike the centralised model, it requires us to distribute a (business) transaction across multiple system boundaries, implying a loss of Atomicity. See below.

In the first scenario (left), typical of a monolith, one transaction (Tx A) manages all five database interactions, often into the same (monolithic) database schema. The second case, more familiar in distribution, is quite different. In this case, a transaction is managed per action (assuming each database interaction is encapsulated by a single distributed interaction).

Distribution need not be asynchronous (independent), yet we can still fall foul of inconsistencies. Let's return our attention to the Will writing scenario. In the distributed model, we might build distinct microservices for each step; i.e.: Testator, Executors, Assets, Guardians, and Signature services. See below.

Time Unit12345
OperationTestatorExecutorsGuardiansAssetsSignature

It doesn't look vastly different to the earlier, centralised example. Note though, how we get different views of the overall business transaction at different points in time. For instance, we only see the testator, executors, and guardianship records at time unit 3. It's definitely not atomic, which makes sense considering that each service is discrete and typically has its own (independent) data store.

Failures are another problem in this model. What happens if our transaction fails part-way through? We don't have the luxury of all-or-nothing atomicity, so we're left with a partial data set - a partial digital representation - that isn't representative of the whole, nor the real world.

DATA DUPLICATION CONSISTENCY

Data Duplication (Data Duplication) is just another form of distribution - in this case of data. We commonly find ETLs sandwiched between the (two) data stores, shifting data from one to the other.

ETLs though have a tendency to be (written as) a bit heavyweight, and “batchy”. It's quite common to execute them only once or twice a day, meaning data is stale, or inconsistent during that period.

CAP THEOREM

CAP Theorem (Consistency, Availability, Partition Tolerance) states: "In theoretical computer science, the CAP theorem, also named Brewer's theorem after computer scientist Eric Brewer, states that any distributed data store can provide only two of the following three guarantees… Consistency, Availability, Partition Tolerance.” [3]

You might be curious about Partition Tolerance. With this in place: “The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes.” [3]. Therefore, when a failure occurs, you may either continue (knowing that it might lead to inconsistencies), or you stop the action (knowing you've reduced availability).

To reiterate, if you have a partition, which you must do with a distributed system, then you can have either Availability or Consistency, but not both. The data is either available, but not necessarily consistent, or it's consistent, but not necessarily (immediately) available. If you need high availability and partitioned services, then you must reduce your consistency expectations. If you want consistency and partitioning, then you must reduce your availability expectations.

SUMMARY

Consistency is a measure of a system's or dataset's ability to remain in a consistent (complete) state. A transaction is consistent with your expectations if its component parts reflect how it should be represented in the real world.

There are two common modes:

  1. Immediate Consistency. The transaction is immediately consistent with your expectations.
  2. Eventual Consistency. The transaction becomes consistent over a period of time, but not immediately.

Centralised (Monolithic) applications tend to have more localised interactions, so can better leverage Atomicity; i.e. an all-or-nothing success or rollback model. That makes them immediately consistent.

Conversely, distributed solutions (e.g. Microservices) exhibit greater isolation, and often use independent (and different) database technologies. Atomicity isn't really an option here, creating consistency (eventual), and rollback challenges. This is also true of temporal practices (e.g. bulkheads, such as Queues) - a business transaction remains incomplete (and may produce inconsistent results) until all temporal parts successfully complete.

That leads us into a nice segway into failures. A failure in the atomic model results in the complete reversal (rollback) of the transaction. Admittedly, this isn't always necessary, but it's a powerful protective tool to retain consistency. A failure in the distributed, or temporal model, though, requires a degree of detective and remedial work to return the transaction to a consistent state.

FURTHER CONSIDERATIONS

(BUSINESS) CONTINUITY

"Humpty Dumpty sat on a wall.
Humpty Dumpty had a great fall.
All the king's horses and all the king's men
Couldn't put Humpty together again."
- Nursery Rhyme

Google's dictionary defines continuity as: "the unbroken and consistent existence or operation of something over time."

Continuity, in the context of an organisation, is its continued ability for it to function, to operate, and retain an appropriate level of operational service, even after a significant disruptive event (or disaster).

It might surprise the reader to learn that not everyone thinks in these terms. From my experience, I'd say a panglossian, overly optimistic, outlook is quite common. And not everyone has a plan.

Whether your business is in the private, public, non-profit, or governmental sector, your business (probably) offers its customers a service. What happens though, when its ability to offer that service is impaired, or eliminated, by an unexpected factor? Does it continue to function? In what capacity? As Humpty Dumpty found out (probably to his dismay), how do you put the pieces back together when your world is so fragmented, obscure, nebulous, unautomated, and untested (in terms of proven resilience), or when you are no longer in Control?

ASSUMPTIONS

Like Resilience, Continuity is heavily influenced by Assumptions - e.g. an over-reliance on a partner, system, physical location, or geographic zone. What sort of assumptions have you made that are false or unreliable?

Of course, the counter-argument is money and time; i.e.: "Why should we spend all that money and time on something that'll never happen?" Sure, you could do nothing, but isn't that a form of betting? Consider our modern world. It's more complex than it's ever been, and becoming increasingly so. Rightly or wrongly, humanity is more dependent (i.e. coupled) on technology, our consumer expectations and needs are greater (Ravenous Consumption), and our reach (e.g. social and commercial) extends much further. Only a short time ago many businesses still relied upon their staff to be physically present at a specific location to undertake operational activities - they weren't prepared when the pandemic struck, and it ended up being a poor Assumption. In this case, continuity was risked by a pandemic, but it could just as easily be a cyberattack, an employee leaving (if they are a Knowledge Silo), a failing business partner, a Single Point of Failure, or some other “disaster” causing a loss of service.

RECOVERY TIME

Recovery is important, but so too is recovery time. For instance, if your system is lost, can you take it elsewhere, and do so within a reasonable timeframe? How would you do this? Has it been tested?

The key to continuity is being proactive. Probably the most important starting point is creating a Disaster Recovery (DR) Plan. From there it should be regularly tested. SLAs also have an important part to play here. They define acceptable service levels (typically under normal conditions), making them a useful metric to build resilient systems and operations. Red Teaming is another useful technique for identifying weaknesses. Finally, you should consider Complexity. If your estate is complex, that implies something that's difficult to fathom and recreate. Continuity here is protected by reducing that complexity.

FURTHER CONSIDERATIONS

UPGRADE PROCRASTINATION

Upgrade Procrastination relates to a tardiness in upgrading systems (or indeed practices) in line with customer (possibly indirectly), business, or industry expectations. It has a notable effect on both Security and Evolvability and is often influenced by Functional Myopia.

Procrastination may occur in any layer, including in hardware, Operating Systems, runtime platforms, libraries, and applications. Security and Evolvability are more heavily influenced (and mitigated) through a coordinated, multi-layered strategy. For instance, upgrading a software application only to leave it running on an old (insecure) hardware, operating system, or runtime platform, only alleviates, it doesn't remove risk.

WHY PROCRASTINATE?

Why procrastinate? Well, like many other non-functional aspects, upgrades are generally deemed less important than features. Also, regression costs are often an impediment.

The second point I have on Upgrade Procrastination is time. The longer it's left, the harder it is to do. Tardy upgrades involve more changes, thereby increasing risk, causing a large Blast Radius and Change Friction, and consequently a lack of impetus. An upgrade "event horizon" is reached - a point of no return - where the upgrade, even when highly desirable, becomes (pragmatically) impossible, and must be abandoned. Such a position typically leads to a shortened product lifespan, slowness (i.e. poor TTM), reduced competitiveness, and reduced staff retention. Don't leave it too late. It's not important until it's important, and by then it might be too late.

BLAST RADIUS V CONTAGION

Whilst Blast Radius and Contagion are related, they're not the same. Blast Radius relates to our (in)ability to act, whereas Contagion relates to our (in)ability to contain something (bad) that's currently occurring.

To reiterate, Contagion is the consequence of an event that further exacerbates it. It's typically more fast-acting than Blast Radius, stemming from more fluid events, like an ongoing cyberattack or Scalability issues. Blast Radius is slower. It creeps into systems over months and years through bad practices, and poor architectural and design choices.

CONTAGION

The word contagion is well-used in the parlance of our modern world, particularly since the pandemic. Google's dictionary describes it as "the communication of disease from one person or organism to another by close contact." Fundamentally, it's the spreading, or exacerbation, of a problem or situation.

Software systems (indeed any system) face these same challenges. The more interconnected, the greater the potential of contagion, and thus disaster.

BLAST RADIUS V CONTAGION

I've already discussed Blast Radius in this text, which is loosely related to Contagion. Whereas Blast Radius relates to our ability to act, contagion relates to our ability to contain something (bad) that's occurring. See Blast Radius v Contagion.

Within a software system, Contagion can be managed through the employment of a number of techniques, including: an Andon Cord (stop all further processing until a root cause and satisfactory resolution is found), Circuit Breakers (the contagion here being the flooding of another system or component), Canary Releases (to limit how widely a new feature is promoted, until we have conclusive evidence of its suitability), Microservices (one contagion being from large, high-risk releases that are irrelevant to it), Loose Coupling (possibly the contagion of large, high-risk releases, or the impact of a security vulnerability), Time and Bulkheads (such as Queues, to throttle interactions between systems of varying performance characteristics and prevent flooding), isolated backup networks (to retain Business Continuity), Data Centers and Availability Zones (e.g. reduce potential contagion from Acts of God).

FURTHER CONSIDERATIONS

PRINCIPLE OF LEAST PRIVILEGE

Systems contain a plethora of damage limitation controls. Indeed Contagion - how widely a problem is exacerbated - is about damage limitation. The Principle of Least Privilege is another of these.

Most systems expect the user (person or system) to be authenticated and authorised before further access is granted. However, it's dangerous to give people more access than they're entitled to (you've seen those spy movies where the protagonist steals a keycard from the security guard, thus allowing them full access to the facility?). Excess privileges may be (intentionally or unintentionally) abused, either by that user, or indeed by attackers able to steal it. Users should only get access to what's necessary, no more. This is the Principle of Least Privilege.

The underlying idea of Least Privilege is quite fundamental, and common sense suggests this should be best practice, so why do we need a principle? Because, even with the best intentions, it isn't always followed. For example, during an application development cycle, it's quite natural to use broad (extended) access to get things done, yet never retract those privileges in production, leaving the application open to abuse. Convenience is another reason. Some businesses make it inordinately difficult to broaden access once it's set. I've seen it take weeks (or months) in some cases, if at all. The somewhat obvious knock-on effect of this is to circumvent it, by starting with those higher privileges, even if they're not immediately required, nor desirable from a security perspective.

MONOLITH & MICROSERVICE

A pure Microservices model, with its own independent datastore, may lessen Contagion over its Monolithic cousin (e.g. access to all tables across all domains), but Least Privilege still applies. Only give users (and others systems) what they need, no more.

INFRASTRUCTURE AS CODE (IAC)

Surprise is rarely a good thing in software. It's often caused by a lack of Consistency, either in environment, or in process. Notably, it also impedes two key delivery qualities - Repeatability and Predictability.

Environmental Drift (the undesired divergence of runtime environments - e.g. dev, uat, production) is a common problem in software businesses. Even a seemingly small and insignificant variance between two environments can have a significant impact. I've seen production deployments fail for no less. Consistency here (i.e. Uniformity) is critical, yet it's at a major disadvantage where: (a) changes occur manually, (b) changes can be circumvented (Circumvention) to only occur in one (typically production), (c) there's no adhered standard for software promotion (from least stable to most stable), and (d) change is commonplace.

TWELVE-FACTOR & IaC

12-Factor Apps makes similar points; e.g.:
  • "Use declarative formats for setup automation, to minimize time and cost for new developers joining the project;
  • Minimize divergence between development and production, enabling continuous deployment for maximum agility;" [1]

As intimated, keeping an environment accurate and aligned can be a costly affair. If it's overly encumbering it'll inevitably get circumvented, and inconsistencies creep in. Over time, it drifts until all credibility is lost, and ultimately, it's discarded. We should avoid this situation.

Another concern lies in the overextension of environmental responsibilities; i.e. an environment is used for too many unrelated responsibilities. Possibly the most notorious example of this is running sales demos from an unstable, unbounded (development or test) environment. It's fraught with danger and has a whiff of unprofessionalism best left unadvertised to prospective customers. The common factor here is typically one of cost (financial or time), to provisioning another environment, or from a Tight-Coupling dependence on another system (e.g. IDs must match).

Nothing I've said so far breeds confidence (Stakeholder Confidence), Repeatability, Predictability, Continuity, nor speed - all of which should be our driving factors.

Ok, now I've set the scene, let's discuss Infrastructure as Code (IaC). Fundamentally, software needs a place to run. There are some exceptions, but most typically that means infrastructure and runtime platforms. We deploy our software to a (runtime) platform that we've provisioned and configured for that very purpose, thus allowing us to use that software. But how do we get our infrastructure / platform into such a position?

Recall that, historically, this activity has been almost exclusively manual; yet due to the increased complexity of software, it's made such intervention entirely impractical [2]. The only real alternative is automation, but to do that, we need a way to program it. This is Infrastructure-as-Code (IaC). We write code to provision (and configure) the infrastructure necessary to run our software applications. We declare how it should look and allow the machine to find the most reliable and efficient way to achieve it.

Let's turn now to our driving factors and see what IaC means for them.

REPEATABILITY

The main problem with manual intervention is its (relative) lack of Repeatability. If you ask two people to do the same job, there's a reasonable chance they'll do it differently, even with explicit instructions. Yet we need consistency. Two environments can't be almost the same, they must be precisely the same [4].

IaC solves the repeatability problem through Automation, repeatedly producing the same results time after time. Confidence flows from this.

PREDICTABILITY

If you can create stability, consistency (through repeatability), and limit Surprise (e.g. by using machines rather than people) then you can - with reasonable accuracy - predict the outcome. And when you can predict the outcome is predictable, everyone gains.

SPEED

One of the drawbacks of manual provisioning is its (relative) slowness. The more steps, the longer it takes, and the more likely an error will creep in.

If environments can be quickly and easily provisioned, then we can do it Just-in-Time (JIT). Consider the tester who wants to test an older software release that's been superseded by a later one in the UAT environment. They can easily spin up a new environment, test it, and destroy it in quick succession. Consider a Sales team who can quickly spin up a demo environment just-in-time for an important sale, and then easily destroy it afterwards. Consider a prospective client who can quickly spin up their own environment to try out your product. Speed is of the essence. You can do all this with IaC.

CONTINUITY

A less obvious - but important - benefit of IaC is Continuity and Disaster Recovery.

For instance, should a successful Ransomware attack be executed on your systems, it needn't be catastrophic. Yes, you may be locked out of your current transactional systems, but IaC makes it recoverable. You can take that code and (rapidly) recreate it elsewhere [3].

CONFIDENCE

All of the previous points create confidence (Stakeholder Confidence). Confidence we'll get a like-for-like environment; confidence we can predict the outcome; confidence we can deliver it quickly; confidence we can continue to function as a business in the most unpredictable of circumstances; and confidence that a new environment can be "self-served" by a non-tekkie. What more do you need?

FURTHER CONSIDERATIONS

BLAST RADIUS

Blast Radius relates to the scale of change, and therefore the effort necessary to make a single change. Or: “if I change this, what's the knock-on effect?” See below.


It shows the impact between a small blast radius (on the left) and a large blast radius (on the right).

Assume in both cases that the central component (Z) acts as the initial stimulus of change. In the example on the left, our component's radius is relatively small. It impacts two others (C and E). The example on the right though impacts many (eight by my reckoning). Now, it depends upon the type of change, but given the choice, I'd opt for the first scenario [1]. It's more change than I'd like, but it's eminently more manageable than the second.

Of course, this mindset is quite natural. Do the smallest thing, of the highest (immediate) value. Don't undertake work where one (anticipated) change generates eight pieces of reactionary, unrelated, but necessary activities, when you can do something smaller.

Clearly, it's unappealing, and therefore never gets done. Yet it doesn't resolve the root cause, it simply kicks the problem further down the road for someone else to deal with.

SMALL V LARGE RADIUS

To be clear, a Blast Radius which is small (a small radius) isn't necessarily an issue, since its impact is likely to be low. Conversely, a large Blast Radius implies lots of change, regression effort, high organisation and coordination costs, and risk.

A large blast radius reduces optionality, your ability to make change (Agility), and thus stymies Innovation and advancement. It's typically associated with a large amount of Tight-Coupling, and generates Change Friction. As the saying goes - there be dragons! Any attempts to resolve it are hampered, leading to longer-term Evolvability, Sustainability, and Agility concerns.

BLAST RADIUS & CONTAGION

Blast Radius relates to Contagion, however I view contagion as the consequence of an event that typically exacerbates it. It's more fast-acting than Blast Radius, such as from a cyberattack or poor Scalability. In contrast, Blast Radius is slower. It creeps into systems over months and years through bad practices, and poor architectural and design choices. See Blast Radius v Contagion.

Blast Radius can occur anywhere where: (a) many dependencies rely upon something capable of evolution, and (b) you can't sensibly break the change into smaller, independently releasable units of work. Most often I see it in systems that allow “database dipping” - where many (unrelated) consumers dip into a database to access data for their own needs, and circumvent any existing interface (e.g. API), such as with Domain Pollution. This is why Encapsulation is best practice! The other common scenario is a runtime platform upgrade on a Monolith.

FURTHER CONSIDERATIONS

RUNTIME & CHANGE RESILIENCE

I can think of two distinct forms of Resilience. In the more traditional sense, we have the resilience (or lack of) introduced by unexpected events (e.g. environmental factors). In the other sense we have the resilience (or lack of) introduced by intended changes that don't go to plan, such as a software release. Both forms can cause service outages.

Why distinguish between them? Because some systems and approaches are more susceptible to one form of resilience, or both, than others.

For instance, let's consider the Monolith. In terms of runtime resilience, we have something of a mixed bag. It's resilient to a point. However, in terms of change resilience, its broad scope, Tight-Coupling, Batching, and Lengthy Releases actually generates change risk, thus making change resilience relatively poor.

Alternatively, by easily adding further routes for High Availability, Infrastructure-as-Code (IaC) can magnify runtime resilience, and also offer decent change resilience, due to its repeatable (and therefore predictable) attributes.

THE PRINCIPLE OF LOCALITY

In Alexandre Dumas' novel The Count of Monte Cristo, Baron Danglers amasses a small fortune through insider trading, by receiving important information before others. The Count then employs the telegraph (and some social engineering) to publish false information and therefore impoverish Danglers [1]). This approach makes use of the Principle of Locality - the idea that some sort of benefit is received by being physically close to something else.

High performance is typically achieved through a combination of low latency (the time taken to travel from component A to B), and fast processing time. Low latency is strongly influenced by the locality between sender and receiver. The closer they are (both physically and logically), the less traveling time, the lower the latency, and the greater the performance. The Monolithic application is a case-in-point. The fact that it's centralised means communications between sender and receiver are local (no network), making them extremely fast.

THE PRINCIPLE IN TRADING

There's a reason why traders pay top dollar to locate their software as close to the stock market as possible. It's based on the Principle of Locality. A seemingly small delay (e.g. fractions of a second) in high-speed trading can give your competitors an unfair advantage over you, resulting in significant lost revenue.

We also find this principle relevant across global solutions. For instance, it's not always expedient to centralise locality in one place, and expect customer requests to travel long distances. See below.


For instance, customers from India or Japan probably aren't going to have a good experience if their digital transactions must travel halfway across the world to access services in the U.S. Rather, by using the Principle of Locality, we place those services nearer to their customers (Cloud vendors use Edge Services in a similar way), tailoring that experience to them.

REACTION

Some businesses find themselves stuck in an endless spiral of reaction, so much so that their decision-making is based almost exclusively on it and they lose sight of their overarching goal and business strategy. It's like being stuck in a foxhole, under fire, with no clear idea of how to get out.

Reaction is fine in small quantities (e g. it shows a willingness to meet customer needs), but it's a trap to continually work in this manner. See The Cycle of Discontent.

EVOLUTION'S PECKING ORDER

In terms of software, evolution has a pecking order, determining how regularly it changes, morphs, or evolves. See the table below.

  User Interface (UI) Business / Service Tier Data Tier
Rate of change Fast Medium Slow

The user interface (UI) is seen by the largest number (and variant) of stakeholders, and therefore receives the most opinion, Bias, and Bike Shedding. It's what you put in front of your prospective customers, or onto your corporate website. It's therefore one of the first signs of modernity, or lack thereof. That puts it high on the evolutionary pecking order.

Next comes the business / service tier, containing all of the features and business logic required to offer customers a “business service”. This tier certainly changes, but holistic change tends to be harder and slower (than UI changes), the functions are more valuable (than a UI), and we expect a decent level of reuse from it. This makes evolution slower. It's common, either to route new (service layer) components into existing data sets, or to use them to wrap legacy monoliths (e.g. SOA).

Data is probably your most valuable asset, bringing Loss Aversion into play. It's seen by the fewest (at least directly [1]), and probably requires the deepest level of understanding and contextualisation. It's also a time machine - a historic record of the past - which most of us consider immutable. That makes change hard.

TRANSFORMATION AFFECTS ALL TIERS

Digital transformation is hard, in part because it typically affects all three areas (user interfaces, business layer, and data). You're fundamentally changing the way you operate the business and technologies to meet modern challenges.

FURTHER CONSIDERATIONS

LATE INTEGRATION

Integration is typically one of the final stages of a project. It's also one of the more complicated and time consuming. Waiting to integrate therefore adds risk and creates unpredictability. It thereby goes that by integrating early and often, you create Fast Feedback, generating the confidence to proceed, but also reducing risk and increasing Predictability.

Try to understand why these comments are being made (techniques like The Five Whys can help here). If possible, avoid building too deep, too soon, and force integration issues to be solved earlier.

FURTHER CONSIDERATIONS

EXPEDITING

Expediting is the act of promoting a previously lower priority work item ahead of current priorities. Fundamentally, something has changed, you're adopting a new priority, to the detriment of your current work items, with the expectation of some gain (e.g. financial).

Expediting can be a valuable tool, or a nefarious enemy, determined to stop advancement. When correctly employed, it allows you to meet an objective that: (a) wasn't previously known or foreseen, or (b) that has since been influenced by internal or external pressures, and is now causing you to adapt [1].

When used incorrectly, or too often, it becomes a hindrance that: limits Agility and the release of value, creates a lethargy and stasis, promotes Manufacturing Purgatory, Environmental Drift [2], and Circumvention. It affects quality, and can demotivate the workforce (how would you feel after doing work to be told it's no longer important?).

Expediting can feel like the Hokey-Cokey [3]. You put a change in, do a bit of work on it, then take it back out. You then put in another change, do some work on it, and take it back out again. See below.


It's messy and difficult to follow. For instance, what's the current feature in the environment? It's also easy to see how drift occurs.

ENDLESS EXPEDITING

The most egregious form of this is Endless Expediting - a business remains in the throes of expediting everything, whilst delivering nothing.

We've talked about what Expediting is, its benefits and its dangers, but we've not yet discussed what causes it. Expediting occurs due to a change in business priorities. Of course, this isn't a cause [4], it's more likely to be the direct result of (some of) the following stimuli: poor culture or cultural inequality (e.g. unruly execs making unrealistic promises without informing others), the lack of a clear strategy (a lack of direction, causing subjective reasoning), too much Work-in-Progress (WIP) (which is both a cause and an effect), large batches of change (i.e. high risk, slow change), slow releases cycles (causing self-inflicted damage - it becomes the victim of expediting), imbalanced evolutionary cycles (such as between you and your dependencies), and external pressures (e.g. critical security vulnerabilities causing immediate attention, or new technologies eclipsing previously selected choices). More fundamentally, it may signal that idea generation is moving much faster than idea realisation (the creation and delivery) can be achieved, leading to half-realised solutions that are only partly, but never wholly, introduced [5].

FURTHER CONSIDERATIONS

THE THREE S's - SURVIVAL, SUSTAINABILITY & STRATEGY

There are three traits I look for when judging the worthiness (its overall value to the business) of a work item, based on what I call The Three S's Principle of: Survival, Sustainability, and Strategy.

SURVIVAL

The first to consider is survival. Or, to reiterate, is the work item critical to your continued existence?

Survival may be of immediate concern, since it may feed immediate cash flow needs, but that doesn't make the work item either sustainable, or strategic. As the word suggests, being in a continued state of sustenance is undesirable; indeed it's unsustainable, and suggests that things aren't improving. Businesses that constantly react (e.g. the Cycle of Discontent), or ones with little control of an outcome, are probably stuck in the sustenance space.

SUSTAINABILITY

Sustainability is the second trait, and is about behaving sustainably. As described earlier (Sustainability), it relates to: “Behaving in a manner that is favourable to the continued viability of your business or the environment.”

This trait isn't specifically about the environment (although that's important too), it's about how you sustain any practice, process, maintenance, delivery, technology, application, or tool, to deliver a work item repeatedly, within an appropriate time frame and budget, and executed with appropriate regularity to meet customer and business needs. If you're doing this, you're in a pretty good place, but that doesn't necessarily make you strategic.

STRATEGY

Strategy then, is the final piece of the puzzle. It describes what you want and how you intend to get there. It's a guiding star, and directional (thinking and planning), but still requires realisation.

A work item can offer sustenance (e.g. immediate cash flow), and be sustainable, yet be non-strategic. Consider those activities in sectors that the business has previously supported, but no longer wishes to. Sustainability, like survival, can promote efficiency, but be ultimately ineffective (see Efficiency v Effectiveness). You can build sustainability into a process, but the question is should you if it's not strategic?

SUMMARY

Sometimes, there's a good reason to undertake a work item solely to satisfy one trait. The ideal position, however, is one that meets all three. See below.

In such cases, we get survival (e.g. it has an immediate return to support cash flow), sustainability (e.g. it can be repeated in a timely, efficient, and cost-effective manner), and it is strategic (e.g. it meets the business' directional expectations). I've included a simple matrix below indicating some of these scenarios.

Survival Sustainable Strategic Comment
Y N N It offers survival only. It's bringing in money (e.g. to satisfy immediate cash flow concerns), but it lacks any repeatable or sustainable structure, nor any strategic direction.
Y Y N It offers survival, and is sustainable, but is not strategic. You may be making money and the practices you apply may be efficient, but are they taking you in the right strategic direction? Are they effective?
N Y N It is sustainable, but lacks survivability, nor any strategic direction. You might build in efficient and repeatable practices, but they may not be immediately transferable into revenue, nor meet the business' strategic direction. Are they effective?
Y Y Y It offers survival, is sustainable, and is strategic. This is the ideal case. You're able to subsist, by making some income from the work item, it's sustainable, meaning it's low cost and repeatable, and it's strategic, indicating it will meet the business' longer-term aspirations.

FURTHER CONSIDERATIONS

INAPPROPRIATE FOUNDATIONS

The (Leaning) Tower of Pisa [1] might be the most exoteric example of the impact of building upon Inappropriate Foundations. First constructed in the twelfth century, it quickly showed signs of subsidence when a second floor was added. Investigations quickly discovered the problem lay not with the tower, but with its foundations (which were only three meters deep, in a infirm subsoil). During its lifetime, multiple costly restoration and stabilisation attempts have been made. And whilst tourists continue to flock to it as an attraction, as a building it's a very costly and ineffective failure - the result entirely of its poor foundations.

The same principle holds true in software construction. When the foundations are poor - the foundations here being some important quality, like Resilience - you'll likely get substandard results. It may not be immediately obvious, but, like the Pisa tower, it will become increasingly clear.

It's much harder to resolve a failing in a foundation, because many things have been built upon it (dependencies), thus making (sometimes incorrect) Assumptions about it. When you build on the equivalent of digital quicksand, you introduce risk, and place the longer-term outcome in question.

FURTHER CONSIDERATIONS

SILOS

Fundamentally, Silos equate to working, and thinking, in isolation. In terms of discoveries and innovations, they can be very successful, yet in terms of (business) sustainability, they often create the following concerns:

Let's start from the collaboration and diversity angle. Perhaps the most obvious example of a silo lies with the individual. In the Indy, Pairing, and Mobbing section I describe how working independently is often deemed (by most) to be the most efficient way of delivering change, yet, it often leads to (unnecessary) specialisms, a loss of Shared Context, and the risk of Single Points of Individual Failure. Of course teams, departments, and even companies, can also fall foul of this. Simply merging two entities doesn't mean they're working collaboratively, nor collectively. Again, the silo often permits the individual entity to move faster, typically because there's a lower cost (or fewer short-term ramifications) to decision-making, but I regularly find this to be a false economy [1].

The lack of (early) diversity is also interesting. As I've stated elsewhere: "Greater diversity […] counters bias, and less bias equates to fewer assumptions, which leads to reduced risk. It also reduces risk in other ways. A wider range of diverse stakeholders can typically identify problems much earlier, react sooner, make better decisions, and thus enable alternative approaches to be tried. You can't do this when time is working against you - the decision is made for you". Silos, then, limit options, and that often leads to quality issues due to missing opportunities to share information and ideas with a wider, more diverse group. Late access to work and ideas that someone thinks they should be involved in, but isn't, is also likely to have a negative effect on morale, and relates to my point on talent retention.

When you work in isolation, it's easy to lose sight of your surroundings. Unfortunately, in terms of work management, that demotes Systems Level Thinking and promotes Batching. For example, even with the advent of Continuous Integration (CI), I still find individual engineers sitting on code branches for weeks, or even months. When it's finally merged back, you're then dealing with a large, risky batch of change that few have seen. No wonder things don't always go to plan. This same principle is true though at all levels of a business. A business that is siloed, tends to work with large batches of change (and the Waterfall Methodology [2]), which is risky, reduces feedback (Fast Feedback), and lowers potential ROI. Another consequence is that since large batches excel at hiding quality issues over smaller ones, then by proxy, Silos have more opportunities to lower quality.

SYSTEM-LEVEL THINKING

Business processes require holistic, system-level thinking, which considers the effect of batching and siloes. There's little point in building efficiency everywhere else, only to leave a key team siloed and working with batches [3].

For the aforementioned reasons, Silos also generate Waste (The Seven Wastes). Batching, for instance, tends to cause Waiting [4] and Overprocessing [5] (which is just another term for Gold Plating), resulting in poor ROI. Only the activities necessary to achieve the goal should be undertaken, no more, anything else is waste.

FURTHER CONSIDERATIONS

MINIMAL DIVERGENCE

“Any customer can have a car painted any color that he wants, so long as it is black.” - Henry Ford.

In his biography, Grinding It Out, Ray Kroc, of McDonalds fame, describes the importance of a minimal menu, with minimal divergences in price points. Henry Ford used similar principles in his production lines. In short, this approach allows for consistency, quality (at every step), and growth.

Of course intentional product variances can make you stand out against your competitors. The theory being the greater the variance, the greater flexibility and choice a customer has, thus the more customer-oriented it is. The bad news is that unnecessary, or unsustainable, variances can quickly become a drain on the business offering them. They create a divergence (a lack of Uniformity), increasing complexity, creating (unnecessary) specialisms and Knowledge Silos, and hampering growth (it's harder to scale the business).

Clearly this is undesirable, so why do it? The two most common forms of software-related divergence I can think of are (a) bespoke features and (b) technology variance (e.g. Technology Sprawl). Let's discuss them now.

Bespoke (or custom) features are, of course, a natural outcome of customer-orientation, and shouldn't necessarily be feared. They should, though, be carefully managed to avoid an unnecessary overabundance.

Consider first that every custom feature you opt to include is an agreement by you to embed the complexities of another party within your own business. Let's pause briefly to reflect on that. For instance, if your business serves ten clients, and each comes with its own uniqueness, then you've adopted ten sets of other business' complexities within your own. Consequently, you've created client specialisms, and are (arguably) employing staff on behalf of another business [1]. Was that your intent?

Adopting too much complexity (and that includes complexity from others) can be detrimental to rate of change (TTM), your ability to adapt, or be nimble (i.e. Agility), and Evolvability (you're handing over some Control to other parties), mainly because higher complexity typically goes hand in hand with greater Assumptions, and that leads to Change Friction.

The second concept, technology variance, has already been covered elsewhere (see Technology Sprawl), so I won't dwell long here. Again, it's fine in relatively small amounts, but it's easy to overdo, thus creating divergences in skills, experience, processes and techniques, and Value Streams, that hampers scale and growth.

FURTHER CONSIDERATIONS

THE CYCLE OF DISCONTENT

Reaction can be one of the most challenging aspects of running a business. Whilst some level of reaction shows your willingness to adapt and satisfy customer needs, too much has a negative effect on deliveries, and can restrict growth and Innovation. Constant reaction leads to what I term the Cycle of Discontent; shown below.


The cycle has four phases:

  1. Constant Reaction.
  2. Constraining Force.
  3. Debt Accrual.
  4. Slower & Slower Deliveries.

Let's look at them in more depth.

PHASE 1 - CONSTANT REACTION

The cycle begins when constant reaction becomes the norm. Work begins, typically with the belief that everything has been sufficiently planned and is accurate, but in this (reactionary) reality, it probably isn't.

REACTION IS THE PAST

Reaction looks to the past, thus it suggests that you're probably not looking at what's to come, nor are you innovating.

PHASE 2 - CONSTRAINING FORCE

Constraining Force may occur anywhere in a project lifecycle. It's often caused by Assumptions made earlier in the process, later discovered to be false. It may be realised by the inability to: think, assign sufficient resources or time, or satisfy the desired outcomes of all stakeholders.

When reaction is constant, there's rarely sufficient time or resource to fully analyse, understand, or contextualise the problem, and thus to identify the correct solution (which should be based upon costs, time, and quality).

Let's say your project doesn't go to plan. Your options are:

This last point often seems to be the one most travelled.

PHASE 3 - DEBT ACCRUAL

If Constraining Force has impacted quality (i.e. corners are cut to deliver the feature), then over time, Technical Debt accrues. This debt accrual starts small, but quickly mounts up as the cycle repeats with no remedial action.

These debts are poor decisions, chosen for both right and wrong reasons, that are never resolved (“paid back”). They force us to employ substandard practices, where unnecessary and overly-complex work become the norm. This is made possible by Creeping Normalcy - we become acclimatized to incremental change, but lose sight of the longer-term, holistic effect.

PHASE 4 - SLOWER & SLOWER DELIVERIES

Debt Accrual leads to the final stage; Slower Deliveries. Each change has greater risk, is more complex, or impacts many more (unnecessary) areas (it has a high Blast Radius). Tasks become harder, take longer, Change Friction is encountered, and in the most severe cases, it prevents any form of change.

Slower Deliveries has some important business implications:

SUMMARY

Reaction can be a good way to emphasise your commitment to clients, yet, to paraphrase the saying, anything taken to the extreme is likely to have negative consequences. Without careful consideration and planning, it's quite easy to allow near-sighted views to dictate terms and obscure long-term goals (e.g. Sustainability), affecting quality and speed, and thereby diminishing your business.

Paradoxically, Constant Reaction has the power to undo the very qualities you may be aspiring to; e.g. your desire to support short-term TTM and ROI needs pollutes your longer-term TTM and ROI aspirations. Watch how often you sacrifice quality for delivery - it's almost always the first to face the chop, but often has the greatest long-term impact.

FURTHER CONSIDERATIONS

CONTEXT SWITCHING

Context Switching is the concept of switching focus (and thereby, context) from one idea (and its realisation) to another. Its occasional use isn't an issue, indeed it may be healthy, but the constant need to do so is highly wasteful, both from an efficiency and effectiveness (efficacy) perspective.

CONTEXT SWITCHING & WASTE

Context Switching generates waste (The Seven Wastes) in the forms of: Transportation (we digitally transport our current activity out of one environment, and brain, and replace it with another) and Waiting (activities are paused, meaning slower delivery and less ROI).

From the individual's perspective, it's deeply frustrating and morale-zapping to be asked to regularly pause (or stop) an activity of significant cognitive load, in favour of another equally demanding activity [1]. It's disruptive, intrusive, and often requires high re-acclimatisation effort.

There's a few potential causes here. From a project perspective, we have poor planning. When work is incorrectly scoped or prioritised, or incorrect Assumptions are made, then it's likely to force the expediting of other (more important) project activities that weren't previously considered.

Some technology teams are regularly interrupted by system failures (incidents), causing them to switch focus from improvements back to nursing systems. Modern autonomic systems mitigate this, at least in part, through self-healing mechanisms that cause limited disruption (Context Switching) to humans.

Over-communication is another possibility. Outward communication should be appropriate, and timely to the audience, otherwise it's diverting them from more pressing activities [2].

Highly reactive businesses (Reaction) do a lot of Context Switching, thus, so too do their staff. However, it's the businesses which are both highly reactive and have Lengthy Release Cycles that suffer the worst. These businesses struggle to retain focus on a specific work item for long enough to finish it, before Endless Expediting occurs. The end result is that very little is delivered, Innovation and evolution are lowered, and the business' overall effectiveness is questioned.

FURTHER CONSIDERATIONS

CREEPING NORMALCY

Since we've already discussed the causes of Technical Debt and the Cycle of Discontent elsewhere, it should be clear now why they exist. However, that doesn't describe how they're able to remain in existence. Or to reiterate, why aren't we better at tackling it?

Let's start with an analogy. Imagine yourself in a lush green forest, lazing by the tranquil azure waters that trickle down off the mountains, sunning yourself in the morning sun. It's idyllic. Suddenly though, without warning, you're transported to an arid desert. The heat is so intense that, without swift action, you will surely perish.

My point relates not to the environment, but to your need to handle extreme change in very short order. It's clear that the second situation / circumstance (arid desert) is far worse than the first (lush and temperate), and without swift action, will mean the loss of something important. See below.

Temperate Hot and Arid
1 2

The comparison here is stark, but it's rarely how change actually occurs in a business. It's more like this.

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Note how the transition - from temperate to blisteringly hot - occurs over many steps (time). It doesn't occur overnight, otherwise it would be identified and resolved. This is the concept of Creeping Normalcy. Creeping Normalcy survives (unwanted [1]) because - by only comparing slight variations in our current and previous positions - we miss the trend. Should we view the trend though (say over fourteen steps), the problem is clear.

In real terms, Creeping Normalcy sees the normalisation of substandard practices and processes into a business (in waves) over many months and years, with each becoming a deeper impediment. It's not expected, no one wants it, nor plans for it, but it's still allowed to happen.

So, what can be done? Creeping Normalcy's breeds on inconclusive evidence. There may be sentiment, but it can't be converted into credible data to "sell" to the key stakeholders holding the purse. It's therefore allowed to exist. Metrics - such as KPIs and DORA Metrics - can help here. For instance, if the data indicates your failure rate has jumped by thirty percent in the last quarter, that's evidence of a necessary intervention.

FURTHER CONSIDERATIONS

ECONOMIES OF SCALE

Economies of Scale - the act of gaining competitive advantage, by increasing the scale of operation, whilst decreasing cost - is often used to gain competitive advantage by:

UNSUSTAINABLE GROWTH

I've witnessed three failures in business strategy caused by an unwillingness (or a neglect) to consolidate, that led to poor agility, reputational issues, and subsequently, growth. They are:

  1. Failure to consolidate a product(s) after a merger or acquisition.
  2. Failure to consolidate after undertaking a product replacement strategy.
  3. Failure to standardise on (unnecessary) variants.

Growth, through mergers and acquisitions, creates its own set of challenges. It's quite common for those businesses to be in the same industry, so they're bound to have commonalities embedded within them, in terms of capabilities and processes. Yet, it's also typical for those capabilities and processes to be sufficiently different to hamper any sort of simple lift-and-shift. One businesses' "ways" can't be easily slotted into another's [1].

Probably the most simplistic example of this divergence is a CRM solution. Each business is likely to bring its own CRM solution with them. Yet they're probably either different products, or the same, yet configured quite differently for that individual business' needs. Consequently, the merged entity now has two ways to manage and store customer information. It is a divergence in the same capability, and therefore a complexity, introduced through the merging of those businesses.

Some choose to ignore this problem. Others try to solve it through an integration project (sometimes years in length) [2]. Both approaches though, could be viewed as hiding the problem away for another day. Some choose to consolidate.

SUSTAINABLE GROWTH

Of course, businesses have successfully used this model to grow; some, however, have found it difficult to sustain, a key reason being insufficient consideration of the longer-term implications (e.g. Technical Debt) of such a strategy, and therefore not planning for it.

Each new acquisition comes with a host of new technical debts you must adopt. Without intervention, it's easy to see how an estate grows into a complex behemoth that doesn't respond well to any sort of significant change.

The product replacement (modernisation) approach - in which the business believes more custom is available through the modernisation of an existing product - also needs careful management. It's a replacement only if you remove the original solution, there is functional parity [3], and all traffic has been migrated onto the modern solution - otherwise you're left managing multiple products, and thus significant additional complexity.

However, building the new solution will take years, and your current product will still be evolving during that time. The goalposts continue to move. There's a fair chance then, that the business never performs the consolidation phase, leaving their customer base strewn across multiple applications.

Finally, we have the variants embedded within a business, caused by an overemphasis of bespoke client needs (typical in the client services model). As described in Minimal Divergence, when you choose to work on behalf of another business, you embed that business' variants (i.e. complexities) into your own, creating specialisms and potentially hampering scale.

SUMMARY

As businesses grow, some lose sight of their more humble beginnings. A single customer holds less importance than a large group, or indeed, for the need to keep growing.

But surely there's always a need to balance business growth with customer retention? Lean too far towards growth, and run the risk of alienating existing customers by diverting focus elsewhere, and eventually, creating churn [4]. Too far in the opposite direction, and you never grow, and risk competitors eating up your market share.

Unsustainable Growth is almost exclusively the result of allowing complexity to take root, of which the following diagram shows the potential consequences.


The best way to manage complexity isn't to pretend it doesn't exist, nor is it to abstract it (e.g. behind some vast integration project, albeit I can follow that logic); it's to consolidate it (see Consolidation).

FURTHER CONSIDERATIONS

CONSOLIDATION

Most businesses desire growth; however - as described in Unsustainable Growth - it often requires it to embed a significant amount of additional complexity. If left unchallenged, it eventually leads to evolutionary, agility, innovation, resilience (both release and runtime), staffing, cultural (e.g. new initiatives), and speed burdens.

In many cases, this complexity stems from a lack of consolidation, either in business processes, in products, tools and technologies, or in the lack of a standardised offering.

EXTREME HOARDERS

From a products, tools, and technologies perspective, it can feel a bit like those "extreme hoarders" TV shows, where the homeowner has collected many items over many years. Their house is now so packed with sundry items that it's barely liveable, and certainly not healthy. Businesses that have collected many other businesses, products, or tools, and never tidied up, may find themselves in a similar situation.

One answer is Consolidation; i.e. by identifying and removing duplication - in technology, tooling, capabilities, and business processes - complexity is reduced, leaving a leaner estate that's less confusing and easier to manage.

Consolidation, of course, takes time, and rarely adds new (sellable) functionality. Consequently, it rarely receives plaudits - i.e. why spend time and money on something that provides no obvious benefit to the customer? Of course, this is a somewhat short-term, ill-informed, and parochial view. A competitive business isn't just one that offers (at least) comparable products and services, it's also one that is effective and efficient, and that requires the constant streamlining of bloat. So if it doesn't get first-class status, maybe it should?

FURTHER CONSIDERATIONS

ROADMAPS

A roadmap is a living artifact that presents a high-level view of (mainly) future important activities to internal stakeholders and customers, and shows how the business intends to meet its future aspirations. If done right, it's a simple, yet powerful concept that quickly and succinctly articulates a high-level plan. Yet I suspect many readers, like myself, have found themselves mouthing the illustrious words: “it'll never work” once or twice in their own career.

Of course there's many reasons why a roadmap fails, but here's a breakdown of the most common problems I encounter:

INACCURACIES

Possibly the most common complaint with a roadmap is its lack of accuracy. Indeed some of the later points are specialisations of inaccuracy.

The most obvious problems are fictions, either in timelines, capacity, or capability. There's a good reason roadmaps don't always present timelines, at least not granular ones, yet the first question asked of a roadmap is usually when, so there should be something to back it up. Yet, amazingly, this isn't always the case. They're built on (educated) guesswork and conjecture.

Capacity is another aspect. Even if you've correctly estimated the effort, have you considered your capacity to implement such a change? If you're only working on one item, you can be confident of your capacity, if not, then you're vying for all other activities (see the section on parallel activities) and have no way to achieve it, so why make it a roadmap item?

Finally, there's capability. Have you considered the current capabilities (abilities) of the people aligned to those activities? How long will it take to get those skills and where will they come from?

INSUFFICIENT ENGAGEMENT

Anyone can build a roadmap, but building a viable one requires you to engage the right stakeholders at the right time. Never present a roadmap without giving the affected people sufficient time to consider and influence it. After all, they have the insight to validate its achievability.

TREATED AS A PLAN

A roadmap presents a high-level, typically horizontal view of the major activities required to achieve a goal. It's not a plan, or the realization of a strategy, nor should it be.

Think of it like this. Your strategy influences your plan, which influences your roadmap, but they're all independent processes and/or artefacts. You can't achieve a roadmap without an accurate, detailed, and thereby achievable, plan. Don't equate a roadmap with a plan. Alternatively, if your roadmap presents every gory detail, then it's a plan, not a roadmap, and shouldn't be presented as such.

UNIDIRECTIONAL

In Top-Down v Bottom-Up Thinking, I describe the two different, and often competing, approaches to approaching change. You may approach it from a bottom-up, current view position, where you attempt to change your current position to align with a future state, or alternatively, you may approach it from a top-down perspective, somewhat ignoring the current position, and focusing on the significant strategic changes required to reach your goal.

A roadmap that only focuses on strategic change, ignoring the tactical (bottom-up) activities, isn't a complete view of the problem space. A roadmap that presents tactical (bottom-up) activities may not be sufficiently aspirational, strategic, and future-proof. A roadmap (and plan) that considers both is a more truthful position. Unless you're delivering a purely greenfield change, you'll need both approaches.

A BIG LIST OF PARALLEL ACTIVITIES

A roadmap with lots of vertical activities probably has a smell about it [1]. You're either presenting it at the wrong granularity (e.g. are you presenting the plan rather than an overview of it?), or you're trying to do too much, and risk falling foul of WIP, coordination challenges, Expediting, and long delivery timeframes.

CONSTANT FLUX

A roadmap that is in constant flux (i.e. fortnightly or monthly changes) suggests one (or more) of the following about the business:

FURTHER CONSIDERATIONS

STRATEGY

A business strategy is a detailed plan of how that business will realize its future goals. It's not just a bunch of hazy goals or fuzzy statements written down on a page for others to implement. To reiterate, a good and useful business strategy doesn't just describe what, it also describes how.

Many of the businesses I encounter who fail to make meaningful, transformational change don't have a strategy. They may think they do, but they don't. They have a set of written “goal statements” and/or themes, but no “bite”. Yet the plan is the most important piece in the puzzle. Without it, how do you know what's required, or indeed what's noise? How can you ensure everyone is aligned and focused without a detailed plan of what's required? And how do you know it's achievable?

Also note that this is not a roadmap (Roadmaps). A roadmap is a high-level (externally publishable) artifact, not a plan. A strategy is a plan that allows you to understand and sequence all activities in the most efficient and profitable manner for the business to achieve its goals. It likely requires a mix of top-down thinking and activities, and bottom-up thinking and activities (as described in Top-Down v Bottom-Up Thinking), and perhaps paradoxically, doesn't need to be fully strategic.

FURTHER CONSIDERATIONS