The downsides of experimentation

Published in

Dev Genius

7 min readSep 12, 2023

Experimentation is the tech world’s equivalent of a gym membership. Everyone wants to sign up, but mastering how to run experiments demands a lot of effort. On the surface, experimenting seems like an obvious and simple formula: make small changes, measure, iterate. However, this post aims to delve into the less discussed, yet critical, aspects of experimentation and outline elements that, if neglected, can make you look as though you’re merely rearranging deck chairs on the Titanic.

I want to be perfectly clear: this post isn’t an argument against experimentation — quite the opposite. I’m confident that on average companies that rely on experimentation are more successful than those that use other forms of decision making. Instead, consider this post as an invitation to enrich your understanding and practice of experimentation, incorporating it into a broader strategy for success.

Paper-cuts

I would like you to consider the following thought experiment. Imagine you have three groups:

Group 1: Experiences a refined version of the product, improved based on previous successful experiments.
Group 2: Subjected to constant changes due to ongoing experiments, from new features to design alterations. Group 2’s user experience is akin to living in an architectural fever dream — nothing stays the same. Instead of signing these users up for a reliable service, we’re taking them on a roller coaster.
Group 3: Interacts with a stable version of the product, updated only for bug fixes.

Which group do you think is likely to have the most optimal experience?

While Group 1 enjoys a premium experience and Group 3 gets a consistent, albeit static, user experience, Group 2 may actually find themselves in the least enviable position. They might be constantly thrashed by paper cuts manifesting as ever-changing features, inconsistent interfaces, terrible ideas, and general product unpredictability. Each individual change might be minor, but the sum total can lead to a frustrating and disjointed experience. Moreover, a high frequency of changes not only affects user experience but also increases the risk of introducing technical issues. This risk could manifest in the form of significant regressions or even major system incidents, potentially causing the Service Level Agreements (SLAs) for Group 2 to be the worst among the three groups.

Note that the lesson here is not to completely halt product evolution by “freezing” it; rather, the challenge is to continually improve the product while mitigating the downsides that constant changes can bring, something that is very, very difficult to do.

Snacking

One of the paradoxical downsides of a culture of experimentation is that it can sometimes encourage an overly cautious approach to innovation, known colloquially as “snacking.” In this process, individuals or organizations make small, incremental changes to an existing system or product, optimizing for short-term gains or local maxima. While these adjustments might yield positive results in the immediate context, they often discourage people from taking bolder, more substantial risks that could lead to significantly greater rewards.

What’s particularly challenging is that this culture of experimentation often goes hand in hand with a culture of “scorecard chasing”, where everything is about generating impact. Note that on average this is a good thing, but when taken to an extreme it manifests itself with a lack of risk-taking at the IC level: “Why opt for a riskier project that might not show results by my next review?”

So, how can we reap the benefits of experimentation while also leaving room for more drastic forms of innovation? I don’t have a good answer, but part of me is very fond of the idea of having autonomous units described in the archipelago model below.

Archipelago Model

Let’s lean into this idea of autonomous units. Think of it like a corporate archipelago: you’ve got a bunch of different islands (or teams), each with its own microclimate and species of flora and fauna. In one corner, you’ve got your well-established, “big and consolidated product” island. Here, the name of the game is ‘incremental maximalism’ — a fancy way of saying we’re optimizing the hell out of what we already have. Now, hop on a metaphorical boat and sail over to the “new offshoots” island. This is your R&D lab, your skunkworks, your “what if we tried this crazy thing” territory. Here, you don’t get a pat on the back for a 0.5% gain; you get it for demonstrating that a radical new idea has legs. The risk is higher, sure, but so is the potential payoff. Failure isn’t just accepted; it’s expected.

So why does this archipelago model make sense? Because it allows a single organization to wear different hats simultaneously. One hat for “We need to improve what we’re already doing” and another for “Let’s shoot for the moon and try something that could be game-changing.” And because these units are autonomous, it becomes easier to specify separate incentive structures and cultures. In this model, you can’t have the R&D department reporting to the CFO who wants to optimize for 2% more revenue in the next quarterly call.

This is obviously a simplification, given that collaboration and cross-pollination between the islands is essential. You can’t just have a team doing deep tech for decades without having a strategy for how that would consolidate and grow your moat.

Replacing dashboards with experiments

When running an experiment, it’s common to define narrowly focused metrics that measure the specific impact of an initiative. These metrics are useful for understanding how well the experiment is performing in relation to its predefined objectives. However, a problem arises when teams perform their impact reporting purely in the context of experiments.

Let’s go through a few scenarios to understand why it is important to look at real metrics rather than only on reported cumulative experiment wins.

Different scenarios for how your real metric might look in contrast to the reported experiment “wins.”

Scenario 1: The ideal scenario is when the growth in the real metric matches with the reported growth from experiment changes. If you report an absolute change of 10% in conversion due to your experiments, most people in your company will have the expectation that there will also be a positive change in the company metric matching your impact.
Scenario 2: In this situation, you report experimental gains while overall business performance metrics remain unchanged. This disparity can raise legitimate concerns about the accuracy of your experiment’s claimed success.
Scenario 3: Here, you report experimental successes even as the company’s overall metric is on the decline. Such a discrepancy often suggests that there are fundamental issues requiring immediate attention; relying solely on experimental metrics in this case can be misleading.

The discrepancy I’m explaining above is not as rare as some might assume. For example, in many companies, the marketing and sales budgets are predefined ahead of time. If you find a way to save 20% on those costs because of more efficient spending, you might find that folks controlling those budgets will immediately spend the surplus on other areas that are similarly cost-inefficient, netting you a zero impact.

In summary, while narrowly focused experiments offer critical insights, it’s crucial to corroborate these findings by examining them in the context of more comprehensive, company-wide performance metrics. This holistic approach provides a more accurate representation of how well both the experiment and the organization are truly performing.

Dead Code

Consider this scenario: your organization runs 5,000 experiments a year, and even if just 20% of them don’t make it to production, that’s 1000 dead ends, each potentially leaving behind traces of code that serve no purpose. This can include unused functions, irrelevant variables, and deprecated feature flags. Now, multiply this by several years of operation, and you have a sizable labyrinth of dead code paths and branches cluttering your codebase. The time and resources needed to clean up this ‘digital detritus’ can be considerable and distract from value-adding activities.

The problem is that experimentation significantly amplifies the rate at which this dead code accumulates. Each experiment that doesn’t pan out adds to the growing pile, making the codebase more difficult to navigate, understand, and maintain. Note that dead code is not dissimilar to software cruft that Martin Fowler describes here.

Conclusion

In wrapping this up: experimentation isn’t the silver bullet some make it out to be. Don’t get me wrong — I’m not waving a flag for us to retreat from the front lines of A/B tests and iterative changes. All I’m pointing out is that experimentation has its downsides when it’s taken to an extreme. Some are subtle, like the constant paper cuts of never-ending tweaks that might erode the user experience. Some are more profound, like the risk of “snacking” your way into a local maximum while ignoring game-changing opportunities.

I’m all for leaning into the scientific method to make products better. But we owe it to ourselves — and our users — to remember that experimentation isn’t the be-all and end-all. It’s a tool in a broader toolkit, and like any tool, its utility is defined by the skill and wisdom with which it’s used.