Whack the Mole

I once worked on a very large software project. Over time, our bug list grew larger and larger, until everyone realized that something had to be done. A team was dispatched to clean up the bug list; to either fix the bugs or remove them from the list. A couple of months later, management announced with great fanfare that the bug list had been reduced from around a thousand bugs to some six hundred, thus reducing the number of bugs from astronomical to merely unmanageable.

... the Development Team is in the throes of the Production Episode adding new functionality to the code. You have, perhaps, committed to have Good Housekeeping so you can sleep at night. As a Cross-Functional Team development, it feels like five steps forward and one step back: problems and emergent requirements interrupt progress and cause the team to lose focus. The team has a Definition of Done that filters out product issues at some level.

✥       ✥       ✥ 

There is always a tension between advancing product functionality and raising product quality.

Business pressures tend to make us view engineering problems, software bugs, and manufacturing line irregularities as necessary evils. We see them as distractions that lie outside the Sprint. And because developers really like to do new stuff, current product problems are often smoothed over, or the team postpones resolving them until the tomorrow that never comes. In software, such “small” issues often live under the radar of the tests that enforce Good Housekeeping. And even if one has the discipline to do Good Housekeeping comprehensively with discipline, one can’t afford to defer resolving issues until the end of the day: it’s hard to know how much time to set aside.

Fixing issues takes time, so we often defer such work. We believe that the market benefit is not worth the effort to fix them, or that the “more important”, revenue-generating work will be displaced. However, McConnell [1] has shown that bugs in software slow down the Development Team because they cause “stumbling“ and workarounds that drag on development. These impediments actually slow down development that isn’t directly related to fixing bugs.

We could administratively define issue repair as “real work” to incentivize the Development Team to turn their energies to ongoing repair instead of focusing only on new things. However, developers in a healthy development context find intrinsic motivation to repair issues. And we want to avoid the administrative overhead of tracking issues: the tracking sometimes costs more than the repair itself.

Collecting all issues to a “cleanup Sprint” leaves the product in a broken state until that Sprint arrives. And even when completing such a Sprint it doesn’t move the team along its Product Roadmap. It only makes it visible that the team should feel guilty about where they believe they stand on the Roadmap: previously delivered PBIs were in fact not properly delivered.

There is some cost of switching context from PBI development to issue mitigation. However, when developers see an issue, they are motivated to fix it now while it is fresh in their mind, in touch within the context. The more the team postpones the change (until later in the day, or until a subsequent Sprint), the more expensive it becomes. Subsequent changes in the environment or in the product may make it challenging to reproduce the issue later.

Issues that aren’t fixed now tend to accumulate, become forgotten or lost in a defect tracking system. Some issues may become legacy components of the product as technical debt grows, maintainability suffers, and quality drops. Keeping an inventory is bad enough: keeping an inventory of defect descriptions is very un-lean. And< keeping an issue work backlog separate from the backlog of value-generating work items makes it impossible for the team to know what the total ordering of all work (issue-repairing and new value-generating) is. This situation can lead to one of two extremes: Either issue resolution becomes a second-class citizen or the team becomes a fire-fighting team.

Bob Martin relates a story of once fixing a spelling error in an application. However, many customers had built screen capture scripts that depended on the misspelled word. He was ordered to “unfix” the misspelled word.

One of our clients found that they awoke one morning to 2000 Category 1 (highest priority) bugs in their bug tracking system and decided to launch a quality Sprint to reduce that number by 60%. Incentivized by the corporate reward structure, the teams met their goal — by reclassifying about 60% of those bugs as Category 2 bugs.


Immediately resolve product problems, big and small, as they arise. Fixing a product that is broken or does the wrong thing should be considered a higher priority than enhancing delivered functionality. After all, the presence of any issue means that some aspect of product value has not been or cannot completely be delivered. If it’s not clear whether the problem should be fixed, then the lack of clarity is itself a problem worth redressing immediately.

It doesn’t matter whether the issue was introduced in this Sprint or in a previous Sprint: to the market, an issue is an issue. And it doesn’t matter whether the issue was found in development or in the field: an issue is still an issue. Ensure that you have a good reporting path for issues up and down the Value Stream to give immediate visibility to all issues.

Developers should drop what they are doing and address product issues when they come to the attention of the Development Team. They should spend an agreed maximum number of hours (e.g., four hours) on the issue. Work should start on the issue without any significant engagement with the Product Owner, though the involved developers should make the effort visible to the rest of the Team through information radiators and the Daily Scrum. If the Development Team cannot remove the cause of the issue in the agreed time box, the team escalates to the Product Owner. The Product Owner quickly decides whether to continue work on the issue at the potential expense of failing to complete all remaining PBIs in the Sprint, or alternatively creates a PBI for the issue and puts it visibly on the Product Backlog. If the Sprint Goal is at stake, the team may go into Emergency Procedure.

✥       ✥       ✥ 

One or two team members start work on the issue, so that the rest of the team can continue making progress on a product increment. that will generate new value; this helps better manage overall risk. See Team per Task.

Good discipline in use of this pattern will enable Good Housekeeping at the end of the day, and will demonstrate how serious the team is about Good Housekeeping. The end result is a heightened focus on the integrity of the Product Increment. Also, this pattern has a close relationship to Definition of Done. A good Definition of Done will limit discussions about whether an issue is really an issue or not. The definition should continuously be extended to improve the immediacy of product repair during production. And it’s important that the fix itself meet the Definition of Done. Just as importantly, it should not be a temporary patch that contributes to long-term technical debt.

This approach means that you can do away with much issue administration, including the management of dependencies related to issues or issue prioritization.

This pattern can’t retire existing technical debt. The team can slowly retire technical debt by refactoring or work on it can be directed in specific product areas as Product Backlog Items. Software refactoring is a form of this pattern at the micro level.

If this pattern causes the Development Team to do nothing but fix issues during the Sprint — which means unendingly deferring PBI development — it points to a serious quality problem that should probably cause the Product Owner to escalate Emergency Procedure all the way to Abnormal Sprint Termination. It should also result in serious head-holding during the Sprint Retrospective. See Illegitimus Non Interruptus as a high-discipline interim measure. Illegitimus Non Interruptus also helps moderate the effort put into emergent issue repair during the Production Episode. It does this by giving the Product Owner the option to consciously and visibly defer the resolution of a given issue until a later Sprint.

There are two distinct steps: fixing the issue, and fixing the process. The Scrum Team an address process changes after a calming period that follows the flurries of Sprint activity. Issues themselves can be identified and repaired directly in terms of cause and effect; the corresponding process changes must consider broader context and subtle issues of organization, history, culture, and so forth. Process changes usually require more deliberation and focus than the team can muster in the middle of a Sprint, and they often require Product Owner intervention — something we want to minimize during the Sprint. Except in obvious cases, leave process changes to the Sprint Retrospective with consideration for implementation in subsequent Sprints.

There are many supporting patterns that deal with competing priorities along the Value Stream. Daily Scrum orders the work items on this particular area of the Value Stream to give priority to plugging holes in the dyke. Good Housekeeping ensures that the workspace itself doesn’t become flooded. Illegitimus Non Interruptus prevents the dyke from ending up with so many holes that it collapses. Emergency Procedure flushes work in progress and redirects the flow.

See also Interrupts Unjam Blocking and Sacrifice One Person.

[1] Steve McConnell. “Software Quality at Top Speed.” In Software Development 4(8), August 1996, pp. 38-42.

Picture from: Sam Howzit, https://www.flickr.com/photos/aloha75/14677050240 (under CC BY 2.0 license).