Fixing the false positive problem

With all the quality automation that is your responsibility, a run of a check failed. It is your job to check it out.

After 30 minutes or so of investigation, you find that the failure happened because an external dependency took too long to respond. There is nothing that the product team can do about this, so you either ignore the failure and move on (until next time this happens), or add code try to tighten up the reporting in the automation code, or increase the timeout for the request (which can just cause the check to take longer to fail).

You’ve been hit with a “false positive;” a positive signal from the automation, but the signal is false because there’s nothing to do here except waste your time and the team’s money doing the obligatory follow-up. These are common from “flaky” automation.

The team loses trust in quality automation that cries “Wolf!”

Imagine, though: what if false positive signals never interrupted the team’s work?

MetaAutomation shows how to fix the false positive problem.

The post here shows how to preserve all the information, following a pattern that we all use already.

This information shows the root cause for any check failure precisely, from the perspective of and detail given by the system driving the SUT. With a correctly configured quality automation system to decide (with the Smart Retry pattern, shown in the pattern map here ) whether to retry, transient failures due to dependencies outside team ownership or control are prevented from interrupting anybody’s workflow.

For example, if there is a failure and an exception thrown from a unit owned by the team, depending on how the quality team chooses to configure the retry, the check might not be retried but rather reported directly with the results of the check run.

If there is a failure due to a race condition in a GUI that is beyond the team’s control, and the failure is not reproduced on retry, all of the artifact data from the check runs are saved anyway (for later analysis, through the Queryable Quality pattern) but the check is not reported as a failure.

If there is a failure due to a race condition in the GUI that is retried and reproduced, then that failure is reported (see the Automated Triage pattern) and the action item might be either to increase a timeout for finding the GUI object, or the GUI object does not appear due to a product failure, file a bug and link it to the check that failed. In either case, however, the potentially flaky check does not notify people of failures that are not actionable.

Given a failure from deep in the SUT that can’t be reproduced, there would appear to be an undiscovered race condition; the action item in that case would be to file a bug that is sufficiently detailed that it can be queried later as needed.

If Smart Retry is configured correctly, the false positive problem goes away and the checks will generally run faster, too.

The next post shows how to fix the false negative problem.

No Comments

Add a Comment

Sign up here for emails with bites of wisdom on quality automation and MetaAutomation

Recent Blog Posts

  • The differences: Manual Test vs. quality automation

    In my last post I describe out the two kinds of automation that fit in the quality automation space.

    People who do quality automation (at least, the part of quality automation that drives and … more

  • The two halves of quality automation

    Quality automation is the domain (or problem space) of driving the SUT, measuring and recording data on SUT behavior and communicating that data to the business. I also use “quality automation” to … more

  • Fixing the false negative problem

    False negatives happen when these three things happen in order:

    Operations (ops) promotes the software to the next level, or ships it to end-users

    Someone (or, some automated process) discovers a … more

  • Fixing the false positive problem

    With all the quality automation that is your responsibility, a run of a check failed. It is your job to check it out.

    After 30 minutes or so of investigation, you find that the failure happened … more