Pattern: Scenario Testing

Cem Kaner, Florida Institute of Technology

The Problem

Develop a validation test that is credible, persuasive, powerful, and manageable.

Scope: Faults of Interest

This testing technique is best used to detect errors in the requirements analysis (as reflected in the running program), in the implementation of something that was intended (a failure to deliver an intended benefit or intended restriction), or in the interaction of two or more features.

Simpler faults (e.g., a single feature doesn't work, a simple benefit is simply missing or erroneously implemented) are best discovered with other techniques, such as domain-based testing, specification-based testing, or function testing.

Broader Context

Early in testing, relatively simple tests will fail a program. For example, the program might fail in response to an input that is too large, a delay that is too long, a typist who is too fast, etc. Random input tests (such as dumb monkeys) might fail the program by repeatedly triggering memory-leaking code or wild pointers. Mechanically derived combinations of inputs or configurations (such as combinations derived from all-pair test design) might be challenging for the program.

Eventually, the program can withstand the tests that are easy to imagine, implement, and run. At this point, we can start asking whether the program is any good (has value) rather than whether it is obviously bad.

The objective of this type of testing is to prove that the program will fail when asked to do real work (significant tasks) by an experienced user. A failure at this level is a validation failure (a failure to meet the stated or implicit program requirements.)

Immediate Context

"All" of the features have been tested in isolation. (More precisely, all of the features that will be called within this scenario have been tested on their own and as far as we can tell, none of them has an error that will block this scenario test.)

The tester must have sufficient knowledge of the domain (e.g. accounting, if this is an accounting program) and many of the ways in which skilled users will use the program.

Forces / Challenges

The fundamental challenge of all software testing is the time tradeoff. There is never enough time to do all of the testing, test planning, test documentation, test result reporting, and other test-related work that you rationally want to do. Any minute you spend on one task is a minute that cannot be spent on the other tasks. Once a program has become reasonably stable, you have the potential to put it through complex, challenging tests. It can take a lot of time to learn enough about the customers, the environment, the risks, the subject matter of the program, etc. in order to write truly challenging and informative tests.

Solution: The Scenario Test

The ideal scenario test has four attributes:

  1. The test is realistic (and therefore credible). You know that this is something that a real user would attempt. You might know this from use case analysis, focus groups, monitoring of actual use of the program over time, discussions with experienced customers, or from other models or sources of empirical evidence.
  2. The test is complex. It combines two or more features (or inputs or attributes—two or more things that we could test separately) and uses them in a way that seems as though it should be challenging for the program.
  3. It is easy to tell whether the program passed or failed the test. If a person has to spend significant time or effort to determine whether the program passed or failed a series of tests, she will take shortcuts and find ways to less expensively guess whether the program is OK or not. These shortcuts will typically be imperfectly accurate (that is, they may miss obvious bugs or they may flag correct code as erroneous.)
  4. At least one stakeholder who has power will consider it a serious failure if the program will not pass a given scenario.

In practice, many scenarios will be weak in at least one of these attributes, but people will still call them scenarios. The key message of this pattern is that you should keep these four attributes in mind when you design a scenario test and try hard to achieve them.

Resulting Context

I'll leave this for now, in favor of "Risks"

Risks

The key remaining problem is coverage. There is nothing inherent in scenario testing that assures good coverage (by any measure) and it is common to hear that a testing effort driven by scenarios achieved only 30% line coverage.

It is also often the case that scenario tests don't look carefully enough at common user errors or failure scenarios or the situations caused by disfavored users (see example 1 below).

Rationale / History

In the Art of Software Testing, Glenford Myers described a series of ineffective tests. 35% of the bugs reported from the field had been exposed by a test but the tester didn't notice or didn't appreciate the failure, and so the bug escaped into the field. These (and many others with the same problems) appear to be scenario-like tests (such as transaction-flow tests that use customer data), but the expectation is that testers will do their own calculations to determine whether the printouts or other output are correct. The testers do some checking, a small sample of the tests, but miss defects in the many tests whose results were not checked. The complexity of the tests makes it much harder to work out expected results and check the program against them. The push toward ease of checking results stems from this. You might make it easy to check results by providing an oracle, or a set of worked results, or internal consistency checks, or in various other ways. The point in this pattern (and in discussions like Myers') is that you must pay attention to this issue or the tests will expose defects that no human recognizes.

Another historical problem is the creation of complex tests that appear artificial. It is extremely demoralizing to spend up to a week building and running a complex test, find a bug, and discover that no one thinks that it is important. The stress on writing/designing with a stakeholder in mind comes from this concern. When I teach this in classes (or to clients that I've consulted to), the most common objection is that testers don't necessarily know what failures will catch the attention of what people. Sometimes (for example, when you work for an independent test lab that has little contact with the client), it is very hard to learn anything about the stakeholders. But even if this is difficult, it is worth asking the question—what would interest a reader of a test result report? How can I design this test to yield results that would be more compelling? What would marketing care about? (For example, can you set up a test that looks like something that might be done by the company's single largest customer? Or by a not-so-friendly journalist?) What has been driving our tech support manager crazy? Asking the questions might help you change aspects of the test design in ways that do not go to the integrity of the test but that slightly or significantly change the persuasive value of the result. In my experience, as you show that you are paying attention to the interests of others, you get feedback that makes you more and more aware of those interests.

The problem of real-life testing is the problem of credibility. Any complex test is open to dismissal (no one would do that, corner case, artificial, etc.). Designing tests based on use cases, customer support data, examples of actual things done with competitors products or with your product previously, etc., makes these tests much more credible. Another issue is that the population of possible scenario tests is virtually infinite. Some population of tests should be designed to reflect real uses, because otherwise, a product that is otherwise over-engineered may well fail when customers try to do things that seem entirely reasonable to them and to reasonable third parties. Hans Buwalda's Soap Operas are superb examples of real-life focus. (These write the description of the test case into a plausible story.)

These issues run throughout the test design literature and conference talks. All that this definition of scenario testing does is to gather the concepts together in a way that reflects practices that I have seen as strong and effective in several companies (domains including telephony, consumer game software, business-critical financial application software, and office productivity software.)

There is another somewhat related use of the term "scenario." A scenario under this definition is an instance of a use case, or an instance of a concatenation of use cases. We don't have a 1:1 mapping of this type of scenario to the scenario defined here (and in my practice for at least a decade) but the relationship is worth noting. A use case is, by definition, customer realistic. The instance may or may not be complex—many of the examples that I've seen are very simple tests, but others are fully complex scenarios.

Examples of Scenario Tests

Scenario 1. A Security Test

Imagine testing the security features of a browser. You do an analysis of the user community, including favored and disfavored users (see Gause & Weinberg's book, Exploring Requirements). Disfavored users include the population of hackers and crackers. Your product's design objectives include making it more difficult for disfavored users to perform the tasks that they want to do.

You study the types of attacks that have been successful against browsers before and learn from CERT that 65% of the successful attacks against systems involve buffer overruns. You also learn that load testing often has the result of degrading performance unevenly. A part of the system might crash under load, or might be run at a lower priority than the rest. You also note that denial-of-service attacks (in which target systems are put under heavy load) are increasing in frequency and publicity in the mainstream press.

Therefore, you focus your testing on a combinations of buffer-overrun and load attacks. Can a skilled hacker disable part of your security system by flooding you with a carefully selected pattern of inputs. (When I say, "carefully selected pattern", I mean that your system can be sent many different commands. The system might respond very differently to millions of requests to process forms than millions of requests to display the home page. Great load testing tries to generate patterns of use that would reflect real-life; great security-related load testing tries to generate patterns that might be attempted by crackers.) It takes a lot of knowledge to design the security-oriented load tests, and you might interview several people, read many log files, run experiments on the impact of many different patterns on performance of various features and of the system as a whole, reliability, the logs, etc. You might find several bugs in these simpler tests, but this work (though productive) is in preparation for your primary scenarios. You do similar work on buffer overruns, sending many types of input, trying to trigger failures caused by excessive input or by calculations that result in excessive intermediate or final results. Again, you might find errors as you go, but your primary goal is to identify ways in which the system protects itself from extremes and then to target them by overworking them or by overworking routines that would otherwise have called them.

Your final series of tests generate massive loads in carefully designed ways, and that include files, requests or inputs that could trigger overflows. The tests also include probes—if a buffer has been overrun, you should be able to see results—either a system is crashed and no longer available or you can gain control, and now do something that you couldn't do before. These probes and diagnostic messages in the server logs are your primary means of detecting failures. When you detect a failure, you might report it directly, or you might subject it to further analysis, troubleshooting, and replication under increasingly simpler conditions.

Your final failure report lays out the security risk, the details of the attack needed, and the consequences if someone does this. If necessary, you include in the initial report (or more likely, at a bug review meeting) newspaper reports of attacks that weren't hugely dissimilar to yours, plus CERT data, discussions of strategy in magazines like 2600, examples gleaned from the RISKS forum and various security-related books and mailing lists, plus other evidence that your tests aren't absurd.

These failures may stand on their own (security failures are of great concern to a lot of companies), but if people respond to your reports with "Who cares" and "No one would do that", then you keep in mind that someone in your company cares about failures like this. Maybe it's your head of marketing, maybe it's the company president, the lead PR staffer, or a senior engineer. You might cc your report to that person, or seek out that person's advice ("how do I report this more effectively?" or "this looks important to me, but what do you think?") or appeal to that person if the bug is deferred ("Can you review this report for me? Is this an important problem? What more would we have to do in order to make other readers understand its importance?") Eventually, if the problem is serious, this person will help you make management confront it, understand it and make a rational business decision about it.

Scenario 2. Configuration Testing

You are testing a product that will run on a network. It is supposed to work on the usual browsers, operating systems (MS, Linux, UNIX, Apple), with the usual devices (video cards, printers, rodents) and communications devices (modems, ethernet connectors, etc.). The cross-product of all possible valid configurations yields 752 million possible tests, which would take about 40 minutes each in setup and teardown time, plus 50 minutes each in compatibility test time if you run the tests by hand.

You have already done single-issue compatibility testing, such as printer compatibility when running under a very standard set of other configuration variables (latest IE version, Win 2000, best selling video card, Dell computer with original equipment, lots of memory, lots of free disk space, the Dell-recommended best-selling ethernet card, etc.). You can't test every instance of every device, but you pick the key printers, the key video cards,etc.

You have also done some fairly mechanical combinations, such as using the most memory-intensive video and printer devices and drivers together. You found some problems this way. Some were fixed. Others were blamed on the manufacturer of a peripherals ("This is their bug, let them fix it!") or dismissed ("No one would set up their system that way.")

Your challenge now is to set up a set of tests that will exercise the system software and the peripherals in a way that is both harsh (likely to expose problems) and realistic. If you pick strange-looking combinations of devices and system software, you search through customer records to show that there exist real humans who have configurations like these, or through manufacturer's sell-you-a-system websites to show that they do sell or will sell systems configured this way. The tests you use include actual customer files (go to tech support) and files based on them (but made harsher), and sequences of tasks that are arguably plausible.

Your next challenge is to determine whether the program has passed or failed. It's not enough to boot the program and set it up with the new configuration. You have to try things, to do tasks that will depend in some way on the configuration settings, and then you have to interpret the results. Your strategy might involve test scripts with checks against expected results, test oracles, printouts that show exactly what the tester should expect to see on the screen, or other methods for detecting a failure. There are so many tests, and they are so broken up by setup and teardown, that you show respect in your test design for the fact that testers will become inattentive (perhaps because they are tired or very bored) after several hours or days of this type of testing. Comparisons must therefore be of obvious things, quick, or automated.