Software as Hammer, MBT/PBT

Is Software Like a Hammer?

Let's start with a short discussion.

Some have said that software is like a hammer: it's a morally neutral tool that can be used for either good or evil. You can build a house with a hammer, or you can commit murder.

Discussion: What do you think about this?

It's an appealing argument! In some ways, software might indeed be viewed as morally neutral. But, personally, I think that the "hammer" analogy is too glib.

My father was a carpenter, and we had a table saw in our garage. A table saw is a terrifying piece of machinery; it's got a large rotary blade that slices through wood as you push it along the table. It's also a complex piece of machinery: there are various levers and dials to control the height of the blade and other factors, but also numerous safety features. Dad's didn't have as many as modern saws have, but there was still an emergency stop button and a blade guard.

My point is: software isn't a hammer. Software is more like a table saw. If you don't think about safety when you design a table saw, people are going to get badly hurt. Yeah, you can still hurt yourself with a hammer (I have, multiple times). But the greater the degree of potential harm, the more careful thought and added safety-features are called for.

So don't get lured in by "software is like a hammer" statements. Something can be morally neutral in itself and still warrant careful design for safety (or a decision not to make it at all). There's a quote from Robert Oppenheimer I like, from the "personnel hearing" in which he lost his clearance:

"When you see something that is technically sweet, you go ahead and do it and argue about what to do about it only after you’ve had your technical success. That is the way it was with the atomic bomb."

Beyond Fuzz Testing

We previously saw fuzz testing, which sent randomly generated inputs to a program under test, and checked whether the program crashed. We can do a lot more with this idea of random inputs. But to do so, we need a broader view of what "testing" means.

In fuzz testing, we didn't pay much attention to the output of the program. We just checked that it didn't raise an exception. The problem with doing more is that, if we've got randomly-generated inputs, we can't pair them with manually-generated outputs. So now the question is: where can we get "correct" outputs from?

What if we had a known-good baseline implementation we were trying to optimize? Consider building one of those clever, super-optimized implementations of a basic arithmetic function. These matter in a lot of domains (e.g. graphics). Suppose you're testing:

float fastFloor(float value) {
    // ...
}

How long do you think it would take to loop through every possible primitive 32-bit float value, and check whether or not fastFloor returns the same value as Math.floor?

Think, then click!

About 90 seconds, and that's on a computer from several years ago. Several years ago.

Here, we don't even need random inputs! Of course, if the function took 64-bit double values, the story would be different, because there are too many of those to loop through exhaustively in a reasonable timeframe. So, in that case, we'd fall back to checking random inputs.

Model Based Testing

Model-based Testing is using a separate artifact that models the desired behavior of the system as either an oracle for correctness or a means of guiding test input generation. Today we'll look at the former: see the livecode for today, where we'll test my implementation of bubble sort. Concretely:

  • My bubble-sort method is the program under test. We want to evaluate whether or not it is correct.
  • Java's standard Collections.sort method is the known-good artifact. It will serve as our oracle of correctness.

Something very important has just happened. Pay attention!

What just happened?

We have stopped talking about testing against specific outputs, and have started testing more abstract properties of outputs. In fuzz testing, the property was "doesn't crash". For (this type of) model-based testing, we want to check whether the output lists are exactly the same. But we don't need to encode a specific output, which lets us use randomness to generate inputs and mine for bugs while we sleep.

Other kinds of model-based testing include using modeling languages to sketch desired behavior, or using something like a state machine model to generate traces of user interactions to test (e.g.) data structures or user interfaces.

Be Careful!

How you choose to use the oracle matters. Using equality to compare a result from the oracle and a result from the implementation under test is great sometimes, but not always. To see why, try randomly generating more than just integers—in particular, try records. The code is in the repository. What happens, and why?

Think, then click!

Our comparison fails! The Java-library sort and my bubble sort implementation disagree on something. But this disagreement can only happen for more complicated data than integers. Here's an example:

org.opentest4j.AssertionFailedError: 
Expected :[Student[id=-4, credits=0], Student[id=5, credits=5], Student[id=5, credits=-4]]
Actual   :[Student[id=-4, credits=0], Student[id=5, credits=-4], Student[id=5, credits=5]]

We have two students who, by the definition of our compare method, ought to be considered equal: they have the same student ID! (Note that this is regardless of whether we implemented .equal for the record type.) But the two sorts produced different sub-orderings for those students. One of these sorts is a stable sort: it preserves the original order of equal elements. The other is not, and may re-order equal elements.

And yet, both sorts are correct (assuming we didn't require the sort to be stable).

This problem comes up frequently. Whenever you have a problem that has more than one correct answer, direct comparison of results is perilous. Consider basically every graph or optimization problem: shortest path, change-making, graph coloring, optimal scheduling, ... They all often have multiple correct answers. We'll call these relational problems, because one input can correspond to more than one correct output.

So how do we deal with this issue? By broadening the way we think about correctness.

Property-Based Testing

What makes a sorting implementation correct? Let's set aside oracles like alternative implementations and models. What makes us look at an output and say "yes, that's right" given an input?

Think, then click!

Here are some properties:

  • the output is sorted;
  • the output has the same elements as the input.

We touched on this briefly before, but let's be more concrete about what we mean by "the same elements". Really, we mean that the input is a permutation of the output. We need to be precise, since we're about to write a Java method that checks for these properties!

If we can write a set of properties as a method, we can just use that as our oracle! Try filling this in, reflecting the properties we want from a correct sort.

// uncomment the test annotation to enable the test
    // @Test
    public void pbtSortStudentRecords() {
        long numIterations = 10000;

        for(int trial=0;trial<numIterations;trial++) {
            List<Student> input = randomStudentList(3, -5, 5);
            List<Student> output = new ArrayList<>(items);
            ExampleSort.bubbleSort(output);    

            // What makes a sort _correct_?
            // EXERCISE: Write some code that compares `input` and `output` in a 
            // property-focused way, not via direct comparison.
        }
    }

This powerful technique is called property-based testing.

Further Reading

Tim has a paper about teaching PBT, which you can check out here if you're interested.

There are also many, many PBT libraries that help with random input generation. For example, if you use Python, Hypothesis is great. For other languages, look for the phrase "QuickCheck", which is the name of the original PBT library, which was written for Haskell. For better or worse, Java has multiple such libraries.

What's the takeaway?

Two things:

First, randomly generating test inputs is practical and, if done well, powerful. You can use this to mine for bugs while doing other things.

Second, in situations where there are multiple correct answers a program might give, the traditional, actual_output == expected_output mode of testing is risky at best. You'll be over-fitting your test suite to whatever your implementation produces today. Then someone comes along and makes a small change, and the program remains correct but a bunch of test cases suddenly break. We can use model-based and property-based testing to avoid that problem.

Broaden your perspective on testing, and you become a better developer.