2014-09-02

Go Testing Patterns: Benchmarking and Expensive Fixture Setup, Memory Allocations, and Automation

Dave Cheney put together a nice posting that outlines how to use Go's pkg/testing facility for microbenchmarking This post should be considered an addendum, for it extends his with other useful tidbits.


Expensive Fixture Setup

In Dave's example, he instruments the Fibonacci sequence, a pure benchmark that requires almost zero setup. Sometimes the real world is not as forgiving in terms of purity. Suppose we want to benchmark something that requires setup. How do we achieve that in the context of pkg/testing? For the sake of argument, we will come up with a contrived example: traversing a binary tree.


What I have done is created an example project on Github. I encourage you to start reading its generated GoDoc to understand what it is and how it works before we proceed.


Let's look at a naive example of what a benchmark could be:

func benchmarkJ(b *testing.B, j int, t Traverser) {
 // Create our system under test (SUT):
 tree := New(j)
 // We create a bunch of recipient channels in advance for our traverser to
 // write into.
 chs := make(chan chan int, b.N)
 for i := 0; i < b.N; i++ {
  chs <- make(chan int, j)
 }
 close(chs)

 // Run our benchmark.
 for ch := range chs { // len(chs) == b.N
  t(tree, ch)
 }
}
(Source)

How does that look to you? That all gravy? Sadly it isn't. What's wrong with it? To start, the benchmark is instrumenting things that are not meant to be instrumented, like the setup cost:

  • tree := New(J) is producing allocations and takes time.
  • for i := 0; i < b.N; i++ { // make(chan int) } is producing allocations and takes time.

You see, pkg/testing's benchmark framework works-out-the-box somewhat naively in that the moment the benchmarking function is called, it begins its instrumentation. What happens if you fall into a boat like this, where we need setup and the setup to be isolated to promote test determinism (as well as not have the setup count in the instrumented run)? Luckily pkg/testing provides the B.StartTimer and B.StopTimer facilities. These enable you to remark to the benchmark framework that what happens between the interval of these calls should not be instrumented (more precisely: only begin the benchmark after the terminal B.StartTimer invocation)! Let's see what this looks like in action!

func benchmarkJCorrect(b *testing.B, j int, t Traverser) {
 // IMPORTANT: Instruct the benchmarker to not track time and memory allocations!
 b.StopTimer()

 // Create our system under test (SUT):
 tree := New(j)
 // We create a bunch of recipient channels in advance for our traverser to
 // write into.
 chs := make(chan chan int, b.N)
 for i := 0; i < b.N; i++ {
  chs <- make(chan int, j)
 }
 close(chs)

 // IMPORTANT: Instruct the benchmarker to begin tracking time and memory
 // allocations.  Everything after this mark is instrumented!
 b.StartTimer()

 // Run our benchmark.
 for ch := range chs { // len(chs) == b.N
  t(tree, ch)
 }
}
(Source)

"Great," you say. Just large of a difference did this make? Here's a raw reading of it:

// Before
PASS
BenchmarkRecursive1     10000000               267 ns/op
BenchmarkRecursive10     5000000               590 ns/op
BenchmarkRecursive100     500000              4315 ns/op
BenchmarkRecursive1000     50000             40746 ns/op
BenchmarkIterative1      5000000               308 ns/op
BenchmarkIterative10     2000000               850 ns/op
BenchmarkIterative100     500000              5275 ns/op
BenchmarkIterative1000     50000             47494 ns/op

// After
PASS
BenchmarkRecursive1     20000000               104 ns/op
BenchmarkRecursive10     5000000               414 ns/op
BenchmarkRecursive100     500000              3828 ns/op
BenchmarkRecursive1000     50000             38120 ns/op
BenchmarkIterative1     10000000               182 ns/op
BenchmarkIterative10     5000000               688 ns/op
BenchmarkIterative100     500000              4888 ns/op
BenchmarkIterative1000     50000             42569 ns/op

"Wait, didn't you say something about memory allocations and one being more efficient than the other?" Why, yes! You'll need to stick with us to the next section to find out more!

Benchmarking Memory Allocations

If I recall correctly Go 1.1 or 1.2 publicly exposed a new feature in pkg/testing for the benchmarking framework: it is the ability to instrument memory allocation. There is quite a bit of debate around the topic of memory management in Go. While the Go team is working hard at improving the garbage collector (that is another topic altogether), there is the question about what happens under the hood, and this pertains to the topic of the compiler as well as the memory manager itself. Once my writeup of the memory manager and garbage collector are published, I will return to this. Anyway, the Go Team would prefer that you not worry about the question of stack or heap. (My professional opinion: It makes a huge difference for writing performance critical and scalable software, but keep in mind that it gets into the realm of implementation detail!)

How do we begin? In the go test documentation under the flag description area, we find a flag called -benchmem. Once turned on, the benchmarker behaves almost identically, except for we get statistics about memory:

// Before
PASS
BenchmarkRecursive1     10000000               284 ns/op             105 B/op          1 allocs/op
BenchmarkRecursive10     5000000               598 ns/op             186 B/op          1 allocs/op
BenchmarkRecursive100     500000              4603 ns/op             912 B/op          1 allocs/op
BenchmarkRecursive1000     50000             43971 ns/op            8281 B/op          2 allocs/op
BenchmarkIterative1      5000000               613 ns/op             115 B/op          2 allocs/op
BenchmarkIterative10     1000000              1730 ns/op             245 B/op          4 allocs/op
BenchmarkIterative100      50000             28248 ns/op            1426 B/op          7 allocs/op
BenchmarkIterative1000      5000            211706 ns/op           12418 B/op         11 allocs/op

// After
PASS
BenchmarkRecursive1     20000000                98.8 ns/op             0 B/op          0 allocs/op
BenchmarkRecursive10     5000000               438 ns/op               0 B/op          0 allocs/op
BenchmarkRecursive100     500000              3819 ns/op               0 B/op          0 allocs/op
BenchmarkRecursive1000     50000             37911 ns/op               0 B/op          0 allocs/op
BenchmarkIterative1     10000000               187 ns/op               9 B/op          1 allocs/op
BenchmarkIterative10     5000000               805 ns/op              59 B/op          3 allocs/op
BenchmarkIterative100     500000              5206 ns/op             513 B/op          6 allocs/op
BenchmarkIterative1000     50000             53608 ns/op            4130 B/op          9 allocs/op

Cool, what do I make of this? That depends on how meaningful the data is to you, the type of component you are instrumenting, whether it is in serving path, exported library code, etc. The crux of is that you can rely on trickery around how your types and API are designed and escape analysis to affect the efficiencies around allocations. The Go GC compiler offers a -gcflags mode that exposes the internal escape analysis. To enable this, adjust your flags as follows: -gcflags='-m'.

The hope of disabling these optimizations is that you can more carefully craft your benchmarks and not fight against escape analysis, whose sole point is to optimize away common usage patterns. For instance, disabling optimizations should prevent you from needing to force an object to escape to ensure that it is, in fact, heap allocated:

// Rest elided.

// This is completely contrived to illustrate how silly manually defeating the
// escape analysis/inliner is!

type BigBeefyMiracle struct {
        You, Can, Never, Have, Too, Many, Arrays [16][16][16]byte
        Or, Beefy, Miracles                      [32][32][32]byte
}

var globalBeefiness *BigBeefyMiracle // Global scope.

func BenchmarkHisBeefiness(b *testing.B) {
        var localBeefiness BigBeefyMiracle
        globalBeefiness = &localBeefiness

        // The API also offers a concise B.ResetTimer(), which implies both
        //B.StopTimer() and B.StartTimer().
        b.ResetTimer()

        for i := 0; i < b.N; i++ {
                // MakeMeatSlurry is the system under test.
                MakeMeatSlurry(globalBeefiness)
        }
}
(Struct Name Etymology)

OK, OK. Now, can I automate any of this? Sure. Read the next section to find out more!

Automating the A/B Testing Readouts

Suppose you have done a run of my suite and swapped the benchmark mechanism. How do you automate the comparison of the data?

It turns out that Go includes an awesome unadvertised utility called benchcmp. It can be installed like this:

$ go install code.google.com/p/go.tools/cmd/benchcmp

It assumes that you've piped the benchmark output to two buffers that you provide as positional arguments:

$ benchcmp <before> <after>

Let's give it a whirl with our memory readout (it supports non-memory readouts, too!):

$ benchcmp before after
enchmark                  old ns/op     new ns/op     delta
BenchmarkRecursive1        284           98.8          -65.21%
BenchmarkRecursive10       598           438           -26.76%
BenchmarkRecursive100      4603          3819          -17.03%
BenchmarkRecursive1000     43971         37911         -13.78%
BenchmarkIterative1        613           187           -69.49%
BenchmarkIterative10       1730          805           -53.47%
BenchmarkIterative100      28248         5206          -81.57%
BenchmarkIterative1000     211706        53608         -74.68%

benchmark                  old allocs     new allocs     delta
BenchmarkRecursive1        1              0              -100.00%
BenchmarkRecursive10       1              0              -100.00%
BenchmarkRecursive100      1              0              -100.00%
BenchmarkRecursive1000     2              0              -100.00%
BenchmarkIterative1        2              1              -50.00%
BenchmarkIterative10       4              3              -25.00%
BenchmarkIterative100      7              6              -14.29%
BenchmarkIterative1000     11             9              -18.18%

benchmark                  old bytes     new bytes     delta
BenchmarkRecursive1        105           0             -100.00%
BenchmarkRecursive10       186           0             -100.00%
BenchmarkRecursive100      912           0             -100.00%
BenchmarkRecursive1000     8281          0             -100.00%
BenchmarkIterative1        115           9             -92.17%
BenchmarkIterative10       245           59            -75.92%
BenchmarkIterative100      1426          513           -64.03%
BenchmarkIterative1000     12418         4130          -66.74%

Nifty, eh? Doesn't rearchitecting the benchmark properly look so much more compelling now?

Bypassing the Optimizer

You may not have been comfortable with my remarks above about the implicit optimizations in the compiler and the runtime and what that means for the code you write. I understand. One of the things you can do to help keep things ceteris paribus is to disable compiler optimizations. It isn't foolproof for comparison, since your production code will likely be built with optimization in mind. Nevertheless, it is still helpful for finding a baseline. Again, this falls into the -gcflags realm:

$ go test -gcflags='-N' <pathspec>  # Other arguments elided.

In Closing …

In closing, these are some useful tidbits to keep in your toolbelt. Don't forget that there's usually a clean way of doing everything in Go, so don't reinvent the wheel for yourself if you can avoid it.

Finally, do keep in mind that microbenchmarking (Link is regarding Java, but you're mature enough to get over this and love mature runtimes) is not an end-all-be-all. It is useful for guidelines and advisement, nothing more. Focus on correct design (internal and external) and code comprehension/readability first.

In our next installments, we will focus on the following:

  • correctly benchmarking throughput in parallelized operations,
  • choosing an optimal concurrency level from benchmarking, and
  • many other useful testing patterns.

Stay tuned!

follow us in feedly

No comments :

Post a Comment

 

None of the content contained herein represents the views of my employer nor should it be construed in such a manner. . All content is copyright Matt T. Proud and may not be reproduced in whole or part without expressed permission.