Monday, June 14, 2021

Introducing Spring.Benchmark - a port of Google benchmark

Some while ago I expressed the wish for someone to create a wrapper for the famous Google benchmark library. It is a library to write so-called microbenchmarks. Microbenchmarks can help to tune small parts of code such as functions or classes. While you can certainly measure code by simply using QueryPerformanceCounter or TStopwatch there is more to a microbenchmark than simply running your code a couple times and stopping the time.

Usually, such libraries make sure the code you want to measure has a warm-up phase before it actually gets measured. And also apart from simply printing some duration that it took a good library can provide more information such as running the code multiple times with different parameters and calculate meaningful statistics for example an estimated efficiency of the measured code.

If you are a regular visitor of Delphi-PRAXiS you might have seen several threads where the performance of different functions and algorithms have been discussed. Most of them have the more simple character of measuring the code of a couple of runs with a stopwatch and often don't do any statement about their asymptotic behavior.

Before we look into what Spring.Benchmark is and offers a little disclaimer: micro benchmarking is something you don't do constantly and you need to be aware that it is one of the many tools at our disposal. Also, be aware of the traps that microbenchmarks hold and take its results with a grain of salt. For more information, I really suggest you watch this interesting and entertaining talk from CppCon 2020.

At the bottom of this post, I will put some more video links if you want to know more about this subject.

Spring.Benchmark is a standalone project which does not require the Spring library but is one of multiple projects planned under the Spring4D name. It is a very accurate port of the C++ code from Google Benchmark and thus will be familiar to those that might have used the C++ library before.

But now let's dive into benchmarking some code. We will measure sorting and compare TArray.Sort from the RTL which uses plain QuickSort internally to the implementation in Spring which uses a hybrid algorithm called Introsort which uses a combination of different sorting algorithms. While it's certainly interesting to see the difference between those two I want to show you the features of Spring.Benchmark with this example.

I will run all tests with Delphi 10.4.2 and use the Win32 platform with Release config. The tests will run on an i5-3570K at 3.40GHz.

Here is the complete listing of our little benchmark program:

program QuickSortBenchmark;



  TSortProc = procedure (var Values: array of Integer);

procedure Sort(const state: TState; const sort: TSortProc);
  n, i: Integer;
  numbers: TArray<Integer>;
  n := state[0];
  SetLength(numbers, n);

  for var _ in state do
    for i := 0 to High(numbers) do
      numbers[i] := Random(MaxInt);

  state.SetItemsProcessed(n * state.Iterations);

procedure QuickSort(const state: TState);
  Sort(state, Generics.Collections.TArray.Sort<Integer>);

procedure IntroSort(const state: TState);
  Sort(state, Spring.TArray.Sort<Integer>);

  Benchmark(QuickSort, 'quicksort')
    .Range(10, 100000);

  Benchmark(IntroSort, 'introsort')
    .Range(10, 100000);


The library consists of just one unit: Spring.Benchmark.pas. Put that into your uses and you are ready to go.

Every benchmark procedure has to have the signature like our procedures QuickSort and IntroSort have, they take a single parameter of type TState. This variable carries the state of the current benchmark run. Since we have two cases with exactly the same code we put that into our Sort procedure which takes a callback that gets either passed the RTL or the Spring TArray.Sort method.

Now let's take a closer look at our actual benchmark. First, we want to do some setup - in this benchmark, we will do the setup every time it is being run. It depends on the nature of the benchmarked code if that needs to be done every time or if you can do this once and then run the benchmark multiple times. Our state variable carries amongst many other things the parameters for the test in our case we have one for the number of elements we want to test with. We can access those parameters via the square brackets and their index.

Now for the actual benchmark loop - this is done with a for-in loop over our state variable. I am using the inline variable syntax here, the _ is a common notation for a variable whose value is actually of no interest. In fact the type of this variable - its type is TState.TValue in case you cannot use the inline variable syntax - is an empty record with no fields whatsoever. This interestingly allows for quite some optimized code being emitted for this loop - especially in 10.4, where due to using the new Finalize operator the end of the loop is not in the way of the loop body code.

Inside the loop, we do two things: we initialize the array with random numbers every time and then we call Sort. Doing the for-in loop over the state variable automatically starts the internal clock to measure the time. However, since we don't want to include the randomization of our array we wrap this loop inside PauseTiming and ResumeTiming calls.

The library itself decides how many iterations it will do - it will perform some startup code until its duration stabilize and then starts the iterations that it measures.

Let's now look at the last part of the code which is the setup. It always starts with a call to Benchmark where you pass a procedure that takes a TState parameter and a name for this benchmark to identify it when displaying or when filtering. There are several methods to be called to set up every benchmark. We are using the Range method to provide several values in a convenient way. In combination with RangeMultiplier, it determines which parameter values to call the benchmark with - in our case the benchmark will be called with 10, 100, 1000, 10000, and 100000. You can find more about all the possibilities in the documentation.

Let's run the project and see what it shows:

You can see a nicely printed summary of our multiple runs with their timings and the actual number of iterations the library ran the code for. In our code we added the value for a so-called counter which also helps to evaluate the code we just benchmarked - we counted the items that we processed by our sorting by multiplying the numbers in the array by the number of iterations we ran the code. The library code does all the necessary calculations to show that we processed several million items per second with our sorting.

There are many more features such as writing the statistics to csv or json format, running your benchmark code in multiple threads to also measure its performance when running in parallel.

The library requires at least Delphi XE7 or FPC 3.2.0 (at least that is what I tested it with) and runs on Windows (Delphi and FPC) and Linux (only tested with Delphi 10.4). Delphi support for macOS, iOS and Android is being worked on. 

Earlier I promised some more interesting videos to watch and here are a few:

An interesting and very entertaining talk from Andrei Alexandrescu from CppCon 2019.

Two talks from Chandler Carruth about efficiency with algorithms and about benchmarks.


  1. Replies
    1. Actually read the article - it might be explained there ;)

  2. A rightly done library: congrats. The StackOverflow link about Micro-Benchmarking is worth looking at for the users. Benchmarking sort algorithms is fine, because it is some realistic long-standing process. But micro-benchmarking of a very small process (like a CompareText or a Pos) is sometimes pointless because the CPU does itself its own optimization, which makes the results biased. This is what the StackOverflow implies, quoting Knuth about premature optimization. Premature Micro-Optimization to be more precise. ;) I have seen so many micro-benchmarks running process in the CPU L1 cache, reusing hot CPU pipeline tracking, which would give diverse results when working from main memory, as most performance-sensitive process would.