Tuesday, June 15, 2021

Spring4D 2.0 sneak peek - the evolution of performance

To be honest - most of the collections in Spring 1.2 are dead slow - especially on Win32. But to be fair, almost nobody complained about that so far. Now that we have some shiny tool to print some numbers here are some simple benchmarks - they show a similar pretty sad picture for almost every method as most of them are virtual in 1.2.

The benchmark simply measures how long it takes to add n integers to a list with a pre-allocated capacity - so this is just the duration it takes to call and stuff that integer into the internal array (and whatever else is going on) - compiled with 10.4.2 - Win32 release. You can see how hard that return stack buffer trashing hits the performance here.


The numbers for Win64 are a little less terrible but still far from good:



Now fast forward to the latest development state - Win32:


and Win64:



Before you ask - well - simply adding Spring4D to the binary makes the RTL faster or slower! No, seriously, that was a joke. You can see the effect that Professor Berger mentioned in his talk that I linked in my last blog post. So be careful with these numbers. Simply flipping some bit in your binary can affect completely unrelated code.

However, the important point here and one that is consistent regardless of the relative numbers is the improvement from 1.2 to 2.0 and this goes for all collections across the board regarding 3 key factors:

1. speed - there are no virtual methods anymore behind the interfaces, the lists are optimized to perform the fastest way possible especially when there are no OnChanged event handlers attached to them which is the common case and where you want the pure speed.

2. binary size - as you know generics can cause quite a huge chunk of binary code because for every specialization the code is basically duplicated even for binary identical types including their RTTI. In 2.0 all RTTI for the implementing classes which you will never touch anyway is turned off and with some trickery, many generic specializations are folded and forced into the Spring.Collections.dcu to not pollute each and every dcu you compile a list or dictionary into. And it does not only shrink your binary, but also speeds up your compilation as the compiler simply has less work to do - no generating code (RTTI and executable code) into dcu which the linker later has to go through and eliminate duplicates.

3. instance size - while it certainly does not matter if a list instance itself is 58byte or 80byte when you store thousands of integers or objects in them it can make a difference if you have many of those lists for example in every data object and each of them only holds a handful of items. Due to field reordering and implementing the interfaces on one class instead of on multiple ones this saves a few bytes for every instance as those interfaces that inherit from each other such as IList<T> inherits from IEnumerable<T> those share the same IMT slot in every object instead of occupying each their own. On 64bit this saving is even bigger due to the doubled size for pointers as you can imagine.


2.0 will provide the perfect balance between the first two points - providing the best speed while keeping the binary bloat low and providing a rich and powerful API as you already know and love it including some valuable additions.


While the development of 2.0 is still going on it is close to completion - currently, I am working on finalizing the refactoring of our dictionary which as you might know was the last collection type that was simply a wrapper around the RTL in 1.2. But more about that in the next article.

For now just a benchmark of different dictionary lookups - the match rate is 100% in these benchmarks which means all lookups are successful - yup, that's roughly a 4-7 times improvement over the RTL!





Monday, June 14, 2021

Introducing Spring.Benchmark - a port of Google benchmark

Some while ago I expressed the wish for someone to create a wrapper for the famous Google benchmark library. It is a library to write so-called microbenchmarks. Microbenchmarks can help to tune small parts of code such as functions or classes. While you can certainly measure code by simply using QueryPerformanceCounter or TStopwatch there is more to a microbenchmark than simply running your code a couple times and stopping the time.

Usually, such libraries make sure the code you want to measure has a warm-up phase before it actually gets measured. And also apart from simply printing some duration that it took a good library can provide more information such as running the code multiple times with different parameters and calculate meaningful statistics for example an estimated efficiency of the measured code.

If you are a regular visitor of Delphi-PRAXiS you might have seen several threads where the performance of different functions and algorithms have been discussed. Most of them have the more simple character of measuring the code of a couple of runs with a stopwatch and often don't do any statement about their asymptotic behavior.

Before we look into what Spring.Benchmark is and offers a little disclaimer: micro benchmarking is something you don't do constantly and you need to be aware that it is one of the many tools at our disposal. Also, be aware of the traps that microbenchmarks hold and take its results with a grain of salt. For more information, I really suggest you watch this interesting and entertaining talk from CppCon 2020.

At the bottom of this post, I will put some more video links if you want to know more about this subject.


Spring.Benchmark is a standalone project which does not require the Spring library but is one of multiple projects planned under the Spring4D name. It is a very accurate port of the C++ code from Google Benchmark and thus will be familiar to those that might have used the C++ library before.

But now let's dive into benchmarking some code. We will measure sorting and compare TArray.Sort from the RTL which uses plain QuickSort internally to the implementation in Spring which uses a hybrid algorithm called Introsort which uses a combination of different sorting algorithms. While it's certainly interesting to see the difference between those two I want to show you the features of Spring.Benchmark with this example.

I will run all tests with Delphi 10.4.2 and use the Win32 platform with Release config. The tests will run on an i5-3570K at 3.40GHz.

Here is the complete listing of our little benchmark program:

program QuickSortBenchmark;

{$APPTYPE CONSOLE}

uses
  Generics.Collections,
  Generics.Defaults,
  Spring,
  Spring.Benchmark;

type
  TSortProc = procedure (var Values: array of Integer);

procedure Sort(const state: TState; const sort: TSortProc);
var
  n, i: Integer;
  numbers: TArray<Integer>;
begin
  n := state[0];
  SetLength(numbers, n);

  for var _ in state do
  begin
    state.PauseTiming;
    for i := 0 to High(numbers) do
      numbers[i] := Random(MaxInt);
    state.ResumeTiming;
    sort(numbers);
  end;

  state.SetItemsProcessed(n * state.Iterations);
end;

procedure QuickSort(const state: TState);
begin
  Sort(state, Generics.Collections.TArray.Sort<Integer>);
end;

procedure IntroSort(const state: TState);
begin
  Sort(state, Spring.TArray.Sort<Integer>);
end;

begin
  Benchmark(QuickSort, 'quicksort')
    .RangeMultiplier(10)
    .Range(10, 100000);

  Benchmark(IntroSort, 'introsort')
    .RangeMultiplier(10)
    .Range(10, 100000);

  Benchmark_Main();
end.

The library consists of just one unit: Spring.Benchmark.pas. Put that into your uses and you are ready to go.

Every benchmark procedure has to have the signature like our procedures QuickSort and IntroSort have, they take a single parameter of type TState. This variable carries the state of the current benchmark run. Since we have two cases with exactly the same code we put that into our Sort procedure which takes a callback that gets either passed the RTL or the Spring TArray.Sort method.

Now let's take a closer look at our actual benchmark. First, we want to do some setup - in this benchmark, we will do the setup every time it is being run. It depends on the nature of the benchmarked code if that needs to be done every time or if you can do this once and then run the benchmark multiple times. Our state variable carries amongst many other things the parameters for the test in our case we have one for the number of elements we want to test with. We can access those parameters via the square brackets and their index.

Now for the actual benchmark loop - this is done with a for-in loop over our state variable. I am using the inline variable syntax here, the _ is a common notation for a variable whose value is actually of no interest. In fact the type of this variable - its type is TState.TValue in case you cannot use the inline variable syntax - is an empty record with no fields whatsoever. This interestingly allows for quite some optimized code being emitted for this loop - especially in 10.4, where due to using the new Finalize operator the end of the loop is not in the way of the loop body code.

Inside the loop, we do two things: we initialize the array with random numbers every time and then we call Sort. Doing the for-in loop over the state variable automatically starts the internal clock to measure the time. However, since we don't want to include the randomization of our array we wrap this loop inside PauseTiming and ResumeTiming calls.

The library itself decides how many iterations it will do - it will perform some startup code until its duration stabilize and then starts the iterations that it measures.

Let's now look at the last part of the code which is the setup. It always starts with a call to Benchmark where you pass a procedure that takes a TState parameter and a name for this benchmark to identify it when displaying or when filtering. There are several methods to be called to set up every benchmark. We are using the Range method to provide several values in a convenient way. In combination with RangeMultiplier, it determines which parameter values to call the benchmark with - in our case the benchmark will be called with 10, 100, 1000, 10000, and 100000. You can find more about all the possibilities in the documentation.

Let's run the project and see what it shows:


You can see a nicely printed summary of our multiple runs with their timings and the actual number of iterations the library ran the code for. In our code we added the value for a so-called counter which also helps to evaluate the code we just benchmarked - we counted the items that we processed by our sorting by multiplying the numbers in the array by the number of iterations we ran the code. The library code does all the necessary calculations to show that we processed several million items per second with our sorting.

There are many more features such as writing the statistics to csv or json format, running your benchmark code in multiple threads to also measure its performance when running in parallel.

The library requires at least Delphi XE7 or FPC 3.2.0 (at least that is what I tested it with) and runs on Windows (Delphi and FPC) and Linux (only tested with Delphi 10.4). Delphi support for macOS, iOS and Android is being worked on. 


Earlier I promised some more interesting videos to watch and here are a few:

An interesting and very entertaining talk from Andrei Alexandrescu from CppCon 2019.

Two talks from Chandler Carruth about efficiency with algorithms and about benchmarks.