Tuesday, July 17, 2018

Loop wars - measuring performance

Today on Google+ there was some off topic discussion going on and the claim was being made that for-in loops are slower than for-to loops and some benchmark was being posted to back that argument.

Because I like to measure and investigate things I looked into that. I was running the following benchmark. Yes, it is a micro benchmark - let's not argue about the usefulness of this in isolation.

Keep in mind that testing code in the dpr main influences results because all variables are then global instead of local variables - that is why I put such code into some routine.

procedure Main;
  xStrings: TStrings;
  i: Integer;
  s: string;
  sw: TStopwatch;
  xStrings := TStringList.Create;
  for i := 1 to 10000000 do

  sw := TStopwatch.StartNew;
  for i := 0 to xStrings.Count - 1 do
    s := xStrings[i];

  sw := TStopwatch.StartNew;
  for s in xStrings do;


Running this on my machine here gave me following average results across multiple runs (release config):


During the for-in loop I am not assigning s to some other variable which was done in the originally posted code because I already have the value for each iteration, doing that again would result in one more string assignment in the for-in loop - be fair when comparing both.

Looks like one loop is more than twice as fast as the other - let's investigate why that is.

The TStringsEnumerator being used here is a class which comes with some overhead for its object allocation but only once for the GetEnumerator call. It might be of significance when measuring many loop iterations but in this case we are not. Anyway, we are going change it, copy the code from Classes.pas and change it to a record and use a helper for TStrings to easily get our enumerator being used. This is the added code:

  TStringsEnumerator = record
    FIndex: Integer;
    FStrings: TStrings;
    function GetCurrent: string; inline;
    function MoveNext: Boolean;
    property Current: string read GetCurrent;

  TStringsHelper = class helper for TStrings
    function GetEnumerator: TStringsEnumerator;

{ TStringsHelper }

function TStringsHelper.GetEnumerator: TStringsEnumerator;
  Result.FIndex := -1;
  Result.FStrings := Self;

{ TStringsEnumerator }

function TStringsEnumerator.GetCurrent: string;
  Result := FStrings[FIndex];

function TStringsEnumerator.MoveNext: Boolean;
  Result := FIndex < FStrings.Count - 1;
  if Result then

Looking at the results again - nothing significant changed but that was expected.

The generated assembly of the for-to loop:

Loops.dpr.60: for i := 0 to xStrings.Count - 1 do
004D5EB8 8B45F0           mov eax,[ebp-$10]
004D5EBB 8B10             mov edx,[eax]
004D5EBD FF5214           call dword ptr [edx+$14]
004D5EC0 8BD8             mov ebx,eax
004D5EC2 4B               dec ebx
004D5EC3 85DB             test ebx,ebx
004D5EC5 7C1C             jl $004d5ee3
004D5EC7 43               inc ebx
004D5EC8 C745EC00000000   mov [ebp-$14],$00000000
Loops.dpr.61: s := xStrings[i];
004D5ECF 8D4DFC           lea ecx,[ebp-$04]
004D5ED2 8B55EC           mov edx,[ebp-$14]
004D5ED5 8B45F0           mov eax,[ebp-$10]
004D5ED8 8B30             mov esi,[eax]
004D5EDA FF560C           call dword ptr [esi+$0c]
004D5EDD FF45EC           inc dword ptr [ebp-$14]
Loops.dpr.60: for i := 0 to xStrings.Count - 1 do
004D5EE0 4B               dec ebx
004D5EE1 75EC             jnz $004d5ecf

You can see that there are in fact 2 loop variables being used, one counting down to zero from xStrings.Count (stored in ebx), one counting up being used for accessing the items (stored on the stack at [ebp-$14]). The Count property is only read once. Anyhow changing the code in the enumerator to only read Count once and store that internally will not change things drastically.

The assembly of the for-in loop - oh wow:

Loops.dpr.67: for s in xStrings do;
004D5F11 8D55C4           lea edx,[ebp-$3c]
004D5F14 8B45F0           mov eax,[ebp-$10]
004D5F17 E8E8FEFFFF       call TStringsHelper.GetEnumerator
004D5F1C EB44             jmp $004d5f62
004D5F1E 33C0             xor eax,eax
004D5F20 55               push ebp
004D5F21 685B5F4D00       push $004d5f5b
004D5F26 64FF30           push dword ptr fs:[eax]
004D5F29 648920           mov fs:[eax],esp
004D5F2C 8D4DF4           lea ecx,[ebp-$0c]
004D5F2F 8B55C4           mov edx,[ebp-$3c]
004D5F32 8B45CC           mov eax,[ebp-$34]
004D5F35 8B18             mov ebx,[eax]
004D5F37 FF530C           call dword ptr [ebx+$0c]
004D5F3A 8D45FC           lea eax,[ebp-$04]
004D5F3D 8B55F4           mov edx,[ebp-$0c]
004D5F40 E89F4EF3FF       call @UStrLAsg
004D5F45 33C0             xor eax,eax
004D5F47 5A               pop edx
004D5F48 59               pop ecx
004D5F49 59               pop ecx
004D5F4A 648910           mov fs:[eax],edx
004D5F4D 68625F4D00       push $004d5f62
004D5F52 8D45F4           lea eax,[ebp-$0c]
004D5F55 E8624AF3FF       call @UStrClr
004D5F5A C3               ret 
004D5F5B E97840F3FF       jmp @HandleFinally
004D5F60 EBF0             jmp $004d5f52
004D5F62 8D45C4           lea eax,[ebp-$3c]
004D5F65 E8B6FEFFFF       call TStringsEnumerator.MoveNext
004D5F6A 84C0             test al,al
004D5F6C 75B0             jnz $004d5f1e

What the hell... well, I smell what is going on here... I know that the compiler generates terrible code when you are using inline with managed function results and I saw GetCurrent being marked as inline - we will change that. The generated assembly - nice and short!

Loops.dpr.66: for s in xStrings do;
004D5F1E 8D55E8           lea edx,[ebp-$18]
004D5F21 8B45F4           mov eax,[ebp-$0c]
004D5F24 E8DBFEFFFF       call TStringsHelper.GetEnumerator
004D5F29 EB0B             jmp $004d5f36
004D5F2B 8D55FC           lea edx,[ebp-$04]
004D5F2E 8D45E8           lea eax,[ebp-$18]
004D5F31 E8DAFEFFFF       call TStringsEnumerator.GetCurrent
004D5F36 8D45E8           lea eax,[ebp-$18]
004D5F39 E8EAFEFFFF       call TStringsEnumerator.MoveNext
004D5F3E 84C0             test al,al
004D5F40 75E9             jnz $004d5f2b

It would be a bit more if the enumerator was an object because the the compiler has to insert the cleanup code for the enumerator instance but it would not affect performance on this benchmark because we are only profiling the iteration and not the GetEnumerator and finalize code - that would be a different story.

Now look at the results:


BOOM! There you have it. The difference is close to 10% and it even changes depending on what loop executes first giving a bit of room for error of measurement. Applying the optimization to only read Count once also gains another 10ms. We can also remove the check on Result and always increase FIndex because when MoveNext returns false accessing Current is not permitted which gives us another few ms. Now imagine the compiler would properly inline GetCurrent we would certainly see equal or even better performance of the for-in loop in this benchmark.

Anyhow - the benefit of for-in loops over classic index based for-to loops is not only to avoid dreaded off by one errors (ever forget to -1?) but them being composable - a very important feature when developing code. You can easily add filters and transformations into the chain and then can iterate through all the items without using intermediate storage or writing nested loops - if you are using Spring4D collections you know what I am talking about, right?

So, what did we learn from this?

1. Measure and investigate code
2. Some code in the RTL is far from being ideal
3. The 32-bit compiler has much room for optimization

P.S. Hallvard did similar investigations more than a decade ago already - makes me a bit sad that the compiler is still not generating more performant code on for-in loops.

Update (16.06.2019): It looks like that in recent versions  the inline got removed from the GetCurrent function bringing the TStringsEnumerator in range of 10% to the for-to loop.