Because I like to measure and investigate things I looked into that. I was running the following benchmark. Yes, it is a micro benchmark - let's not argue about the usefulness of this in isolation.
Keep in mind that testing code in the dpr main influences results because all variables are then global instead of local variables - that is why I put such code into some routine.
procedure Main;
var
xStrings: TStrings;
i: Integer;
s: string;
sw: TStopwatch;
begin
xStrings := TStringList.Create;
for i := 1 to 10000000 do
xStrings.Add(i.ToString);
sw := TStopwatch.StartNew;
for i := 0 to xStrings.Count - 1 do
s := xStrings[i];
Writeln(sw.ElapsedMilliseconds);
sw := TStopwatch.StartNew;
for s in xStrings do;
Writeln(sw.ElapsedMilliseconds);
Readln;
end;
Running this on my machine here gave me following average results across multiple runs (release config):
210
470
During the for-in loop I am not assigning s to some other variable which was done in the originally posted code because I already have the value for each iteration, doing that again would result in one more string assignment in the for-in loop - be fair when comparing both.
Looks like one loop is more than twice as fast as the other - let's investigate why that is.
The TStringsEnumerator being used here is a class which comes with some overhead for its object allocation but only once for the GetEnumerator call. It might be of significance when measuring many loop iterations but in this case we are not. Anyway, we are going change it, copy the code from Classes.pas and change it to a record and use a helper for TStrings to easily get our enumerator being used. This is the added code:
type
TStringsEnumerator = record
private
FIndex: Integer;
FStrings: TStrings;
public
function GetCurrent: string; inline;
function MoveNext: Boolean;
property Current: string read GetCurrent;
end;
TStringsHelper = class helper for TStrings
function GetEnumerator: TStringsEnumerator;
end;
{ TStringsHelper }
function TStringsHelper.GetEnumerator: TStringsEnumerator;
begin
Result.FIndex := -1;
Result.FStrings := Self;
end;
{ TStringsEnumerator }
function TStringsEnumerator.GetCurrent: string;
begin
Result := FStrings[FIndex];
end;
function TStringsEnumerator.MoveNext: Boolean;
begin
Result := FIndex < FStrings.Count - 1;
if Result then
Inc(FIndex);
end;
Looking at the results again - nothing significant changed but that was expected.
The generated assembly of the for-to loop:
Loops.dpr.60: for i := 0 to xStrings.Count - 1 do
004D5EB8 8B45F0 mov eax,[ebp-$10]
004D5EBB 8B10 mov edx,[eax]
004D5EBD FF5214 call dword ptr [edx+$14]
004D5EC0 8BD8 mov ebx,eax
004D5EC2 4B dec ebx
004D5EC3 85DB test ebx,ebx
004D5EC5 7C1C jl $004d5ee3
004D5EC7 43 inc ebx
004D5EC8 C745EC00000000 mov [ebp-$14],$00000000
Loops.dpr.61: s := xStrings[i];
004D5ECF 8D4DFC lea ecx,[ebp-$04]
004D5ED2 8B55EC mov edx,[ebp-$14]
004D5ED5 8B45F0 mov eax,[ebp-$10]
004D5ED8 8B30 mov esi,[eax]
004D5EDA FF560C call dword ptr [esi+$0c]
004D5EDD FF45EC inc dword ptr [ebp-$14]
Loops.dpr.60: for i := 0 to xStrings.Count - 1 do
004D5EE0 4B dec ebx
004D5EE1 75EC jnz $004d5ecf
You can see that there are in fact 2 loop variables being used, one counting down to zero from xStrings.Count (stored in ebx), one counting up being used for accessing the items (stored on the stack at [ebp-$14]). The Count property is only read once. Anyhow changing the code in the enumerator to only read Count once and store that internally will not change things drastically.
The assembly of the for-in loop - oh wow:
Loops.dpr.67: for s in xStrings do;
004D5F11 8D55C4 lea edx,[ebp-$3c]
004D5F14 8B45F0 mov eax,[ebp-$10]
004D5F17 E8E8FEFFFF call TStringsHelper.GetEnumerator
004D5F1C EB44 jmp $004d5f62
004D5F1E 33C0 xor eax,eax
004D5F20 55 push ebp
004D5F21 685B5F4D00 push $004d5f5b
004D5F26 64FF30 push dword ptr fs:[eax]
004D5F29 648920 mov fs:[eax],esp
004D5F2C 8D4DF4 lea ecx,[ebp-$0c]
004D5F2F 8B55C4 mov edx,[ebp-$3c]
004D5F32 8B45CC mov eax,[ebp-$34]
004D5F35 8B18 mov ebx,[eax]
004D5F37 FF530C call dword ptr [ebx+$0c]
004D5F3A 8D45FC lea eax,[ebp-$04]
004D5F3D 8B55F4 mov edx,[ebp-$0c]
004D5F40 E89F4EF3FF call @UStrLAsg
004D5F45 33C0 xor eax,eax
004D5F47 5A pop edx
004D5F48 59 pop ecx
004D5F49 59 pop ecx
004D5F4A 648910 mov fs:[eax],edx
004D5F4D 68625F4D00 push $004d5f62
004D5F52 8D45F4 lea eax,[ebp-$0c]
004D5F55 E8624AF3FF call @UStrClr
004D5F5A C3 ret
004D5F5B E97840F3FF jmp @HandleFinally
004D5F60 EBF0 jmp $004d5f52
004D5F62 8D45C4 lea eax,[ebp-$3c]
004D5F65 E8B6FEFFFF call TStringsEnumerator.MoveNext
004D5F6A 84C0 test al,al
004D5F6C 75B0 jnz $004d5f1e
What the hell... well, I smell what is going on here... I know that the compiler generates terrible code when you are using inline with managed function results and I saw GetCurrent being marked as inline - we will change that. The generated assembly - nice and short!
Loops.dpr.66: for s in xStrings do;
004D5F1E 8D55E8 lea edx,[ebp-$18]
004D5F21 8B45F4 mov eax,[ebp-$0c]
004D5F24 E8DBFEFFFF call TStringsHelper.GetEnumerator
004D5F29 EB0B jmp $004d5f36
004D5F2B 8D55FC lea edx,[ebp-$04]
004D5F2E 8D45E8 lea eax,[ebp-$18]
004D5F31 E8DAFEFFFF call TStringsEnumerator.GetCurrent
004D5F36 8D45E8 lea eax,[ebp-$18]
004D5F39 E8EAFEFFFF call TStringsEnumerator.MoveNext
004D5F3E 84C0 test al,al
004D5F40 75E9 jnz $004d5f2b
It would be a bit more if the enumerator was an object because the the compiler has to insert the cleanup code for the enumerator instance but it would not affect performance on this benchmark because we are only profiling the iteration and not the GetEnumerator and finalize code - that would be a different story.
Now look at the results:
210
240
BOOM! There you have it. The difference is close to 10% and it even changes depending on what loop executes first giving a bit of room for error of measurement. Applying the optimization to only read Count once also gains another 10ms. We can also remove the check on Result and always increase FIndex because when MoveNext returns false accessing Current is not permitted which gives us another few ms. Now imagine the compiler would properly inline GetCurrent we would certainly see equal or even better performance of the for-in loop in this benchmark.
Anyhow - the benefit of for-in loops over classic index based for-to loops is not only to avoid dreaded off by one errors (ever forget to -1?) but them being composable - a very important feature when developing code. You can easily add filters and transformations into the chain and then can iterate through all the items without using intermediate storage or writing nested loops - if you are using Spring4D collections you know what I am talking about, right?
So, what did we learn from this?
1. Measure and investigate code
2. Some code in the RTL is far from being ideal
3. The 32-bit compiler has much room for optimization
P.S. Hallvard did similar investigations more than a decade ago already - makes me a bit sad that the compiler is still not generating more performant code on for-in loops.
Update (16.06.2019): It looks like that in recent versions the inline got removed from the GetCurrent function bringing the TStringsEnumerator in range of 10% to the for-to loop.
No comments:
Post a Comment