Tommi Prami 131 Posted December 7, 2023 (edited) https://towardsdatascience.com/34-faster-integer-to-string-conversion-algorithm-c72453d25352 To my understanding Delphi has bit different algorithm, but looking at "default implementation" mentioned article or one Delphi has, it is pretty complex beast in either case, and some what computationally heavy. Delphi has lookup tables and DivBy100() which I would guess it'll make it faster than the "default implementation". Not sure tough. lr_printer Github repo: https://github.com/tigranh/lr_printer -Tee- Edited December 7, 2023 by Tommi Prami Share this post Link to post
David Heffernan 2353 Posted December 7, 2023 The biggest issue with the Delphi code is that it forces heap allocation on you which is the real bottleneck. 3 Share this post Link to post
Stefan Glienke 2019 Posted December 7, 2023 (edited) I don't know what to think after reading that article. Here are my comments on it: - the classic way of truncating the last 2 digits with div and mod 10 (or 100) does not involve a costly div or mod instruction on modern compilers (*cough* even Delphi 12 now does it - apart from the bugs that came with it) - I think C++ compilers would detect doing a div and a mod instruction and the code they emit would be further optimized so it does not require the "workaround" that the Delphi RTL uses by calculating the modulo by subtracting the div result times 100 from the original value. - the pseudo-code he shows for detecting the number of digits is correct but this is never what gets executed - and you either rewrite this into a few branches (as you can see in the RTL), a C++ compiler might unroll the loop or some other trickery is applied The DivBy100 function was introduced by me in RSP-36119 and I already notified them that DivBy100 can be removed in 12 because now it properly optimizes a div by 100 - however, that affects performance only by like 0.5% or so. As David correctly pointed out the real bottleneck is the heap allocation - and not only a single one when you just turn an integer into a string and display that one but when you concat strings and numbers the "classic" way because then it produces a ton of small temporary strings. That issue even exists when using TStringBuilder where one might think that this was built for optimization. If you look into some overloads of Append you will see that it naively calls into IntToStr and passes that down to the overload that takes string. This is completely insane as the conversion should be done directly in place into the internal buffer that TStringBuilder already uses instead of creating a temporary string, convert the integer into that one, pass that to Append to copy its content into the buffer. This will likely be my next contribution as part of my "Better RTL" series of JIRA entries. Edited December 7, 2023 by Stefan Glienke 8 Share this post Link to post
David Heffernan 2353 Posted December 7, 2023 (edited) From my codebase: // disable range checks and overflow checks so that Abs() functions in case Value = Low(Value) {$R-} {$Q-} function CopyIntegerToAnsiBuffer(const Value: Integer; var Buffer: array of AnsiChar): Integer; var i, j: Integer; val, remainder: Cardinal; negative: Boolean; tmp: array [0..15] of AnsiChar; begin negative := Value<0; val := Abs(Value); Result := 0; repeat DivMod(val, 10, val, remainder); tmp[Result] := AnsiChar(remainder + Ord('0')); Inc(Result); until val=0; if negative then begin tmp[Result] := '-'; Inc(Result); end; Assert(Result<=Length(Buffer)); i := 0; j := Result-1; while i<Result do begin Buffer[i] := tmp[j]; Inc(i); Dec(j); end; end; function CopyIntegerToWideBuffer(const Value: Integer; var Buffer: array of WideChar): Integer; var i, j: Integer; val, remainder: Cardinal; negative: Boolean; tmp: array [0..15] of WideChar; begin negative := Value<0; val := Abs(Value); Result := 0; repeat DivMod(val, 10, val, remainder); tmp[Result] := WideChar(remainder + Ord('0')); Inc(Result); until val=0; if negative then begin tmp[Result] := '-'; Inc(Result); end; Assert(Result<=Length(Buffer)); i := 0; j := Result-1; while i<Result do begin Buffer[i] := tmp[j]; Inc(i); Dec(j); end; end; function CopyInt64ToAnsiBuffer(const Value: Int64; var Buffer: array of AnsiChar): Integer; var i, j: Integer; val, remainder: UInt64; negative: Boolean; tmp: array [0..23] of AnsiChar; begin negative := Value<0; val := Abs(Value); Result := 0; repeat DivMod(val, 10, val, remainder); tmp[Result] := AnsiChar(remainder + Ord('0')); Inc(Result); until val=0; if negative then begin tmp[Result] := '-'; Inc(Result); end; Assert(Result<=Length(Buffer)); i := 0; j := Result-1; while i<Result do begin Buffer[i] := tmp[j]; Inc(i); Dec(j); end; end; function CopyInt64ToWideBuffer(const Value: Int64; var Buffer: array of WideChar): Integer; var i, j: Integer; val, remainder: UInt64; negative: Boolean; tmp: array [0..23] of WideChar; begin negative := Value<0; val := Abs(Value); Result := 0; repeat DivMod(val, 10, val, remainder); tmp[Result] := WideChar(remainder + Ord('0')); Inc(Result); until val=0; if negative then begin tmp[Result] := '-'; Inc(Result); end; Assert(Result<=Length(Buffer)); i := 0; j := Result-1; while i<Result do begin Buffer[i] := tmp[j]; Inc(i); Dec(j); end; end; {$IFDEF RANGECHECKSON}{$R+}{$ENDIF} {$IFDEF OVERFLOWCHECKSON}{$Q+}{$ENDIF} No heap allocation. I guess I could avoid that DivMod now and do something better. The main point is that a good runtime library should have building block functions like this that allow us to perform such conversions without forcing heap allocation. The RTL should have functionality like this on top of which all conversion functions can be built. Then the RTL function can be optimised and everyone benefits. Edited December 8, 2023 by David Heffernan 5 1 Share this post Link to post
Dave Novo 51 Posted December 12, 2023 I am trying to understand when this stack based allocation can practically help, because it seems to me that for the most part, you usually are going to have to do something with this array of char, and inevitably transform it to a string which will do some heap allocation and copy the stack data. The one thing that pops to my head is if saving a large number of integers to a file (as strings of course), when you can write to the file directly from the char buffer allocated on the stack. Then you avoid any heap allocation for a large number of conversions. Are there many other circumstances where you can avoid the eventual conversion to a heap allocated string and reap the benefits of the stack based memory allocation? It seems limited to areas where you are doing large numbers of IntToStr and only need to keep the string value within the local method within the loop. 1 Share this post Link to post
David Heffernan 2353 Posted December 13, 2023 (edited) 13 hours ago, Dave Novo said: The one thing that pops to my head is if saving a large number of integers to a file (as strings of course), when you can write to the file directly from the char buffer allocated on the stack. Yes, this is it. It's a very important use case though. Imagine writing large YAML or JSON files containing a lot of data. And then imagine trying to do the same with multiple threads with extra contention on a memory manager. The point is that you build IntToStr using a function like mine above. And publish both. This allows the consumer of the library to have both options. Even if avoiding heap allocation is something that is only done in a narrow set of circumstances, that doesn't render it unnecessary. Edited December 13, 2023 by David Heffernan 1 Share this post Link to post
Dave Novo 51 Posted December 13, 2023 I never meant to imply that it was unnecessary. In fact, I already applied your excellent code to my own code that was writing out CSV files from large 2D matrices of integers. I was just wondering if there was a wider variety of circumstances that I could also leverage this technique that I was not thinking of. In fact, I also have to write out large matrices of floating point values, so will try to modify the Delphi float conversion routines to work on array of char similar to what you proposed below. TBH though, I have not done much timing. I wonder if the overhead of writing the file to disk dwarfs the time it takes for the conversions. Share this post Link to post
Kas Ob. 124 Posted December 14, 2023 On 12/12/2023 at 8:51 PM, Dave Novo said: The one thing that pops to my head is if saving a large number of integers to a file (as strings of course), when you can write to the file directly from the char buffer allocated on the stack. 9 hours ago, Dave Novo said: I already applied your excellent code to my own code that was writing out CSV files from large 2D matrices of integers. I would suggest to ditch the stack and return to heap, but without allocation, see, i have very similar to David implementation, (David's can be refactored easily to do so), with differences that i refactored the converting code into a function take a pointer to 24 char ( two in fact one ansichar and one widechar) this function take three parameters : 1) pointer to this buffer ( hence it could be on stack or on the heap) guaranteed by the caller to have the space of 24 char or 24 byte based on the version. 2) the value to parse 3) var to receive the length in char, this will be as same as the result but in chars, to trim in char in case needed. the result will be in byte the actual length being used, to trim in bytes in case needed. with such you can convert your csv numbers at really great number with 0 allocation per operation, and you don't need the limited stack space. example SetLength( T, 24 ); LenByte := IntToStrMem( T[1], Value, CharsLen); SetLength(T,CharsLen); // or // Loop over this, for buffer (as in array of bytes) LenByte := IntToStrMem(@Buffer[P], Value[i], CharsLen); Inc(P,LenBye); Buffer[P]:=','; Inc(P); .... //use PBufferWC : PWideChar; .. LenByte := IntToStrMem(PBufferWC, Value[i], CharsLen); Inc(OverallLength,LenByte); Inc(PBufferWC,CharsLen); PBufferWC^:=','; Inc(PBufferWC); Share this post Link to post
David Heffernan 2353 Posted December 14, 2023 3 hours ago, Kas Ob. said: I would suggest to ditch the stack and return to heap Kind of irrelevant whether it's heap or stack. Just that you avoid the heap allocation. Share this post Link to post
Kas Ob. 124 Posted December 14, 2023 Just now, David Heffernan said: Kind of irrelevant whether it's heap or stack. Just that you avoid the heap allocation. Sure and agree, just more portability and in case Dave will be faster. Share this post Link to post
David Heffernan 2353 Posted December 14, 2023 13 hours ago, Dave Novo said: In fact, I also have to write out large matrices of floating point values, so will try to modify the Delphi float conversion routines to work on array of char similar to what you proposed below. I have routines to do that. Because the Delphi code to convert between text and floats is irredeemably broken. I'm using dragon4 for float to text, but these days there are better algorithms around. I use dtoa in the other direction. 13 hours ago, Dave Novo said: I wonder if the overhead of writing the file to disk dwarfs the time it takes for the conversions. My code base is spewing out a lot of floats to text files and yes it makes a significant difference to performance. Especially for files on a local drive. Share this post Link to post
Gord P 14 Posted December 15, 2023 On 12/14/2023 at 4:04 AM, David Heffernan said: Because the Delphi code to convert between text and floats is irredeemably broken. Could you expand on that a little bit please? And does that apply to C++Builder as well? Share this post Link to post
dennis12 0 Posted April 2 (edited) On 12/13/2023 at 10:33 AM, David Heffernan said: Yes, this is it. It's a very important use case though. Imagine writing large YAML or JSON files containing a lot of data. And then imagine trying to do the same with multiple threads with extra contention on a memory manager. The point is that you build IntToStr using a function like mine above. And publish both. This allows the consumer of the library to have both options. Even if avoiding heap allocation is something that is only done in a narrow set of circumstances, that doesn't render it unnecessary. Delphi memory manager is very fast and for small block allocation it has some parallelism built in. I measured IntToStr(-1583749570) at 17ns, CopyIntegerToWideBuffer(-1583749570, Buffer) at 13ns and the version in System.SysUtils without string allocation 7ns. All numbers in nano seconds. Example code below: const CTwoDigitLookup : packed array[0..99] of array[0..1] of Char = ('00','01','02','03','04','05','06','07','08','09', '10','11','12','13','14','15','16','17','18','19', '20','21','22','23','24','25','26','27','28','29', '30','31','32','33','34','35','36','37','38','39', '40','41','42','43','44','45','46','47','48','49', '50','51','52','53','54','55','56','57','58','59', '60','61','62','63','64','65','66','67','68','69', '70','71','72','73','74','75','76','77','78','79', '80','81','82','83','84','85','86','87','88','89', '90','91','92','93','94','95','96','97','98','99'); function DivBy100(i: Cardinal): Cardinal; {$IF Defined(WIN32)} asm MOV ECX, 1374389535 // 1/100 * 2^(32+5) MUL ECX MOV EAX, EDX SHR EAX, 5 end; {$ELSEIF Defined(WIN64))} asm MOV EAX, ECX IMUL RAX, RAX, 1374389535 SHR RAX, 37 end; {$ELSE} inline; begin Result := i div 100; end; {$ENDIF} {$R-} {$Q-} {$POINTERMATH ON} function Int32ToWideBuffer(const AValue: Int32; ABuffer: PChar): Integer; var LVal, LRemainder: UInt32; LDigits: Integer; LNegative: Boolean; begin LVal := Abs(AValue); if LVal >= 10000 then if LVal >= 1000000 then if LVal >= 100000000 then LDigits := 9 + Byte(Ord(LVal >= 1000000000)) else LDigits := 7 + Byte(Ord(LVal >= 10000000)) else LDigits := 5 + Byte(Ord(LVal >= 100000)) else if LVal >= 100 then LDigits := 3 + Byte(Ord(LVal >= 1000)) else LDigits := 1 + Byte(Ord(LVal >= 10)); LNegative := AValue < 0; ABuffer^ := '-'; Inc(ABuffer, Ord(LNegative)); Result := LDigits; Inc(Result, Ord(LNegative)); while LDigits > 1 do begin LRemainder := LVal; LVal := DivBy100(LVal); Dec(LRemainder, LVal * 100); Dec(LDigits, 2); PCardinal(@ABuffer[LDigits])^ := Cardinal(CTwoDigitLookup[LRemainder]); end; if LDigits <> 0 then ABuffer^ := Char(LVal or Ord('0')); end; Edited April 2 by dennis12 Share this post Link to post
David Heffernan 2353 Posted April 3 13 hours ago, dennis12 said: Delphi memory manager is very fast and for small block allocation it has some parallelism built in. Automatic stack allocation is faster than all heap allocators, and has no thread contention. And Delphi's current heap allocator is known not to be scalable. I've measured this in a real world setting. Avoiding heap allocations when converting numbers (integer and real) to text makes a huge performance difference when creating files, e.g. XML, JSON, YAML. Even when single threaded. When multi-threading the impact is even greater, if using the default heap allocator. I personally use isolated per thread heaps to reduce contention. This doesn't eliminate it because every allocation/deallocation that requires a call to VirtualAlloc and friends has contention. Share this post Link to post
Stefan Glienke 2019 Posted April 5 Unless you show the benchmark code any results you mention are pointless because there can be so many things in the benchmark that lead to bias. You could for example run IntToStr in a loop and this will lead to using the same already allocated string over and over. Share this post Link to post