Vincent Parrett 767 Posted August 5, 2024 Hi All I created a Delphi implementation of ULID - A Universally Unique Lexicographically Sortable Identifier. ULID's are a better option than GUIDs where sorting is needed 128-bit compatibility with UUID 1.21e+24 unique ULIDs per millisecond Lexicographically sortable! Canonically encoded as a 26 character string, as opposed to the 36 character UUID Uses Crockford's base32 for better efficiency and readability (5 bits per character) Case insensitive No special characters (URL safe) Monotonic sort order (correctly detects and handles the same millisecond) https://github.com/VSoftTechnologies/VSoft.Ulid 5 2 Share this post Link to post
Vincent Parrett 767 Posted August 5, 2024 Out of curiosity I decided to benchmark TUlid - using the excellent Spring4D Benchmark - some interesting results 32bit 64 bit Not being a performance guru - I am guessing the difference in the TUlid.Create performance is down to the Random number generator - @Stefan Glienke Share this post Link to post
Kas Ob. 124 Posted August 6, 2024 @Vincent Parrett I would suggest to benchmark the following class function TUlid.InternalNewUlid(timestamp: UInt64): TUlid; var random : UInt64; ts : Int64Rec absolute timestamp; begin result := default(TUlId); //reverse order! result.FTimeStamp0 := ts.Bytes[5]; result.FTimeStamp1 := ts.Bytes[4]; result.FTimeStamp2 := ts.Bytes[3]; result.FTimeStamp3 := ts.Bytes[2]; result.FTimeStamp4 := ts.Bytes[1]; result.FTimeStamp5 := ts.Bytes[0]; random := FXorShift64.Next; //Move(random,result.FRandomness0, 2); // randomness 0-1 //PWORD(@result.FRandomness0)^ := UInt16(random); // the exact size PNativeUInt(@result.FRandomness0)^ := NativeUInt(random); // it is safe to overflow here, might yeild better and simpler asm instruction random := FXorShift64.Next; //Move(random,result.FRandomness2, 8); // randomness 2-9 PUInt64(@result.FRandomness0)^ := random; end; Removing the overhead of Move should count for something, i guess !! Share this post Link to post
Vincent Parrett 767 Posted August 6, 2024 1 minute ago, Kas Ob. said: @Vincent Parrett I would suggest to benchmark the following random := FXorShift64.Next; //Move(random,result.FRandomness0, 2); // randomness 0-1 //PWORD(@result.FRandomness0)^ := UInt16(random); // the exact size PNativeUInt(@result.FRandomness0)^ := NativeUInt(random); // it is safe to overflow here, might yeild better and simpler asm instruction random := FXorShift64.Next; //Move(random,result.FRandomness2, 8); // randomness 2-9 PUInt64(@result.FRandomness0)^ := random; Removing the overhead of Move should count for something, i guess !! @Stefan Glienke is way ahead of you already sent me a few suggestions. Using a variant record case Boolean of True: ( FRandomness0 : byte; FRandomness1 : byte; FRandomness2 : byte; FRandomness3 : byte; FRandomness4 : byte; FRandomness5 : byte; FRandomness6 : byte; FRandomness7 : byte; FRandomness8 : byte; FRandomness9 : byte); False: ( FRandomness0_1 : Word; FRandomness2_9 : UInt64); Did squeeze a out few ns better performance. I'll test your idea though - because micro optimising is more fun the boring stuff I was working on... 😉 1 1 Share this post Link to post
Kas Ob. 124 Posted August 6, 2024 Here another idea. The implementation require 2 consequence random then calculate them in one hit, instead of calling FXorShift64.Next in the middle, in other words add/use Random1 and Random2, use only one Next. Share this post Link to post
Vincent Parrett 767 Posted August 6, 2024 (edited) 2 hours ago, Kas Ob. said: in other words add/use Random1 and Random2, use only one Next. This made no discernable difference. Your version of avoiding the move was slightly faster than the variant array idea. Thanks to a bunch of suggestions from Stefan & Kas it is now a lot faster than the c# version I was comparing it to (64bit create - 63ns) - I did also add some validation to the parse method which slowed it down again! The big speedup in the Parse method was avoiding TEncoding.GetBytes - which uses a dynamic array (slow). D 11.3 - 32bit ------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------- TUlid.Create 52.1 ns 51.6 ns 10000000 TUlid.Parse - AnsiString 28.4 ns 28.5 ns 23578947 TUlid.Parse 33.7 ns 33.0 ns 20363636 TUlid.ToString 32.0 ns 32.5 ns 23578947 D 11.3 64bit ------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------- TUlid.Create 10.9 ns 11.0 ns 64000000 TUlid.Parse - AnsiString 28.1 ns 28.5 ns 23578947 TUlid.Parse 29.2 ns 29.5 ns 24888889 TUlid.ToString 29.4 ns 29.3 ns 22400000 D12.1 32bit ------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------- TUlid.Create 51.2 ns 51.6 ns 10000000 TUlid.Parse - AnsiString 28.5 ns 28.3 ns 24888889 TUlid.Parse 30.2 ns 30.5 ns 23578947 TUlid.ToString 30.2 ns 29.3 ns 22400000 D 12.1 64bit ------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------- TUlid.Create 8.94 ns 8.79 ns 74666667 TUlid.Parse - AnsiString 28.1 ns 28.5 ns 26352941 TUlid.Parse 32.8 ns 32.2 ns 21333333 TUlid.ToString 33.0 ns 33.0 ns 20363636 Anyway, that's more than fast enough - given that this is more than likely used to generate id's for use with databases or in messaging - it's unlikely you would ever hit a performance issue with this! https://github.com/VSoftTechnologies/VSoft.Ulid/tree/f-optimise I'll merge into main shortly. Edit : forgot to mention, the reason 32bit is still slower than 64bit is due to way the compiler does Int64 division (calls a helper function). Edited August 6, 2024 by Vincent Parrett addition 1 Share this post Link to post
Kas Ob. 124 Posted August 6, 2024 18 minutes ago, Vincent Parrett said: 3 hours ago, Kas Ob. said: in other words add/use Random1 and Random2, use only one Next. This made no discernable difference. It is expected with your modern CPU, but the difference will be noticeable when the CPU under high usage, this is just for information and you don't need to do it, i liked to share. Anyway, on my XE8 i found one of these loch ness monsters, the one that i saw in real life and in dreams, the ones that very hard to catch red handed, when the compiler throw the towel and silently produce unoptimized ugly assembly instructions. This is reproducible on my XE8 and would love to know if this case is still in newer versions: 1) create new and empty console applicaiton. 2) add VSoft.Ulid 3) put this line in the main procedure, uses System.SysUtils, VSoft.Ulid; begin TUlid.Create; // we don't care now for memory leak or whatevery end. 4) add 64-bit platform and enable optimize, keep stack frame enabled. 5) use this create, the same i used earlier class function TUlid.InternalNewUlid(timestamp: UInt64): TUlid; var random : UInt64; ts : Int64Rec absolute timestamp; begin result := default(TUlId); //reverse order! result.FTimeStamp0 := ts.Bytes[5]; result.FTimeStamp1 := ts.Bytes[4]; result.FTimeStamp2 := ts.Bytes[3]; result.FTimeStamp3 := ts.Bytes[2]; result.FTimeStamp4 := ts.Bytes[1]; result.FTimeStamp5 := ts.Bytes[0]; random := FXorShift64.Next; PNativeUInt(@result.FRandomness0)^ := NativeUInt(random); random := FXorShift64.Next; Move(random,result.FRandomness2, 8); // comment and uncomment this line and see the difference in assembly code PUInt64(@result.FRandomness0)^ := random; end; The result is astonishing ! Here the difference from my XE8 The compiler with move behaved differently and stupidly. One side note this is also really annoying, though i don't know if this still the same for the newer and enhanced versions it could be handled way better for these limited size like up to 128bit. Share this post Link to post
Vincent Parrett 767 Posted August 6, 2024 You have a typo PUInt64(@result.FRandomness0)^ := random; << should be FRandomness2 I did actually implement that and it did make a big difference - it's in the f-optimise branch - I haven't merged the changes in to main just yet. I also removed the call to default(TUlid) since we're setting all the fields anyway (as pointed out by Stefan!). I would say that codegen on later versions of delphi is only slightly better than XE8 - mostly it's improved rtl methods. I'm no assembly expert though so take that comment with a pinch of salt. 12.1 is slightly faster due to some div by const optimizations (again thanks to Stefan). Share this post Link to post
Kas Ob. 124 Posted August 6, 2024 You still can (and may be should) use the record variant as Stefan suggested, only extended it a little from case boolean to case integer with 0,1 and 2 where you can align the needed types, in the third (additional) case just put case Integer of 0: ( FRandomness0 : byte; FRandomness1 : byte; FRandomness2 : byte; FRandomness3 : byte; FRandomness4 : byte; FRandomness5 : byte; FRandomness6 : byte; FRandomness7 : byte; FRandomness8 : byte; FRandomness9 : byte); 1: ( FRandomness0_1 : Word; FRandomness2_9 : UInt64) 2: ( FRandomness0_N : NativeUInt); There is no need for padding, the compiler should take care of that. Share this post Link to post