Jump to content
Vincent Parrett

VSoft.Ulid - A Delphi Implementation of ULID for Delphi XE2 or later.

Recommended Posts

Hi All

 

I created a Delphi implementation of ULID - A Universally Unique Lexicographically Sortable Identifier. ULID's are a better option than GUIDs where sorting is needed 

 

  • 128-bit compatibility with UUID
  • 1.21e+24 unique ULIDs per millisecond
  • Lexicographically sortable!
  • Canonically encoded as a 26 character string, as opposed to the 36 character UUID
  • Uses Crockford's base32 for better efficiency and readability (5 bits per character)
  • Case insensitive
  • No special characters (URL safe)
  • Monotonic sort order (correctly detects and handles the same millisecond)

 

https://github.com/VSoftTechnologies/VSoft.Ulid

 

  • Like 5
  • Thanks 2

Share this post


Link to post

Out of curiosity I decided to benchmark TUlid - using the excellent Spring4D Benchmark - some interesting results

 

32bit

 

tulid-bm32.thumb.png.a6df841cb0aa3c9db9ab1dd27c64c459.png

 

64 bit

 

tulid-bm64.thumb.png.c8626afa6d0e891c4ee966c37e9d82f0.png

 

Not being a performance guru - I am guessing the difference in the TUlid.Create performance is down to the Random number generator  - @Stefan Glienke

 

Share this post


Link to post

@Vincent Parrett I would suggest to benchmark the following

class function TUlid.InternalNewUlid(timestamp: UInt64): TUlid;
var
  random : UInt64;
  ts : Int64Rec absolute timestamp;
begin
  result := default(TUlId);
  //reverse order!
  result.FTimeStamp0 := ts.Bytes[5];
  result.FTimeStamp1 := ts.Bytes[4];
  result.FTimeStamp2 := ts.Bytes[3];
  result.FTimeStamp3 := ts.Bytes[2];
  result.FTimeStamp4 := ts.Bytes[1];
  result.FTimeStamp5 := ts.Bytes[0];
  random := FXorShift64.Next;
  //Move(random,result.FRandomness0, 2); // randomness 0-1
  //PWORD(@result.FRandomness0)^ := UInt16(random);      // the exact size
  PNativeUInt(@result.FRandomness0)^ := NativeUInt(random);  // it is safe to overflow here, might yeild better and simpler asm instruction
  random := FXorShift64.Next;
  //Move(random,result.FRandomness2, 8); // randomness 2-9
  PUInt64(@result.FRandomness0)^ := random;
end;

Removing the overhead of Move should count for something, i guess !!

 

Share this post


Link to post
1 minute ago, Kas Ob. said:

@Vincent Parrett I would suggest to benchmark the following


  random := FXorShift64.Next;
  //Move(random,result.FRandomness0, 2); // randomness 0-1
  //PWORD(@result.FRandomness0)^ := UInt16(random);      // the exact size
  PNativeUInt(@result.FRandomness0)^ := NativeUInt(random);  // it is safe to overflow here, might yeild better and simpler asm instruction
  random := FXorShift64.Next;
  //Move(random,result.FRandomness2, 8); // randomness 2-9
  PUInt64(@result.FRandomness0)^ := random;

Removing the overhead of Move should count for something, i guess !!

 

 

@Stefan Glienke is way ahead of you  already sent me a few suggestions. Using a variant record

 

    case Boolean of
        True: (
          FRandomness0 : byte;
          FRandomness1 : byte;
          FRandomness2 : byte;
          FRandomness3 : byte;
          FRandomness4 : byte;
          FRandomness5 : byte;
          FRandomness6 : byte;
          FRandomness7 : byte;
          FRandomness8 : byte;
          FRandomness9 : byte);
        False: (
          FRandomness0_1 : Word;
          FRandomness2_9 : UInt64);

Did squeeze a out few ns better performance.

 

I'll test your idea though - because micro optimising is more fun the boring stuff I was working on... 😉

  • Like 1
  • Haha 1

Share this post


Link to post

Here another idea.

 

The implementation require 2 consequence random then calculate them in one hit, instead of calling FXorShift64.Next in the middle, in other words add/use Random1 and Random2, use only one Next.

Share this post


Link to post
2 hours ago, Kas Ob. said:

in other words add/use Random1 and Random2, use only one Next.

This made no discernable difference. Your version of avoiding the move was slightly faster than the variant array idea. 

Thanks to a bunch of suggestions from Stefan & Kas it is now a lot faster than the c# version I was comparing it to (64bit create - 63ns) - I did also add some validation to the parse method which slowed it down again!
 

The big speedup in the Parse method was avoiding TEncoding.GetBytes - which uses a dynamic array (slow). 

D 11.3 - 32bit
-------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations
-------------------------------------------------------------------
TUlid.Create                   52.1 ns         51.6 ns     10000000
TUlid.Parse - AnsiString       28.4 ns         28.5 ns     23578947
TUlid.Parse                    33.7 ns         33.0 ns     20363636
TUlid.ToString                 32.0 ns         32.5 ns     23578947
D 11.3 64bit
-------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations
-------------------------------------------------------------------
TUlid.Create                   10.9 ns         11.0 ns     64000000
TUlid.Parse - AnsiString       28.1 ns         28.5 ns     23578947
TUlid.Parse                    29.2 ns         29.5 ns     24888889
TUlid.ToString                 29.4 ns         29.3 ns     22400000
D12.1 32bit
-------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations
-------------------------------------------------------------------
TUlid.Create                   51.2 ns         51.6 ns     10000000
TUlid.Parse - AnsiString       28.5 ns         28.3 ns     24888889
TUlid.Parse                    30.2 ns         30.5 ns     23578947
TUlid.ToString                 30.2 ns         29.3 ns     22400000
D 12.1 64bit
-------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations
-------------------------------------------------------------------
TUlid.Create                   8.94 ns         8.79 ns     74666667
TUlid.Parse - AnsiString       28.1 ns         28.5 ns     26352941
TUlid.Parse                    32.8 ns         32.2 ns     21333333
TUlid.ToString                 33.0 ns         33.0 ns     20363636

 

Anyway, that's more than fast enough - given that this is more than likely used to generate id's for use with databases or in messaging - it's unlikely you would ever hit a performance issue with this! 

 

https://github.com/VSoftTechnologies/VSoft.Ulid/tree/f-optimise

 

I'll merge into main shortly. 

Edit : forgot to mention, the reason 32bit is still slower than 64bit is due to way the compiler does Int64 division (calls a helper function). 

 

 

Edited by Vincent Parrett
addition
  • Like 1

Share this post


Link to post
18 minutes ago, Vincent Parrett said:
3 hours ago, Kas Ob. said:

in other words add/use Random1 and Random2, use only one Next.

This made no discernable difference.

It is expected with your modern CPU, but the difference will be noticeable when the CPU under high usage, this is just for information and you don't need to do it, i liked to share.

 

Anyway, on my XE8 i found one of these loch ness monsters, the one that i saw in real life and in dreams, the ones that very hard to catch red handed, when the compiler throw the towel and silently produce unoptimized ugly assembly instructions.

 

This is reproducible on my XE8 and would love to know if this case is still in newer versions:

1) create new and empty console applicaiton.

2) add VSoft.Ulid

3) put this line in the main procedure, 

uses
  System.SysUtils,
  VSoft.Ulid;

begin
  TUlid.Create;  // we don't care now for memory leak or whatevery
end.

4) add 64-bit platform and enable optimize, keep stack frame enabled.

5) use this create, the same i used earlier

class function TUlid.InternalNewUlid(timestamp: UInt64): TUlid;
var
  random : UInt64;
  ts : Int64Rec absolute timestamp;
begin
  result := default(TUlId);
  //reverse order!
  result.FTimeStamp0 := ts.Bytes[5];
  result.FTimeStamp1 := ts.Bytes[4];
  result.FTimeStamp2 := ts.Bytes[3];
  result.FTimeStamp3 := ts.Bytes[2];
  result.FTimeStamp4 := ts.Bytes[1];
  result.FTimeStamp5 := ts.Bytes[0];
  random := FXorShift64.Next;

  PNativeUInt(@result.FRandomness0)^ := NativeUInt(random);
  random := FXorShift64.Next;
  Move(random,result.FRandomness2, 8); //  comment and uncomment this line and see the difference in assembly code
  PUInt64(@result.FRandomness0)^ := random;
end;

The result is astonishing !

 

Here the difference from my XE8

image.thumb.png.993a0aa871eb5265dd21dd9372b311ca.pngimage.thumb.png.340d57f719102fb3dff2990736cc3973.png

 

The compiler with move behaved differently and stupidly.

 

One side note this is also really annoying, though i don't know if this still the same for the newer and enhanced versions

image.png.f9f26e21e43c5d6f7f8d496156720064.png 

it could be handled way better for these limited size like up to 128bit.

Share this post


Link to post

You have a typo

  PUInt64(@result.FRandomness0)^ := random; << should be FRandomness2

I did actually implement that and it did make a big difference - it's in the f-optimise branch - I haven't merged the changes in to main just yet.

 

I also removed the call to default(TUlid) since we're setting all the fields anyway (as pointed out by Stefan!).

 

I would say that codegen on later versions of delphi is only slightly better than XE8 - mostly it's improved rtl methods. I'm no assembly expert though so take that comment with a pinch of salt. 12.1 is slightly faster due to some div by const optimizations (again thanks to Stefan). 

 

Share this post


Link to post

You still can (and may be should) use the record variant as Stefan suggested, only extended it a little from case boolean to case integer with 0,1 and 2 where you can align the needed types, in the third (additional) case just put 

    case Integer of
        0: (
          FRandomness0 : byte;
          FRandomness1 : byte;
          FRandomness2 : byte;
          FRandomness3 : byte;
          FRandomness4 : byte;
          FRandomness5 : byte;
          FRandomness6 : byte;
          FRandomness7 : byte;
          FRandomness8 : byte;
          FRandomness9 : byte);
        1: (
          FRandomness0_1 : Word;
          FRandomness2_9 : UInt64)
        2: (
          FRandomness0_N : NativeUInt);

There is no need for padding, the compiler should take care of that.

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×