Jump to content
Vincent Parrett

VSoft.Ulid - A Delphi Implementation of ULID for Delphi XE2 or later.

Recommended Posts

Hi All

 

I created a Delphi implementation of ULID - A Universally Unique Lexicographically Sortable Identifier. ULID's are a better option than GUIDs where sorting is needed 

 

  • 128-bit compatibility with UUID
  • 1.21e+24 unique ULIDs per millisecond
  • Lexicographically sortable!
  • Canonically encoded as a 26 character string, as opposed to the 36 character UUID
  • Uses Crockford's base32 for better efficiency and readability (5 bits per character)
  • Case insensitive
  • No special characters (URL safe)
  • Monotonic sort order (correctly detects and handles the same millisecond)

 

https://github.com/VSoftTechnologies/VSoft.Ulid

 

  • Like 6
  • Thanks 2

Share this post


Link to post

Out of curiosity I decided to benchmark TUlid - using the excellent Spring4D Benchmark - some interesting results

 

32bit

 

tulid-bm32.thumb.png.a6df841cb0aa3c9db9ab1dd27c64c459.png

 

64 bit

 

tulid-bm64.thumb.png.c8626afa6d0e891c4ee966c37e9d82f0.png

 

Not being a performance guru - I am guessing the difference in the TUlid.Create performance is down to the Random number generator  - @Stefan Glienke

 

Share this post


Link to post

@Vincent Parrett I would suggest to benchmark the following

class function TUlid.InternalNewUlid(timestamp: UInt64): TUlid;
var
  random : UInt64;
  ts : Int64Rec absolute timestamp;
begin
  result := default(TUlId);
  //reverse order!
  result.FTimeStamp0 := ts.Bytes[5];
  result.FTimeStamp1 := ts.Bytes[4];
  result.FTimeStamp2 := ts.Bytes[3];
  result.FTimeStamp3 := ts.Bytes[2];
  result.FTimeStamp4 := ts.Bytes[1];
  result.FTimeStamp5 := ts.Bytes[0];
  random := FXorShift64.Next;
  //Move(random,result.FRandomness0, 2); // randomness 0-1
  //PWORD(@result.FRandomness0)^ := UInt16(random);      // the exact size
  PNativeUInt(@result.FRandomness0)^ := NativeUInt(random);  // it is safe to overflow here, might yeild better and simpler asm instruction
  random := FXorShift64.Next;
  //Move(random,result.FRandomness2, 8); // randomness 2-9
  PUInt64(@result.FRandomness0)^ := random;
end;

Removing the overhead of Move should count for something, i guess !!

 

Share this post


Link to post
1 minute ago, Kas Ob. said:

@Vincent Parrett I would suggest to benchmark the following


  random := FXorShift64.Next;
  //Move(random,result.FRandomness0, 2); // randomness 0-1
  //PWORD(@result.FRandomness0)^ := UInt16(random);      // the exact size
  PNativeUInt(@result.FRandomness0)^ := NativeUInt(random);  // it is safe to overflow here, might yeild better and simpler asm instruction
  random := FXorShift64.Next;
  //Move(random,result.FRandomness2, 8); // randomness 2-9
  PUInt64(@result.FRandomness0)^ := random;

Removing the overhead of Move should count for something, i guess !!

 

 

@Stefan Glienke is way ahead of you  already sent me a few suggestions. Using a variant record

 

    case Boolean of
        True: (
          FRandomness0 : byte;
          FRandomness1 : byte;
          FRandomness2 : byte;
          FRandomness3 : byte;
          FRandomness4 : byte;
          FRandomness5 : byte;
          FRandomness6 : byte;
          FRandomness7 : byte;
          FRandomness8 : byte;
          FRandomness9 : byte);
        False: (
          FRandomness0_1 : Word;
          FRandomness2_9 : UInt64);

Did squeeze a out few ns better performance.

 

I'll test your idea though - because micro optimising is more fun the boring stuff I was working on... 😉

  • Like 1
  • Haha 1

Share this post


Link to post

Here another idea.

 

The implementation require 2 consequence random then calculate them in one hit, instead of calling FXorShift64.Next in the middle, in other words add/use Random1 and Random2, use only one Next.

Share this post


Link to post
2 hours ago, Kas Ob. said:

in other words add/use Random1 and Random2, use only one Next.

This made no discernable difference. Your version of avoiding the move was slightly faster than the variant array idea. 

Thanks to a bunch of suggestions from Stefan & Kas it is now a lot faster than the c# version I was comparing it to (64bit create - 63ns) - I did also add some validation to the parse method which slowed it down again!
 

The big speedup in the Parse method was avoiding TEncoding.GetBytes - which uses a dynamic array (slow). 

D 11.3 - 32bit
-------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations
-------------------------------------------------------------------
TUlid.Create                   52.1 ns         51.6 ns     10000000
TUlid.Parse - AnsiString       28.4 ns         28.5 ns     23578947
TUlid.Parse                    33.7 ns         33.0 ns     20363636
TUlid.ToString                 32.0 ns         32.5 ns     23578947
D 11.3 64bit
-------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations
-------------------------------------------------------------------
TUlid.Create                   10.9 ns         11.0 ns     64000000
TUlid.Parse - AnsiString       28.1 ns         28.5 ns     23578947
TUlid.Parse                    29.2 ns         29.5 ns     24888889
TUlid.ToString                 29.4 ns         29.3 ns     22400000
D12.1 32bit
-------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations
-------------------------------------------------------------------
TUlid.Create                   51.2 ns         51.6 ns     10000000
TUlid.Parse - AnsiString       28.5 ns         28.3 ns     24888889
TUlid.Parse                    30.2 ns         30.5 ns     23578947
TUlid.ToString                 30.2 ns         29.3 ns     22400000
D 12.1 64bit
-------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations
-------------------------------------------------------------------
TUlid.Create                   8.94 ns         8.79 ns     74666667
TUlid.Parse - AnsiString       28.1 ns         28.5 ns     26352941
TUlid.Parse                    32.8 ns         32.2 ns     21333333
TUlid.ToString                 33.0 ns         33.0 ns     20363636

 

Anyway, that's more than fast enough - given that this is more than likely used to generate id's for use with databases or in messaging - it's unlikely you would ever hit a performance issue with this! 

 

https://github.com/VSoftTechnologies/VSoft.Ulid/tree/f-optimise

 

I'll merge into main shortly. 

Edit : forgot to mention, the reason 32bit is still slower than 64bit is due to way the compiler does Int64 division (calls a helper function). 

 

 

Edited by Vincent Parrett
addition
  • Like 1

Share this post


Link to post
18 minutes ago, Vincent Parrett said:
3 hours ago, Kas Ob. said:

in other words add/use Random1 and Random2, use only one Next.

This made no discernable difference.

It is expected with your modern CPU, but the difference will be noticeable when the CPU under high usage, this is just for information and you don't need to do it, i liked to share.

 

Anyway, on my XE8 i found one of these loch ness monsters, the one that i saw in real life and in dreams, the ones that very hard to catch red handed, when the compiler throw the towel and silently produce unoptimized ugly assembly instructions.

 

This is reproducible on my XE8 and would love to know if this case is still in newer versions:

1) create new and empty console applicaiton.

2) add VSoft.Ulid

3) put this line in the main procedure, 

uses
  System.SysUtils,
  VSoft.Ulid;

begin
  TUlid.Create;  // we don't care now for memory leak or whatevery
end.

4) add 64-bit platform and enable optimize, keep stack frame enabled.

5) use this create, the same i used earlier

class function TUlid.InternalNewUlid(timestamp: UInt64): TUlid;
var
  random : UInt64;
  ts : Int64Rec absolute timestamp;
begin
  result := default(TUlId);
  //reverse order!
  result.FTimeStamp0 := ts.Bytes[5];
  result.FTimeStamp1 := ts.Bytes[4];
  result.FTimeStamp2 := ts.Bytes[3];
  result.FTimeStamp3 := ts.Bytes[2];
  result.FTimeStamp4 := ts.Bytes[1];
  result.FTimeStamp5 := ts.Bytes[0];
  random := FXorShift64.Next;

  PNativeUInt(@result.FRandomness0)^ := NativeUInt(random);
  random := FXorShift64.Next;
  Move(random,result.FRandomness2, 8); //  comment and uncomment this line and see the difference in assembly code
  PUInt64(@result.FRandomness0)^ := random;
end;

The result is astonishing !

 

Here the difference from my XE8

image.thumb.png.993a0aa871eb5265dd21dd9372b311ca.pngimage.thumb.png.340d57f719102fb3dff2990736cc3973.png

 

The compiler with move behaved differently and stupidly.

 

One side note this is also really annoying, though i don't know if this still the same for the newer and enhanced versions

image.png.f9f26e21e43c5d6f7f8d496156720064.png 

it could be handled way better for these limited size like up to 128bit.

Share this post


Link to post

You have a typo

  PUInt64(@result.FRandomness0)^ := random; << should be FRandomness2

I did actually implement that and it did make a big difference - it's in the f-optimise branch - I haven't merged the changes in to main just yet.

 

I also removed the call to default(TUlid) since we're setting all the fields anyway (as pointed out by Stefan!).

 

I would say that codegen on later versions of delphi is only slightly better than XE8 - mostly it's improved rtl methods. I'm no assembly expert though so take that comment with a pinch of salt. 12.1 is slightly faster due to some div by const optimizations (again thanks to Stefan). 

 

Share this post


Link to post

You still can (and may be should) use the record variant as Stefan suggested, only extended it a little from case boolean to case integer with 0,1 and 2 where you can align the needed types, in the third (additional) case just put 

    case Integer of
        0: (
          FRandomness0 : byte;
          FRandomness1 : byte;
          FRandomness2 : byte;
          FRandomness3 : byte;
          FRandomness4 : byte;
          FRandomness5 : byte;
          FRandomness6 : byte;
          FRandomness7 : byte;
          FRandomness8 : byte;
          FRandomness9 : byte);
        1: (
          FRandomness0_1 : Word;
          FRandomness2_9 : UInt64)
        2: (
          FRandomness0_N : NativeUInt);

There is no need for padding, the compiler should take care of that.

Share this post


Link to post

Good stuff! And a couple of additional points:

1) You could use hardware-generated random seed from either a) CPU or b) the Trusted Platform Module, although as it can be supplied in the constructor, it's immaterial.

2) It would really add a lot of value to have a Timestamp property or function and maybe also UTCTimestamp property as well, possibly also 2 of each: returning TDateTime and integer Unix timestamps. Ideally as class functions, doing TryParse internally on a supplied ULID. - this way, the data with ULID primary keys can be directly used for plotting things on a real time scale.

 

BTW, have you tested it under high-concurrency loads? - i.e.: for the uniqueness of the results, or anything unexpected, like '', or '0' and some such results?

Share this post


Link to post
2 hours ago, ap2021 said:

1) You could use hardware-generated random seed from either a) CPU or b) the Trusted Platform Module, although as it can be supplied in the constructor, it's immaterial.

I have no idea how to do that.

 

2 hours ago, ap2021 said:

2) It would really add a lot of value to have a Timestamp property or function and maybe also UTCTimestamp property as well, possibly also 2 of each: returning TDateTime and integer Unix timestamps. Ideally as class functions, doing TryParse internally on a supplied ULID. - this way, the data with ULID primary keys can be directly used for plotting things on a real time scale.

Happy to take a pull request to implement this. FWIW My UUID v7 library does expose the UTCTimestamp. It uses a v54 UUID under the hood to create the random part - which on windows uses the windows cng crypto library.  

 

2 hours ago, ap2021 said:

BTW, have you tested it under high-concurrency loads? - i.e.: for the uniqueness of the results, or anything unexpected, like '', or '0' and some such results?

Other than a quick n dirty test application, no. 

Share this post


Link to post

Vincent, thanks, I saw you have created an issue yourself (and I still have no account for GitHub). Here's a public converter, which may be helpful: https://ugai.github.io/ulid-timestamp-converter/

 

As for the random numbers, CPU is supposed to have RDRAND and RDSEED instructions, although they are flagged in Delphi asm as invalid - they do not keep up, but I assume that that can be plugged with the bytes in a DB statement - it's at the edge of my experience, so I'm not pushing. It's probably different for x32 vs. x64 and could malfunction cross-platform because of parameter passing conventions vs. the register(s) it uses. Apparently it has its own limitations and could return 0, if not ready, so a fallback would be required anyway. All hardware random numbers are also slow, so this only makes sense for initial seeding. And they all have adverse reports, so some may prefer one over the other, so it's probably best left to the caller do do what they want.

Share this post


Link to post
On 10/3/2025 at 10:14 AM, Vincent Parrett said:

I have no idea how to do that.

If you want to use RDRAND and RDSEEK, I attach the sources so you can see how they works.

These is another topic where we talked about: https://en.delphipraxis.net/topic/10271-getting-rdseed-with-delphi/?do=findComment&comment=81868

 

These two instructions works in Intel and Amd CPU only (x86 or x64). In Arm they were introduced recently and the behave in slightly different way.

In the source there are some lines commented that you can delete.

 

These have been in use for at least 4 years in all my production programs without ever having given me any problems. They have also been in use and constantly monitored for a year on a test web server (to check the workload and performance in a multithreaded environment).

Unit11.zip

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×