Array compression in InternalGetUnicodeCategory() (System.Character)

mael · February 13, 2019

Hi,

In the unit System.Character there is a function InternalGetUnicodeCategory(). It uses a complex indexing scheme to get the category of a Unicode Codepoint (determining if it is a control character, a letter, a number, etc.) from a table.

Result := CategoryTable[CatIndexSecondary[CatIndexPrimary[C shr 8] + ((C shr 4) and $F)] + C and $F];

The indexing is used to save memory, probably using a sort of trie that is implemented using arrays, or a kind of hashmap principle. What I don't get is how the range of Codepoints that goes from 0..$10FFFF is exactly reduced so that the table is only about 36664 bytes (in Delphi XE3) in size.

Can somebody shed some light on how this indexing scheme was determined from the initial array that had a form similar to this:

CodepointProperty: array[0..$10FFFF] of Byte;

System.Character_const.5.2.0.inc gives a little more detail, because there the arrays are more explicit. Seems to be some kind of bit compression, but I am still looking for more insight into how it works.

Edited February 13, 2019 by mael

Rudy Velthuis · February 13, 2019

6 hours ago, mael said:
Hi,

In the unit System.Character there is a function InternalGetUnicodeCategory(). It uses a complex indexing scheme to get the category of a Unicode Codepoint (determining if it is a control character, a letter, a number, etc.) from a table.
Result := CategoryTable[CatIndexSecondary[CatIndexPrimary[C shr 8] + ((C shr 4) and $F)] + C and $F];
The indexing is used to save memory, probably using a sort of trie that is implemented using arrays, or a kind of hashmap principle. What I don't get is how the range of Codepoints that goes from 0..$10FFFF is exactly reduced so that the table is only about 36664 bytes (in Delphi XE3) in size.

Can somebody shed some light on how this indexing scheme was determined from the initial array that had a form similar to this:
CodepointProperty: array[0..$10FFFF] of Byte;
System.Character_const.5.2.0.inc gives a little more detail, because there the arrays are more explicit. Seems to be some kind of bit compression, but I am still looking for more insight into how it works.

I assume that the structure of these $110000 values is thus, that there are many repeating structure, blocks and ranges, so I guess that for certain ranges, you only have to know parts of the values to know which types they are. I didn't look into the detail but I bet this is even described somewhere, in the Unicode documentation. Embarcadero didn't probably devise these lookups on their own.

mael · February 13, 2019

Thanks. There is definitely a structure and ranges that are assigned the same value. But there is no special documentation in the Unicode standard that would go beyond what you can directly deduce from the mapping available in Delphi. I actually started with the standard, then looked for efficient encodings. The standard vaguely suggests using a data structure like a trie.

The Unicode documentation itself only lists every character and gives it a matching category.

The Delphi implementation apparently uses some kind of Hashmap. But I haven't been able to figure out the "inverse function" yet, to create the table.

Edit: I have looked into writing my own hashing function, assuming the division of the original key into three parts (one 13 bit key, and two 4 bit keys) as the original RTL code does. I could reproduce the values after a while, eventhough it seems the RTL wastes a bit of value range. I will update this post when I found out the final solution.

Edited February 13, 2019 by mael

mael · February 14, 2019

After more analysis I found out the tables implement a 3 level hashmap, or actually three hashmaps that are used consecutively to implement the mapping from codepoints to categories.

I have been able to reverse engineer part of the hash functions, but besides the first table, I don't get identical results for the table values or table sizes.

The overall mapping from codepoints to categories works however.

The second table CatIndexSecondary increases its values in steps of 16 every time a bucket with a collision is found. If there is a bucket that has a single value (i.e., no collisions), and that single value appears again in another bucket, they both get assigned the same index. Sometimes though it gets strange and suddenly values get large, apparently to provide more room for collision avoidance, but it's not obvious how they are computed.

It is also strange that a value between 0..15 is added to the result of a mapping with CatIndexSecondary. It would cause collisions if the values in CatIndexSecondary are not carefully chosen to avoid that. The increment in steps of 16 would ensure this happens, but not all values are computed in such a straight forward way. Maybe the increments in steps of 16 are a first attempt, then a check for collisions occurs, and remaining gaps in the index numbers are filled to reduce the size of the hashmap incrementally.

I know this remains vague, but it's at least a quick progress update.

Edited February 14, 2019 by mael

Mahdi Safsafi · February 14, 2019

Quote

What I don't get is how the range of Codepoints that goes from 0..$10FFFF is exactly reduced so that the table is only about 36664 bytes

That's because Unicode code-points are organised in ranges. Meaning, for a given code-point, if you can know its range, you can get its category without implementing a full 0xFFFF table.

I don't know what algorithm Delphi uses (I'm not in a mood to debug the System.Character unit). The good news, I wrote a simple demonstration that do what the InternalGetUnicodeCategory does. The principle is to regroup all similar category and then do a binary filtering.

I assumed that my Unicode code-points lies from 0x0000 to 0xFFFF. First I grouped them by ranges and then I did a binary indexing and finally I setup my table. My final table size is 19200 bytes rather than ‭65535 bytes!

program Console2;

{$APPTYPE CONSOLE}

{$R *.res}


uses
  System.SysUtils,
  System.TypInfo,
  System.Classes,
  System.Character;

type
  TRange = record
    Min: Integer;
    Max: Integer;
    Category: Integer;
    Mask: Integer;
  public
    procedure Increase(Value: Integer);
    procedure Compute(); inline;
    class function Create(AMin, AMax: Integer): TRange; static;
  end;

  PRange = ^TRange;

var
  CatIndexPrimary: array [0 .. $FF] of SmallInt;
  CatIndexSecondary: array [0 .. $FF] of SmallInt;
  Table: array of ShortInt;

const
  MAX_CODEPOINT = $FFFF;

function MyGetUnicodeCategory(C: UCS4Char): TUnicodeCategory;
begin
  // similar to InternalGetUnicodeCategory.
  Result := TUnicodeCategory(
    Table[CatIndexPrimary[C shr 8] + CatIndexSecondary[C and $FF]]);
end;

function GetPrimaryKey(Value: Integer): Integer; inline;
begin
  Result := Value shr $08;
end;

function GetSecondaryKey(Value: Integer): Integer; inline;
begin
  Result := Value and $FF;
end;

function GetUnicodeCategoryName(Value: TUnicodeCategory): string;
begin
  Result := GetEnumName(TypeInfo(TUnicodeCategory), Integer(Value));
end;

procedure BuildTable();
var
  i, k, j, p, s, index: Integer;
  Category: TUnicodeCategory;
  List: TList;
  Range: PRange;
  SeenList: TList;
  size: Integer;
  LTable: array [0 .. MAX_CODEPOINT] of ShortInt;
  LCatIndexPrimary: array [0 .. $FF] of Integer;
  LCatIndexSecondary: array [0 .. $FF] of Integer;
  CategoryValue, PreviousCategoryValue: Integer;
begin
  { initialization }
  List := TList.Create();
  SeenList := TList.Create();
  PreviousCategoryValue := -1;
  Range := nil;
  size := 0;
  for i := 0 to $FF do
  begin
    LCatIndexPrimary[i] := 0;
    LCatIndexSecondary[i] := 0;
  end;

  { build codepoint ranges }
  for i := 0 to MAX_CODEPOINT do
  begin
    LTable[i] := -1;
    Category := GetUnicodeCategory(i);
    CategoryValue := Integer(Category);
    if (PreviousCategoryValue <> CategoryValue) then
    begin
      PreviousCategoryValue := CategoryValue;
      Range := PRange(GetMemory(sizeof(TRange)));
      Range^ := TRange.Create(i, i);
      Range^.Category := CategoryValue;
      List.Add(Range);
    end
    else
    begin
      // continued range.
      Range^.Increase(1);
    end;
  end;
  {
    // debug ranges
    for i := 0 to List.Count - 1 do
    begin
    Range := List[i];
    Writeln(Format('category=%s range[%d..%d] mask=0x%x', [GetUnicodeCategoryName(TUnicodeCategory(Range^.Category)),
    Range^.Min, Range^.Max, Range^.Max]));
    end;
  }

  { setup a uniq mask for LCatIndexPrimary and LCatIndexSecondary }
  for i := 0 to List.Count - 1 do
  begin
    Range := List[i];
    for j := Range^.Min to Range^.Max do
    begin
      p := GetPrimaryKey(j);
      s := GetSecondaryKey(j);
      LCatIndexPrimary[p] := LCatIndexPrimary[p] or Range^.Mask or Range^.Category;
      LCatIndexSecondary[s] := s;
    end;
  end;

  { normalize masks to indexes }
  for i := 0 to $FF do
  begin
    k := SeenList.IndexOf(Pointer(LCatIndexPrimary[i]));
    if (k = -1) then
    begin
      k := SeenList.Count;
      SeenList.Add(Pointer(LCatIndexPrimary[i]));
    end;
    LCatIndexPrimary[i] := k shl 8;
  end;

  { fill table }
  for i := 0 to List.Count - 1 do
  begin
    Range := List[i];
    for j := Range^.Min to Range^.Max do
    begin
      p := GetPrimaryKey(j);
      s := GetSecondaryKey(j);
      index := LCatIndexPrimary[p] + LCatIndexSecondary[s];
      Assert((LTable[index] = -1) or (LTable[index] = Range^.Category));
      LTable[index] := Range^.Category;
    end;
  end;

  { clone all L* tables }
  for i := 0 to MAX_CODEPOINT do
  begin
    if (LTable[i] <> -1) then
      Inc(size);
  end;
  SetLength(Table, size);
  for i := 0 to size - 1 do
    Table[i] := LTable[i];

  for i := 0 to $FF do
  begin
    Assert((LCatIndexPrimary[i] >= Low(SmallInt)) and (LCatIndexPrimary[i] <= High(SmallInt)));
    Assert((LCatIndexSecondary[i] >= Low(SmallInt)) and (LCatIndexSecondary[i] <= High(SmallInt)));
    CatIndexPrimary[i] := LCatIndexPrimary[i];
    CatIndexSecondary[i] := LCatIndexSecondary[i];
  end;

  { output results }
  Writeln('---------------------------------');
  Writeln(Format('size of table             = %d bytes', [Length(Table) * sizeof(ShortInt)]));
  Writeln(Format('size of CatIndexPrimary   = %d bytes', [sizeof(CatIndexPrimary)]));
  Writeln(Format('size of CatIndexSecondary = %d bytes', [sizeof(CatIndexSecondary)]));
  size := (Length(Table) * sizeof(ShortInt)) +
    sizeof(CatIndexPrimary) + sizeof(CatIndexSecondary);
  Writeln(Format('total size = %d bytes (rather than %d) => we won %d bytes !', [size, $FFFF, $FFFF - size]));

  { clean-up }
  for i := 0 to List.Count - 1 do
  begin
    Range := List[i];
    FreeMem(Range);
  end;
  List.Free();
  SeenList.Free();
end;

{ TRange }

procedure TRange.Compute();
var
  i: Integer;
begin
  for i := Min to Max do
    Mask := Mask and i;
end;

class function TRange.Create(AMin, AMax: Integer): TRange;
begin
  Result.Mask := -1;
  Result.Min := AMin;
  Result.Max := AMax;
  Result.Compute();
end;

procedure TRange.Increase(Value: Integer);
begin
  Inc(Max, Value);
  Compute();
end;

var
  i: Integer;
  Category, Category2: TUnicodeCategory;

begin
  ReportMemoryLeaksOnShutdown := True;
  BuildTable();
  { tests }
  for i := 0 to MAX_CODEPOINT do
  begin
    Category := MyGetUnicodeCategory(i);
    Category2 := GetUnicodeCategory(i);
    Assert(Category = Category2);
  end;
  Readln;

end.

NB: I ignored all optimisation/memory good practices. You might need to optimise the code yourself.

mael · February 21, 2019

Thanks a lot for your input, Mahdi.

I wanted to post working code once it's finished and polished, but that will take a while, as I have to solve other parts of the software first.

Your solution is somewhat similar to what I tried, however this is limited to values <= UInt16 (as you noted), whereas Unicode code points range from 0 to $10FFFF. The other issue is that while TUnicodeCategory was in the question, I wanted to implement a general solution, where ranges aren't necessarily as regular (lookup tables for other properties of codepoints where they don't build such nice ranges) or working with the specifically chosen "hash keys" that will not work when your ranges are not as good.

Ideally I was looking for an algorithm that automatically searches for the proper hash functions that would yield a reasonably small hash table. Or at least a principle algorithm for it.

I'll have a look at your solution again, though, if I need to optimize my approach further.

mael · June 14, 2019

I finally wrote a class to iterate over a range of parameters for hash functions/tables chained one after another.

First I thought about using some optimization algorithm, like gradient descent, but it proved hard to write a reasonable error function which has gradients that are smooth and wide enough for optimization to work well. So I simply iterated over a reasonable range of the parameters until a satisfyingly small size was found.

I could reduce the size of 1088 KiB (1 Byte for each of the 1114112 code points), to 16 KiB while keeping random access (three chained hash table lookups and three hash functions to compute the index/key for each hash table), which I think is good enough.

Edited June 14, 2019 by mael

Sign In

Array compression in InternalGetUnicodeCategory() (System.Character)

Recommended Posts

mael 29

Share this post

Link to post

Rudy Velthuis 91

Share this post

Link to post

mael 29

Share this post

Link to post

mael 29

Share this post

Link to post

Mahdi Safsafi 225

Share this post

Link to post

mael 29

Share this post

Link to post

mael 29

Share this post

Link to post

Create an account or sign in to comment

Create an account

Sign in

Browse

Activity