Jump to content
vfbb

Unicode string - how element iterating?

Recommended Posts

The delphi string is an implementation of UTF-16 in which the "normal" graphic characters corresponding to 1 Char (2 bytes), but having others graphic characters being represented by 2 Chars (4 bytes), the so-called surrogate pair, (like the emoji 🙏). And in this case we need to check if the char IsSurrogate, to know that it is a graphic character of 2 Chars (4 bytes). But the problem is that utf-16 is not limited to a maximum of 4 bytes for a graphic character, in fact it already predicts that a graphic character can have up to 14 bytes for future implementations, and today most of the emojis already occupy at least 8 bytes (like the emoji 🙏🏻 <- light version). This is a problem, because delphi only handles the graphic character with the possibility of being 2 bytes or 4 bytes, and is often converting the graphic character or string to UCS4Char or UCS4String respectively, to treat everyone as being 4 bytes.


So, how do you know the actual size of each graphic character in a string?

Edited by vfbb

Share this post


Link to post

it's 4 bytes only and the modifier (to make it lighter) is also 4 bytes

 

like 🧑🏿‍🦽 is

 

D8 3E DD D1 = 🧑 

D8 3C DF FF = skin tone modifier

20 0D = zero width joiner

D8 3E DD BD =🦽

 

and how to find out that it's one emoji? I'm afraid you have to parse the codes and go after the rules defined in the RFC.

 

Edited by Attila Kovacs
  • Like 1

Share this post


Link to post

Embarcadero should create a class based on C# TextElementEnumerator, which does exactly this.

Share this post


Link to post

TextElementEnumerator looks really lightweight and GetUnicodeCategory is already in System.Character. Not sure what unicode version supported though.

Share this post


Link to post

Here is an implementation of a TextElementEnumerator (not tested):

Note that in Windows CharNext is the best but still not perfect way to get text elements.  See What('s) a character! | The Old New Thing (microsoft.com)

 

You should then be able to write code such as:

 

for var Element in TextElements(MyString) do:

 

interface
type
  TTextElementEnumerator = record
  private
    FStart: PChar;
    FCurrent: string;
  public
    constructor Create(const AValue: string);
    function MoveNext: Boolean; inline;
    function GetCurrent: string; inline;
    property Current: string read GetCurrent;
  end;

  TTextElementEnumeratorHelper = record
  private
    FString: string;
  public
    constructor Create(const AValue: string);
    function  GetEnumerator: TTextElementEnumerator;
  end;

function TextElements(const AValue: string): TTextElementEnumeratorHelper;

implementation

{$REGION Text Element Enumberator}
{ TTextElementEnumerator }

constructor TTextElementEnumerator.Create(const AValue: string);
begin
   FStart := PWideChar(AValue);
end;

function TTextElementEnumerator.GetCurrent: string;
begin
  Result := FCurrent;
end;

function TTextElementEnumerator.MoveNext: Boolean;
Var
  FEnd : PWideChar;
begin
  if FStart^ = #0 then Exit(False);

  FEnd := Windows.CharNext(FStart);
  SetString(FCurrent, FStart, FEnd - FStart);
  FStart := FEnd;
  Result := True;
end;

{ TTextElementEnumeratorHelper }
constructor TTextElementEnumeratorHelper.Create(const AValue: string);
begin
  FString := AValue;
end;

function TTextElementEnumeratorHelper.GetEnumerator: TTextElementEnumerator;
begin
   Result.Create(FString);
end;

function TextElements(const AValue: string): TTextElementEnumeratorHelper;
begin
  Result.Create(AValue);
end;
{$ENDREGION}
Edited by pyscripter

Share this post


Link to post
7 hours ago, vfbb said:

The delphi string is an implementation of UTF-16 in which the "normal" graphic characters corresponding to 1 Char (2 bytes), but having others graphic characters being represented by 2 Chars (4 bytes), the so-called surrogate pair

There is no such thing as a "normal graphic character" in Unicode.  What you are thinking of as a "character" is officially referred to as a "grapheme", which consists of 1 or more Unicode codepoints linked together to make up 1 human-readable glyph.  Individual Unicode codepoints are encoded as 1 or 2 codeunits in UTF-16, which is what each 2-byte Char represent.  When a codepoint is encoded into 2 UTF-16 codeunits, that is also known as a "surrogate pair".

Quote

(like the emoji 🙏)

That emoji is 1 Unicode codepoint:

U+1F64F (Folded Hands)

 

Which is encoded as 2 codeunits in UTF-16:

D83D

DE4F

Quote

And in this case we need to check if the char IsSurrogate, to know that it is a graphic character of 2 Chars (4 bytes).

That will only allow you to determine the value of the 1st Unicode codepoint in the grapheme.  But then you need to look at and decode subsequent codepoints to determine if they "combine" in sequence with that 1st codepoint.

Quote

But the problem is that utf-16 is not limited to a maximum of 4 bytes for a graphic character, in fact it already predicts that a graphic character can have up to 14 bytes for future implementations, and today most of the emojis already occupy at least 8 bytes (like the emoji 🙏🏻 <- light version).

That emoji takes up 8 bytes, consisting of 4 UTF-16 codeunits:

D83D

DE4F

DB3C

DFFB

 

Which decode as 2 Unicode codepoints:

U+1F64F (Folded Hands)

U+1F3FB (Emoji Modifier Type-1-2)

 

And it gets even more complicated especially for Emoji, because 1) modern Emoji support skin tones and genders, which are handled using 1+ modifier codepoints, and 2) multiple unrelated Emoji can be grouped together with codepoint U+200D to create even new Emoji.

 

For example:

 

U+1F469 (Woman)

U+1F3FD (Emoji Modifier Type-4)

U+200D (Zero Width Joiner)

U+1F4BB (Personal Computer)

 

Which is treated as 1 single Emoji of a light-skined woman sitting behind a PC.

 

Or, how about:

 

U+1F468 (Man)

U+200D (ZWJ)

U+1F469 (Woman)

U+200D (ZWJ)

U+1F467 (Girl)

U+200D (ZWJ)

U+1F466 (Boy)

 

Which is treated as 1 single Emoji of a Family with a dad, mom, and 2 children.

 

See https://eng.getwisdom.io/emoji-modifiers-and-sequence-combinations/ for more details.

Quote

This is a problem, because delphi only handles the graphic character with the possibility of being 2 bytes or 4 bytes, and is often converting the graphic character or string to UCS4Char or UCS4String respectively, to treat everyone as being 4 bytes.

Delphi itself only concerns itself with the encoding/decoding of UTF-16 itself (especially when converting that data to other encodings, like ANSI, UTF-8, etc).  Delphi does not care what the UTF-16 data represents.  Graphemes are handled only by application code that needs to be do text processing, glyph rendering, etc.  Things that are outside of Delphi's scope as a general programming language.  Most of the time, you should just let the OS deal with them.  Unless you are writing your own engines that need to be Grapheme-aware.

Quote

So, how do you know the actual size of each graphic character in a string?

By using a Unicode library that understands the rules of Graphemes, Emojis, etc.  Delphi has no such library built-in, but there are several 3rd party Unicode libraries that do understand those things.

Edited by Remy Lebeau
  • Like 2

Share this post


Link to post
22 minutes ago, Stefan Glienke said:

That does not work at all - CharNext does not work well. And that article is from 2007 - did we even have emojis back then? 😄

Not really, but we did have multi-byte and multi-codepoint character sequences, which MSDN claims CharNextA() and CharNextW() do handle.  But not well enough, in this case.

Quote

As that article describes, you need .NET 5, which was just released 4 1/2 months ago, to handle Grapheme clusters correctly in things like enumeration, etc.

Edited by Remy Lebeau
  • Like 2

Share this post


Link to post

CharNext does handle surrogate pairs, diacritics and other multi-codepoint sequences reasonably well.  Cleary it does not handle emojis well.  Nor does any other Windows function that I know of. But I was mostly trying to show that doing the text enumeration is the easy part.

 

Quote

CharNext works with default "user" expectations of characters when dealing with diacritics. For example: A string that contains U+0061 U+030a "LATIN SMALL LETTER A" + COMBINING RING ABOVE" — which looks like "å", will advance two code points, not one. A string that contains U+0061 U+0301 U+0302 U+0303 U+0304 — which looks like "a´^~¯", will advance five code points, not one, and so on.

 

The .NET 5 unicode support is based on the ICU library.  See Globalization and ICU | Microsoft Docs for details.  There is an ICU Delphi wrapper but it has not been updated for years.

 

ICU has been included in the Windows Creators update.  The dll names the the header files are listed here.  It would be nice to have translations of the headers to pascal and even better some higher level wrappers.

Edited by pyscripter

Share this post


Link to post
1 hour ago, Remy Lebeau said:

Not really, but we did have multi-byte and multi-codepoint character sequences, which MSDN claims CharNextA() and CharNextW() do handle.  But not well enough, in this case.

As that article describes, you need .NET 5, which was just released 4 1/2 months ago, to handle Grapheme clusters correctly in things like enumeration, etc.

But before it did not run in an endless loop like the code from @pyscripter does 🤷‍♂️

Share this post


Link to post
37 minutes ago, Stefan Glienke said:

But before it did not run in an endless loop

I did say I did not test :classic_biggrin:.  One missing line now added to the code.  (FStart := FEnd);

 

Works as expected in:

    var TestString := 'å'+#$0061#$0301#$0302#$0303#$0304;
    for var S in TextElements(TestString) do
      Writeln(S);

and yes it does not work as it should with complex emojis.

 

 

Interestingly even VScode does not handle 🤷🏽‍♀️ correctly.  Paste the symbol and then try to delete it with Backspace.  It takes pressing backspace multiple time to actually delete this emoji.

Edited by pyscripter

Share this post


Link to post

And attached here is a text element enumerator that handles 🤷🏽‍♀️ correctly using ICU.

image.png.56595330c6a7a1573ce5bf6778c78e79.png

 

Note though that the Edit Box and the List Box do not display 🤷🏽‍♀️correctly (shown as two characters).  The content of the EditBox is  åá̂̃̄🤷🏽‍♀️ (copied and pasted here).

EnumTextElements.pas

 

Edited by pyscripter
  • Like 1

Share this post


Link to post
14 hours ago, David Heffernan said:

Why would you need to know this, unless you are rendering the text 

All manipulations of strings received by user input should consider each element and not each individual character. This is basically why delphi TEdit does not work properly with these kind of emojis (just try adding one and pressing backspace that you will see part of problem). In short, when you don't consider this, you risk breaking a string into a position in the middle of a Grapheme, creating a malformed string.

 

As I said, this iteration is essential, Embarcadero should implement this natively to work cross-platform just like C # did.

 

The best thing to do at this point is to port the C # code.

Share this post


Link to post
1 hour ago, vfbb said:

The best thing to do at this point is to port the C# code.

Wrong! Actually porting the C# code will be laborious and will require maintenance as the unicode is changed.

 

The best solution at this point is to use apis:

 

Windows - DWriteTextAnalyze
Android - GraphemeCharsLength := JString.codePointAt(Index);
iOS - CFStringGetRangeOfComposedCharactersAtIndex

Share this post


Link to post
1 hour ago, vfbb said:

Windows - DWriteTextAnalyze
Android - GraphemeCharsLength := JString.codePointAt(Index);
iOS - CFStringGetRangeOfComposedCharactersAtInd

It appears that ICU is the way to go.  ICU is now bunded in Windows, .NET 5 is based on ICU, also android Unicode and internationalization support  |  Android Developers  and iOS ICU usage in Swift - Development / Standard Library - Swift Forums

  • Like 1
  • Thanks 1

Share this post


Link to post
3 hours ago, vfbb said:

All manipulations of strings received by user input should consider each element and not each individual character. This is basically why delphi TEdit does not work properly with these kind of emojis (just try adding one and pressing backspace that you will see part of problem). In short, when you don't consider this, you risk breaking a string into a position in the middle of a Grapheme, creating a malformed string.

 

As I said, this iteration is essential, Embarcadero should implement this natively to work cross-platform just like C # did.

 

The best thing to do at this point is to port the C # code.

I don't think this is true. TEdit is the Win32 EDIT control. What are you doing with strings that need to know what you are asking about? What's your usage scenario? 

  • Like 1

Share this post


Link to post
26 minutes ago, David Heffernan said:

I don't think this is true. TEdit is the Win32 EDIT control. What are you doing with strings that need to know what you are asking about? What's your usage scenario? 

Mostly agree. The problem with TEdit is a Windows and not a Delphi one.  Notepad and VSCode have similar issues.

 

If you are not rendering text it probably does not matter.  I got interested in this because SynEdit does not handle complex unicode characters well.  Other scenarios for using libraries such as ICU include proper sorting of multi-lingual text, proper change of capitalization, string normalization etc.  (I mean better in corner cases than what Windows provides and more compatible with Unicode standards), 

 

 

Edited by pyscripter

Share this post


Link to post
8 hours ago, David Heffernan said:

What's your usage scenario? 

 

8 hours ago, pyscripter said:

Notepad and VSCode have similar issues.

There are really few apps on Windows, except for chat apps, which fully support unicode (including MS apps), but in my case I really need to deal with this because it will be just a cross platform chat (Win, Android, iOS).

 

Today's problem is to fix TEdit, but it is not just that, but the safe way to manipulate any string received from the user. It is the same as using AnsiString/AnsiChar when the input is string (unicode), you are at risk of generating an unexpected string.

 

 

To clarify the problem, follow an example on TEdit:

1) Select a TEdit, press Windows key +. to open the emoji window
2) Select for example the "White man raising his hand" which is represented by 7 characters, that is, 14 bytes)

3) Proceed:

TEdit_problem.thumb.png.fcdea122f8574d0830bbf35a6b4319de.png

 

 

Example string manipulation:

In a string manipulation it is very common, for example, to take x first characters, so suppose I want to take the first 8 chars:

S := S.Substring(0, 8);

But if S is "🙋🏻‍♂️🙋🏻‍♂️", which incredibly has 14 chars, and which is the same as:

S := #55357 + #56907 + #55356 + #57339 + #8205 + #9794 + #65039 + #55357 + #56907 + #55356 + #57339 + #8205 + #9794 + #65039;

When I give the substring (0, 8), the result will be:

S := #55357 + #56907 + #55356 + #57339 + #8205 + #9794 + #65039 + #55357;

Which is represented by:

🙋🏻‍♂️

But a Substring (1, 7) would be even worse in this case:

Substring1_7.png.c720aa91e723a8c451b37576f599d33b.png

 

 

The problem is very clear and a modern application, mainly of chat, has to know how to deal with it.

 

Share this post


Link to post

The official APIs in Windows to deal with Unicode stuff - is Uniscribe. It allows to decode Unicode related information from a Utf16 string.

 

Including the sequence of glyphs for rendering. But, what you will do with them? Because, every glyph is denoted as a number, without any meaningfull value; and you can only pass this sequence to ExtTextOut.

 

Also, Uniscribe provides supporting attributes to Unicode codepoints, such as: 

  • whether it's a wrod start (for Ctrl+Left/Right navigation)
  • whether it's a word end
  • a valid caret positions (some codepoints are not valid caret positions)
  • should the codepoint be deleted as a group with neightboord codepoints
  • etc.

All rules are complex and different from each other. There no simple way to move editor caret and select the text in Unicode enabled text editor.

 

I suggest the following old and great tutorial (an attempt to create the editor with Unicode support)

http://www.catch22.net/tuts/neatpad/introduction-uniscribe#

 

 

Edited by balabuev

Share this post


Link to post

image.thumb.png.358c9fb97c8ab7b7436eb872ffcee901.png

 

Relying on OS facilities is simpler but they depend on updates (Unicode committee constantly adds more and more weird combos) so own ICU seems the most reliable way.

Share this post


Link to post

Not fully agree. You still has to rely on OS text rendering features. Which are themselfs depend on ICU. So, your own ICU can be not well consistent with OS's one, which is used for rendering.

Share this post


Link to post
3 hours ago, balabuev said:

Not fully agree. You still has to rely on OS text rendering features. Which are themselfs depend on ICU. So, your own ICU can be not well consistent with OS's one, which is used for rendering.

Not all applications need rendering, don't forget non-visual text processing tasks. I guess these ones are more critical to support bleeding edge standards (imagine someone would strongly urgently perform full-text search among many posts written in Quenya  :))

Edited by Fr0sT.Brutal

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×