vfbb 284 Posted April 2, 2021 (edited) The delphi string is an implementation of UTF-16 in which the "normal" graphic characters corresponding to 1 Char (2 bytes), but having others graphic characters being represented by 2 Chars (4 bytes), the so-called surrogate pair, (like the emoji 🙏). And in this case we need to check if the char IsSurrogate, to know that it is a graphic character of 2 Chars (4 bytes). But the problem is that utf-16 is not limited to a maximum of 4 bytes for a graphic character, in fact it already predicts that a graphic character can have up to 14 bytes for future implementations, and today most of the emojis already occupy at least 8 bytes (like the emoji 🙏🏻 <- light version). This is a problem, because delphi only handles the graphic character with the possibility of being 2 bytes or 4 bytes, and is often converting the graphic character or string to UCS4Char or UCS4String respectively, to treat everyone as being 4 bytes. So, how do you know the actual size of each graphic character in a string? Edited April 2, 2021 by vfbb Share this post Link to post
Attila Kovacs 627 Posted April 2, 2021 (edited) it's 4 bytes only and the modifier (to make it lighter) is also 4 bytes like 🧑🏿🦽 is D8 3E DD D1 = 🧑 D8 3C DF FF = skin tone modifier 20 0D = zero width joiner D8 3E DD BD =🦽 and how to find out that it's one emoji? I'm afraid you have to parse the codes and go after the rules defined in the RFC. Edited April 2, 2021 by Attila Kovacs 1 Share this post Link to post
vfbb 284 Posted April 2, 2021 Embarcadero should create a class based on C# TextElementEnumerator, which does exactly this. Share this post Link to post
Attila Kovacs 627 Posted April 2, 2021 TextElementEnumerator looks really lightweight and GetUnicodeCategory is already in System.Character. Not sure what unicode version supported though. Share this post Link to post
David Heffernan 2345 Posted April 2, 2021 Why would you need to know this, unless you are rendering the text Share this post Link to post
pyscripter 684 Posted April 3, 2021 (edited) Here is an implementation of a TextElementEnumerator (not tested): Note that in Windows CharNext is the best but still not perfect way to get text elements. See What('s) a character! | The Old New Thing (microsoft.com) You should then be able to write code such as: for var Element in TextElements(MyString) do: interface type TTextElementEnumerator = record private FStart: PChar; FCurrent: string; public constructor Create(const AValue: string); function MoveNext: Boolean; inline; function GetCurrent: string; inline; property Current: string read GetCurrent; end; TTextElementEnumeratorHelper = record private FString: string; public constructor Create(const AValue: string); function GetEnumerator: TTextElementEnumerator; end; function TextElements(const AValue: string): TTextElementEnumeratorHelper; implementation {$REGION Text Element Enumberator} { TTextElementEnumerator } constructor TTextElementEnumerator.Create(const AValue: string); begin FStart := PWideChar(AValue); end; function TTextElementEnumerator.GetCurrent: string; begin Result := FCurrent; end; function TTextElementEnumerator.MoveNext: Boolean; Var FEnd : PWideChar; begin if FStart^ = #0 then Exit(False); FEnd := Windows.CharNext(FStart); SetString(FCurrent, FStart, FEnd - FStart); FStart := FEnd; Result := True; end; { TTextElementEnumeratorHelper } constructor TTextElementEnumeratorHelper.Create(const AValue: string); begin FString := AValue; end; function TTextElementEnumeratorHelper.GetEnumerator: TTextElementEnumerator; begin Result.Create(FString); end; function TextElements(const AValue: string): TTextElementEnumeratorHelper; begin Result.Create(AValue); end; {$ENDREGION} Edited April 3, 2021 by pyscripter Share this post Link to post
Stefan Glienke 1996 Posted April 3, 2021 (edited) That does not work at all - CharNext does not work well. And that article is from 2007 - did we even have emojis back then? 😄 Try yourself with the testcase here: https://docs.microsoft.com/en-us/dotnet/core/compatibility/globalization/5.0/uax29-compliant-grapheme-enumeration Edited April 3, 2021 by Stefan Glienke 2 Share this post Link to post
Remy Lebeau 1385 Posted April 3, 2021 (edited) 7 hours ago, vfbb said: The delphi string is an implementation of UTF-16 in which the "normal" graphic characters corresponding to 1 Char (2 bytes), but having others graphic characters being represented by 2 Chars (4 bytes), the so-called surrogate pair There is no such thing as a "normal graphic character" in Unicode. What you are thinking of as a "character" is officially referred to as a "grapheme", which consists of 1 or more Unicode codepoints linked together to make up 1 human-readable glyph. Individual Unicode codepoints are encoded as 1 or 2 codeunits in UTF-16, which is what each 2-byte Char represent. When a codepoint is encoded into 2 UTF-16 codeunits, that is also known as a "surrogate pair". Quote (like the emoji 🙏) That emoji is 1 Unicode codepoint: U+1F64F (Folded Hands) Which is encoded as 2 codeunits in UTF-16: D83D DE4F Quote And in this case we need to check if the char IsSurrogate, to know that it is a graphic character of 2 Chars (4 bytes). That will only allow you to determine the value of the 1st Unicode codepoint in the grapheme. But then you need to look at and decode subsequent codepoints to determine if they "combine" in sequence with that 1st codepoint. Quote But the problem is that utf-16 is not limited to a maximum of 4 bytes for a graphic character, in fact it already predicts that a graphic character can have up to 14 bytes for future implementations, and today most of the emojis already occupy at least 8 bytes (like the emoji 🙏🏻 <- light version). That emoji takes up 8 bytes, consisting of 4 UTF-16 codeunits: D83D DE4F DB3C DFFB Which decode as 2 Unicode codepoints: U+1F64F (Folded Hands) U+1F3FB (Emoji Modifier Type-1-2) And it gets even more complicated especially for Emoji, because 1) modern Emoji support skin tones and genders, which are handled using 1+ modifier codepoints, and 2) multiple unrelated Emoji can be grouped together with codepoint U+200D to create even new Emoji. For example: U+1F469 (Woman) U+1F3FD (Emoji Modifier Type-4) U+200D (Zero Width Joiner) U+1F4BB (Personal Computer) Which is treated as 1 single Emoji of a light-skined woman sitting behind a PC. Or, how about: U+1F468 (Man) U+200D (ZWJ) U+1F469 (Woman) U+200D (ZWJ) U+1F467 (Girl) U+200D (ZWJ) U+1F466 (Boy) Which is treated as 1 single Emoji of a Family with a dad, mom, and 2 children. See https://eng.getwisdom.io/emoji-modifiers-and-sequence-combinations/ for more details. Quote This is a problem, because delphi only handles the graphic character with the possibility of being 2 bytes or 4 bytes, and is often converting the graphic character or string to UCS4Char or UCS4String respectively, to treat everyone as being 4 bytes. Delphi itself only concerns itself with the encoding/decoding of UTF-16 itself (especially when converting that data to other encodings, like ANSI, UTF-8, etc). Delphi does not care what the UTF-16 data represents. Graphemes are handled only by application code that needs to be do text processing, glyph rendering, etc. Things that are outside of Delphi's scope as a general programming language. Most of the time, you should just let the OS deal with them. Unless you are writing your own engines that need to be Grapheme-aware. Quote So, how do you know the actual size of each graphic character in a string? By using a Unicode library that understands the rules of Graphemes, Emojis, etc. Delphi has no such library built-in, but there are several 3rd party Unicode libraries that do understand those things. Edited April 3, 2021 by Remy Lebeau 2 Share this post Link to post
Remy Lebeau 1385 Posted April 3, 2021 (edited) 22 minutes ago, Stefan Glienke said: That does not work at all - CharNext does not work well. And that article is from 2007 - did we even have emojis back then? 😄 Not really, but we did have multi-byte and multi-codepoint character sequences, which MSDN claims CharNextA() and CharNextW() do handle. But not well enough, in this case. Quote Try yourself with the testcase here: https://docs.microsoft.com/en-us/dotnet/core/compatibility/globalization/5.0/uax29-compliant-grapheme-enumeration As that article describes, you need .NET 5, which was just released 4 1/2 months ago, to handle Grapheme clusters correctly in things like enumeration, etc. Edited April 3, 2021 by Remy Lebeau 2 Share this post Link to post
pyscripter 684 Posted April 3, 2021 (edited) CharNext does handle surrogate pairs, diacritics and other multi-codepoint sequences reasonably well. Cleary it does not handle emojis well. Nor does any other Windows function that I know of. But I was mostly trying to show that doing the text enumeration is the easy part. Quote CharNext works with default "user" expectations of characters when dealing with diacritics. For example: A string that contains U+0061 U+030a "LATIN SMALL LETTER A" + COMBINING RING ABOVE" — which looks like "å", will advance two code points, not one. A string that contains U+0061 U+0301 U+0302 U+0303 U+0304 — which looks like "a´^~¯", will advance five code points, not one, and so on. The .NET 5 unicode support is based on the ICU library. See Globalization and ICU | Microsoft Docs for details. There is an ICU Delphi wrapper but it has not been updated for years. ICU has been included in the Windows Creators update. The dll names the the header files are listed here. It would be nice to have translations of the headers to pascal and even better some higher level wrappers. Edited April 3, 2021 by pyscripter Share this post Link to post
Stefan Glienke 1996 Posted April 3, 2021 1 hour ago, Remy Lebeau said: Not really, but we did have multi-byte and multi-codepoint character sequences, which MSDN claims CharNextA() and CharNextW() do handle. But not well enough, in this case. As that article describes, you need .NET 5, which was just released 4 1/2 months ago, to handle Grapheme clusters correctly in things like enumeration, etc. But before it did not run in an endless loop like the code from @pyscripter does 🤷♂️ Share this post Link to post
pyscripter 684 Posted April 3, 2021 (edited) 37 minutes ago, Stefan Glienke said: But before it did not run in an endless loop I did say I did not test . One missing line now added to the code. (FStart := FEnd); Works as expected in: var TestString := 'å'+#$0061#$0301#$0302#$0303#$0304; for var S in TextElements(TestString) do Writeln(S); and yes it does not work as it should with complex emojis. Interestingly even VScode does not handle 🤷🏽♀️ correctly. Paste the symbol and then try to delete it with Backspace. It takes pressing backspace multiple time to actually delete this emoji. Edited April 3, 2021 by pyscripter Share this post Link to post
pyscripter 684 Posted April 3, 2021 (edited) And attached here is a text element enumerator that handles 🤷🏽♀️ correctly using ICU. Note though that the Edit Box and the List Box do not display 🤷🏽♀️correctly (shown as two characters). The content of the EditBox is åá̂̃̄🤷🏽♀️ (copied and pasted here). EnumTextElements.pas Edited April 3, 2021 by pyscripter 1 Share this post Link to post
Vandrovnik 211 Posted April 3, 2021 DirectWrite has some methods that could be usefull. https://docs.microsoft.com/en-us/windows/win32/api/dwrite/nn-dwrite-idwritetextanalyzer It is able to display for example 🧑🏿🦽 correctly. 1 Share this post Link to post
vfbb 284 Posted April 3, 2021 14 hours ago, David Heffernan said: Why would you need to know this, unless you are rendering the text All manipulations of strings received by user input should consider each element and not each individual character. This is basically why delphi TEdit does not work properly with these kind of emojis (just try adding one and pressing backspace that you will see part of problem). In short, when you don't consider this, you risk breaking a string into a position in the middle of a Grapheme, creating a malformed string. As I said, this iteration is essential, Embarcadero should implement this natively to work cross-platform just like C # did. The best thing to do at this point is to port the C # code. Share this post Link to post
vfbb 284 Posted April 3, 2021 1 hour ago, vfbb said: The best thing to do at this point is to port the C# code. Wrong! Actually porting the C# code will be laborious and will require maintenance as the unicode is changed. The best solution at this point is to use apis: Windows - DWriteTextAnalyze Android - GraphemeCharsLength := JString.codePointAt(Index); iOS - CFStringGetRangeOfComposedCharactersAtIndex Share this post Link to post
pyscripter 684 Posted April 3, 2021 1 hour ago, vfbb said: Windows - DWriteTextAnalyze Android - GraphemeCharsLength := JString.codePointAt(Index); iOS - CFStringGetRangeOfComposedCharactersAtInd It appears that ICU is the way to go. ICU is now bunded in Windows, .NET 5 is based on ICU, also android Unicode and internationalization support | Android Developers and iOS ICU usage in Swift - Development / Standard Library - Swift Forums 1 1 Share this post Link to post
David Heffernan 2345 Posted April 3, 2021 3 hours ago, vfbb said: All manipulations of strings received by user input should consider each element and not each individual character. This is basically why delphi TEdit does not work properly with these kind of emojis (just try adding one and pressing backspace that you will see part of problem). In short, when you don't consider this, you risk breaking a string into a position in the middle of a Grapheme, creating a malformed string. As I said, this iteration is essential, Embarcadero should implement this natively to work cross-platform just like C # did. The best thing to do at this point is to port the C # code. I don't think this is true. TEdit is the Win32 EDIT control. What are you doing with strings that need to know what you are asking about? What's your usage scenario? 1 Share this post Link to post
pyscripter 684 Posted April 3, 2021 (edited) 26 minutes ago, David Heffernan said: I don't think this is true. TEdit is the Win32 EDIT control. What are you doing with strings that need to know what you are asking about? What's your usage scenario? Mostly agree. The problem with TEdit is a Windows and not a Delphi one. Notepad and VSCode have similar issues. If you are not rendering text it probably does not matter. I got interested in this because SynEdit does not handle complex unicode characters well. Other scenarios for using libraries such as ICU include proper sorting of multi-lingual text, proper change of capitalization, string normalization etc. (I mean better in corner cases than what Windows provides and more compatible with Unicode standards), Edited April 3, 2021 by pyscripter Share this post Link to post
vfbb 284 Posted April 4, 2021 8 hours ago, David Heffernan said: What's your usage scenario? 8 hours ago, pyscripter said: Notepad and VSCode have similar issues. There are really few apps on Windows, except for chat apps, which fully support unicode (including MS apps), but in my case I really need to deal with this because it will be just a cross platform chat (Win, Android, iOS). Today's problem is to fix TEdit, but it is not just that, but the safe way to manipulate any string received from the user. It is the same as using AnsiString/AnsiChar when the input is string (unicode), you are at risk of generating an unexpected string. To clarify the problem, follow an example on TEdit: 1) Select a TEdit, press Windows key +. to open the emoji window 2) Select for example the "White man raising his hand" which is represented by 7 characters, that is, 14 bytes) 3) Proceed: Example string manipulation: In a string manipulation it is very common, for example, to take x first characters, so suppose I want to take the first 8 chars: S := S.Substring(0, 8); But if S is "🙋🏻♂️🙋🏻♂️", which incredibly has 14 chars, and which is the same as: S := #55357 + #56907 + #55356 + #57339 + #8205 + #9794 + #65039 + #55357 + #56907 + #55356 + #57339 + #8205 + #9794 + #65039; When I give the substring (0, 8), the result will be: S := #55357 + #56907 + #55356 + #57339 + #8205 + #9794 + #65039 + #55357; Which is represented by: 🙋🏻♂️� But a Substring (1, 7) would be even worse in this case: The problem is very clear and a modern application, mainly of chat, has to know how to deal with it. Share this post Link to post
David Heffernan 2345 Posted April 4, 2021 Selecting text in an edit control is for sure a use case where this is needed. 1 Share this post Link to post
balabuev 102 Posted April 4, 2021 (edited) The official APIs in Windows to deal with Unicode stuff - is Uniscribe. It allows to decode Unicode related information from a Utf16 string. Including the sequence of glyphs for rendering. But, what you will do with them? Because, every glyph is denoted as a number, without any meaningfull value; and you can only pass this sequence to ExtTextOut. Also, Uniscribe provides supporting attributes to Unicode codepoints, such as: whether it's a wrod start (for Ctrl+Left/Right navigation) whether it's a word end a valid caret positions (some codepoints are not valid caret positions) should the codepoint be deleted as a group with neightboord codepoints etc. All rules are complex and different from each other. There no simple way to move editor caret and select the text in Unicode enabled text editor. I suggest the following old and great tutorial (an attempt to create the editor with Unicode support) http://www.catch22.net/tuts/neatpad/introduction-uniscribe# Edited April 4, 2021 by balabuev Share this post Link to post
Fr0sT.Brutal 899 Posted April 5, 2021 Relying on OS facilities is simpler but they depend on updates (Unicode committee constantly adds more and more weird combos) so own ICU seems the most reliable way. Share this post Link to post
balabuev 102 Posted April 5, 2021 Not fully agree. You still has to rely on OS text rendering features. Which are themselfs depend on ICU. So, your own ICU can be not well consistent with OS's one, which is used for rendering. Share this post Link to post
Fr0sT.Brutal 899 Posted April 5, 2021 (edited) 3 hours ago, balabuev said: Not fully agree. You still has to rely on OS text rendering features. Which are themselfs depend on ICU. So, your own ICU can be not well consistent with OS's one, which is used for rendering. Not all applications need rendering, don't forget non-visual text processing tasks. I guess these ones are more critical to support bleeding edge standards (imagine someone would strongly urgently perform full-text search among many posts written in Quenya :)) Edited April 5, 2021 by Fr0sT.Brutal Share this post Link to post