A.M. Hoornweg 144 Posted September 24, 2020 Hello all, could it be that function UCS4StringToWideString misses the final character during the conversion? VAR s:UCS4String; W:String; begin s:=[220,98,101,114,109,228,223,105,103]; w:=UCS4StringToWideString(s); showmessage(w); end; Share this post Link to post
Guest Posted September 24, 2020 UCS4 is UTF32, this standard is fixed at 4 bytes per character. Now you can see the problem, your data length is 9. from https://unicode.org/faq/utf_bom.html Name UTF-8 UTF-16 UTF-16BE UTF-16LE UTF-32 UTF-32BE UTF-32LE Smallest code point 0000 0000 0000 0000 0000 0000 0000 Largest code point 10FFFF 10FFFF 10FFFF 10FFFF 10FFFF 10FFFF 10FFFF Code unit size 8 bits 16 bits 16 bits 16 bits 32 bits 32 bits 32 bits Byte order N/A <BOM> big-endian little-endian <BOM> big-endian little-endian Fewest bytes per character 1 2 2 2 4 4 4 Most bytes per character 4 4 4 4 4 4 4 Share this post Link to post
Anders Melander 1782 Posted September 24, 2020 1 hour ago, A.M. Hoornweg said: could it be that function UCS4StringToWideString misses the final character during the conversion? Yes it does: function UCS4StringToWideString(const S: UCS4String): _WideStr; var I: Integer; CharCount: Integer; begin SetLength(Result, Length(S) * 2 - 1); //Maximum possible number of characters CharCount := 0; I := 0; while I < Length(S) - 1 do begin if S[I] >= $10000 then begin Inc(CharCount); Result[CharCount] := WideChar((((S[I] - $00010000) shr 10) and $000003FF) or $D800); Inc(CharCount); Result[CharCount] := WideChar(((S[I] - $00010000) and $000003FF)or $DC00); end else begin Inc(CharCount); Result[CharCount] := WideChar(S[I]); end; Inc(I); end; SetLength(Result, CharCount); end; The bug is the Length(s)-1 in the while condition (or the < instead of <=). The first SetLength also seems strange. And can't see why it should do a -1 there. Share this post Link to post
Anders Melander 1782 Posted September 24, 2020 12 minutes ago, Kas Ob. said: your data length is 9 Yes but each element is an UCS4Char (which is 4 bytes), so the string (it's an array actually) size is 9 * SizeOf(UCS4Char). Share this post Link to post
A.M. Hoornweg 144 Posted September 24, 2020 (edited) OK, I'll post a QC then. Edit: https://quality.embarcadero.com/browse/RSP-31114 Edited September 24, 2020 by A.M. Hoornweg 1 Share this post Link to post
Lars Fosdal 1792 Posted September 24, 2020 @A.M. Hoornweg Being a bit pedantic now, but you would save the QA and developers quite a bit of work if your example was a complete console program. Share this post Link to post
Marco Cantu 78 Posted September 24, 2020 Looking into it. I have the impression the original design assumes a UCS4String to have a null terminator. s:= [98,101,114,109,228,223,105,103,0]; // "übermäßig" Not only If you add 0 to the array everything works, but if you call UnicodeStringToUCS4String this is exactly what you get (an array plus the #0) s:= UnicodeStringToUCS4String('übermäßig'); //[98,101,114,109,228,223,105,103,0]; // "übermäßig" ShowMessage (length(s).ToString); Honestly, I don't see us putting a lot of effort in UCS4String 1 Share this post Link to post
Darian Miller 361 Posted September 24, 2020 28 minutes ago, Marco Cantu said: Honestly, I don't see us putting a lot of effort in UCS4String While it may be an honest statement, it comes across horribly... (maybe I'm just having a bad online day with all the garbage I've seen this morning?) But at least put in the effort to document the expected behavior. Share this post Link to post
Remy Lebeau 1393 Posted September 24, 2020 4 hours ago, Anders Melander said: The bug is the Length(s)-1 in the while condition (or the < instead of <=). That is not a bug, actually. That is intentional behavior. UCS4String is not a native RTL string type, like (Ansi|Unicode|UTF8|Wide)String are. It is just a plain ordinary dynamic array of UCS4Char: type UCS4Char = type LongWord; UCS4String = array of UCS4Char; So, for compatibility with APIs that take C-style strings, a UCS4String has an explicit #0 element at the end, so that PUCS4Char() typecasts are null-terminated the same way that P(Ansi|Wide)Char() typecasts on other string types are. As such, functions that output a UCS4String are required to allocate +1 for the null terminator, and functions that take a UCS4String as input are required to ignore the last element expecting it to be the null terminator. 4 hours ago, Anders Melander said: The first SetLength also seems strange. And can't see why it should do a -1 there. The code is pre-allocating the output WideString to the maximum number of UTF-16 codeunits that MAY be produced, which is the total number of UTF-32 codepoints minus the null terminator, multiplied by 2 (in case EVERY codepoint needs a UTF-16 surrogate pair). Then the code fills the WideString with actual UTF-16 codeunits, and then finally resizes the memory to the actual number of WideChars produced, not including a null terminator. Share this post Link to post
Remy Lebeau 1393 Posted September 24, 2020 45 minutes ago, Marco Cantu said: Looking into it. I have the impression the original design assumes a UCS4String to have a null terminator. Yes, exactly. UCS4String predates Delphi 2009, it was first introduced in Delphi 6 alongside UTF8String, which at the time was just an alias for AnsiString. In Delphi 2009, when AnsiString became codepage-aware and UTF8String became a true UTF-8 string type, UCS4String did not become a true UCS-4 string type, it remained a plain dynamic array. Since the D2009+ version of the StrRec record supports characters of varying byte sizes, I've asked for UCS4String to be changed into a true string type using 4-byte chars, backed by the same RTL logic that handles AnsiString(N) and UnicodeString, but that never happened. 1 Share this post Link to post
Anders Melander 1782 Posted September 24, 2020 2 hours ago, Remy Lebeau said: UCS4String is not a native RTL string type, like (Ansi|Unicode|UTF8|Wide)String are. It is just a plain ordinary dynamic array of UCS4Char There's no need to state the obvious. While the function may behave as designed, it's at the very least poorly documented. I looked at both the documentation and the implementation before I came to the conclusion that there was a bug. I'm fine with the implementation staying the way it is (to avoid breaking existing code) but it should be documented that the array is assumed to be zero terminated. Actually since the array doesn't need to be zero terminated (we know the length of the "string" already from the array) I can't see why the zero termination couldn't be made optional. Either way: fix the documentation. 1 Share this post Link to post
Remy Lebeau 1393 Posted September 24, 2020 1 hour ago, Anders Melander said: There's no need to state the obvious. Not obvious to other people who are not familiar with UCS4String and how it works. It is not a common type most people work with. 1 hour ago, Anders Melander said: While the function may behave as designed, it's at the very least poorly documented. Agreed. Just like a lot of things in the documentation (or lack of). 1 hour ago, Anders Melander said: Actually since the array doesn't need to be zero terminated (we know the length of the "string" already from the array) Sure, for purposes of just iterating and accessing characters, such as during conversions. 1 hour ago, Anders Melander said: I can't see why the zero termination couldn't be made optional. I stated earlier why it exists at all - mainly so PUCS4Char() typecasts will be null-terminated. That allows the array to act more like a C-style string, just like native string types do. 1 hour ago, Anders Melander said: Either way: fix the documentation. Of course, obviously. Share this post Link to post
pyscripter 689 Posted September 24, 2020 (edited) This was all covered in Edited September 24, 2020 by pyscripter 1 Share this post Link to post
pyscripter 689 Posted September 24, 2020 (edited) 3 hours ago, Anders Melander said: Actually since the array doesn't need to be zero terminated It needs to be when you interoperate with libraries in which the default character encoding is USC4 (eg. python API on Linux https://www.python.org/dev/peps/pep-0513/). Edited September 24, 2020 by pyscripter Share this post Link to post
pyscripter 689 Posted September 24, 2020 (edited) 7 hours ago, Marco Cantu said: Honestly, I don't see us putting a lot of effort in UCS4String I suppose this is understandable. The Windows world is on UTF-16 and the rest on UTF-8. UCS4 (UTF-32) is very rarely used. With the benefit of hindsight, UTF-8 would probably have been a better choice for Windows. Now Microsoft is trying to bring UTF-8 back into Windows via the A routines and by offering to make UTF-8 the default character encoding. Edited September 24, 2020 by pyscripter Share this post Link to post
A.M. Hoornweg 144 Posted September 25, 2020 6 hours ago, pyscripter said: I suppose this is understandable. The Windows world is on UTF-16 and the rest on UTF-8. UCS4 (UTF-32) is very rarely used. With the benefit of hindsight, UTF-8 would probably have been a better choice for Windows. Now Microsoft is trying to bring UTF-8 back into Windows via the A routines and by offering to make UTF-8 the default character encoding. Most programmers ignore that UTF-16 is a variable-length encoding and treat it like UCS-2, falsely assuming that one widechar corresponds to one codepoint. While most of us probably don't handle exotic languages, this is the 21st century and sooner or later you'll stumble upon strings that contain Emoticons which just won't fit in one widechar ( https://en.wikipedia.org/wiki/Emoticons_(Unicode_block) . Delphi could use some better support for that, like iterators etc. Something like this: Function ReverseString (S:String):String; VAR c:UCS4Char; Begin Result:=''; FOR c in s do result:=c + result; End; And if you think this is far fetched, just look how elegantly Freepascal solves this, https://wiki.freepascal.org/for-in_loop#Traversing_UTF-8_strings . Share this post Link to post
Anders Melander 1782 Posted September 25, 2020 8 hours ago, pyscripter said: It needs to be when you interoperate with libraries in which the default character encoding is USC4 I meant the function doesn't need it to be zero terminated (it has the length already). Share this post Link to post
pyscripter 689 Posted September 25, 2020 1 hour ago, Anders Melander said: I meant the function doesn't need it to be zero terminated (it has the length already). Why should any Delphi program use UCS4? The main use case is for inter-operability. You would need to pass the result to some external function. Having to manually add the null-terminator, would be inconvenient. Share this post Link to post
pyscripter 689 Posted September 25, 2020 (edited) 3 hours ago, A.M. Hoornweg said: Most programmers ignore that UTF-16 is a variable-length encoding and treat it like UCS-2, falsely assuming that one widechar corresponds to one codepoint. While most of us probably don't handle exotic languages, this is the 21st century and sooner or later you'll stumble upon strings that contain Emoticons which just won't fit in one widechar ( https://en.wikipedia.org/wiki/Emoticons_(Unicode_block) . You don't have to use USC4 for that and using USC4 would not solve this problem. There are mainly two issues: 1) Surrogates pairs (two widechars correspond to one glyph) UCS4 would help with this one. UTF-16 Encoding: 0xD83D 0xDCBC “💼” 2) Combining characters (more than one Widechars shown as one glyph). But UCS4 would not help with this one. Åström ḱṷṓn Precomposed vrs Decomposed ḱṷṓn (U+1E31 U+1E77 U+1E53 U+006E) ḱṷṓn (U+006B U+0301 U+0075 U+032D U+006F U+0304 U+0301 U+006E) Windows provides CharNext/Prev that deals with both issues, but not perfectly. You have to use Uniscribe or DirectWrite for greater accuracy. In SynEdit there is this function: function SynCharNext(P: PWideChar; out Element : String) : WideChar; overload; Var Start : PWideChar; begin Start := P; Result := Windows.CharNext(P); SetString(Element, Start, Result - Start); end; It is very easy to write an enumerator that works with CharNext. Edited September 25, 2020 by pyscripter Share this post Link to post
Anders Melander 1782 Posted September 25, 2020 10 minutes ago, pyscripter said: Why should any Delphi program use UCS4? The main use case is for inter-operability. You would need to pass the result to some external function. Having to manually add the null-terminator, would be inconvenient. Remember that we're talking about UCS4StringToWideString and not WideStringToUCS4String but I'll try again: Yes, it's a given that the resulting string, which is a 2 byte widestring, will be zero terminated as all Delphi long strings are. If the input was a PUCS4Char (a pointer to a zero terminated 4 byte string) then the input would have to be zero terminated. But it isn't. The input is an array in which the length is implicit. This means that the zero termination requirement is superfluous. The function could be implemented so that it handled both arrays with and arrays without a zero in the final entry. You know; Defensive coding. Just in case someone mistakenly passed an array that wasn't zero terminated because the documentation didn't state that you had to... Share this post Link to post
pyscripter 689 Posted September 25, 2020 (edited) 36 minutes ago, Anders Melander said: Remember that we're talking about UCS4StringToWideString and not WideStringToUCS4String The issue is the in-memory representation of USC4. Should it have the redundant (obviously, no need to make the same argument multiple times ) #0 or not? And I argued that for inter-operability (passing a pointer to external functions expecting null-terminated UCS4) it is better that it always includes it despite the redundancy. And this is what the RTL assumes and does. Edited September 25, 2020 by pyscripter Share this post Link to post