UCS4StringToWideString broken?

A.M. Hoornweg · September 24, 2020

Hello all,

could it be that function UCS4StringToWideString misses the final character during the conversion?

VAR   s:UCS4String; W:String;
begin
    s:=[220,98,101,114,109,228,223,105,103];
    w:=UCS4StringToWideString(s);
    showmessage(w);
end;

September 24, 2020

UCS4 is UTF32, this standard is fixed at 4 bytes per character.

Now you can see the problem, your data length is 9.

from https://unicode.org/faq/utf_bom.html

Name	UTF-8	UTF-16	UTF-16BE	UTF-16LE	UTF-32	UTF-32BE	UTF-32LE
Smallest code point	0000	0000	0000	0000	0000	0000	0000
Largest code point	10FFFF	10FFFF	10FFFF	10FFFF	10FFFF	10FFFF	10FFFF
Code unit size	8 bits	16 bits	16 bits	16 bits	32 bits	32 bits	32 bits
Byte order	N/A	<BOM>	big-endian	little-endian	<BOM>	big-endian	little-endian
Fewest bytes per character	1	2	2	2	4	4	4
Most bytes per character	4	4	4	4	4	4	4

Anders Melander · September 24, 2020

1 hour ago, A.M. Hoornweg said:

could it be that function UCS4StringToWideString misses the final character during the conversion?

Yes it does:

function UCS4StringToWideString(const S: UCS4String): _WideStr;
var
  I: Integer;
  CharCount: Integer;
begin
  SetLength(Result, Length(S) * 2 - 1); //Maximum possible number of characters
  CharCount := 0;

  I := 0;
  while I < Length(S) - 1 do
  begin
    if S[I] >= $10000 then
    begin
      Inc(CharCount);
      Result[CharCount] := WideChar((((S[I] - $00010000) shr 10) and $000003FF) or $D800);
      Inc(CharCount);
      Result[CharCount] := WideChar(((S[I] - $00010000) and $000003FF)or $DC00);
    end
    else
    begin
      Inc(CharCount);
      Result[CharCount] := WideChar(S[I]);
    end;

    Inc(I);
  end;

  SetLength(Result, CharCount);
end;

The bug is the Length(s)-1 in the while condition (or the < instead of <=).

The first SetLength also seems strange. And can't see why it should do a -1 there.

Anders Melander · September 24, 2020

12 minutes ago, Kas Ob. said:

your data length is 9

Yes but each element is an UCS4Char (which is 4 bytes), so the string (it's an array actually) size is 9 * SizeOf(UCS4Char).

A.M. Hoornweg · September 24, 2020

OK, I'll post a QC then.

Edit: https://quality.embarcadero.com/browse/RSP-31114

Edited September 24, 2020 by A.M. Hoornweg

Lars Fosdal · September 24, 2020

@A.M. Hoornweg Being a bit pedantic now, but you would save the QA and developers quite a bit of work if your example was a complete console program.

Marco Cantu · September 24, 2020

Looking into it. I have the impression the original design assumes a UCS4String to have a null terminator.

s:= [98,101,114,109,228,223,105,103,0]; // "übermäßig"

Not only If you add 0 to the array everything works, but if you call UnicodeStringToUCS4String this is exactly what you get (an array plus the #0)

s:= UnicodeStringToUCS4String('übermäßig'); //[98,101,114,109,228,223,105,103,0]; // "übermäßig"
ShowMessage (length(s).ToString);

Honestly, I don't see us putting a lot of effort in UCS4String

Darian Miller · September 24, 2020

28 minutes ago, Marco Cantu said:

Honestly, I don't see us putting a lot of effort in UCS4String

While it may be an honest statement, it comes across horribly... (maybe I'm just having a bad online day with all the garbage I've seen this morning?) But at least put in the effort to document the expected behavior.

Remy Lebeau · September 24, 2020

4 hours ago, Anders Melander said:

The bug is the Length(s)-1 in the while condition (or the < instead of <=).

That is not a bug, actually. That is intentional behavior. UCS4String is not a native RTL string type, like (Ansi|Unicode|UTF8|Wide)String are. It is just a plain ordinary dynamic array of UCS4Char:

type
  UCS4Char = type LongWord;
  UCS4String = array of UCS4Char;

So, for compatibility with APIs that take C-style strings, a UCS4String has an explicit #0 element at the end, so that PUCS4Char() typecasts are null-terminated the same way that P(Ansi|Wide)Char() typecasts on other string types are. As such, functions that output a UCS4String are required to allocate +1 for the null terminator, and functions that take a UCS4String as input are required to ignore the last element expecting it to be the null terminator.

4 hours ago, Anders Melander said:

The first SetLength also seems strange. And can't see why it should do a -1 there.

The code is pre-allocating the output WideString to the maximum number of UTF-16 codeunits that MAY be produced, which is the total number of UTF-32 codepoints minus the null terminator, multiplied by 2 (in case EVERY codepoint needs a UTF-16 surrogate pair). Then the code fills the WideString with actual UTF-16 codeunits, and then finally resizes the memory to the actual number of WideChars produced, not including a null terminator.

Remy Lebeau · September 24, 2020

45 minutes ago, Marco Cantu said:

Looking into it. I have the impression the original design assumes a UCS4String to have a null terminator.

Yes, exactly. UCS4String predates Delphi 2009, it was first introduced in Delphi 6 alongside UTF8String, which at the time was just an alias for AnsiString. In Delphi 2009, when AnsiString became codepage-aware and UTF8String became a true UTF-8 string type, UCS4String did not become a true UCS-4 string type, it remained a plain dynamic array. Since the D2009+ version of the StrRec record supports characters of varying byte sizes, I've asked for UCS4String to be changed into a true string type using 4-byte chars, backed by the same RTL logic that handles AnsiString(N) and UnicodeString, but that never happened.

Anders Melander · September 24, 2020

2 hours ago, Remy Lebeau said:

UCS4String is not a native RTL string type, like (Ansi|Unicode|UTF8|Wide)String are. It is just a plain ordinary dynamic array of UCS4Char

There's no need to state the obvious.

While the function may behave as designed, it's at the very least poorly documented. I looked at both the documentation and the implementation before I came to the conclusion that there was a bug.

I'm fine with the implementation staying the way it is (to avoid breaking existing code) but it should be documented that the array is assumed to be zero terminated.

Actually since the array doesn't need to be zero terminated (we know the length of the "string" already from the array) I can't see why the zero termination couldn't be made optional. Either way: fix the documentation.

Remy Lebeau · September 24, 2020

1 hour ago, Anders Melander said:

There's no need to state the obvious.

Not obvious to other people who are not familiar with UCS4String and how it works. It is not a common type most people work with.

1 hour ago, Anders Melander said:

While the function may behave as designed, it's at the very least poorly documented.

Agreed. Just like a lot of things in the documentation (or lack of).

1 hour ago, Anders Melander said:

Actually since the array doesn't need to be zero terminated (we know the length of the "string" already from the array)

Sure, for purposes of just iterating and accessing characters, such as during conversions.

1 hour ago, Anders Melander said:

I can't see why the zero termination couldn't be made optional.

I stated earlier why it exists at all - mainly so PUCS4Char() typecasts will be null-terminated. That allows the array to act more like a C-style string, just like native string types do.

1 hour ago, Anders Melander said:

Either way: fix the documentation.

Of course, obviously.

pyscripter · September 24, 2020

This was all covered in

Edited September 24, 2020 by pyscripter

pyscripter · September 24, 2020

3 hours ago, Anders Melander said:

Actually since the array doesn't need to be zero terminated

It needs to be when you interoperate with libraries in which the default character encoding is USC4 (eg. python API on Linux https://www.python.org/dev/peps/pep-0513/).

Edited September 24, 2020 by pyscripter

pyscripter · September 24, 2020

7 hours ago, Marco Cantu said:

Honestly, I don't see us putting a lot of effort in UCS4String

I suppose this is understandable. The Windows world is on UTF-16 and the rest on UTF-8. UCS4 (UTF-32) is very rarely used.

With the benefit of hindsight, UTF-8 would probably have been a better choice for Windows. Now Microsoft is trying to bring UTF-8 back into Windows via the A routines and by offering to make UTF-8 the default character encoding.

Edited September 24, 2020 by pyscripter

A.M. Hoornweg · September 25, 2020

6 hours ago, pyscripter said:

I suppose this is understandable. The Windows world is on UTF-16 and the rest on UTF-8. UCS4 (UTF-32) is very rarely used.

With the benefit of hindsight, UTF-8 would probably have been a better choice for Windows. Now Microsoft is trying to bring UTF-8 back into Windows via the A routines and by offering to make UTF-8 the default character encoding.

Most programmers ignore that UTF-16 is a variable-length encoding and treat it like UCS-2, falsely assuming that one widechar corresponds to one codepoint. While most of us probably don't handle exotic languages, this is the 21st century and sooner or later you'll stumble upon strings that contain Emoticons which just won't fit in one widechar ( https://en.wikipedia.org/wiki/Emoticons_(Unicode_block) .

Delphi could use some better support for that, like iterators etc. Something like this:

Function ReverseString (S:String):String;
VAR c:UCS4Char;
Begin
  Result:='';
  FOR c in s do 
   result:=c + result;
End;

And if you think this is far fetched, just look how elegantly Freepascal solves this, https://wiki.freepascal.org/for-in_loop#Traversing_UTF-8_strings .

Anders Melander · September 25, 2020

8 hours ago, pyscripter said:

It needs to be when you interoperate with libraries in which the default character encoding is USC4

I meant the function doesn't need it to be zero terminated (it has the length already).

pyscripter · September 25, 2020

1 hour ago, Anders Melander said:

I meant the function doesn't need it to be zero terminated (it has the length already).

Why should any Delphi program use UCS4? The main use case is for inter-operability. You would need to pass the result to some external function. Having to manually add the null-terminator, would be inconvenient.

pyscripter · September 25, 2020

3 hours ago, A.M. Hoornweg said:

Most programmers ignore that UTF-16 is a variable-length encoding and treat it like UCS-2, falsely assuming that one widechar corresponds to one codepoint. While most of us probably don't handle exotic languages, this is the 21st century and sooner or later you'll stumble upon strings that contain Emoticons which just won't fit in one widechar ( https://en.wikipedia.org/wiki/Emoticons_(Unicode_block) .

You don't have to use USC4 for that and using USC4 would not solve this problem. There are mainly two issues:

1) Surrogates pairs (two widechars correspond to one glyph) UCS4 would help with this one.
UTF-16 Encoding: 0xD83D 0xDCBC

“💼”

2) Combining characters (more than one Widechars shown as one glyph). But UCS4 would not help with this one.
Åström ḱṷṓn
Precomposed vrs Decomposed
ḱṷṓn (U+1E31 U+1E77 U+1E53 U+006E)
ḱṷṓn (U+006B U+0301 U+0075 U+032D U+006F U+0304 U+0301 U+006E)

Windows provides CharNext/Prev that deals with both issues, but not perfectly. You have to use Uniscribe or DirectWrite for greater accuracy.

In SynEdit there is this function:

function SynCharNext(P: PWideChar; out Element : String) : WideChar; overload;

Var

Start : PWideChar;

begin

Start := P;

Result := Windows.CharNext(P);

SetString(Element, Start, Result - Start);

end;

It is very easy to write an enumerator that works with CharNext.

Edited September 25, 2020 by pyscripter

Anders Melander · September 25, 2020

10 minutes ago, pyscripter said:

Why should any Delphi program use UCS4? The main use case is for inter-operability. You would need to pass the result to some external function. Having to manually add the null-terminator, would be inconvenient.

Remember that we're talking about UCS4StringToWideString and not WideStringToUCS4String but I'll try again: Yes, it's a given that the resulting string, which is a 2 byte widestring, will be zero terminated as all Delphi long strings are.

If the input was a PUCS4Char (a pointer to a zero terminated 4 byte string) then the input would have to be zero terminated. But it isn't. The input is an array in which the length is implicit.

This means that the zero termination requirement is superfluous. The function could be implemented so that it handled both arrays with and arrays without a zero in the final entry. You know; Defensive coding. Just in case someone mistakenly passed an array that wasn't zero terminated because the documentation didn't state that you had to...

pyscripter · September 25, 2020

36 minutes ago, Anders Melander said:

Remember that we're talking about UCS4StringToWideString and not WideStringToUCS4String

The issue is the in-memory representation of USC4. Should it have the redundant (obviously, no need to make the same argument multiple times ) #0 or not? And I argued that for inter-operability (passing a pointer to external functions expecting null-terminated UCS4) it is better that it always includes it despite the redundancy. And this is what the RTL assumes and does.

Edited September 25, 2020 by pyscripter

Sign In

UCS4StringToWideString broken?

Recommended Posts

A.M. Hoornweg 161

Share this post

Link to post

Guest

Share this post

Link to post

Anders Melander 2137

Share this post

Link to post

Anders Melander 2137

Share this post

Link to post

A.M. Hoornweg 161

Share this post

Link to post

Lars Fosdal 1959

Share this post

Link to post

Marco Cantu 82

Share this post

Link to post

Darian Miller 395

Share this post

Link to post

Remy Lebeau 1676

Share this post

Link to post

Remy Lebeau 1676

Share this post

Link to post

Anders Melander 2137

Share this post

Link to post

Remy Lebeau 1676

Share this post

Link to post

pyscripter 854

Share this post

Link to post

pyscripter 854

Share this post

Link to post

pyscripter 854

Share this post

Link to post

A.M. Hoornweg 161

Share this post

Link to post

Anders Melander 2137

Share this post

Link to post

pyscripter 854

Share this post

Link to post

pyscripter 854

Share this post

Link to post

Anders Melander 2137

Share this post

Link to post

pyscripter 854

Share this post

Link to post

Create an account or sign in to comment

Create an account

Sign in

Browse

Activity