Jump to content
A.M. Hoornweg

UCS4StringToWideString broken?

Recommended Posts

Hello all,

 

could it be that function UCS4StringToWideString misses the final character during the conversion?

 

VAR   s:UCS4String; W:String;
begin
    s:=[220,98,101,114,109,228,223,105,103];
    w:=UCS4StringToWideString(s);
    showmessage(w);
end;

 

Share this post


Link to post
Guest

UCS4 is UTF32, this standard is fixed at 4 bytes per character.

 

Now you can see the problem, your data length is 9.

 

from https://unicode.org/faq/utf_bom.html

 

Name UTF-8 UTF-16 UTF-16BE UTF-16LE UTF-32 UTF-32BE UTF-32LE
Smallest code point 0000 0000 0000 0000 0000 0000 0000
Largest code point 10FFFF 10FFFF 10FFFF 10FFFF 10FFFF 10FFFF 10FFFF
Code unit size 8 bits 16 bits 16 bits 16 bits 32 bits 32 bits 32 bits
Byte order N/A <BOM> big-endian little-endian <BOM> big-endian little-endian
Fewest bytes per character 1 2 2 2 4 4 4
Most bytes per character 4 4 4 4 4 4 4

Share this post


Link to post
1 hour ago, A.M. Hoornweg said:

could it be that function UCS4StringToWideString misses the final character during the conversion?

Yes it does:

function UCS4StringToWideString(const S: UCS4String): _WideStr;
var
  I: Integer;
  CharCount: Integer;
begin
  SetLength(Result, Length(S) * 2 - 1); //Maximum possible number of characters
  CharCount := 0;

  I := 0;
  while I < Length(S) - 1 do
  begin
    if S[I] >= $10000 then
    begin
      Inc(CharCount);
      Result[CharCount] := WideChar((((S[I] - $00010000) shr 10) and $000003FF) or $D800);
      Inc(CharCount);
      Result[CharCount] := WideChar(((S[I] - $00010000) and $000003FF)or $DC00);
    end
    else
    begin
      Inc(CharCount);
      Result[CharCount] := WideChar(S[I]);
    end;

    Inc(I);
  end;

  SetLength(Result, CharCount);
end;

 

The bug is the Length(s)-1 in the while condition (or the < instead of <=).

 

The first SetLength also seems strange. And can't see why it should do a -1 there.

Share this post


Link to post
12 minutes ago, Kas Ob. said:

your data length is 9

Yes but each element is an UCS4Char (which is 4 bytes), so the string (it's an array actually) size is 9 * SizeOf(UCS4Char).

Share this post


Link to post

Looking into it. I have the impression the original design assumes a UCS4String to have a null terminator.

s:= [98,101,114,109,228,223,105,103,0]; // "übermäßig"

 

Not only If you add 0 to the array everything works, but if you call UnicodeStringToUCS4String this is exactly what you get (an array plus the #0)

 

s:= UnicodeStringToUCS4String('übermäßig'); //[98,101,114,109,228,223,105,103,0]; // "übermäßig"
ShowMessage (length(s).ToString);

 

Honestly, I don't see us putting a lot of effort in UCS4String 

  • Thanks 1

Share this post


Link to post
28 minutes ago, Marco Cantu said:

Honestly, I don't see us putting a lot of effort in UCS4String 

 

While it may be an honest statement, it comes across horribly... (maybe I'm just having a bad online day with all the garbage I've seen this morning?)  But at least put in the effort to document the expected behavior.

Share this post


Link to post
4 hours ago, Anders Melander said:

The bug is the Length(s)-1 in the while condition (or the < instead of <=).

That is not a bug, actually.  That is intentional behavior.  UCS4String is not a native RTL string type, like (Ansi|Unicode|UTF8|Wide)String are.  It is just a plain ordinary dynamic array of UCS4Char:

type
  UCS4Char = type LongWord;
  UCS4String = array of UCS4Char;

So, for compatibility with APIs that take C-style strings, a UCS4String has an explicit #0 element at the end, so that PUCS4Char() typecasts are null-terminated the same way that P(Ansi|Wide)Char() typecasts on other string types are.  As such, functions that output a UCS4String are required to allocate +1 for the null terminator, and functions that take a UCS4String as input are required to ignore the last element expecting it to be the null terminator.

4 hours ago, Anders Melander said:

The first SetLength also seems strange. And can't see why it should do a -1 there.

The code is pre-allocating the output WideString to the maximum number of UTF-16 codeunits that MAY be produced, which is the total number of UTF-32 codepoints minus the null terminator, multiplied by 2 (in case EVERY codepoint needs a UTF-16 surrogate pair).  Then the code fills the WideString with actual UTF-16 codeunits, and then finally resizes the memory to the actual number of WideChars produced, not including a null terminator.

Share this post


Link to post
45 minutes ago, Marco Cantu said:

Looking into it. I have the impression the original design assumes a UCS4String to have a null terminator.

Yes, exactly.  UCS4String predates Delphi 2009, it was first introduced in Delphi 6 alongside UTF8String, which at the time was just an alias for AnsiString.  In Delphi 2009, when AnsiString became codepage-aware and UTF8String became a true UTF-8 string type, UCS4String did not become a true UCS-4 string type, it remained a plain dynamic array.  Since the D2009+ version of the StrRec record supports characters of varying byte sizes, I've asked for UCS4String to be changed into a true string type using 4-byte chars, backed by the same RTL logic that handles AnsiString(N) and UnicodeString, but that never happened.

  • Like 1

Share this post


Link to post
2 hours ago, Remy Lebeau said:

UCS4String is not a native RTL string type, like (Ansi|Unicode|UTF8|Wide)String are.  It is just a plain ordinary dynamic array of UCS4Char

There's no need to state the obvious.

While the function may behave as designed, it's at the very least poorly documented. I looked at both the documentation and the implementation before I came to the conclusion that there was a bug.

 

I'm fine with the implementation staying the way it is (to avoid breaking existing code) but it should be documented that the array is assumed to be zero terminated.

 

Actually since the array doesn't need to be zero terminated (we know the length of the "string" already from the array) I can't see why the zero termination couldn't be made optional. Either way: fix the documentation.

  • Like 1

Share this post


Link to post
1 hour ago, Anders Melander said:

There's no need to state the obvious.

Not obvious to other people who are not familiar with UCS4String and how it works.  It is not a common type most people work with.

1 hour ago, Anders Melander said:

While the function may behave as designed, it's at the very least poorly documented.

Agreed.  Just like a lot of things in the documentation (or lack of).

1 hour ago, Anders Melander said:

Actually since the array doesn't need to be zero terminated (we know the length of the "string" already from the array)

Sure, for purposes of just iterating and accessing characters, such as during conversions.

1 hour ago, Anders Melander said:

I can't see why the zero termination couldn't be made optional.

I stated earlier why it exists at all - mainly so PUCS4Char() typecasts will be null-terminated.  That allows the array to act more like a C-style string, just like native string types do.

1 hour ago, Anders Melander said:

Either way: fix the documentation.

Of course, obviously.

Share this post


Link to post
3 hours ago, Anders Melander said:

Actually since the array doesn't need to be zero terminated

It needs to be when you interoperate with libraries in which the default character encoding is USC4 (eg. python API on Linux https://www.python.org/dev/peps/pep-0513/).

Edited by pyscripter

Share this post


Link to post
7 hours ago, Marco Cantu said:

Honestly, I don't see us putting a lot of effort in UCS4String

I suppose this is understandable.   The Windows world is on UTF-16 and the rest on UTF-8.  UCS4 (UTF-32) is very rarely used.

 

With the benefit of hindsight, UTF-8 would probably have been a better choice for Windows.  Now Microsoft is trying to bring UTF-8 back into Windows via the A routines and by offering to make UTF-8 the default character encoding.

Edited by pyscripter

Share this post


Link to post
6 hours ago, pyscripter said:

I suppose this is understandable.   The Windows world is on UTF-16 and the rest on UTF-8.  UCS4 (UTF-32) is very rarely used.

 

With the benefit of hindsight, UTF-8 would probably have been a better choice for Windows.  Now Microsoft is trying to bring UTF-8 back into Windows via the A routines and by offering to make UTF-8 the default character encoding.

 

Most programmers ignore that UTF-16 is a variable-length encoding and treat it like UCS-2, falsely assuming that one widechar corresponds to one codepoint.  While most of us probably don't handle exotic languages, this is the 21st century and sooner or later you'll stumble upon strings that contain Emoticons which just won't fit in one widechar ( https://en.wikipedia.org/wiki/Emoticons_(Unicode_block) .

 

Delphi could use some better support for that, like iterators etc. Something like this:

Function ReverseString (S:String):String;
VAR c:UCS4Char;
Begin
  Result:='';
  FOR c in s do 
   result:=c + result;
End;

 

And if you think this is far fetched, just look how elegantly Freepascal solves this, https://wiki.freepascal.org/for-in_loop#Traversing_UTF-8_strings . 

 

 

 

 

 

 

 

Share this post


Link to post
8 hours ago, pyscripter said:

It needs to be when you interoperate with libraries in which the default character encoding is USC4

I meant the function doesn't need it to be zero terminated (it has the length already).

 

 

Share this post


Link to post
1 hour ago, Anders Melander said:

I meant the function doesn't need it to be zero terminated (it has the length already).

Why should any Delphi program use UCS4?  The main use case is for inter-operability.   You would need to pass the result to some external function.   Having to manually add the null-terminator, would be inconvenient.

Share this post


Link to post
3 hours ago, A.M. Hoornweg said:

Most programmers ignore that UTF-16 is a variable-length encoding and treat it like UCS-2, falsely assuming that one widechar corresponds to one codepoint.  While most of us probably don't handle exotic languages, this is the 21st century and sooner or later you'll stumble upon strings that contain Emoticons which just won't fit in one widechar ( https://en.wikipedia.org/wiki/Emoticons_(Unicode_block) .

You don't have to use USC4 for that and using USC4 would not solve this problem.  There are mainly two issues:

 

1) Surrogates pairs (two widechars correspond to one glyph)  UCS4 would help with this one.
UTF-16 Encoding: 0xD83D 0xDCBC 

 

2) Combining characters (more than one Widechars shown as one glyph).  But UCS4 would not help with this one.
Åström ḱṷṓn
Precomposed vrs Decomposed
ḱṷṓn (U+1E31 U+1E77 U+1E53 U+006E)
ḱṷṓn (U+006B U+0301 U+0075 U+032D U+006F U+0304 U+0301 U+006E)

 

Windows provides CharNext/Prev that deals with both issues,  but not perfectly.  You have to use Uniscribe or DirectWrite for greater accuracy.

 

In SynEdit there is this function:

function SynCharNext(P: PWideCharout Element : String) WideCharoverload;
Var
  Start : PWideChar;
begin
  Start := P;
  Result := Windows.CharNext(P);
  SetString(Element, Start, Result - Start);
end;

 

 

It is very easy to write an enumerator that works with CharNext.

 

Edited by pyscripter

Share this post


Link to post
10 minutes ago, pyscripter said:

Why should any Delphi program use UCS4?  The main use case is for inter-operability.   You would need to pass the result to some external function.   Having to manually add the null-terminator, would be inconvenient.

Remember that we're talking about UCS4StringToWideString and not WideStringToUCS4String but I'll try again: Yes, it's a given that the resulting string, which is a 2 byte widestring, will be zero terminated as all Delphi long strings are.

If the input was a PUCS4Char (a pointer to a zero terminated 4 byte string) then the input would have to be zero terminated. But it isn't. The input is an array in which the length is implicit.

This means that the zero termination requirement is superfluous. The function could be implemented so that it handled both arrays with and arrays without a zero in the final entry. You know; Defensive coding. Just in case someone mistakenly passed an array that wasn't zero terminated because the documentation didn't state that you had to...

Share this post


Link to post
36 minutes ago, Anders Melander said:

Remember that we're talking about UCS4StringToWideString and not WideStringToUCS4String

The issue is the in-memory representation of USC4.  Should it have the redundant (obviously, no need to make the same argument multiple times :classic_smile:) #0 or not?   And I argued that for inter-operability (passing a pointer to external functions expecting  null-terminated UCS4) it is better that it always includes it despite the redundancy.   And this is what the RTL assumes and does.

Edited by pyscripter

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×