Delphi 12 : Encoding Unicode strange behaviour

gioma · March 4, 2024

Hi,

I have this code:

procedure ProcessData(data:PByte, asize:integer);
var
	data_pack:TBytes;
    t:TStreamWriter;
    DataType:Char;
begin
	nflog:= 'C:\TEST\Log.txt';
    if FileExists(nflog) then
      t := TStreamWriter.Create(nflog, true, TEncoding.UTF8)
    else
      t := TStreamWriter.Create(nflog, false, TEncoding.UTF8);
      
    SetLength(data_pack, asize);
    Move(PByte(data)^, data_pack[0], asize);
	// here the coding is perfect
    t.WriteLine( TEncoding.Unicode.GetString(data_pack));
    t.Flush;

    dataType := TEncoding.unicode.GetChars(data_pack)[0];
    
    delete(data_pack, 0, 1);
    // here the content of datapack has become UTF 8 and therefore does not coding well
    t.WriteLine( TEncoding.Unicode.GetString(data_pack));
    t.flush
    t.free;
    
end;

In the "data" variable there is a string in unicode, but when Delphi esxcute the "delete" operation it seems that it convert the contents of the data_pack variable into UTF8 and therefore the print of the second value it's not correct

does anyone know why?

Edited March 4, 2024 by gioma

Alexander Sviridenkov · March 4, 2024

Unicode (UTF16) symbol size is at least 2 bytes, so after deleting 1 byte string became incorrect.

gioma · March 4, 2024

then if I have a unicode string like this: 'ABCDEFGHI' and I put it inside a byte array if I want to remove the first character from the byte array do I have to remove two bytes and not one?

Der schöne Günther · March 4, 2024

Your life will be much easier if you rely on Strings and Chars for text manipulation, not bytes. Convert to bytes when your text manipulation is done, not before that.

Alexander Sviridenkov · March 4, 2024

38 minutes ago, gioma said:

then if I have a unicode string like this: 'ABCDEFGHI' and I put it inside a byte array if I want to remove the first character from the byte array do I have to remove two bytes and not one?

Yes, two or four (for symbols behind BMP).

gioma · March 4, 2024

6 minutes ago, Der schöne Günther said:

Your life will be much easier if you rely on Strings and Chars for text manipulation, not bytes. Convert to bytes when your text manipulation is done, not before that.

I know, but it's a packet received via datachannel that only transmits byte arrays.

gioma · March 4, 2024

7 minutes ago, Alexander Sviridenkov said:

Yes, two or four (for symbols behind BMP).

ok, thanks, then I think this is exactly the problem.

aehimself · March 4, 2024

3 hours ago, gioma said:

I know, but it's a packet received via datachannel that only transmits byte arrays.

The medium and the format where your data arrives is irrelevant. Just assemble the text in your receiver routine and go on from there.

gioma · March 4, 2024

1 hour ago, aehimself said:

The medium and the format where your data arrives is irrelevant. Just assemble the text in your receiver routine and go on from there.

in reality that data can be a UTF8 string or a UNICODE string or a file.
The first character of the data tells me what type of data I have to work with.

For this reason I have to separate the header from the rest of the message.

Edited March 4, 2024 by gioma

Cristian Peța · March 5, 2024

16 hours ago, gioma said:

in reality that data can be a UTF8 string or a UNICODE string or a file.
The first character of the data tells me what type of data I have to work with.

Then you don't know if that first char is one or two bytes.

You test first byte (suppose UTF-8) and if it is not what you expected then you test first two bytes?

David Heffernan · March 5, 2024

17 hours ago, gioma said:

The first character of the data tells me what type of data I have to work with.

For this reason I have to separate the header from the rest of the message.

This sounds completely broken. What if the file starts with the sentinel value that indicates that you have a utf8 payload.

Why aren't you passing a separate header, and then the payload? And why are you passing utf16 at all. Isn't that just expensive. And if you must pass around utf16 please tell me that you are handling byte order correctly.

Remy Lebeau · March 5, 2024

On 3/4/2024 at 7:55 AM, gioma said:

in reality that data can be a UTF8 string or a UNICODE string or a file.
The first character of the data tells me what type of data I have to work with.

That is not a good idea. Such an indicator should not be in the data payload itself, it should precede the payload, ie in a separate message header. Also, to differentiate between UTF8 or UTF16, you could use standard Unicode BOMs. So, for instance, have a message header that indicates whether the payload is text bytes or file bytes, and the total byte size. Then in the payload itself, if it is text then have it start with a BOM before the actual text bytes. Although, in reality, it is generally not a good idea to use UTF16 in data transmissions. Better to stick with just UTF8. Convert to/from UTF16 in memory only, if needed.

Sign In

Delphi 12 : Encoding Unicode strange behaviour

Recommended Posts

gioma 21

Share this post

Link to post

Alexander Sviridenkov 363

Share this post

Link to post

gioma 21

Share this post

Link to post

Der schöne Günther 338

Share this post

Link to post

Alexander Sviridenkov 363

Share this post

Link to post

gioma 21

Share this post

Link to post

gioma 21

Share this post

Link to post

aehimself 407

Share this post

Link to post

gioma 21

Share this post

Link to post

Cristian Peța 122

Share this post

Link to post

David Heffernan 2463

Share this post

Link to post

Remy Lebeau 1642

Share this post

Link to post

Create an account or sign in to comment

Create an account

Sign in

Browse

Activity