Jump to content
gioma

Delphi 12 : Encoding Unicode strange behaviour

Recommended Posts

Posted (edited)

Hi,

I have this code:

 

procedure ProcessData(data:PByte, asize:integer);
var
	data_pack:TBytes;
    t:TStreamWriter;
    DataType:Char;
begin
	nflog:= 'C:\TEST\Log.txt';
    if FileExists(nflog) then
      t := TStreamWriter.Create(nflog, true, TEncoding.UTF8)
    else
      t := TStreamWriter.Create(nflog, false, TEncoding.UTF8);
      
    SetLength(data_pack, asize);
    Move(PByte(data)^, data_pack[0], asize);
	// here the coding is perfect
    t.WriteLine( TEncoding.Unicode.GetString(data_pack));
    t.Flush;

    dataType := TEncoding.unicode.GetChars(data_pack)[0];
    
    delete(data_pack, 0, 1);
    // here the content of datapack has become UTF 8 and therefore does not coding well
    t.WriteLine( TEncoding.Unicode.GetString(data_pack));
    t.flush
    t.free;
    
end;

In the "data" variable there is a string in unicode, but when Delphi esxcute the "delete" operation it seems that it convert the contents of the data_pack variable into UTF8 and therefore the print of the second value it's not correct

does anyone know why?

Edited by gioma

Share this post


Link to post

then if I have a unicode string like this: 'ABCDEFGHI' and I put it inside a byte array if I want to remove the first character from the byte array do I have to remove two bytes and not one?

Share this post


Link to post

Your life will be much easier if you rely on Strings and Chars for text manipulation, not bytes. Convert to bytes when your text manipulation is done, not before that.

  • Like 3

Share this post


Link to post
38 minutes ago, gioma said:

then if I have a unicode string like this: 'ABCDEFGHI' and I put it inside a byte array if I want to remove the first character from the byte array do I have to remove two bytes and not one?

Yes, two or four (for symbols behind BMP).

  • Like 1

Share this post


Link to post
6 minutes ago, Der schöne Günther said:

Your life will be much easier if you rely on Strings and Chars for text manipulation, not bytes. Convert to bytes when your text manipulation is done, not before that.

I know, but it's a packet received via datachannel that only transmits byte arrays.

Share this post


Link to post
7 minutes ago, Alexander Sviridenkov said:

Yes, two or four (for symbols behind BMP).

ok, thanks, then I think this is exactly the problem.

Share this post


Link to post
3 hours ago, gioma said:

I know, but it's a packet received via datachannel that only transmits byte arrays.

The medium and the format where your data arrives is irrelevant. Just assemble the text in your receiver routine and go on from there.

Share this post


Link to post
Posted (edited)
1 hour ago, aehimself said:

The medium and the format where your data arrives is irrelevant. Just assemble the text in your receiver routine and go on from there.

in reality that data can be a UTF8 string or a UNICODE string or a file.
The first character of the data tells me what type of data I have to work with.

For this reason I have to separate the header from the rest of the message.

Edited by gioma

Share this post


Link to post
16 hours ago, gioma said:

in reality that data can be a UTF8 string or a UNICODE string or a file.
The first character of the data tells me what type of data I have to work with.

Then you don't know if that first char is one or two bytes.

You test first byte (suppose UTF-8) and if it is not what you expected then you test first two bytes?

Share this post


Link to post
17 hours ago, gioma said:

The first character of the data tells me what type of data I have to work with.

For this reason I have to separate the header from the rest of the message.

This sounds completely broken. What if the file starts with the sentinel value that indicates that you have a utf8 payload.

 

Why aren't you passing a separate header, and then the payload? And why are you passing utf16 at all. Isn't that just expensive. And if you must pass around utf16 please tell me that you are handling byte order correctly.

  • Like 2

Share this post


Link to post
On 3/4/2024 at 7:55 AM, gioma said:

in reality that data can be a UTF8 string or a UNICODE string or a file.
The first character of the data tells me what type of data I have to work with.

That is not a good idea.  Such an indicator should not be in the data payload itself, it should precede the payload, ie in a separate message header. Also, to differentiate between UTF8 or UTF16, you could use standard Unicode BOMs.  So, for instance, have a message header that indicates whether the payload is text bytes or file bytes, and the total byte size.  Then in the payload itself, if it is text then have it start with a BOM before the actual text bytes.  Although, in reality, it is generally not a good idea to use UTF16 in data transmissions. Better to stick with just UTF8.  Convert to/from UTF16 in memory only, if needed.

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×