gioma 19 Posted March 4 (edited) Hi, I have this code: procedure ProcessData(data:PByte, asize:integer); var data_pack:TBytes; t:TStreamWriter; DataType:Char; begin nflog:= 'C:\TEST\Log.txt'; if FileExists(nflog) then t := TStreamWriter.Create(nflog, true, TEncoding.UTF8) else t := TStreamWriter.Create(nflog, false, TEncoding.UTF8); SetLength(data_pack, asize); Move(PByte(data)^, data_pack[0], asize); // here the coding is perfect t.WriteLine( TEncoding.Unicode.GetString(data_pack)); t.Flush; dataType := TEncoding.unicode.GetChars(data_pack)[0]; delete(data_pack, 0, 1); // here the content of datapack has become UTF 8 and therefore does not coding well t.WriteLine( TEncoding.Unicode.GetString(data_pack)); t.flush t.free; end; In the "data" variable there is a string in unicode, but when Delphi esxcute the "delete" operation it seems that it convert the contents of the data_pack variable into UTF8 and therefore the print of the second value it's not correct does anyone know why? Edited March 4 by gioma Share this post Link to post
Alexander Sviridenkov 357 Posted March 4 Unicode (UTF16) symbol size is at least 2 bytes, so after deleting 1 byte string became incorrect. 1 Share this post Link to post
gioma 19 Posted March 4 then if I have a unicode string like this: 'ABCDEFGHI' and I put it inside a byte array if I want to remove the first character from the byte array do I have to remove two bytes and not one? Share this post Link to post
Der schöne Günther 316 Posted March 4 Your life will be much easier if you rely on Strings and Chars for text manipulation, not bytes. Convert to bytes when your text manipulation is done, not before that. 3 Share this post Link to post
Alexander Sviridenkov 357 Posted March 4 38 minutes ago, gioma said: then if I have a unicode string like this: 'ABCDEFGHI' and I put it inside a byte array if I want to remove the first character from the byte array do I have to remove two bytes and not one? Yes, two or four (for symbols behind BMP). 1 Share this post Link to post
gioma 19 Posted March 4 6 minutes ago, Der schöne Günther said: Your life will be much easier if you rely on Strings and Chars for text manipulation, not bytes. Convert to bytes when your text manipulation is done, not before that. I know, but it's a packet received via datachannel that only transmits byte arrays. Share this post Link to post
gioma 19 Posted March 4 7 minutes ago, Alexander Sviridenkov said: Yes, two or four (for symbols behind BMP). ok, thanks, then I think this is exactly the problem. Share this post Link to post
aehimself 396 Posted March 4 3 hours ago, gioma said: I know, but it's a packet received via datachannel that only transmits byte arrays. The medium and the format where your data arrives is irrelevant. Just assemble the text in your receiver routine and go on from there. Share this post Link to post
gioma 19 Posted March 4 (edited) 1 hour ago, aehimself said: The medium and the format where your data arrives is irrelevant. Just assemble the text in your receiver routine and go on from there. in reality that data can be a UTF8 string or a UNICODE string or a file. The first character of the data tells me what type of data I have to work with. For this reason I have to separate the header from the rest of the message. Edited March 4 by gioma Share this post Link to post
Cristian Peța 103 Posted March 5 16 hours ago, gioma said: in reality that data can be a UTF8 string or a UNICODE string or a file. The first character of the data tells me what type of data I have to work with. Then you don't know if that first char is one or two bytes. You test first byte (suppose UTF-8) and if it is not what you expected then you test first two bytes? Share this post Link to post
David Heffernan 2345 Posted March 5 17 hours ago, gioma said: The first character of the data tells me what type of data I have to work with. For this reason I have to separate the header from the rest of the message. This sounds completely broken. What if the file starts with the sentinel value that indicates that you have a utf8 payload. Why aren't you passing a separate header, and then the payload? And why are you passing utf16 at all. Isn't that just expensive. And if you must pass around utf16 please tell me that you are handling byte order correctly. 2 Share this post Link to post
Remy Lebeau 1398 Posted March 5 On 3/4/2024 at 7:55 AM, gioma said: in reality that data can be a UTF8 string or a UNICODE string or a file. The first character of the data tells me what type of data I have to work with. That is not a good idea. Such an indicator should not be in the data payload itself, it should precede the payload, ie in a separate message header. Also, to differentiate between UTF8 or UTF16, you could use standard Unicode BOMs. So, for instance, have a message header that indicates whether the payload is text bytes or file bytes, and the total byte size. Then in the payload itself, if it is text then have it start with a BOM before the actual text bytes. Although, in reality, it is generally not a good idea to use UTF16 in data transmissions. Better to stick with just UTF8. Convert to/from UTF16 in memory only, if needed. Share this post Link to post