aehimself 396 Posted May 12, 2021 (edited) While digging in the depths of a legacy application I was shocked to see that a binary data received through the network is stored and handled as a String. And it works. Imagine the following code: procedure TForm1.Button1Click(Sender: TObject); Var tb: TBytes; s: String; begin tb := TFile.ReadAllBytes('C:\temp\Project1.exe'); SetLength(s, Length(tb)); Move(tb[0], s[1], Length(tb)); // If CompareMem(@tb[0], @s[1], Length(tb)) Then ShowMessage('Contents are the same'); TFile.WriteAllText('C:\temp\project2.exe', s, TEncoding.Default); end; Fails. Produces the same amount of bytes, but it doesn't work. However, just by introducing a string casting: procedure TForm1.Button1Click(Sender: TObject); Var tb: TBytes; s: String; ans: AnsiString; begin tb := TFile.ReadAllBytes('C:\temp\Project1.exe'); SetLength(ans, Length(tb)); Move(tb[0], ans[1], Length(tb)); s := String(ans); TFile.WriteAllText('C:\temp\Project2.exe', s, TEncoding.Default); end; output executable is... well, executable. My bet is on some pointer magic Delphi is doing in the background, but can someone please explain WHY this works?! Edited May 12, 2021 by aehimself Share this post Link to post
Remy Lebeau 1398 Posted May 12, 2021 1 hour ago, aehimself said: While digging in the depths of a legacy application I was shocked to see that a binary data received through the network is stored and handled as a String. Most likely, that code predated the shift to Unicode in Delphi 2009. 1 hour ago, aehimself said: And it works. No, it doesn't. It has the potential to corrupt the data. This is exactly why you SHOULD NOT put binary data into a UnicodeString. 1 hour ago, aehimself said: Imagine the following code: Doesn't work. It fills only 1/2 of the UnicodeString's memory with the non-textual binary (because SizeOf(WideChar) is 2, so the SetLength() is allocating twice the number of bytes as were read in), then converts the entire UnicodeString (including the unused bytes) from UTF-16 to ANSI producing complete garbage, and then writes that garbage as-is to file. So yes, the same number of bytes MAY be written as were read in (but that is not guaranteed), but those bytes are useless. 1 hour ago, aehimself said: However, just by introducing a string casting: That code is copying the binary as-is into an AnsiString of equal byte size, converting that AnsiString to a UTF-16 UnicodeString using the user's default locale, then converting that UnicodeString from UTF-16 back to ANSI using the same locale. Depending on the particular locale used, that MAY be a lossy conversion, you MIGHT end up with the same bytes that you started with, or you MIGHT NOT. 1 hour ago, aehimself said: My bet is on some pointer magic Delphi is doing in the background This has nothing to do with pointers. You are simply performing 2 separate data conversions (binary/ANSI -> UTF-16 -> binary/ANSI ) that just HAPPEN to produce the same results as the input IN YOUR ENVIRONMENT. 2 Share this post Link to post
aehimself 396 Posted May 12, 2021 (edited) 36 minutes ago, Remy Lebeau said: Most likely, that code predated the shift to Unicode in Delphi 2009. Not most likely, it does. Was Delphi 6 or 7, way before I joined the company. 36 minutes ago, Remy Lebeau said: This is exactly why you SHOULD NOT put binary data into a UnicodeString. I know, this is why I was really surprised that it actually works like this. We are just lucky with our locale it seems 🙂 36 minutes ago, Remy Lebeau said: Doesn't work. It fills only 1/2 of the UnicodeString's memory with the non-textual binary (because SizeOf(WideChar) is 2, so the SetLength() is allocating twice the number of bytes as were read in), then converts the entire UnicodeString (including the unused bytes) from UTF-16 to ANSI producing complete garbage, and then writes that garbage as-is to file. So yes, the same number of bytes MAY be written as were read in (but that is not guaranteed), but those bytes are useless. First one was only a demonstration; I knew it won't work. I found it strange that the output byte count is the same as the input (because of the double size as you pointed out) though. Guess I was lucky with the random choice of exe. 36 minutes ago, Remy Lebeau said: That code is copying the binary as-is into an AnsiString of equal byte size, converting that AnsiString to a UTF-16 UnicodeString using the user's default locale, then converting that UnicodeString from UTF-16 back to ANSI using the same locale. Depending on the particular locale used, that MAY be a lossy conversion, you MIGHT end up with the same bytes that you started with, or you MIGHT NOT. So if I get it right... we read the binary data, doubling it's size as we pad each character with a #0 during AnsiString -> String conversion? The real code is creating a TStringStream out of this and passing it as a parameter of a method, which is expecting a stream. That method will access the contents with .Seek and .Read I suppose. I didn't test this, but am I safe to assume that this would include the extra #0s, causing the binary data to be corrupted? Edited May 12, 2021 by aehimself Share this post Link to post
aehimself 396 Posted May 12, 2021 Update: wrong. #0s are not present. Var ss: TStringStream; s: TStream; tb: TBytes; begin ss := TStringStream.Create('Árvíztűrő tükörfúrógép'); Try s := ss; SetLength(tb, s.Size); s.Read(tb, Length(tb)); ShowMessage(TEncoding.Default.GetString(tb)); Finally FreeAndNil(ss); End; Lossy conversion as you mentioned, though. Share this post Link to post
Remy Lebeau 1398 Posted May 12, 2021 3 hours ago, aehimself said: We are just lucky with our locale it seems 🙂 Yes, quite lucky. 3 hours ago, aehimself said: I found it strange that the output byte count is the same as the input (because of the double size as you pointed out) though. Most ANSI locales use 1 byte per character, and UTF-16 uses 1 codeunit per character for most Western languages. So, you usually end up with 1 byte -> 2 bytes -> 1 byte conversion, hence why the final size was the same byte size, but may or may not be the same bytes as the original. 3 hours ago, aehimself said: So if I get it right... we read the binary data, doubling it's size as we pad each character with a #0 during AnsiString -> String conversion? There is more involved than just nul-padding, which typically only applies for bytes in the $00..$7F (ASCII) range. For non-ASCII characters, it is not a matter of simply padding '<HH>' to '<HH>#0', there is an actual mapping process involved. For example, if Windows-1252 were the locale used for the conversion, and byte $80 (Euro) were encountered, it would be converted to the Unicode character U+20AC, which is bytes $AC $20 in UTF-16LE, not $80 $00 like you are thinking. But yes, the individual bytes of the EXE data would mostly get doubled when converted to Unicode, and then truncated to single bytes when converted back to ANSI. But that does not necessarily mean that you will end up with the same bytes that you started with. For example, using Windows-1252 again, byte $81 (amongst a few others) would end up converted to either Unicode character U+FFFD (Replacement Character) or U+003F (ASCII '?') depending on the converter's implementation, which would thus be bytes $FD $FF or $3F $00 in UTF-16LE respectively, and then converted back to ANSI as byte $3F, which is clearly not the same as the original. If you absolutely need a charset that ensures no data loss when round-tripping bytes from ANSI to Unicode to ANSI, you can use codepage 437 for that, see Is there a code page that matches ASCII and can round trip arbitrary bytes through Unicode? The Unicode won't have the same character values as the original bytes in the ranges of $00..$1F and $7F..$FF, but the result of converting the Unicode back to codepage 437 will produce the original bytes. 3 hours ago, aehimself said: The real code is creating a TStringStream out of this and passing it as a parameter of a method, which is expecting a stream... am I safe to assume that this would include the extra #0s, causing the binary data to be corrupted? Nul-padding is not guaranteed, but yes, the String data inside the stream can get messed up. Do not use a TStringStream for binary data. Use TMemoryStream or TBytesStream instead. 2 Share this post Link to post
aehimself 396 Posted May 12, 2021 Thank you for the explanation, it all makes sense now. I don't really need workarounds; if this area will be touched it will be properly changed instead to use a hack-of-a-hack... Don't get me wrong - I know this should not be done because it won't work; this is why I was this surprised that it does in our case. The sole purpose of this topic was for me to understand why it does, when it should not 🙂 Share this post Link to post
Fr0sT.Brutal 900 Posted May 13, 2021 AFAIK usually ANSI codepage has all 1..255 bytes mapped to some character so ANSI=>UTF16=>ANSI conversion results in the same data. That's why it worked. Share this post Link to post
Rollo62 536 Posted May 13, 2021 (edited) 11 hours ago, aehimself said: Thank you for the explanation, it all makes sense now. I don't really need workarounds; if this area will be touched it will be properly changed instead to use a hack-of-a-hack... Just keep as-is ... but make a BIG !! WARNING !! and explaining comment ( pointing to this thread maybe ) Edited May 13, 2021 by Rollo62 Share this post Link to post
Lajos Juhász 293 Posted May 13, 2021 3 hours ago, Fr0sT.Brutal said: AFAIK usually ANSI codepage has all 1..255 bytes mapped to some character so ANSI=>UTF16=>ANSI conversion results in the same data. That's why it worked. Try https://en.wikipedia.org/wiki/Windows-1251 one character is undefined hex 98. https://en.wikipedia.org/wiki/Windows-1250 there 5 undefined characters. 1 Share this post Link to post
Fr0sT.Brutal 900 Posted May 13, 2021 25 minutes ago, Lajos Juhász said: Try https://en.wikipedia.org/wiki/Windows-1251 one character is undefined hex 98. https://en.wikipedia.org/wiki/Windows-1250 there 5 undefined characters. Funny. I wasn't aware of that. I wonder why these holes exist, did their creators thought "OK, we already have all the characters we need, let's leave some lacunas here and there" Share this post Link to post
Vandrovnik 214 Posted May 13, 2021 (edited) 31 minutes ago, Lajos Juhász said: Try https://en.wikipedia.org/wiki/Windows-1251 one character is undefined hex 98. https://en.wikipedia.org/wiki/Windows-1250 there 5 undefined characters. But even when I put #129 in txt file (Notepad, Alt+0129) and save it, it is saved as bytes C2 81. Notepad does not display there any character, but it is present (cursor "stays" there when using arrow on keyboard). Edited May 13, 2021 by Vandrovnik Share this post Link to post
Remy Lebeau 1398 Posted May 13, 2021 (edited) 10 hours ago, Fr0sT.Brutal said: AFAIK usually ANSI codepage has all 1..255 bytes mapped to some character Many ANSI charsets DON'T have all 256 bytes mapped. Edited May 13, 2021 by Remy Lebeau 1 1 Share this post Link to post
Remy Lebeau 1398 Posted May 13, 2021 6 hours ago, Lajos Juhász said: Try https://en.wikipedia.org/wiki/Windows-1251 one character is undefined hex 98. https://en.wikipedia.org/wiki/Windows-1250 there 5 undefined characters. https://en.wikipedia.org/wiki/ISO/IEC_8859-1 there are 65 undefined characters 1 Share this post Link to post
Remy Lebeau 1398 Posted May 13, 2021 5 hours ago, Vandrovnik said: But even when I put #129 in txt file (Notepad, Alt+0129) and save it, it is saved as bytes C2 81. That is the UTF-8 (not ANSI) encoded form of Unicode codepoint U+0081, which is a non-visual control character. Share this post Link to post
skyzoframe[hun] 4 Posted June 14, 2021 On 5/13/2021 at 7:58 PM, Remy Lebeau said: https://en.wikipedia.org/wiki/ISO/IEC_8859-1 there are 65 undefined characters Extended version.: https://en.wikipedia.org/wiki/ISO/IEC_8859-2 Here "ő" and "ű" aren't missing. Share this post Link to post