Jump to content
baoquan.zuo

Blog: Byte Loss in String-Literal Concatenation

Recommended Posts

Hi,

 

I'd like to share a post. It addressed a byte loss issue captured from a discussion.

// Compile with code page 936
program Problem;

const
  strPublicKey: RawByteString =
    #$30#$3C#$30#$0D#$06#$09#$2A#$86#$48#$86#$F7#$0D#$01#$01#$01#$05 +
    #$00#$03#$2B#$00#$30#$28#$02#$21#$00#$A4#$65#$B8#$CD#$B4#$29#$A9 +
    #$64#$1A#$C5#$80#$55#$22#$1B#$BB#$C5#$98#$36#$B9#$23#$0C#$CA#$D4 +
    #$A8#$B8#$7C#$E6#$32#$E3#$89#$3D#$77#$02#$03#$01#$00#$01;
begin
  Writeln(Length(strPublicKey)); // expected 62 got 58 - why?
  Readln;
end.

https://devjetsoftware.com/delphi/byte-loss-in-string-literal-concatenation/

  • Like 3

Share this post


Link to post

What is wrong with the world of Delphi programmers that in 2025 there are still people who can't understand the difference between text and bytes? 

 

The article you link to goes on and on about text but your data is bytes. Why not just use the correct data type? 

  • Like 1

Share this post


Link to post
2 minutes ago, David Heffernan said:

What is wrong with the world of Delphi programmers that in 2025 there are still people who can't understand the difference between text and bytes? 

 

The article you link to goes on and on about text but your data is bytes. Why not just use the correct data type? 

Yes — it’s clear we should use the proper data type to represent raw bytes.

 

What the article (and my curiosity) really digs into is why the data loss occurs and how DCC handles string literals (as there is no formal Delphi language specification). That matters to me because I’m also writing some Delphi compiler-frontend code in my products.

 

In any case, these insights should help when migrating legacy ANSI-based Delphi projects.

Share this post


Link to post
5 minutes ago, baoquan.zuo said:

In any case, these insights should help when migrating legacy ANSI-based Delphi projects

I don't see this as helpful to anyone. Use bytes to represent bytes. Use strings to represent text. Don't use ANSI strings. 

Share this post


Link to post
13 minutes ago, David Heffernan said:

I don't see this as helpful to anyone. Use bytes to represent bytes. Use strings to represent text. Don't use ANSI strings. 

Thanks for sharing your view.

Share this post


Link to post

Of course there ARE times when the use of ANSI strings makes sense. One example is when sending data to/from an external device down an RS232 port where the external device uses a protocol based on simple ANSI text. We have many real world cases such as this (eg Eurotherm temperature controllers). The key point that @David Heffernan makes is that you should choose your types carefully to closely (or exactly!) reflect your needs. Time spent thinking carefully about your type selection will save you time in the long run.....

  • Like 2

Share this post


Link to post
2 hours ago, Roger Cigol said:

Of course there ARE times when the use of ANSI strings makes sense. One example is when sending data to/from an external device down an RS232 port where the external device uses a protocol based on simple ANSI text. We have many real world cases such as this (eg Eurotherm temperature controllers). The key point that @David Heffernan makes is that you should choose your types carefully to closely (or exactly!) reflect your needs. Time spent thinking carefully about your type selection will save you time in the long run.....

I mean, you work with strings and do TEncoding.ASCII.GetBytes

Share this post


Link to post
20 minutes ago, David Heffernan said:

I mean, you work with strings and do TEncoding.ASCII.GetBytes

All good ! There is more than one way to skin a cat.....

Share this post


Link to post
6 hours ago, David Heffernan said:

I don't see this as helpful to anyone.

That's a bit... unnuanced.

 

I thought it was helpful. Even though the "problem" is pretty obscure, and I haven't encountered it myself, it's a point of data that might come in handy some day.

Share this post


Link to post
16 hours ago, Roger Cigol said:

All good ! There is more than one way to skin a cat.....

This way is reliable and works

Share this post


Link to post
15 hours ago, Anders Melander said:

That's a bit... unnuanced.

 

I thought it was helpful. Even though the "problem" is pretty obscure, and I haven't encountered it myself, it's a point of data that might come in handy some day.

My point is that it's behaviour that you don't ever need to know because the correct way to handle byte data is as, well, bytes and not text. So for sure there's an algorithm, but it's not one that anyone actually needs to know. 

  • Like 1

Share this post


Link to post

I forgot to mention that, in the original case, the proper solution is to use a byte array —I’d assumed this was common knowledge, but I should have spelled it out.

 

As I wrote at the beginning:

Quote

I was curious about the data-loss issue, so I decided to investigate it.

 

I simply documented the journey, shared it, and hope it helps someone. At the very least, the exercise deepened my understanding of character encoding and how dcc handles string literals.

 

In the end, it’s just an article. If you skimmed it, read the conclusion, and found nothing useful -- no worries, and thanks for taking a look.

Edited by baoquan.zuo
proofread

Share this post


Link to post
5 hours ago, baoquan.zuo said:

I’d assumed this was common knowledge, but I should have spelled it out.

That makes a lot more sense. 

 

Assumed it was common knowledge? I'm not so sure. I think there's still a big underbelly of Delphi coders that don't get this. 

Share this post


Link to post

Yes. It was a bit surprised that, when CnPack published the Chinese Translation, advwang mentioned he had reported the issue (RSP-20624) back in 2018. The issue was closed as 'Work as Designed', with a suggestion to add a warning in cases of potential data loss. He also said Eurekalog 7.0 used this approach in their shellcode but fixed with byte array later.

 

btw. I added this paragraph to the introduction:

Quote

Note: Generally, the correct approach is to use a byte array to represent binary data, since strings are intended for textual content. You may also skip the analysis and jump directly to the Conclusion section.

and improve the Conclusion section:

Quote

Why does data loss occur? In short, it is caused by converting an invalid AnsiString to a UnicodeString. Invalid byte sequences are replaced with the ? character. But how does this happen exactly? What's the underlying reason?

Revisiting the Original Program section:

Quote

The first string literal is interpreted as a UnicodeString without any data loss. However, since the three subsequent string literals contain invalid byte sequences in CP936, they are treated as AnsiStrings. When the compiler encounters a UnicodeString + AnsiString operation, it converts the AnsiString into a UnicodeString. During this conversion, any invalid byte sequences (whether a single invalid byte or a byte pair with a valid lead byte followed by an invalid or missing trail byte) are replaced with the ? character.

 

Edited by baoquan.zuo

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×