Blog: Byte Loss in String-Literal Concatenation

baoquan.zuo · April 24

Hi,

I'd like to share a post. It addressed a byte loss issue captured from a discussion.

// Compile with code page 936
program Problem;

const
  strPublicKey: RawByteString =
    #$30#$3C#$30#$0D#$06#$09#$2A#$86#$48#$86#$F7#$0D#$01#$01#$01#$05 +
    #$00#$03#$2B#$00#$30#$28#$02#$21#$00#$A4#$65#$B8#$CD#$B4#$29#$A9 +
    #$64#$1A#$C5#$80#$55#$22#$1B#$BB#$C5#$98#$36#$B9#$23#$0C#$CA#$D4 +
    #$A8#$B8#$7C#$E6#$32#$E3#$89#$3D#$77#$02#$03#$01#$00#$01;
begin
  Writeln(Length(strPublicKey)); // expected 62 got 58 - why?
  Readln;
end.

https://devjetsoftware.com/delphi/byte-loss-in-string-literal-concatenation/

David Heffernan · April 24

What is wrong with the world of Delphi programmers that in 2025 there are still people who can't understand the difference between text and bytes?

The article you link to goes on and on about text but your data is bytes. Why not just use the correct data type?

baoquan.zuo · April 24

2 minutes ago, David Heffernan said:

What is wrong with the world of Delphi programmers that in 2025 there are still people who can't understand the difference between text and bytes?

The article you link to goes on and on about text but your data is bytes. Why not just use the correct data type?

Yes — it’s clear we should use the proper data type to represent raw bytes.

What the article (and my curiosity) really digs into is why the data loss occurs and how DCC handles string literals (as there is no formal Delphi language specification). That matters to me because I’m also writing some Delphi compiler-frontend code in my products.

In any case, these insights should help when migrating legacy ANSI-based Delphi projects.

David Heffernan · April 24

5 minutes ago, baoquan.zuo said:

In any case, these insights should help when migrating legacy ANSI-based Delphi projects

I don't see this as helpful to anyone. Use bytes to represent bytes. Use strings to represent text. Don't use ANSI strings.

baoquan.zuo · April 24

13 minutes ago, David Heffernan said:

I don't see this as helpful to anyone. Use bytes to represent bytes. Use strings to represent text. Don't use ANSI strings.

Thanks for sharing your view.

Roger Cigol · April 24

Of course there ARE times when the use of ANSI strings makes sense. One example is when sending data to/from an external device down an RS232 port where the external device uses a protocol based on simple ANSI text. We have many real world cases such as this (eg Eurotherm temperature controllers). The key point that @David Heffernan makes is that you should choose your types carefully to closely (or exactly!) reflect your needs. Time spent thinking carefully about your type selection will save you time in the long run.....

David Heffernan · April 24

2 hours ago, Roger Cigol said:

Of course there ARE times when the use of ANSI strings makes sense. One example is when sending data to/from an external device down an RS232 port where the external device uses a protocol based on simple ANSI text. We have many real world cases such as this (eg Eurotherm temperature controllers). The key point that @David Heffernan makes is that you should choose your types carefully to closely (or exactly!) reflect your needs. Time spent thinking carefully about your type selection will save you time in the long run.....

I mean, you work with strings and do TEncoding.ASCII.GetBytes

Roger Cigol · April 24

20 minutes ago, David Heffernan said:

I mean, you work with strings and do TEncoding.ASCII.GetBytes

All good ! There is more than one way to skin a cat.....

Anders Melander · April 24

6 hours ago, David Heffernan said:

I don't see this as helpful to anyone.

That's a bit... unnuanced.

I thought it was helpful. Even though the "problem" is pretty obscure, and I haven't encountered it myself, it's a point of data that might come in handy some day.

David Heffernan · April 25

16 hours ago, Roger Cigol said:

All good ! There is more than one way to skin a cat.....

This way is reliable and works

David Heffernan · April 25

15 hours ago, Anders Melander said:

That's a bit... unnuanced.

I thought it was helpful. Even though the "problem" is pretty obscure, and I haven't encountered it myself, it's a point of data that might come in handy some day.

My point is that it's behaviour that you don't ever need to know because the correct way to handle byte data is as, well, bytes and not text. So for sure there's an algorithm, but it's not one that anyone actually needs to know.

baoquan.zuo · April 25

I forgot to mention that, in the original case, the proper solution is to use a byte array —I’d assumed this was common knowledge, but I should have spelled it out.

As I wrote at the beginning:

Quote

I was curious about the data-loss issue, so I decided to investigate it.

I simply documented the journey, shared it, and hope it helps someone. At the very least, the exercise deepened my understanding of character encoding and how dcc handles string literals.

In the end, it’s just an article. If you skimmed it, read the conclusion, and found nothing useful -- no worries, and thanks for taking a look.

Edited April 25 by baoquan.zuo
proofread

David Heffernan · April 25

5 hours ago, baoquan.zuo said:

I’d assumed this was common knowledge, but I should have spelled it out.

That makes a lot more sense.

Assumed it was common knowledge? I'm not so sure. I think there's still a big underbelly of Delphi coders that don't get this.

baoquan.zuo · April 25

Yes. It was a bit surprised that, when CnPack published the Chinese Translation, advwang mentioned he had reported the issue (RSP-20624) back in 2018. The issue was closed as 'Work as Designed', with a suggestion to add a warning in cases of potential data loss. He also said Eurekalog 7.0 used this approach in their shellcode but fixed with byte array later.

btw. I added this paragraph to the introduction:

Quote

Note: Generally, the correct approach is to use a byte array to represent binary data, since strings are intended for textual content. You may also skip the analysis and jump directly to the Conclusion section.

and improve the Conclusion section:

Quote

Why does data loss occur? In short, it is caused by converting an invalid AnsiString to a UnicodeString. Invalid byte sequences are replaced with the ? character. But how does this happen exactly? What's the underlying reason?

Revisiting the Original Program section:

Quote

The first string literal is interpreted as a UnicodeString without any data loss. However, since the three subsequent string literals contain invalid byte sequences in CP936, they are treated as AnsiStrings. When the compiler encounters a UnicodeString + AnsiString operation, it converts the AnsiString into a UnicodeString. During this conversion, any invalid byte sequences (whether a single invalid byte or a byte pair with a valid lead byte followed by an invalid or missing trail byte) are replaced with the ? character.

Edited April 25 by baoquan.zuo

Joseph MItzen · April 26

On 4/24/2025 at 2:51 AM, David Heffernan said:

What is wrong with the world of Delphi programmers that in 2025 there are still people who can't understand the difference between text and bytes?

Maybe Delphi is the last refuge of Python 2 diehards?

Sign In

Blog: Byte Loss in String-Literal Concatenation

Recommended Posts

baoquan.zuo 43

Share this post

Link to post

David Heffernan 2454

Share this post

Link to post

baoquan.zuo 43

Share this post

Link to post

David Heffernan 2454

Share this post

Link to post

baoquan.zuo 43

Share this post

Link to post

Roger Cigol 133

Share this post

Link to post

David Heffernan 2454

Share this post

Link to post

Roger Cigol 133

Share this post

Link to post

Anders Melander 2046

Share this post

Link to post

David Heffernan 2454

Share this post

Link to post

David Heffernan 2454

Share this post

Link to post

baoquan.zuo 43

Share this post

Link to post

David Heffernan 2454

Share this post

Link to post

baoquan.zuo 43

Share this post

Link to post

Joseph MItzen 257

Share this post

Link to post

Create an account or sign in to comment

Create an account

Sign in

Browse

Activity