Jump to content
A.M. Hoornweg

Workaround for binary data in strings ...

Recommended Posts

Hello all,

 

when Delphi didn't know about unicode yet people would often stuff binary data into strings because strings were soooo practical and easy to manipulate.  Yes that is and was bad practice and highly frowned upon, but as we all know it was done anyway and people got away with it because Delphi didn't care about code pages at the time so it just worked. It was even done in professional libraries.  Code like that is very hard to port to newer Delphi versions which are unicode-enabled and codepage-aware and when you attempt a conversion, subtle errors may happen where you least expect it. 

 

 

If you still have precious old code libraries that do such things and which would be too costly or too complex to rewrite,  you may consider trying this workaround as a quick & dirty fix:

 

Type Binarystring= type Ansistring (437);

 

The nice thing about code page 437 is that all 256 possible ansichar values map to valid unicode code points so you can safely assign these strings to Unicodestrings and back again without data loss and Delphi's built-in string functions won't break the binary data contained in these strings. So you just may me able to salvage some legacy code by declaring this new type and then replacing all "string" declarations in the code with "binarystring" and all (p)Char with (p)Ansichar.  

 

And yes, it's still bad practice...

 

 

 

(The idea is originally from Raymond Chen:  https://devblogs.microsoft.com/oldnewthing/20200831-00/?p=104142 )

Edited by A.M. Hoornweg
giving proper credit to Raymond Chen
  • Like 2

Share this post


Link to post
3 hours ago, A.M. Hoornweg said:

The nice thing about code page 437 is that all 256 possible ansichar values map to valid unicode code points so you can safely assign these strings to Unicodestring

Well, while this example by R.Chen is good finding, it's a bit useless for Delphi. Why reinvent the wheel when Delphi already has codepage-neutral RawByteString that won't do any char conversion at all?

AFAIU only "uint8" <=> "uint16" conversion is performed when assigning Rawbytes to unicode and vice versa.

Edited by Fr0sT.Brutal
  • Like 2

Share this post


Link to post
37 minutes ago, Fr0sT.Brutal said:

Well, while this example by R.Chen is good finding, it's a bit useless for Delphi. Why reinvent the wheel when Delphi already has codepage-neutral RawByteString that won't do any char conversion at all?

AFAIU only "uint8" <=> "uint16" conversion is performed when assigning Rawbytes to unicode and vice versa.

The documentation says that "Rawbytestring should only be used as a const or value type parameter or a return type from a function. It should never be passed by reference (passed by var), and should never be instantiated as a variable."   I read that as "you may not declare variables of type "RawByteString" and do stuff with them".

 

But if I define a Type Ansistring(437) and put binary data into it, the data is associated with a valid code page. A conversion to unicodestring will convert characters [#128..#255]  to some different unicode code points, which is uninteresting, because a back-conversion to ansi codepage 437 will produce the original binary data again.  So it is safe to use all normal unicode string functions as long as we remember to assign the final results back to an Ansistring(437).

 

 

 

Share this post


Link to post
Guest
7 hours ago, A.M. Hoornweg said:

when Delphi didn't know about unicode yet people would often stuff binary data into strings because strings were soooo practical and easy to manipulate.  Yes that is and was bad practice and highly frowned upon, but as we all know it was done anyway and people got away with it because Delphi didn't care about code pages at the time so it just worked. It was even done in professional libraries.  Code like that is very hard to port to newer Delphi versions which are unicode-enabled and codepage-aware and when you attempt a conversion, subtle errors may happen where you least expect it.

Points for that writeup!

Share this post


Link to post
1 hour ago, David Heffernan said:

Kind of odd that you wouldn't just use a byte, TBytes. 

I do, all the time. But this is about salvaging older (possibly third party) ansi code without a rewrite. 

Share this post


Link to post
7 hours ago, A.M. Hoornweg said:

when Delphi didn't know about unicode yet people would often stuff binary data into strings because strings were soooo practical and easy to manipulate.  Yes that is and was bad practice and highly frowned upon, but as we all know it was done anyway and people got away with it because Delphi didn't care about code pages at the time so it just worked.

 

I dispute the concept that it was always considered bad practice.  Like many things, over time it was eventually agreed to be universally known as bad practice, but it's revisionist history to claim that it was always considered such.

How many Delphi Informant magazine articles, website articles and blog posts, along with commercial components and enterprise wide codebases were littered with this usage and there wasn't a peep about bad practice for years?

 

 

 

 

  • Like 2

Share this post


Link to post
1 hour ago, David Heffernan said:

Kind of odd that you wouldn't just use a byte, TBytes. 

Hmm, is there well-optimized Pos function that takes TBytes in RTL? Strings were used for byte chunks because of lots of convenient utilities.

  • Like 2

Share this post


Link to post
7 minutes ago, Darian Miller said:

 

I dispute the concept that it was always considered bad practice.  Like many things, over time it was eventually agreed to be universally known as bad practice, but it's revisionist history to claim that it was always considered such.

How many Delphi Informant magazine articles, website articles and blog posts, along with commercial components and enterprise wide codebases were littered with this usage and there wasn't a peep about bad practice for years?

 

 

 

 

Fair point.  Moreover, the only tool in Delphi that could perform operations like copy/append/delete/insert/search on variable length data out of the box was "string". If all you have is a hammer, every problem looks like a nail.

 

I remember that Windows NT (from the mid-1990s) was already unicode-enabled but Delphi didn't support that until version 2009. So all that time, a char and a byte were basically synonymous in Delphi.

 

 

Share this post


Link to post
8 minutes ago, A.M. Hoornweg said:

Windows NT (from the mid-1990s) was already unicode-enabled but Delphi didn't support that until version 2009

*cough* WideString *cough*

Available since D4 or thereabouts...

Share this post


Link to post
26 minutes ago, Anders Melander said:

*cough* WideString *cough*

Available since D4 or thereabouts...

The VCL was strictly Ansi. I needed visual unicode compatibility in the early noughties and was forced to switch to a third-party VCL (first TNT, then LMD Elpack). 

 

Share this post


Link to post
9 hours ago, A.M. Hoornweg said:

If you still have precious old code libraries that do such things and which would be too costly or too complex to rewrite,  you may consider trying this workaround as a quick & dirty fix:


Type Binarystring= type Ansistring (437);

The nice thing about code page 437 is that all 256 possible ansichar values map to valid unicode code points

True, though 437 does not map every byte (in fact, many bytes) to a Unicode codepoint of same numeric value.  For this task (when I have needed to use it in the past), I would use codepage 28591 (ISO-8859-1) instead, which has bytes use the same numeric values as their codepoints.

9 hours ago, A.M. Hoornweg said:

And yes, it's still bad practice...

Agreed.

6 hours ago, Fr0sT.Brutal said:

Why reinvent the wheel when Delphi already has codepage-neutral RawByteString that won't do any char conversion at all?

More accurately, there is no conversion only when an AnsiString(N)-based string type is assigned to it, as it will simply inherit N as its current codepage, but it does perform a character conversion when a UnicodeString or WideString is assigned to it, and when it is assigned to another non-RawByteString string type.  So, even if you were to use RawByteString, you still have to be careful with how you use it.

 

  • Like 1

Share this post


Link to post
4 hours ago, David Heffernan said:

Kind of odd that you wouldn't just use a byte, TBytes. 

Exactly my thoughts. When I was hit with the first encoding issue I switched to TBytes. Easy to handle, but really powerful; especially with the built-in TEncoding classes.

2 hours ago, A.M. Hoornweg said:

I do, all the time. But this is about salvaging older (possibly third party) ansi code without a rewrite

RawByteString, as it was mentioned earlier; but that also requires you to change the library itself

Edited by aehimself

Share this post


Link to post
11 hours ago, aehimself said:

Exactly my thoughts. When I was hit with the first encoding issue I switched to TBytes. Easy to handle, but really powerful; especially with the built-in TEncoding classes.

RawByteString, as it was mentioned earlier; but that also requires you to change the library itself

For my own source code I switched to tBytes too, long ago.  But I have several legacy third party libraries of companies that no longer exist and that are still very nice when used with Delphi 2007.  So when I came across Raymond Chen's blog I had the idea to salvage these libraries without mutilating the existing source code too much.  And I stated loudly and clearly that this is just a quick & dirty fix to get it done.

 

As Remy Lebeau stated, assigning RawByteString <-> String may still do a codepage conversion.  Ansistring(437) and Ansistring (28591) are safe.

 

 

Edited by A.M. Hoornweg
typo

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×