Jump to content
omnibrain

Possible changes to string/char handling in Delphi 11(.2)?

Recommended Posts

Are there any (known) changes to the compiler regarding string handling from Delphi 10.2 (or 11.0)to 11.2?

I don't have code, because I can't reproduce it yet, it's more of a feeling, that something is wrong.

 

We do a lot of serial communication. Parts of the code are rather old, but survived man Delphi version changes. Some of the code may be pre-Unicode but got modernized before my time and thus before source control, so I can't check the history. We read the ansichars that come via the serial connection and put them into chars and them into strings and work from there. I think that's not ideal because strings are unicode (nowadays) but so far it worked fine.

Recently we switched from Delphi 10.2 to 11.2 (with a short stint in 11.0, but I'm only 80% sure the error wasn't there). And now it only works like 99.9% of the time. (With our test systems everything works, but our customers have more "traffic).

 

The error is, that we get symbols we can't explain in positions where they don't belong. It looks like chars get converted to other hex values.

 

At the moment I'm just poking around, because the error is rare enough and we don't have a trace yet. But perhaps someone knows of a possible change to string/char handling with the most recent compiler versions.

 

 

Share this post


Link to post
18 minutes ago, programmerdelphi2k said:

I dont know very well about this "transliteration" occurring,...

"transliteration" is a good word. My gut feeling is, that we receive some byte value that translates to a char, that get's "transliteraded" to a unicode glyph and when we try to work with the byte value again we get the value of the unicode glyph.

18 minutes ago, programmerdelphi2k said:

but, do you tryed use "AnsiString" instead "String" type?

https://docwiki.embarcadero.com/Libraries/Alexandria/en/System.AnsiStrings

Yes, I'm currently thinking about converting everything to ansichars and ansistrings, or even rawbytestrings, or possible TBytes (though we depend heavily on pos() for protocol parsing). But it may very well be, that I'm chasing ghosts. So if someone could chime up and say "yes, something really changed", that would give me confidence.

18 minutes ago, programmerdelphi2k said:

I could see, too, that you're almost a statistical timer... 🙂

?

Edited by omnibrain

Share this post


Link to post
procedure TForm1.Button1Click(Sender: TObject);
var
  MyAnsiString: AnsiString;
  MyText      : string;
begin
  MyAnsiString :=                                   { }
    '123' +                                         { }
    chr(10) { appears, but not to eyes } +          { }
    'hello' +                                       { }
    chr(0) { ... from forward will be losted!!! } + { }
    'world' +                                       { }
    chr(11200) + { } 'hi';
  //
  MyText := '';
  //
  for var C in MyAnsiString do
    MyText := MyText + ',"' + C + ' - Code: ' + Ord(C).ToString + '"';
  //
  Memo1.Lines.DelimitedText := MyText.Remove(0, 1);
end;

 

image.thumb.png.08d1d006501e75055bba4d376e7675a7.png

Edited by programmerdelphi2k

Share this post


Link to post
2 hours ago, programmerdelphi2k said:

I dont know very well about this "transliteration" occurring,... but, do you tryed use "AnsiString" instead "String" type?

https://docwiki.embarcadero.com/Libraries/Alexandria/en/System.AnsiStrings

 

I could see, too, that you're almost a statistical timer... 🙂

 

Nobody has any idea what the actual problem is, but yeah, let's just randomly through some AnsiStrings around. 

 

This approach to coding doesn't work. The monkeys still haven't typed Shakespeare yet.. 

  • Haha 1

Share this post


Link to post
1 hour ago, David Heffernan said:

Nobody has any idea what the actual problem is, but yeah, let's just randomly through some AnsiStrings around. 

Yeah, me neither. That's why I ask in general terms. Like if someone asked if anything with pointers and DLLs has changed in the latest release. Then we would answer "of course, ASLR and HE-ASLR is enabled by default. Look into the linker options." I hope for something like that.

 

I suspect the code has been broken since the unicode migration and now some (undefined?) behaviour in edge cases may have changed. I'm afraid I won't be able to avoid reworking it, into proper datatypes for stuff received via serial connection. (A proper mix between raw byte protocols, text protocols and protocols that mix both). 

The discussion in Best type for data buffer: TBytes, RawByteString, String, AnsiString, ... - Algorithms, Data Structures and Class Design - Delphi-PRAXiS [en] (delphipraxis.net) goes into a similar direction, thoguh starting at another angle.

Share this post


Link to post

Apparently, we have a critic on duty pointing the finger towards infinity?
Does the answer lie in your galaxy, in a hazy light-years of poor teratian mortals?

I'm get out here... help help... the aliens are coming :classic_cheerleader:

Share this post


Link to post
13 hours ago, omnibrain said:

Yeah, me neither. That's why I ask in general terms. Like if someone asked if anything with pointers and DLLs has changed in the latest release. Then we would answer "of course, ASLR and HE-ASLR is enabled by default. Look into the linker options." I hope for something like that.

I was pushing back against the quoted response, which was deeply unhelpful. 

 

As for what you need to do, I doubt the issue is with the update. I'd look to debug your code. 

Share this post


Link to post
chr(11200)

What do you expect Delphi to make of chr(11200).
Everything not between #0 to #255 then becomes a "?".

 

Share this post


Link to post
18 hours ago, omnibrain said:

We read the ansichars that come via the serial connection and put them into chars and them into strings and work from there. I think that's not ideal because strings are unicode (nowadays) but so far it worked fine.

That is the source of your problem because converting an AnsiChar to UnicodeChar ( = Char) is not a simple byte copy, it will convert characters not in the 7 bit ASCII set to Unicode codepoints that may have different ordinal values. This depends on the active ANSI codepage of the system as well.

The only sensible solution is to treat bytes as bytes and not as characters, use TBytes or other suitable containers to store sequences of bytes instead of strings (ANSI or Unicode). Writing a Pos equivalent for TBytes is easy and there are likely dozens of implementations around, you just have to find one :classic_cool:.

  • Thanks 1

Share this post


Link to post
29 minutes ago, PeterBelow said:

That is the source of your problem because converting an AnsiChar to UnicodeChar ( = Char) is not a simple byte copy, it will convert characters not in the 7 bit ASCII set to Unicode codepoints that may have different ordinal values. This depends on the active ANSI codepage of the system as well.

It depends on the encoding of the 8 bit data. For all we know, that data could be ASCII.

 

The problem is that the question doesn't have any actionable information, and the asked is just hoping for some silver bullet.  Asker needs to get some real information rather than hope that people here can guess what's up.

Edited by David Heffernan
  • Like 1

Share this post


Link to post
8 minutes ago, David Heffernan said:

Thanks for the correction. The main point stands, namely that Chr(11200) is perfectly valid.

This is true for strings. However in the example it's used to assign the value to ansistring:

 

var
  MyAnsiString: AnsiString;

 

AnsiString is not a unicode string thus there is no chr(11200) most probably the code page of the ansistring will have no conversation for that unicode code point thus will be converted to ?

Share this post


Link to post

I tried to condense the code:

procedure tfr_com.dataavail(sender:TObject; Count:integer);
var i  : word;
    c  : char;                  // serielles zeichen empfangen
    ac : ansichar;
    s  : string;
begin
      for i:=1 to count do
         if com.ReadChar(ac) then
            begin
            c:=char(ac);
{$R-} 
            showinchar(c);
            if assigned(receivecharproc) then
               receivecharproc(c);
{$R+}
            end;      
end;  
//receivecharproc is 
procedure tfi_m.receivechar(ach:char);
begin
  //state machine that works with the chars to parse the various protocols and adds them to a string typed variable for further processing
end;

ReadChar provides us ansichars, so I guess it's easiest just to stay with ansichars for further processing.
But why did it work for 10 years and suddenly stopped? I can't rule out, that we see a new type of input we haven't seen before. But I still have no trace of what we actually receive...

Share this post


Link to post
10 minutes ago, omnibrain said:

But why did it work for 10 years and suddenly stopped? I can't rule out, that we see a new type of input we haven't seen before. But I still have no trace of what we actually receive...

What is the encoding of the input. Can you guarantee that it is ascii? 

Share this post


Link to post
1 hour ago, Lajos Juhász said:

This is true for strings. However in the example it's used to assign the value to ansistring:

 


var
  MyAnsiString: AnsiString;

 

AnsiString is not a unicode string thus there is no chr(11200) most probably the code page of the ansistring will have no conversation for that unicode code point thus will be converted to ?

Fair. I was just looking at a statement about Chr(11200) in isolation. My bad. 

 

Off topic aside follows below:

 

Interestingly, if the process has UTF-8 as the active code page (ccchttps://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page) then you can use AnsiString fine and be fully Unicode compliant. I discovered this by accident lately when my MATLAB mex file, which uses ANSI because MATLAB doesn't do UTF16, unexpectedly started handling Unicode with a recent MATLAB update! The update set this code page in its manifest. 

Share this post


Link to post
30 minutes ago, David Heffernan said:

Interestingly, if the process has UTF-8 as the active code page (ccchttps://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page) then you can use AnsiString fine and be fully Unicode compliant. I discovered this by accident lately when my MATLAB mex file, which uses ANSI because MATLAB doesn't do UTF16, unexpectedly started handling Unicode with a recent MATLAB update! The update set this code page in its manifest. 

Yeah, and breaks FireDAC as it converts from UTF-16 using the language for non-unicode programs instead of using conversion from client locale to server locale.

Share this post


Link to post
58 minutes ago, David Heffernan said:

What is the encoding of the input. Can you guarantee that it is ascii? 

The value can be basically anything from $01 to $ff. It's serial communication with various protocols. Some text based, some byte based and some of them a mix of both. Some of them are delimited by EOT ($04), for some we need to calculate the lengths, for some we need to calculate CRCs, etc. Not all in the same process, but the pattern ist the same for all of them and the tfr_com.dataavail ist the same.

 

The serial communications components provides ansichars. And we don't expect to receive multi byte characters via serial communications anyway. We communicate with old hardware, with old protocols. Most of the time there is no encoding specified, but for the text parts (if there are some) most of the times just plain ascii characters are used.

Share this post


Link to post
59 minutes ago, Lajos Juhász said:

Yeah, and breaks FireDAC as it converts from UTF-16 using the language for non-unicode programs instead of using conversion from client locale to server locale.

That's on FireDAC I guess. AnsiString conversions works just fine in this scenario because it calls GetACP and uses the returned value (65001) for all conversions.

Share this post


Link to post
24 minutes ago, omnibrain said:

The value can be basically anything from $01 to $ff.

What do you expect and intend to happen then with values of >= $80?

 

I don't think anything has changed in recent Delphi releases, but your code may have been broken forever. 

Share this post


Link to post

in HELP RAD 11.2 say:  

System.AnsiString

Quote

ANSISTRING: Represents a dynamically allocated string whose maximum length is limited only by available memory.

An AnsiString variable is a structure containing string information. When the variable is empty (when it contains a zero-length string), the pointer is nil and the string uses no additional storage. When the variable is nonempty, it points to a dynamically allocated block of memory that contains the string value. This memory is allocated on the heap, but its management is entirely automatic and requires no user code. The AnsiString structure contains a 32-bit length indicator, a 32-bit reference count, a 16-bit data length indicating the number of bytes per character, and a 16-bit code page. This code page is set, by default, to the operating system's code page. It can be changed by calling SetMultiByteConversionCodePage

 

An AnsiString represents a single-byte string. With a single-byte character set (SBCS), each byte in a string represents one character. In a multibyte character set (MBCS), the elements are still single bytes, but some characters are represented by one byte and others by more than one byte. Multibyte character sets--especially double-byte character sets (DBCS)--are widely used for Asian languages. An AnsiString can contain MBCS characters.

etc..

for that  Char(11200) = ? ... or exists in Default O.S. page?

Edited by programmerdelphi2k

Share this post


Link to post
2 hours ago, programmerdelphi2k said:

for that  Char(11200) = ? ... or exists in Default O.S. page?

Char(11200) is a perfectly valid Char, and represents a well defined UTF-16 element. The issue is when you convert to AnsiString.

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×