Best type for data buffer: TBytes, RawByteString, String, AnsiString, ...

Fr0sT.Brutal · October 2, 2020

Byte array still really lacks useful optimized routines. Annoying thing is that in general the code of such routines could be almost the same (Length, SetLength, indexed access...). I wish RTL had good old pointer-and-length routines as basis so that calling them on string, tbytes or even raw buffer would be equally convenient.

Regarding the topic: I'd still use byte strings because they're simple to use and convenient but try to isolate string-specific stuff to routines so that changing them to tbytes would be just a matter of type rename.

Edited October 2, 2020 by Fr0sT.Brutal

Kryvich · October 2, 2020

We need a real RawByteString without ANY implicit conversions. Just show a compiler error when trying to implicitly convert a string. Or improved TBytes with copy-on-write and all manipulation routines.

FPiette · October 2, 2020

1 hour ago, Rollo62 said:

The problem is the speed.

Linked list is among the fastest. Much faster than arrays or strings. But since you didn't told us how data is coming and which processing you need to do, all answers (Mine and from others) are just guess and probably not really helpful.

You'll get performance if you minimize the number of data copy and memory allocation. Be aware of hidden data copy and allocation when you resize a dynamic array or a string and all variation.

Edited October 2, 2020 by FPiette
Added more details.

Dalija Prasnikar · October 2, 2020

2 hours ago, Rollo62 said:

but as AnsiString was deprecated and removed once from modern platforms, this leaves a bad taste.
It seems that I came back only on massive complaints from the community.

Even AnsiString is back, but you can safely use RawByteString and UTF8String.

Original version of NextGen compiler didn't have support for 8-bit strings, but it is not brought back just because of community demands, but because UTF-8 is commonly used Unicode encoding, especially on Linux. So RawByteString and UTF8String were introduced in NextGen as a step toward supporting Linux.

Now when NextGen is gone, you can also use AnsiString, but that type is really legacy nowadays and it makes no sense to use it beyond that.

A.M. Hoornweg · October 2, 2020

42 minutes ago, Kryvich said:

We need a real RawByteString without ANY implicit conversions. Just show a compiler error when trying to implicitly convert a string. Or improved TBytes with copy-on-write and all manipulation routines.

Ansistring (28591) does that, if you convert it to unicode then the ordinal values of the widechars are identical to the ordinal values of the ansichars.

Type Binarystring = type Ansistring (28591);


procedure TForm2.Button1Click(Sender: TObject);
VAR Original, aCopy:Binarystring;
    Ansi:Ansistring;
    Raw:Rawbytestring;
    changed:Boolean;
    Uni:Unicodestring; i:integer;
begin
     SetLength(Original,256);
     for i:=1 to 256 do
         Original[i]:=ansichar(i-1);

     Uni:=Original; //copy to unicode, see if ordinal values are the same

     changed:=False;
     for i:=1 to 256 do
       if Byte(Original[i]) <> WORD(uni[i]) then
           changed:=true;

   If not changed then
      showmessage('It appears that the Widechar ordinal values are THE SAME as the Ansichar values');

    aCopy:=Uni;
    IF aCopy=Original then
         showmessage('It appears that converting back to binarystring is SAFE');

   Ansi:=Uni;
    IF Original<>Ansi then
         showmessage('It appears that converting back to ANSIstring is UNSAFE');

   Raw:=Uni;
    IF Original<>Raw then
         showmessage('It appears that converting back to RAWBYTEstring is UNSAFE');
end;

Edited October 2, 2020 by A.M. Hoornweg

Kryvich · October 2, 2020

@A.M. Hoornweg It depends on Windows ANSI codepage. I have CP-1251.

Quote

It appears that the Widechar ordinal values are THE SAME as the Ansichar values
It appears that converting back to binarystring is SAFE
It appears that converting back to ANSIstring is UNSAFE
It appears that converting back to RAWBYTEstring is UNSAFE

The compiler also issues a lot of warnings:

Quote

[dcc32 Warning] Ansistring28591.dpr(23): W1057 Implicit string cast from 'Binarystring' to 'string'
[dcc32 Warning] Ansistring28591.dpr(33): W1058 Implicit string cast with potential data loss from 'string' to 'Binarystring'
[dcc32 Warning] Ansistring28591.dpr(37): W1058 Implicit string cast with potential data loss from 'string' to 'AnsiString'
[dcc32 Warning] Ansistring28591.dpr(38): W1057 Implicit string cast from 'Binarystring' to 'string'
[dcc32 Warning] Ansistring28591.dpr(38): W1057 Implicit string cast from 'AnsiString' to 'string'
[dcc32 Warning] Ansistring28591.dpr(41): W1058 Implicit string cast with potential data loss from 'string' to 'RawByteString'

Ansistring28591.dpr

Edited October 2, 2020 by Kryvich

A.M. Hoornweg · October 2, 2020

4 minutes ago, Kryvich said:

@A.M. Hoornweg It depends on Windows ANSI codepage. I have CP-1251.

The compiler also issues a lot of warnings:

Ansistring28591.dpr

Yes of course the compiler warns. Ansistrings (with the exception of utf8string) can contain only a subset of Unicode so the compiler will warn if you convert between the types.

But the code proves without any doubt that "Binarystring" can be losslessly assigned to Unicodestring and back again and that the ordinal values stay the same. So for storing bytes (which contain only values 0-255 anyway) that is perfectly safe to use.

Just don't assign them to any other flavor of Ansistring or things get corrupted.

Anders Melander · October 2, 2020

The only general advice I can give is: Convenience or Performance? Pick one.

I won't get into this discussion about which of the many different data types can solve your problem best, because your requirements seems rather fluid and it's impossible to give any targeted advice, when we don't know anything about the specifics.

I'm puzzled why one would even be asking a question like this other than to pass time. If the implementation details are important you shouldn't rely on something someone said on a forum anyway. Try different solutions and benchmark them. If none of the solutions are good enough then there will be something to discuss.

Disclaimer: I've been up all night and I'm on my eight cup of coffee.

A.M. Hoornweg · October 2, 2020

9 minutes ago, Anders Melander said:

The only general advice I can give is: Convenience or Performance? Pick one.

I won't get into this discussion about which of the many different data types can solve your problem best, because your requirements seems rather fluid and it's impossible to give any targeted advice, when we don't know anything about the specifics.

I'm puzzled why one would even be asking a question like this other than to pass time. If the implementation details are important you shouldn't rely on something someone said on a forum anyway. Try different solutions and benchmark them. If none of the solutions are good enough then there will be something to discuss.

Disclaimer: I've been up all night and I'm on my eight cup of coffee.

I sympathize with the OP, I have the same problem often, data streams containing a mixture of text and binary.

- Many communication components for RS232 (such as good-old Async Pro) have receive events which pass a string. So far I haven't seen any that has the decency to pass a tBytes.

- With modems (both the old-fashion ones and modern cellular 4G ones) one communicates in the "Hayes" protocol which is text-based until the connection is established, then the data stream becomes binary.

Rollo62 · October 2, 2020

3 hours ago, David Heffernan said:

At the start of this thread you said that the data was binary. Now you say it is ASCII. Hard to give advice on this basis.

@David Heffernan

Sorry to be unclear with that, I wrote:

Quote

I need to choose a basic type for caching and manipulating binary data, which is mostly represented as String, but could be also pure Byte data.

Maybe I should say that it contains maybe 95% string and 5% binary data, as it may come from various sources.

3 hours ago, Lars Fosdal said:

What format is the string source? UTF-8? Ansi? How are string buffers measured? Byte/word length? Character length? Zero terminated?

Mainly printable ASCI ( 0 ... 127) fitting in one Byte, but as I said also pure binary ( 0 .255) can occur from some providers.

How are the strings delimited in the buffer?

3 hours ago, Lars Fosdal said:

How are the strings delimited in the buffer?

Usually SOT with CrLf or Lf as EOT for the ASCII data representation,
the binary data representation also has SOT/EOT of some kind but can be fixed length data too.

3 hours ago, Lars Fosdal said:

IMO, if you need to copy the strings out of the buffer to work with them anyways, you might as well future-proof for non-ascii and copy to a regular string with the appropriate routine. That can allow you to sanity check the string in context of format, and handle abnormalities.

Yes, the data will be distributed into various processors, which know how to handle the data right.
But I need a common buffer type for the intermediate buffer, which distributes the data.

TBytes is preferred type for manipulating binary data.

3 hours ago, Dalija Prasnikar said:

TBytes is preferred type for manipulating binary data.

Absolutely, unless I have a mixture of both.

3 hours ago, Dalija Prasnikar said:

However, it lacks some features and behaviors that string has (copy on write, easy concatenation, string manipulating routines (even if you have to copy paste pieces of RTL code to adjust them for RawByteStrings, this is easier and faster than handling TBytes), and most important one debugging).
If those features are needed, then RawByteString is the only other option you have left.

I have not really checked yet in detail what is the internal difference between RawByteString and AnsiString internally,
I assume this are mainly the codepage parts.

3 hours ago, Dalija Prasnikar said:

Most important thing is, you should avoid calling ANY function that does Unicode transformation and usually that means some common Delphi string functions will be out of reach, and you need to make your own.

Right, thats my goal too.
I try first to make custom types, derived from standard classes, with some disadvantages.

Maybe its also worth to wrap those standard classes into a new wrapper class or even generics, opening full featured class handling,
but I wanted to keep such stuff more low level, and not to overdue.

There should be something already fitting this case, from the above candidates.

3 hours ago, Dalija Prasnikar said:

With CP 437 you might get away with transformation problems, but any unnecessary conversion also means performance loss.

Now I will go sit in my corner waiting to be tarred and feathered.

As always I'm open to all non-standard proposals, no reason to take out tar and feathers here, but I know that this proposal is seen as bad practice.
In case it works and solves my issues, I could live with that, but still there might be better solutions around.

3 hours ago, Arnaud Bouchez said:

I would rather use RawByteString for several reasons:

I'm with you too, since already the name proposes what purpose it has.
If not only the documents would say "dont use it".

3 hours ago, Arnaud Bouchez said:

2. SetLength(TBytes) will allocate the memory and fill it with zeros, whereas SetLength(RawByteString) will just allocate the memory.

Well thanks, was not aware of that important feature.

3 hours ago, Arnaud Bouchez said:

3. Even if your RawByteString has some binary, the ASCII characters will be easier to read e.g. #2#0#7'Some Text'#0'Some other text'#49.

Right, thats what I hope too.
Makes live easier, even if the purists might say "binary its not a string".

So maybe I should dig deeper into a RawByteString implementation now, and check if that might fit well.
Would be good to know the future of RawByteString, are there plans to phase this out ?

Edited October 2, 2020 by Rollo62

Rollo62 · October 2, 2020

2 hours ago, FPiette said:

Linked list is among the fastest. Much faster than arrays or strings. But since you didn't told us how data is coming and which processing you need to do, all answers (Mine and from others) are just guess and probably not really helpful.

See my first post

Quote

The original source is TBytes, so my first consideration is to keep TBytes as buffer data type.

While the original data mostly contains ANSI strings, but in some cases maybe also contains binary (Byte) data, 0 ... 255.

Rollo62 · October 2, 2020

1 hour ago, A.M. Hoornweg said:

So far I haven't seen any that has the decency to pass a tBytes.

Look for example into the Bluetooth LE communication functions.

Kryvich · October 2, 2020

Marco Cantù has a comprehensive guide to Unicode, ANSI, RawByteStrings etc.

Edited October 2, 2020 by Kryvich

Rollo62 · October 2, 2020

@Kryvich

Thanks, but its from 2008.
Is this still 100% valid, after all that FMX, mobile, ARC, .... additions ?

Edited October 2, 2020 by Rollo62

Lars Fosdal · October 2, 2020

Another factor for your choice:

TBytes - zero based indexes

AnsiString/RawByteString - one based indexes

So - doing over will be a pain.

Dalija Prasnikar · October 2, 2020

18 minutes ago, Rollo62 said:

I have not really checked yet in detail what is the internal difference between RawByteString and AnsiString internally,
I assume this are mainly the codepage parts.

Some common RTL string manipulation functions have several overloads - including RawByteString one. If you use such function on AnsiString, you will trigger automatic compiler Unicode conversion. Using RawByteString will just use appropriate overload.

There might be other differences, I haven't been fiddling around 8-bit strings (writing new code, besides the one I already have) for at least 5 years now.

18 minutes ago, Rollo62 said:

In case it works and solves my issues, I could live with that, but still there might be better solutions around.

None that I have found so far. I am using RawByteString for mixed ASCII/UTF8/binary data for 10 years now. No problems.

Rollo62 · October 2, 2020

Yes, from Marco's guide this is exactly what I'm looking for,

regarding the RawByteString

Quote

Declaring variables of type RawByteString for storing an actual string should rarely be done.

Given the undefined code page, this can lead to undefined behavior and potential data loss.

On the other hand if your goal is saving binary data using a string-like memory allocation and representation, you can use the RawByteString in the same way you used AnsiString in past versions of Delphi.

Replacing non-string code that used AnsiString with RawByteString is an interesting migration path.

If this RawByteString is still modern, then I will choose it as preferred solution.

Since its that long in the Delphi environment maybe its very likely to stay there in the future too.
I will try and check how it behaves and compares to the pure TBytes solution.

Thanks to all for putting your arguments into this interesting discussion.

Edited October 2, 2020 by Rollo62

FPiette · October 2, 2020

Quote

Usually SOT with CrLf or Lf as EOT for the ASCII data representation,
the binary data representation also has SOT/EOT of some kind but can be fixed length data too.

Quote

the data will be distributed into various processors, which know how to handle the data right.

Quote

The original source is TBytes, so my first consideration is to keep TBytes as buffer data type.

You'll get performance if you avoid copying data. If your receiver component accept a buffer (You said TBytes which is a TArray<Byte>), you can keep it into that buffer, add that buffer in a linked list of pointers to those TBytes buffer and pass another buffer for the next receive.

Each "processor" will then receive the data it know how to handle right and produce some result maybe in the same buffer or in a new one.

One a buffer is not needed, it is freed or better put on a second linked list with available buffers and reuse that buffer later. You'll avoid memory allocation/deallocation which produce memory fragmentation which is also very bad if the application has to run continuously.

Resizing and array (Remember TBytes is an array) is an inefficient operation. It involve copying data because the array can't always be expanded in place and obviously it involves memory alloc/free because you use a dynamic array (TBytes).

I don't know what your receiver looks like nor if the data size is known up front. But if possible, it is better to preallocate a buffer large enough for - let's say - 90% of the occurrences and reuse that buffer again and again (Because as I said above you have a linked list of free buffers).

Rollo62 · October 2, 2020

16 minutes ago, FPiette said:

You'll get performance if you avoid copying data. If your receiver component accept a buffer (You said TBytes which is a TArray<Byte>), you can keep it into that buffer, add that buffer in a linked list of pointers to those TBytes buffer and pass another buffer for the next receive.

Right, but as I explained I need to decouple source and processors by a circular buffer.

The source data can provide data fastly, from a thread,
while the processors will analyse and send the decoded data to different consumer.

Thats why I'm looking for an intermediate buffer solution, to pre-process the data, until a complete chunk can be identified.
I have sources were a transmission is not complete in one step, only the next transmission(s) may be complete one data chunk.

Fr0sT.Brutal · October 2, 2020

1 hour ago, Lars Fosdal said:

Another factor for your choice:

TBytes - zero based indexes

AnsiString/RawByteString - one based indexes

So - doing over will be a pain.

That's why hard-coded index should be avoided where possible.

@str[1] => Pointer(str)

@arr[0] => Pointer(arr)

for i := 0 to Length(arr) => for i := Low(arr) to High(arr) / for elem in arr

Low and High for strings will also work in modern compilers

FPiette · October 2, 2020

11 minutes ago, Rollo62 said:

Right, but as I explained I need to decouple source and processors by a circular buffer.

The source data can provide data fastly, from a thread,
while the processors will analyse and send the decoded data to different consumer.

So this is perfect for a linked list! The linked list will grow as the source data thread add more buffer at the end. Buffer extracted from released buffers after processing or new buffer if no free available.

Processors, will take a buffer from the start of the linked list for processing and move the buffer to the released linked list.

Quote

Thats why I'm looking for an intermediate buffer solution, to pre-process the data, until a complete chunk can be identified.

I have sources were a transmission is not complete in one step, only the next transmission(s) may be complete one data chunk.

Then you have to have a linked list per source data. When a processor retrieve the first buffer from the linked list and determine it is not complete, then it is flagged and it stays in the linked list. When the next data chunk arrive, and the first buffer in the list is flagged as incomplete, then it is flagged as fragment so that the processors knows that if must also retrieve the next item in the list (or a few ones) to get complete data.

btw: Linked list are frequently used to implement queues (FIFO, LIFO). I never seriously looked at how Delphi RTL implement TQueue<T> but I think it is an array which is less efficient.

David Heffernan · October 2, 2020

1 hour ago, Rollo62 said:

Maybe I should say that it contains maybe 95% string and 5% binary data, as it may come from various sources.

It kinda makes no sense then that you also say that the data is ASCII (0..127). I'm very confused.

Rollo62 · October 2, 2020

10 minutes ago, FPiette said:

So this is perfect for a linked list! The linked list will grow as the source data thread add more buffer at the end. Buffer extracted from released buffers after processing or new buffer if no free available.

Yes maybe, but then the list items will include incomplete chunks, and I have to torn and re-construct them later.
When the source data arrives I have not enough time to analyse it, just push and go.

If I push to lists, then that might happen

ListMessages:

- 11111
- 111        //<== Ok, this I can combine easily
    = 11111111
- 222
- 222222
- 2223333 // <== This I have to combine, torn apart, so that 3333 can be used later
        = 222222222222
     +3333
...
- 33333333

= 333333333333

Yes, its possible, but a little more tricky.

Edited October 2, 2020 by Rollo62

Rollo62 · October 2, 2020

14 minutes ago, David Heffernan said:

It kinda makes no sense then that you also say that the data is ASCII (0..127). I'm very confused.

Sorry to confuse,
I meant the original, old ASCII character set, of usuallly printable characters >= 32 < 127, pllus also including controls like Cr, Lf, Tab, ...).
https://theasciicode.com.ar/ascii-printable-characters/space-ascii-code-32.html

To me this is "ASCII 0 ... 127", sorry I have no better name for that set, maybe its compatible to CP_437.
Since this kind of characters mostly do no harm and in memory these are 1:1 byte compatible.

I see such character sets mainly produced from embedded devices, and like most that its a human readable "string".

Edited October 2, 2020 by Rollo62

FPiette · October 2, 2020

7 minutes ago, Rollo62 said:

Yes, its possible, but a little more tricky.

Not tricky but less trivial. As someone else said in this conversation, performance often means more complex solution. You put your priority where you want!

Quote

- 2223333 // <== This I have to combine, torn apart, so that 3333 can be used later

Maybe you don't know, but I'm the author of ICS (Internet Component Suite) and what you describe is exactly what happens with a TCP stream which is what a TCP socket gives. So I have large experience in that kind of data receiving handling.

There are 3 ways to delimit data:

1) There are delimiters such STX/END around a data block or a CRLF (or any other end-of-block marker) at the end of data block.

2) Each block begins with a data length.

3) Communication is closed or broken

You should move the determination of the data block complete/incomplete in the receiving process or thread, not in the processing thread. This means that the same buffer (Pointer and length or pointer to star, offset and length) is given to the low level receiver until a complete data block is received. Only then this data block is put at the end of the incoming data linked list (or Queue if you prefer this term) for processing.

Sign In

Best type for data buffer: TBytes, RawByteString, String, AnsiString, ...

Recommended Posts

Fr0sT.Brutal 904

Share this post

Link to post

Kryvich 184

Share this post

Link to post

FPiette 393

Share this post

Link to post

Dalija Prasnikar 1536

Share this post

Link to post

A.M. Hoornweg 161

Share this post

Link to post

Kryvich 184

Share this post

Link to post

A.M. Hoornweg 161

Share this post

Link to post

Anders Melander 2108

Share this post

Link to post

A.M. Hoornweg 161

Share this post

Link to post

Rollo62 608

Share this post

Link to post

Rollo62 608

Share this post

Link to post

Rollo62 608

Share this post

Link to post

Kryvich 184

Share this post

Link to post

Rollo62 608

Share this post

Link to post

Lars Fosdal 1911

Share this post

Link to post

Dalija Prasnikar 1536

Share this post

Link to post

Rollo62 608

Share this post

Link to post

FPiette 393

Share this post

Link to post

Rollo62 608

Share this post

Link to post

Fr0sT.Brutal 904

Share this post

Link to post

FPiette 393

Share this post

Link to post

David Heffernan 2479

Share this post

Link to post

Rollo62 608

Share this post

Link to post

Rollo62 608

Share this post

Link to post

FPiette 393

Share this post

Link to post

Create an account or sign in to comment

Create an account