Rollo62 536 Posted October 2, 2020 (edited) Hi there, I need to choose a basic type for caching and manipulating binary data, which is mostly represented as String, but could be also pure Byte data. The problem is that I need to analyse, chop, copy, append, re-combined to this buffer into several places, and finally the data will be string in most times. The original source is TBytes, so my first consideration is to keep TBytes as buffer data type. While the original data mostly contains ANSI strings, but in some cases maybe also contains binary (Byte) data, 0 ... 255. In short, the basic question is maybe: With original source data as TBytes - Keep TBytes for buffer manipulations, and convert in different places maybe later convert parts to string, or - Immediately convert all TBytes to e.g. String, and use string for manipulating data, even if maybe some of the data may be binary. TBytes: But from my gut feeling I would say that TBytes is probably not the most efficient data type for handling data, since its dynamic handling is not supported very well from the compiler. There needs to be done a lot of pointer tricks and memory move's to make that efficient. String: Strings on the other hand are very efficient and optimized, using all tricks like copy-on-write to make them fast and easy. I use them in many paces and they behave always very good and very efficient. But the drawback is that Strings are Char-based, which should double the memory footprint compared to Byte. What will be the right codepage for the encoding then ?RawByteString: The alternative RawByteString is not recommended, only as replacement for older AnsiStrings with codepage issues. So they have clearly another use-case: Quote RawByteString should only be used as a parameter type, and only in routines which otherwise would need multiple overloads for AnsiStrings with different codepages. In general, it is recommended that string processing routines should simply use "string" as the string type. Declaring variables or fields of type RawByteString should rarely, if ever, be done, because this practice can lead to undefined behavior and potential data loss. AnsiString: (without specific codepage) I could take AnsiString without codepage as base class, which would possibly reach the same efficieny as strings, but as AnsiString was deprecated and removed once from modern platforms, this leaves a bad taste. It seems that I came back only on massive complaints from the community. So my current decision tends more to use pure String as base class: type BufferType = String; var FBuffer : BufferType; ... //<== Single point of source data procedure SourceData( AData : TBytes ); begin FBuffer := EncodeAsASCII( AData ); // use no specific codepage, or DOS-like, to simply use Byte (0 ... 255) as elements ... // FBuffer copy, move, indexof, concat, ... //<== Further processing on FBuffer with effective string routines end; Is that the right decision, ignoring the doubled footprint in favor of speed ? So which option to choose from, A., B., C., or maybe I have overseen even another possible option ? I hope that you can help me with that decision. Edited October 2, 2020 by Rollo62 Share this post Link to post
David Heffernan 2345 Posted October 2, 2020 Would be perverse to use 16 bit Char to store 8 bit data. In terms of performance byte strings and byte arrays are similar but if anything byte arrays will be faster. Precisely because they don't have coy on write. No idea why you thing strings perform better. My guess is that your antipathy to byte arrays is a hangover from the legacy Delphi anti pattern that byte arrays are handled as strings. Share this post Link to post
Kryvich 165 Posted October 2, 2020 I would use RawByteString in this case, despite the warning in the manual. (Actually, that's what I do in such cases.) I don't think these strings will be thrown out of the RTL in the future. Use raw byte strings inside the module, explicitly converting them to Unicode strings in the output. 2 Share this post Link to post
Rollo62 536 Posted October 2, 2020 (edited) @David Heffernan I basically thought the same, only a feeling that TBytes gets a little slower. Not performance tested yet, but the data comes from external device. The reason I like strings if because they support chopping, deleting unwanted chars, recombining them very effective. And thats what I need with the data, I receive it into a ringbuffer fastly, then need to analyse, chop and redistribute into many other places. The problem is that the data may come from very different external devices, but need to be manages all in the same processor. Quote Would be perverse to use 16 bit Char to store 8 bit data. You're right, thats why I also consider AnsiString, but is this still recommended to be used ? Edited October 2, 2020 by Rollo62 Share this post Link to post
Pawel Piotrowski 18 Posted October 2, 2020 if you are more comfortable using strings, then just use rawByteString. TBytes are fine, too. There will be no real performance drawback using them. RawByteString have a possible drawback, depending how you will use them. There are some functions, that expect a string, but will happyly accept a rawByteString, doing a implicit conversion. And that conversion from rawByteString to String and then back to rawByteString can cost some performance... You will not have such a problem with TBytes. SO if you go with rawByteString, keep a close look on your compiler warnings 🙂 Share this post Link to post
Rollo62 536 Posted October 2, 2020 (edited) @Kryvich Right, that was my first thought too, but then see the docs ... I don't want to change my code all the time in the future. By the way, there was a similar discussion here, regarding the codepare 437. I found another possible reason to use codepage 437, because the TEncoding includes something like this: class function TEncoding.GetEncoding(CodePage: Integer): TEncoding; begin case CodePage of {$IFDEF ANDROID} 437: Result := TCP437Encoding.Create; {$ENDIF ANDROID} 1200: Result := TUnicodeEncoding.Create; 1201: Result := TBigEndianUnicodeEncoding.Create; CP_UTF7: Result := TUTF7Encoding.Create; CP_UTF8: Result := TUTF8Encoding.Create; else Result := TMBCSEncoding.Create(CodePage); end; end; So I need Android, which means codepage 437 will be supported in a special way, maybe with better performance, to convert back and forth with Unicode strings. Anyway, @Remy Lebeau proposed to use "codepage 28591 (ISO-8859-1) instead", which probably should be fine on Android too. He also notes that RawByteString may have conversion issues Quote More accurately, there is no conversion only when an AnsiString(N)-based string type is assigned to it, as it will simply inherit N as its current codepage, but it does perform a character conversion when a UnicodeString or WideString is assigned to it, and when it is assigned to another non-RawByteString string type. So, even if you were to use RawByteString, you still have to be careful with how you use it. Since my data mainly includes "string data", and only in some cases "binary data", I think early conversion to string is better for manipulation, than carrying and manipulating in TBytes, and only late convert to strings. Regarding RawByteString, this has the ugly note in the docs, but AnsiString seems to be returned as a full citizen in Delphi again, so maybe AnsiString with codepage 28591 or codepage 437 is the way to go ? Edited October 2, 2020 by Rollo62 Share this post Link to post
FPiette 383 Posted October 2, 2020 1 hour ago, Rollo62 said: I need to choose a basic type for caching and manipulating binary data, which is mostly represented as String, but could be also pure Byte data. The problem is that I need to analyse, chop, copy, append, re-combined to this buffer into several places, and finally the data will be string in most times. You forgot a data structure specifically designed to handle the processing of your data. For example linked list of some basic data type. Since we don't say anything about the details of your processing, I can't be more specific. Share this post Link to post
Kryvich 165 Posted October 2, 2020 1) Recommendations can change over time. 2) Look at Microsoft: they used UTF-16 as the default string for Unicode Windows in the beginning, and now they trying to use UTF-8 instead. Share this post Link to post
Rollo62 536 Posted October 2, 2020 (edited) 4 minutes ago, FPiette said: You forgot a data structure specifically designed to handle the processing of your data. For example linked list of some basic data type. Since we don't say anything about the details of your processing, I can't be more specific. The problem is the speed. Incoming data can be very fast, so I need to store them in a ringbuffer, and need to extract and process later. Excactly for that reason I'm looking for the right buffer data structure. Since the data is maybe 95% string and 5% binary, my thought is that string-like could be preferred. Edited October 2, 2020 by Rollo62 Share this post Link to post
A.M. Hoornweg 144 Posted October 2, 2020 Strings should not be used for byte manipulation (which does not contradict an earlier post of mine, explaining how to get legacy code libraries working if they do this). Rather, write a new class or class helper that enhances tbytes and offers the functionality that makes strings so practical (insert, delete, append, concatenate, pos). Maybe even some new classes "tBytelist" and "tByteBuilder" as an analog to tStringlist and tStringbuilder. Share this post Link to post
Rollo62 536 Posted October 2, 2020 1 minute ago, A.M. Hoornweg said: Strings should not be used for byte manipulation (which does not contradict an earlier post of mine, explaining how to get legacy code libraries working if they do this). Rather, write a new class or class helper that enhances tbytes and offers the functionality that makes strings so practical (insert, delete, append, concatenate, pos). Maybe even some new classes "tBytelist" and "tByteBuilder" as an analog to tStringlist and tStringbuilder. As I said 95% of data is string source, what decision to make then ? I would tend to prefer string over Bytes, as this is the main data used. Share this post Link to post
Rollo62 536 Posted October 2, 2020 (edited) 3 minutes ago, A.M. Hoornweg said: Rather, write a new class or class helper that enhances tbytes and offers the functionality that makes strings so practical (insert, delete, append, concatenate, pos). Maybe even some new classes "tBytelist" and "tByteBuilder" as an analog to tStringlist and tStringbuilder. Yes I have that, but the strings are so much more elegant. I would whish to have better TBytes and dyn. array support like that in Delphi. Edited October 2, 2020 by Rollo62 Share this post Link to post
Lars Fosdal 1792 Posted October 2, 2020 What format is the string source? UTF-8? Ansi? How are string buffers measured? Byte/word length? Character length? Zero terminated? Share this post Link to post
Rollo62 536 Posted October 2, 2020 Just now, Lars Fosdal said: What format is the string source? UTF-8? Ansi? How are string buffers measured? Byte/word length? Character length? No, just basic ASCII 0 ... 127, but I cannot prevent that in future there maybe some "failure" in the data. We work with several vendors for the data sources, so there might be some other codepage inbetween. But I consider this is a failure now, and need to find a workaround once this happens. Share this post Link to post
M.Joos 30 Posted October 2, 2020 If the stream of data is a mix of characters and binary data I would strongly recommend not using string to hold the data. A (unicode) string should only be used for character data otherwise you can expect a lot of unforeseeable surprises. Share this post Link to post
Kryvich 165 Posted October 2, 2020 @A.M. Hoornweg Why duplicate what the standard library already does for strings, including byte manipulation and copy-on-write? Share this post Link to post
Rollo62 536 Posted October 2, 2020 I tend to see the right candidate would be AnsiString, as it supports codepages for maybe future use, but the support in Delphi of AnsiString I also have in question. Shall it stay, or shall it go ? ( according to a well known song ) Share this post Link to post
A.M. Hoornweg 144 Posted October 2, 2020 3 minutes ago, Rollo62 said: Yes I have that, but the strings are so much more elegant. I would whish to have better TBytes and dyn. array support like that in Delphi. Lookup "operator overloading". It should be possible to do something like newbytes:=something+somethingelse+'Hello world'; But adding text to tBytes or searching a text inside a tBytes would mean code page handling. You could consider defining a Binarystring type (type binarystring=type Ansistring (28591)) because that's least likely to get messed up. Share this post Link to post
Rollo62 536 Posted October 2, 2020 (edited) 2 minutes ago, A.M. Hoornweg said: Lookup "operator overloading". I do. I think even a simple assignment operator is a showstopper .... Edited October 2, 2020 by Rollo62 Share this post Link to post
A.M. Hoornweg 144 Posted October 2, 2020 6 minutes ago, Kryvich said: @A.M. Hoornweg Why duplicate what the standard library already does for strings, including byte manipulation and copy-on-write? Because the data is not strings. But the in-memory representation of a dynamic array is rather similar so the functionality should be straightforward to duplicate. And thanks to generics, it's probably possible to design the base class to be universal, IIRC a tbytes is just something like "tarray<byte>" (but I may be wrong here). Share this post Link to post
David Heffernan 2345 Posted October 2, 2020 7 minutes ago, Rollo62 said: I tend to see the right candidate would be AnsiString, as it supports codepages for maybe future use, but the support in Delphi of AnsiString I also have in question. Shall it stay, or shall it go ? ( according to a well known song ) At the start of this thread you said that the data was binary. Now you say it is ASCII. Hard to give advice on this basis. Share this post Link to post
Lars Fosdal 1792 Posted October 2, 2020 How are the strings delimited in the buffer? IMO, if you need to copy the strings out of the buffer to work with them anyways, you might as well future-proof for non-ascii and copy to a regular string with the appropriate routine. That can allow you to sanity check the string in context of format, and handle abnormalities. Share this post Link to post
Dalija Prasnikar 1396 Posted October 2, 2020 TBytes is preferred type for manipulating binary data. However, it lacks some features and behaviors that string has (copy on write, easy concatenation, string manipulating routines (even if you have to copy paste pieces of RTL code to adjust them for RawByteStrings, this is easier and faster than handling TBytes), and most important one debugging). If those features are needed, then RawByteString is the only other option you have left. Which codepage to use is debatable - probably 437 (I have been using UTF8 because textual data I am extracting from such strings is always in UTF8 format, so this makes handling easier). Most important thing is, you should avoid calling ANY function that does Unicode transformation and usually that means some common Delphi string functions will be out of reach, and you need to make your own. Of course, the whole thing needs to be covered with tests to prevent accidental mistakes when you change pieces of code. With CP 437 you might get away with transformation problems, but any unnecessary conversion also means performance loss. Now I will go sit in my corner waiting to be tarred and feathered. Share this post Link to post
Arnaud Bouchez 407 Posted October 2, 2020 (edited) I would rather use RawByteString for several reasons: 1. Proper reference counting; 2. (Slightly) faster allocation. 3. Better debugging experience in your case. Some details: 1. TBytes = array of byte has a less strict reference counting. If you modify a TBytes item, all its copied instances will be modified. Whereas with RawByteString, the string will be made unique before modification (it is called Copy on Write - aka COW). 2. SetLength(TBytes) will allocate the memory and fill it with zeros, whereas SetLength(RawByteString) will just allocate the memory. If you use a memory buffer which will immediately be filled with some data, no need to fill it with zeros first. 3. Even if your RawByteString has some binary, the ASCII characters will be easier to read e.g. #2#0#7'Some Text'#0'Some other text'#49. Edited October 2, 2020 by Arnaud Bouchez Share this post Link to post