Jump to content
chkaufmann

String memory usage

Recommended Posts

I have a question regarding memory usage for strings.

 

When I load a big number of persons, there are many hundred with the same name. Is this handled automatically so that there will be only one copy of the string for a certain name or should I write my own code for this? Or does it depend how the strings are loaded and to what kind of variable these are assigned?

 

I didn't find a good source to read about this topic. Any hints for that?

 

Thanks

 

Christian

Share this post


Link to post

Depends on how you store the loaded names. If these are records from a database, for instance, and you access them via a query or table component each result row will have its own memory, regardless of what it contains. With a query you have the option of removing duplicates in the SQL statement used, but only if the row does not have any other columns with non-duplicate content.

If you load the names into a TStringlist you can set the list's Duplicates property to dupIgnore, it will then discard values already in the list when you try to add them.

Share this post


Link to post

Strings are copy-on-write, but the compiler doesn't do any magic to automatically de-duplicate strings. The only way you get COW efficiency in memory is when you assign one string to another. Some classes have code for deduplication, like TStringList, but you would know if you were using it and it probably doesn't apply to your situation.

  • Like 1

Share this post


Link to post

I think the official term being referenced is 'string interning'.https://en.wikipedia.org/wiki/String_interning

 

Scenarios where it may be useful is with JSON, results from dbs, or when custom collections containing strings from file or network. Depending on how the data is pulled in, you may have a spike in memory if all f the raw data is loaded, followed by the interning process which may normalise the structures. If the data is loaded incrementally, memory utilisation should correlate to the levels of duplication you know exists in the data. There is some overhead to the interning process in terms of constantly trying to deduplicate, so it depends on your scenario. If you keep the data in memory for a longish period of time, I think it can be useful to do this, especially if you are a ware that memory utilisation is a concern in your problem space.

 

C# has a .Intern() method on string https://learn.microsoft.com/en-us/dotnet/api/system.string.intern?view=net-8.0. Similar in Java https://docs.oracle.com/en/java/javase/22/docs/api/java.base/java/lang/String.html#intern()

 

As Delphi strings are reference counted, you could say that having multiple variables assigned to the same string would be a form of interning. e.g.

var str := 'hello world';
var str2 := str; 
var str3 := str;

All of the above will reference the same data. Under the hood there should be a reference count (3).

 

A non thread-safe approach to illustrate leveraging the above property by managing a TDictionary<string,string>:

var GInternedStrings : TDictionary<string,string>;

function Intern(const AString:string) : string;
begin
     if not GInternedStrings.TryGetValue(AString, Result) then begin
       GInternedStrings.Add(AString, AString);
       Result := AString;
     end;
end;

initialization

GInternedStrings := TDictionary<string,string>.Create;

finalization

GInternedStrings.Free;

Here we see a lookup being done on the pool, and result being populated with a value from the pool if it exists. If no value exists, we simply add the value too the pool, and return the original.

This could be used like:

for var rec in records do
begin
   rec.str := Intern(rec.str);
end;

 

Obviously, if you decide to go multi threaded, you would need to introduce some synchronisation, which will add some performance overhead due to locking and unlocking to keep the dictionary consistent.

 

If you don't do anything multithreaded, you can get away with not having a sync object and could even just allocate a  local pool in your loading procedure to localise the lifetime of the pool.

 

Having the global pool will mean that you will have that memory 'permanently' allocated... so however you decide to manage the pool depends on your use case.

 

Ideally, the functionality that loads data, such as TJSONValue.ParseJSONValue() or db query libraries would offer an option to do this. However, in many cases, I suspect it may be seen as an overhead in itself if the data is loaded into memory, processed and discarded. So the lifetime of the data in the app and how it is used is relevant to the effort of doing this all over.

 

 

Share this post


Link to post

EDIT:  The answer is very "basic", I don't think I fully understand what you are asking ...

 

I think it refers to differences from other languages where an "array" is automatically created in certain cases.
In Pascal every object, variable, type or other must be known and defined before use.
So in your case, knowing that you will have more "strings" to use you must predict their use using for example these solutions (there can be a thousand others that can be created, but these are some of those that the language and the predefined types handle natively, except for database that is another kind of container):

 

Here same ref: https://docwiki.embarcadero.com/RADStudio/Athens/en/Structured_Types_(Delphi)#Arrays

 

1) static arrays;
2) dynamic arrays;
3) Lists;
4) Collections;

5) Dictionary;
6) Generics array;
7) ....

 

1) Static arrays are arrays with a predefined and immutable length, they are defined at compiler time and are fine when you know how long your list is at most.

//This array contains 1000 strings (i.e. for example a thousand names)
var myarray: array [0..999] of string;
    i: integer;
myarray[0] := ' Mario';
myarray[1] := ' Maria';
//loop through the entire array
for i:= Low(myarray) to High(myarray) do
  ShowMessage(myarray[i]);

2) Dynamic arrays:
they are defined at compiler time (with the number of elements predefined at 0) and can be changed at runtime.

var myarray: array of string;
    s: string;
//Creates space for a thousand strings
//I can change this number whenever I need it during the program, either increasing or decreasing it
SetLength(myarray, 1000);
myarray[0] := ' Mario';
myarray[1] := ' Maria';

//loop through the entire array, you can also use the same syntax with for
//as for the static array.
for s in myarray do
  ShowMessage(s);
........
........
//At the end of the use, the space must be returned by setting the length to zero
SetLength(myarray, 0);

 

3) and 4) and 5) They are specialized "containers": https://docwiki.embarcadero.com/Libraries/Athens/en/System.Generics.Collections

 

6) Generics can contain any "thing", in this example I show you the use with strings:

var myarray: TArray<string>;
s: string;
//Create space for a thousand strings
//I can vary this number whenever I need it during the program, both increasing and decreasing it
SetLength(myarray, 1000);
myarray[0] := ' Mario';
myarray[1] := ' Maria';
//loop through the entire array, you can also use the same syntax with for
//as for the static array.
for s in myarray do
  ShowMessage(s);
.......
.......
//At the end of the use, the space must be returned by setting the length to zero
SetLength(myarray, 0);

7) ... you can create your own array, for example with adavanced record:

type
 //Really simple example
 TAnag = record
    private
      fName: string;
      fAge: integer;
    public
      property Age: integer read fAge write fAge;
      property  Name: string read fName write fName;
end;

var Anag: TArray<TAnag>;
    s: TAnag;
SetLength(Anag, 1000);
Anag[0].Name := 'Mario';
Anag[0].Age := 21;
Anag[1].Name := 'Maria';
Anag[1].Age := 18;
//loop through the entire array, you can also use the same syntax with for
//as for the static array.
for s in Anag do
  ShowMessage(s.Name);
.......
.......
//At the end of the use, the space must be returned by setting the length to zero
SetLength(Anag, 0);

 

A lot of functions in standard library return an array of object, like strings, integers of doubles.  In this case you can assign to an empty generic or dynamic array the result ... automatically the array will be resized. Remeber to free it after used, example;

 

var Data: TArray<string>;

      Combo: TComboBox;

....

....

Data := Combo.Items.ToStringArray;

.....

.....

SetLength(Data, 0);

 

Edited by DelphiUdIT

Share this post


Link to post

What about

//           Key     Value
TDictionary<Integer,string>

with Integer Key as Hash from the original string?
This will allow you to handle and operate with key's and only retrieve the real, original strings whenever necessary.

Edited by Rollo62

Share this post


Link to post

FWIW, I once, as an experiment, implemented string interning using a dictionary in an application that contained hundred of thousands of strings with about 75% duplicates. It was a 32-bit application which was hitting the 4Gb limit and running out of memory.

Sure, it saved a few hundred megabytes but the overhead of the dictionary lookup completely killed the performance.

 

With 64-bit and virtual memory I can't see any reason at all to do this exerciser.

Share this post


Link to post
1 hour ago, Anders Melander said:

overhead of the dictionary lookup completely killed the performance.

If you were using the RTL dictionary with its abysmal performance, that does not surprise me at all.

Mostly because of its poor hash function, replacing that with a better one speeds it up significantly. However, string interning does not require a dictionary with key/value but just a hashtable for strings.

DelphiAST has the option to use one for its parsing - we needed that when using it in the context of the IDE to avoid spikes in memory consumption.

  • Like 5

Share this post


Link to post
38 minutes ago, Stefan Glienke said:

DelphiAST has the option to use one for its parsing

This is the unit @Stefan Glienke was referring to.   100 lines and it uses no other unit!

Edited by pyscripter
  • Like 3

Share this post


Link to post

I sort of assumed that 16777619 in that code was a prime, but wanted to check...

ChatGPT 4o mini said "No".  

image.thumb.png.d022a5c59dfc866939cc3cee87abd82a.png

So, I asked CoPilot in Norwegian - and it said "No" (Nei)
image.thumb.png.1bafdfa87d105f567661257c3df9ecbb.png

On a whim, I then asked it in English
image.thumb.png.c4e84c087fdf199779180b59fea97690.png

 

LLMs - Lovely Lying Machines.

  • Haha 3

Share this post


Link to post
10 hours ago, Lars Fosdal said:

It supposedly was a trivial question.

AI can't even count properly. Why would it be able to know whether number is prime or not?

Share this post


Link to post
2 hours ago, Lars Fosdal said:

Well, the English CoPilot knew. But, the Norwegian one did not.

That sounds more like Quantum Mechanics than Mathematics.

  • Haha 4

Share this post


Link to post
17 hours ago, Lars Fosdal said:

It supposedly was a trivial question.

There are no trivial questions for LLMs...or perhaps I should rephrase that: All questions are considered equally trivial by an LLM. Unless your specific number was covered in a publication it has scanned, it will "know" nothing about that number. Hell, it might even believe it to be your grandmas name, if you should claim so.

  • Like 1

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×