Jump to content
Jud

Maximum static memory

Recommended Posts

What is the largest amount of static memory you can get in Windows?  I know about the 2GB limit on static structures (even on 64-bit platform!)  The most static memory I can get is about 1,994,000,000 bytes - then I start getting access violations.  Is there a way to get larger static arrays or is there a way to get several static arrays under 2GB each, with the total static memory over 2GB?

Share this post


Link to post

Are talking about a 64 bit process or a 32 bit process running on a 64 bit system?

 

Share this post


Link to post
Guest

Well, this is not short story as there is many things to distinguish between, so i will try to point few things and most likely will miss few things too, thankfully there is others here will help.

 

But first lets make sure we are on the same page as your post combined many things.

4 hours ago, Jud said:

What is the largest amount of static memory you can get in Windows?

What did you mean with static ?!!

This is unclear, did you mean static memory declared as global in the project ?

4 hours ago, Jud said:

I know about the 2GB limit on static structures (even on 64-bit platform!)

Here you mentioned static structure, this also unclear, because when you allocate array of "something" then size of this "something" is included in the allocation, the simplest "something" is byte which can be translated to size easily, but when you need a array of record or an instances then the size will depends on the direct size including the hidden ones, example even "array of byte" does include a hidden header which is a record for Size and ReferenceCounter, so that hidden header is 8 bytes on 32bit, but wait it is 16 bytes on 64Bit, notice here that the size does accept 64bit value while Reference Counter is still 4 bytes, but a padding introduced with 4 bytes to align the data on 16 bytes, (alas the padding is in wrong place it should be in the middle !, but this is different story)

4 hours ago, Jud said:

The most static memory I can get is about 1,994,000,000 bytes - then I start getting access violations.

This is also unclear, how did you hit the access violation while the allocation didn't raise an exception ?

In theory if an allocation was success then accessing it should not raise AV.

4 hours ago, Jud said:

Is there a way to get larger static arrays or is there a way to get several static arrays under 2GB each, with the total static memory over 2GB?

There is, but this depends on few factors.

 

No need to answer any of the above, read the following and then formulate your question(s) accordingly.

 

 

So instead of answering and commenting on each one of the above, i will go with basic to clear things, and you can comment and ask more in precise way.

1) Windows have hard limit for an EXE size that can be allocated, we must differentiate between EXE on disk and EXE at runtime, see you can declare global vars with huge size that require 1gb or close to 2gb (32bit) at runtime while the EXE size on disk will stay something like 30mb or even 50kb !, on other hand you can build an EXE with size 47GB (random number) on disk while at runtime it is few mb, if you want better understanding of the EXE then you can read the specification of the PE format and this part to be exact is clear enough 

https://docs.microsoft.com/en-us/windows/win32/debug/pe-format#optional-header-image-only

Quote

PE32+ images allow for a 64-bit address space while limiting the image size to 2 gigabytes. Other PE32+ modifications are addressed in their respective sections.

In this quote, the first Image word is referring to the segments(sections) sizes allocated in memory while the second used for the EXE (PE32) size on disk.

 

2) in (1) i was talking about the EXE size and its impact and limits, which mainly depends on compiled resources, code, and global allocated data, of course constants will be reserve space too, but all of that should not be a problem unless you allocating huge arrays as global vars, which is not recommended, for few reason, like what if there no memory on the system, then you application will fail to run silently, so it is not recommended to use global var with very large size, you are better with runtime/dynamic allocations.

 

3) We need to understand an allocation of an array of "TSomething", this is simple enough the needed size is CountOfElement*SizeOf(TSomething)+ (ArrayHeader), CountOfElement is exactly equal to Length(Something), while ArrayHeader is either 8 or 16byte, now will this work every time ?, it should but will not work, as there is another hidden factor, which is the Memory Manager, see while the language itself required a header, MM will require additional header(s) or extra information to manage it, i am not sure how much but will be completely a MM thing.

 

4) Now to the allocation it self and how to use it, lets take SetLength as an example as mostly all behave the same, here we need to understand the Compiler behaviour first, see the following

var
  BytesArr: TBytes;
......

  // Compiler will fail to calculate the size right !
  SetLength(BytesArr, $7FFFFFFF *3); // almost 3 * 2Gbyte , exactly 3*(2GB-1byte)

  // OK and will compile
  SetLength(BytesArr, Int64($7FFFFFFF) *3); // almost 3 * 2Gbyte , exactly 3*(2GB-1byte)

Same behaviour on 32bit and 64bit, which is really annoying on 64bit as it fail to make the default as Int64, but that is not critical here, the main point is how to access these elements of BytesArr ? ( i do understand that 32bit will fail to allocate at runtime as mentioned above)

Will you use "Index : Integer" ?

On 32Bit assuming it did allocate, it will be OK, but on 64Bit it will be a problem as here you need bigger indexing mechanism so the index must be Int64, so if you are allocating huge arrays on 64Bit remember to use Int64.

 

Last thing to say i used and i am using huge arrays but not that size as i don't want my application to fail, i am caching many thing on one server, and it is utilizing something around 16GB in pieces less than 2GB on a server with 32GB RAM, nothing else is running on that server except MySQL with dedicated 8GB memory.

 

Always develop your application with IntergerOverflow and BuferOverflow Enabled, this will save you time and headache later, also in many cases will predict AV before they happen.

As a rule of thumb, i always use signed integer for indexing, meaning use Integer or Int64, the other cases to use unsigned is very rare and mostly will be coming form OS API or other 3rd-party libraries, but for you own code try to stick to signed Integers.

Try to stick to smaller allocation, it is relatively how much small is small, but hitting 2gb is the limit, also in very cases, as it require from the OS to find such allocation in one piece, this will stress the System memory, remember this, the OS will provide you with such virtual memory, but there is a price to pay, in most cases the price is huge, as the OS will start to paging the memory on its swap file and the performance degradation will be huge, not only you application will suffer but the system application wide.

 

Hope that help, now if you have more questions, please ask.

Share this post


Link to post
8 hours ago, FPiette said:

Are talking about a 64 bit process or a 32 bit process running on a 64 bit system?

 

It is on a 64-bit system. I tried both the 32-bit and 64-bit platforms in Delphi.

Share this post


Link to post
Posted (edited)
8 hours ago, Kas Ob. said:

Well, this is not short story as there is many things to distinguish between, so i will try to point few things and most likely will miss few things too, thankfully there is others here will help.

 

But first lets make sure we are on the same page as your post combined many things.

What did you mean with static ?!!

This is unclear, did you mean static memory declared as global in the project ?

By static, I mean an array that is not a dynamic array. 

 

Briefly, hat the program needs to do would naturally use a two-dimensional non-rectangular dynamic array.  The array would have millions of rows, but each row would be a short list.  I wrote it that way, but performance is terrible.  The sizes aren't known until the procedure is run.  I believe memory is allocated in 4KB blocks (and zeroed out), which is a lot more than each row needs.  I think this is the bottleneck.

 

added... I thought of something else to try - set the length of the dynamic array of arrays once, outside the procedure and enlarge as necessary.  That may work if it doesn't use up too much memory.

Edited by Jud
added

Share this post


Link to post
12 minutes ago, Jud said:

By static, I mean an array that is not a dynamic array.

Please show a minimal reproducible example so that be better understand what you mean.

 

21 minutes ago, Jud said:

It is on a 64-bit system. I tried both the 32-bit and 64-bit platforms in Delphi.

The situation is not the same on 32 or 64 bits.

 

14 minutes ago, Jud said:

The array would have millions of rows, but each row would be a short list.  I wrote it that way, but performance is terrible.  The sizes aren't known until the procedure is run. 

You need an array of pointers pointing to arrays of the datatype (You didn't say anything about the data type you use/need). At first glance, you should probably use a kind of linked list of arrays of pointers.

 

For performance, you should also think about how the data is accessed so that the processor's memory cache is used efficiently.

Share this post


Link to post
On 8/28/2021 at 6:03 PM, Jud said:

Briefly, hat the program needs to do would naturally use a two-dimensional non-rectangular dynamic array.  The array would have millions of rows, but each row would be a short list.  I wrote it that way, but performance is terrible.  The sizes aren't known until the procedure is run.  I believe memory is allocated in 4KB blocks (and zeroed out), which is a lot more than each row needs.  I think this is the bottleneck.

Delphi memory manager takes care of allocations so no 4Kb-for-each-10b-array happens.

You should specify what "performance is terrible" means for you.

As quick check, you could comment out all processing actions and leave only looping through the array. This will show you the best performance you could reach with this data structure.

Share this post


Link to post

I need to update (I've been away for a couple of days).

 

Here is the situation.  A program does a lot of processing and calls a procedure to do the final work.  This procedure is given an array of records, which can number in the millions.  The procedure is called a large number of times. 

 

The procedure does some work matching pairs of records that meet certain criteria.  At first, I just did a brute-force quadratic time pair of nested loops, to make sure it was working.  (Get it right before making it fast.)  The test run took 17.7 hours. 

 

I knew there were better ways to do it.  The obvious choice is to do a bucket sort on the data before processing it.  A static array is limited to something less than 2 billion bytes, as far as I know.  This isn't enough, so I tried a tw0-dimensional non-rectangular array for the bucket sort.  (The first dimension is potentially in the millions, or many millions, but the size of each bucket is small.)  I thought this would be much faster than the quadratic-time brute-force method, but it ran for 2 days on the test case and didn't get anywhere near finishing.  I think part of the problem is that I'm using SetLength on the daynamic array of array a lot of times inside the procedure.  I might can avoid that, but I haven't tested it yet.

 

Right now I'm working on a quicksort of the data, even though a bucket sort is theoretically faster.  I have it working on small cases, but later today I should have it working on the test case.

 

So that is the background and where I am.

Share this post


Link to post

So the issue is looping through data not allocation. At the 1st glance, if you could calculate the matching criteria beforehand, you could build a structure that ties criteria and item index so that at the next step you compare only flags/short hashes to receive indices.

Share this post


Link to post
On 9/2/2021 at 3:54 AM, Fr0sT.Brutal said:

So the issue is looping through data not allocation. At the 1st glance, if you could calculate the matching criteria beforehand, you could build a structure that ties criteria and item index so that at the next step you compare only flags/short hashes to receive indices.

I think the problem may have to be with the data allocation, since bucket sort (the natural thing for this data) is so much slower than quadratic-time brute force.  But I need to get back to looking at it - I've taken a different approach for now and I got the first working version going yesterday,

 

The procedure that puts the data in buckets is called millions of times, with data arrays of varying sizes, but the bulk of them under 100.  I switched to doing a quicksort and it took 6.8 hours on the test case, as opposed to 17.7 hours before.  So a lot of the time is not in this procedure now, so I don't know how much faster a linear-time sort would be.

 

Since so many calls to the procedure have a small number of records, I compared several sorting methods to quicksort on different sizes of data.  Quicksort beat them all until the size got down into the 20s - then selection sort started outperforming quicksort by a good margin.

 

An interesting thing - I was testing several sorts, and I've read many times that insertion sort is the fastest of the quadratic-time sorting algorithms.  But, on these data records, selection sort beat insertion sort significantly.

 

I'm going to get back to looking at the original bucket method.

Edited by Jud

Share this post


Link to post
On 9/1/2021 at 7:29 PM, Jud said:

I think part of the problem is that I'm using SetLength on the daynamic array of array a lot of times inside the procedure.

I am absolutely putting my bet on this. The second issue after that will most likely be a non-cache-friendly access pattern.

 

But you don't have to guess - run the code under a profiler and it will tell you the slow parts.

Edited by Stefan Glienke
  • Like 1

Share this post


Link to post
12 hours ago, Stefan Glienke said:

I am absolutely putting my bet on this. The second issue after that will most likely be a non-cache-friendly access pattern.

 

But you don't have to guess - run the code under a profiler and it will tell you the slow parts.

 

I haven't gotten back to that yet, but I'm going to do timings and see.

Share this post


Link to post

What is a good low-cost profiler?  I downloaded AQtime but its trial period is short, then it is expensive.

Share this post


Link to post

I recommend SampingProfiler which simply works with either map file or td32 info in the exe. If you use the map2pdb Tool that Cristian linked you can use Intel VTune or uProf

Share this post


Link to post

I've installed that.  It gives me the message "Neither TD32, MAP, nor JDB information were found...". EDIT: I found where to turn on the MAP file, and I think it is working.

 

Results: it is spending 80% of its time in Quicksort.  So I need to go back to trying the bucket sort.

Edited by Jud

Share this post


Link to post

Are you using arrays of records?  How large records? Are you shuffling whole records when sorting?

If so, why not have an array of pointers to records, and shuffle the pointers instead of the actual data?

Share this post


Link to post
1 hour ago, Lars Fosdal said:

Are you using arrays of records?

AFAIU he's using large arrays of small arrays

 

@Jud I think you better prepare simplified test project and show it here. You can use Github's gists to be able to update the text.

Edited by Fr0sT.Brutal

Share this post


Link to post

In that case, use index arrays and reorder the indexes instead of shuffling values.

Share this post


Link to post

I'm hot shuffling the records, I'm grouping them according to how they are to be processed.  The records are 16-bytes long (there are a lot of them).

 

Yesterday (after watching a Webinar and installing Delphi 11.0, yes it is available) I did more profiling.

 

I said that the quicksort was taking 80%.  There is a SwapRecords procedure nested in the quicksort procedure that is taking an additional 7%.

 

I went back to my earlier version (with compiler directives) - the one that uses a non-rectulangular dynamic array of dynamic arrays.  The profile shows that SetLength is taking most of the time.  The profiler shows "system" using 77% of the time.  @FillChar is using the most, which must be zeroing out the dynamic array (which ( don't need).  SysGetMem is second in  the profiler.

 

But yesterday I had the idea of using a large dynamic array of small static arrays.  I think that is going to work (in 64-bit mode).

 

Also, what Administrators said above just made me think of another possible approach.  Keep the dynamic array of records and go through and pick out the ones with a certain key, and process them.  Repeat with the next key.  The problem with this is that there are so many keys (and each one will have to make a pass through the array) that I think it will be back to quadratic time.

 

Probably tomorrow I will implement the dynamic array of small static array method.

 

Edited by Jud

Share this post


Link to post

If your small items are record or static arrays, the sort has to copy memory for each pair of them. If you use System.Generics.Collections.TArray.QuickSort<T>, it calls comparer function several times and performs 1..4 memory copying in each iteration.

 

Anyway, instead of guessing, minimal code sample would be much more useful

Share this post


Link to post

The final phase of the processing needs the items in buckets, so bucket sort seems logical.  It should beat quicksort.  I think the problem making it slow was that the procedure that prepares the data for final processing is called millions of times and the large dynamic array of arrays was inside that procedure, calling SetLength each time.  I've just made the large dynamic array global and SetLength to once to reasonable estimates, and then increase it if necessary.  That is working but I haven't compared it to quicksort yet.

Share this post


Link to post

I think the problem is probably calling SetLength on a fairly large dynamic array millions of times inside a procedure is overwhelming the memory manager.  I need to keep a count showing how many times it has been called in the last second, and see if that starts to drop.

 

Maybe I can get some sample code.  I'm trying different things.  Today I finished a run after making the 2-dimensional non-rectangular array global, so SetLength isn't called for it millions of time inside the procedure that prepares the data for the final phase.  This was faster, but the run took 39.6 hours, as opposed to about 2.2 for the quicksort approach.  But I have a dynamic array inside that procedure that tells me how many elements across the big array is.

 

I gathered some statistics during the run and it shows that this array could be less than 200 megabytes, so I'm going to try making it a global static array.  Inside the procedure that is called millions of times, I have to set part of the array to zero, but I know how much of that array needs to use FillChar.

Edited by Jud

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×