Jump to content
Sign in to follow this  
dummzeuch

System.GetMemory returning NIL

Recommended Posts

The Help has the following to say about System.GetMemory:

Quote

GetMemory allocates a block of the given Size on the heap, and returns the address of this memory. The bytes of the allocated buffer are not set to zero. To dispose of the buffer, use FreeMemory. If there is not enough memory available to allocate the block, an EOutOfMemory exception is raised.

If the memory needs to be zero-initialized, you can use AllocMem.

From that description, I assumed that this function raises an EOutOfMemory if it cannot allocate a block and so I don't need to check whether it returns NIL.

 

Apparently that's wrong. My program just requested a block of 35880960 bytes (for the 50th time, yes that makes 1.7 gigabytes and it's quite possible that it run out of memory) and the function returned NIL on this call any any subsequent calls with that size.

 

Am I missing something here?

Share this post


Link to post
2 minutes ago, Stefan Glienke said:

Look into the implementation of System._GetMem and compare that with System.GetMemory and you know the answer.

So, what I'm missing is, that the help is wrong:

 

_GetMem calls MemoryManager.GetMemory, checks for NIL and raises EOutOfMemory. System.GetMem simply calls MemoryManager.GetMemory.

(That's Delphi 10.2.3)

 

Great! 😞

Now I have to go through the whole program and change the code to either call GetMem or check the result.

Share this post


Link to post
1 minute ago, Mahdi Safsafi said:

Why you're using Delphi MM for such large block ? It would be better to use OS functions.

Why not? FastMM should be able to handle that and the overhead should be negligible, since I am not calling it too often.

 

In case you are wondering what I need those memory blocks for:

It's a buffer for receiving pictures from a camera which are 5120x1168 pixels in BayerRG10 format. That format uses 6 bytes per pixel.

The camera sends 30 pictures per second, so I played is safe and allocated "enough" buffers to last for more than 1.5 seconds.

 

I am still experimenting with the GenICam interface. I didn't think about the required memory, I have to admit, That's what EOutOfMemory is for. 😉

Share this post


Link to post

@dummzeuch Yes FastMM can handle it but OS provides much options(ZeroInit, FileMapping, ...). Besides the behavior when OS fails is very well documented.

 Anyway, there is no harm to use FastMM ... after all it will call OS functions. 

15 minutes ago, dummzeuch said:

Great! 😞

Now I have to go through the whole program and change the code to either call GetMem or check the result.

or temporary solution :


var
  OriginalMemoryManager: TMemoryManagerEx;
  NewMemoryManager: TMemoryManagerEx;

function NewGetMem(Size: NativeInt): Pointer;
begin
  Result := OriginalMemoryManager.GetMem(Size);
  if not Assigned(Result) then
    raise EOutOfMemory.Create('Error Message');
end;

begin
  GetMemoryManager(OriginalMemoryManager);
  NewMemoryManager := OriginalMemoryManager;
  NewMemoryManager.GetMem := NewGetMem;
  SetMemoryManager(NewMemoryManager);
end.

 

  • Like 2

Share this post


Link to post

@Mahdi Safsafi Might be simpler to just redirect System.GetMemory instead of the memory manager's GetMem which is called by other methods doing a reOutOfMemory check themselves.

 

 

 

  • Like 1

Share this post


Link to post
7 minutes ago, FredS said:

@Mahdi Safsafi Might be simpler to just redirect System.GetMemory instead of the memory manager's GetMem which is called by other methods doing a reOutOfMemory check themselves.

 

 

 

Indeed you're absolutely right. But the intention of the above solution I proposed was to be a temporary fix that doesn't use any external library. 

When he would have some time, he may need to investigate further for a suitable solution. 

Share this post


Link to post
12 hours ago, David Heffernan said:

Can't understand why you aren't calling GetMem. 

Because according to the OLH there is no functional difference between GetMem and GetMemory, so I thought I had free choice. And I prefer a function over a procedure when a single value is returned. It makes the code easier to read.

Share this post


Link to post
20 hours ago, Mahdi Safsafi said:

Why you're using Delphi MM for such large block ? It would be better to use OS functions.

Delphi MM (FastMM4) is just a wrapper around the OS API for big blocks. No benefit of calling direclty the OS function, which is system-specific, and unsafe. Just use getmem/fremem/reallocmem everywhere.

  • Like 3

Share this post


Link to post
56 minutes ago, Arnaud Bouchez said:

Delphi MM (FastMM4) is just a wrapper around the OS API for big blocks. No benefit of calling direclty the OS function, which is system-specific, and unsafe. Just use getmem/fremem/reallocmem everywhere.

I know that already and I agree to all what you said except the thing that says there is no benefit for using OS function :

Large data tends to be aligned, calling Delphi MM will likely result to allocate one extra page for storing pointer info if the requested size is at the system granularity. Also pointer is not aligned at the system page granularity. Moreover OS functions provides many options against DMM.

So if portability is not an issue ... I really don't understand why someone will not use OS functions.

Share this post


Link to post
57 minutes ago, Mahdi Safsafi said:

I really don't understand why someone will not use OS functions

Premature optimisation 

  • Like 5
  • Thanks 1

Share this post


Link to post
On 8/15/2020 at 3:58 PM, Mahdi Safsafi said:

Large data tends to be aligned, calling Delphi MM will likely result to allocate one extra page for storing pointer info if the requested size is at the system granularity. Also pointer is not aligned at the system page granularity.

Allocating 4KB more for huge blocks is not an issue.

If you want the buffer aligned with system page granularity, then it is a very specific case, only needed by other OS calls, like changing the memory protection flags. It is theoritically possible, but very rare. This is the only reason when using the internal MM is not to be used.

If you expect to see any performance benefit of using memory page-aligned, you are pretty wrong for huge blocks - it doesn't change anything in practice.  The only way to increase performance with huge block of memory is by using non-volatile/non-temporal asm opcodes (movnti e.g.), which won't populate the CPU cache. But this is only possible with raw asm, not Delphi code, and will clearly be MM independent.

Edited by Arnaud Bouchez
  • Like 1

Share this post


Link to post
3 hours ago, Arnaud Bouchez said:

Allocating 4KB more for huge blocks is not an issue.

First, I'm not sure if you're aware about that. But 4KB is not the only available page size ! while most environment supports it, there're some that are using more than 4KB (link) !!!

Second, you should know better than anyone what does this mean for fragmentation. Try to convince someone that he runs on a limited environment that for each large allocation, a 4KB or whatever is not an issue ! Personally, the last thing that I want to have when allocating some Gb is dealing with unnecessary fragmentation that can be easily avoided.

Third, if I tell you that some OS functions are thread safe, what will be your answer ? Spoiler: Think twice before answering because the answer you're preparing is not what I'm expecting !

Quote

If you want the buffer aligned with system page granularity, then it is a very specific case, only needed by other OS calls, like changing the memory protection flags. It is theoritically possible, but very rare. This is the only reason when using the internal MM is not to be used.

You're completely wrong on this ! its not like you need an aligned memory at SPS only to do some basic stuff like changing memory protection. There're many other case. Take a look at the high-performance I/O : ReadFileScatter, WriteFileGather.
Do you know that for a read/write operation between mem/disk, its better to have an alignment at SPS. i.e: sections of PE files, large files. 
Do you know that disk manufacture today tries to align their sector size at SPS (today we have 4Kb, 8Kb)?  See File buffering.

Quote

 If you expect to see any performance benefit of using memory page-aligned, you are pretty wrong for huge blocks 

I'm just going to pretend that I didn't hear that. You knew exactly what I was referring to !

 

Final thing Arnaud and just to make things clear : I didn't invented the rule that says "using OS functions for large allocation is better than MM" myself. In fact, if you just did some research on the internet, you will find that many developers recommend using OS function for large data against any MM. I remember I saw a statement from Microsoft too !

There was also an interesting benchmark comparing c malloc against Linus/Windows function for allocating/freeing small/large block ... you should definitely take a look at that: 

https://raima.com/memory-management-allocation/

@David Heffernan I also recommend that you read the above article. It will change your mind about when to call it "Premature optimisation" and when not !


 

Share this post


Link to post
Guest

I agree with Mahdi here and will add these 

Linux, Zero-Copy operation was there for long now now and all need page aligned memory buffers where the size is n*PageSize

 

https://www.kernel.org/doc/html/v4.15/networking/msg_zerocopy.html

https://stackoverflow.com/questions/18343365/zero-copy-networking-vs-kernel-bypass

Or just search for "Zero-Copy Linux"

 

Windows, It is behaviour like in case with TransmitFile where the file is loaded then sent from the cache means those pages allocating the file is passed to sending directly.

Or from here http://www.serverframework.com/asynchronousevents/2011/10/windows-8-registered-io-buffer-strategies.html

Quote

It's also sensible to use page aligned memory for buffers that you register with RIORegisterBuffer() as the locking granularity of the operating system is page level so if you use a buffer that is not aligned on a page boundary you will lock the entire page that it occupies. This is especially important given that there's a limit to the number of I/O pages that can be locked at one time and I would imagine that buffers registered with RIORegisterBuffer() count against this limit.

 

Quote

To avoid locking more memory than you need to always align your buffers by allocating with VirtualAlloc(), or VirtualAllocExNuma(). 

 

Share this post


Link to post
Guest

Forgot this too 

 

AcceptEx , from https://docs.microsoft.com/en-us/archive/msdn-magazine/2000/october/windows-sockets-2-0-write-scalable-winsock-apps-using-completion-ports

Quote

An important issue in this design is to determine how many outstanding AcceptEx calls are allowed. Because a receive buffer is being posted with each accept call, a significant number of pages could be locked in memory. (Remember each overlapped operation consumes a small portion of non-paged pool and also locks any data buffers into memory.) There is no real answer or concrete formula for determining how many accept calls should be allowed. The best solution is to make this number tunable so that performance tests may be run to determine the best value for the typical environment that the server will be running in.

And that explain why AcceptEX is has at least same speed as Recv even it does more work, this will be noticed when the buffer passed to AcceptEx is page aligned in offset and size, and when it is not aligned it will be slower than Recv, using buffers from MM with AcceptEx will not yield any profit.

Share this post


Link to post

@Mahdi Safsafi
Your article refers to the C malloc on Windows - which is known to be far from optimized - much less optimized than the Delphi MM.
For instance, the conclusion of the article doesn't apply to the Delphi MM: "If you have an application that uses a lot of memory allocation in relatively small chunks, you may want to consider using alternatives to malloc/free on Windows-based systems. While VirtualAlloc/VirtualFree are not appropriate for allocating less than a memory page they can greatly improve database performance and predictability when allocating memory in multiples of a single page.". This is exactly what FastMM4 does.

When I wrote fragmentation won't increase for HUGE blocks, I meant > some MB blocks. With such size, I would probably reuse the very same buffer per thread if performance is needed.

 

@Kas Ob.

You are just proving my point: if you use very specific OS calls, you may need buffer aligned on memory page.

Edited by Arnaud Bouchez

Share this post


Link to post
5 hours ago, Mahdi Safsafi said:

I also recommend that you read the above article. It will change your mind about when to call it "Premature optimisation" and when not !

A valid benchmark from a real world program would make me consider what I said. 

Share this post


Link to post
1 hour ago, Arnaud Bouchez said:

@Mahdi Safsafi
Your article refers to the C malloc on Windows - which is known to be far from optimized - much less optimized than the Delphi MM.
For instance, the conclusion of the article doesn't apply to the Delphi MM: "If you have an application that uses a lot of memory allocation in relatively small chunks, you may want to consider using alternatives to malloc/free on Windows-based systems. While VirtualAlloc/VirtualFree are not appropriate for allocating less than a memory page they can greatly improve database performance and predictability when allocating memory in multiples of a single page.". This is exactly what FastMM4 does.

When I wrote fragmentation won't increase for HUGE blocks, I meant > some MB blocks. With such size, I would probably reuse the very same buffer per thread if performance is needed.

 

@Kas Ob.

You are just proving my point: if you use very specific OS calls, you may need buffer aligned on memory page.

 

Again you're completely wrong ! Even if you change C MM with Delphi MM. There is no way by any chance that FastMM can outperform OS functions. Allocating memory perhaps may give just a little closet result (single thread) to what OS may give but freeing memory will never outperform specially if you're under an environment where system paging is actively working. Should I explain more what does that mean ? I'd be very interested if you have some benchmark that prove otherwise.


What you're missing Arnaud is the following : At a software level, it may look from the first sigh that there is no difference when using C/Delphi MM against OS Api. But the fact is the memory management is a very complex thing and requires collaboration from different component(RAM, Disk, Software, OS and even CPU). Without understanding the nature/relationship between those components ... you will never understand the full story. 
 

Share this post


Link to post
46 minutes ago, David Heffernan said:

A valid benchmark from a real world program would make me consider what I said. 

Are you saying that the benchmark from the article is not enough ?

Share this post


Link to post
3 hours ago, Mahdi Safsafi said:

Are you saying that the benchmark from the article is not enough ?

I don't see any benchmark to support the assertion that all allocation of huge blocks should be done using VirtualAlloc.

 

Can you point to it. 

Share this post


Link to post
1 minute ago, David Heffernan said:

I don't see any benchmark to support the assertion that all allocation of huge blocks should be done using VirtualAlloc.

 

Can you point to it. 

You and Arnaud have read the article but you both didn't understand it clearly perhaps because it requires being familiar with some details. That's why I told Arnaud that he is wrong again and its not related to the malloc ! Even if someone replaces the malloc with FastMM, he will likely going to have the same result.

 

I'll give a very simple example(I will do my best to make it understood by anyone) to explain why a VirtualFree is much better than using FreeMemory.

Suppose you have allocated a bunch of large data using GetMemory. Obviously soon or later, a system paging will work and starts to swap pages from memory to disk. It happens that some of your pages(more likely the first allocated one) end-up on disk instead of memory. When time comes to a cleanup, you will call FreeMemory to free the allocated memory :

// FreeMemory for large block
function FreeLargeBlock(APointer: Pointer): Integer;
var
  LPreviousLargeBlockHeader, LNextLargeBlockHeader: PLargeBlockHeader;
begin
  {Point to the start of the large block}
  APointer := Pointer(PByte(APointer) - LargeBlockHeaderSize);
  {Get the previous and next large blocks}
  LPreviousLargeBlockHeader := PLargeBlockHeader(APointer).PreviousLargeBlockHeader;
  ...
end;

1 - As you can see the function tries to de-reference the pointer.

2 - Because the page was on disk and not on memory, an interrupt (for simplification, think it just like an invisible exception) occurs at CPU level. CPU then suspends the process that tried to access the memory and sends an interrupt to the OS kernel : Hi kernel, a process "A" is trying to access an invalid memory at address "X".

3 - Kernel search for a page "B" associated with that address "X". If the page was not found, it just an AV exception. Otherwise it proceeds to step 4.

4 - Kernel moves a page "C" from memory to disk in order to make a place for the requested page "B".

5 - Kernel loads the requested page "B" from disk to memory. 

6 - Kernel resumes execution of the process "A" like nothing happened.

7 - Process "A" calls VirtualFree.

8 - Kernel dis-allocates page "B".

 

Now, if you just used VirtualAlloc/VirtualFree, de-referencing the pointer is not required and all steps from 1 to 6 aren't necessary at all and paging is not happening too !!!

The important thing is that some of the above steps are heavy ... and that's why on their analyze they were taking hours to free the memory. Because a swap from/to disk/memory is happening all the time.

 

The Reverse Memory Free On Window Server 2008 R2 Datacenter Benchmark was consuming seconds instead of hours because they were clever to avoid system paging : They were freeing page in a reverse order than it was allocated. Meaning the last allocated page is getting free first(a last allocated page will likely resides on memory and not on disk) and steps 2 - 6 may not be necessary for all transaction. Yet the paging may still happen but not important as the original code.

 

By understanding this, anyone can quickly understand that a FreeMemory (that implies using GetMemory) is an evil for large block when paging is active. And cannot at any case competes with VirtualFree. Going from hours to seconds is really a high optimization level that worth understanding and changing many bad practices/wrong believes.

 

A better practice would be to use OS functions for larges block: 

- They don't de-reference pointer ... no issue with paging.

- They don't suffer from fragmentation as GetMemory does. Remember the last thing that anyone wants to see when allocating large data is to have fragmentation issue.

- They are always aligned to SPS.

- Thread safe : they only do one look. GetMemory/FreeMemory is looking twice !

- Provides more additional options.

- If portability is an issue: a wrapper will be very useful. Its not like we have a bench of system. Mostly a Windows and Posix:

 

function GetLargeMemory(Size: SIZE_T): Pointer;
begin
{$IFDEF MSWINDOWS}
  Result := VirtualAlloc(nil, Size, MEM_COMMIT or MEM_RESERVE, PAGE_READWRITE);
{$ELSE POSIX}
  Result := mmap(nil, Size, PROT_READ or PROT_WRITE, MAP_ANONYMOUS or MAP_PRIVATE or MAP_FIXED, 0, 0);
{$ENDIF MSWINDOWS}
end;

procedure GetLargeMemory(P: Pointer; Size: SIZE_T);
begin
{$IFDEF MSWINDOWS}
  VirtualFree(P, Size, MEM_RELEASE);
{$ELSE POSIX}
  munmap(P, size);
{$ENDIF MSWINDOWS}
end;

I hope by now, everyone here is understanding the difference between OS functions and Delphi MM functions.

Share this post


Link to post
1 hour ago, Mahdi Safsafi said:

Suppose you have allocated a bunch of large data using GetMemory. Obviously soon or later, a system paging will work and starts to swap pages from memory to disk. 

Yes some are written to disk but usually not commonly used memory - especially not when there is plenty of RAM available.

 

The article even says:

Quote

Our analysis showed that there was no more I/O being performed other than what was expected by the operations requested by the application. The available persistent memory was being used and the process was not virtualized.

 

Edited by Stefan Glienke

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

×