Jump to content
Sign in to follow this  
dummzeuch

Is Move the fastest way to copy memory?

Recommended Posts

All functions for copying / moving the contents of memory from one buffer to another seem to end up calling System.Move. (e.g. Windows.CopyMemory and Windows.MoveMemory).

 

Is that really the fastest way to copy large amounts of data?

 

Edit: This is about the 32 bit compiler for Windows.

 

In my case these are Mono8 bitmaps with a resolution up to 5120x5120 = about 26 MB or BGR8 bitmaps with HD resolution (1920x1080 * 3 bytes = about 6 MB (or even BayerRG16 or Mono16 Bitmaps with up to 5120x5120 * 2 bytes = about 52 MB). These come in a buffer delivered from a camera API which must be returned as fast as possible, so I need to copy these pictures to another buffer for further processing and do it fast. The buffers don't overlap in that case which is something that Move takes into consideration, so there is some small amount of optimization possible.

 

I think it might be possible to make it faster using some MMX/SSE instructions, but I am not familiar with that.

 

I tried to google for this, but Google kept showing showing results for the C function CopyMem, as usual ignoring the "Delphi" part of my query (even when put in quotes).

Edited by dummzeuch

Share this post


Link to post
1 hour ago, David Heffernan said:

I think Arnaud's synopse library has a bunch of more optimised mem copy routines

Found them, thanks.

 

Just in case anybody else wants to look at them:

 

* for 32 bit:

https://github.com/synopse/mORMot2/blob/master/src/core/mormot.core.base.asmx86.inc

 

* for 64 bit:

https://github.com/synopse/mORMot2/blob/master/src/core/mormot.core.base.asmx64.inc

Edited by dummzeuch
  • Like 1

Share this post


Link to post
56 minutes ago, dummzeuch said:

Found them, thanks.

Are you planning any testing/bench marking to verify they are a better option?

Share this post


Link to post
50 minutes ago, Mark- said:

Are you planning any testing/bench marking to verify they are a better option?

I will definitely do that before using it. So far I have just had a look.

  • Thanks 1

Share this post


Link to post
Just now, dummzeuch said:

I will definitely do that before using it. So far I have just had a look.

Please publish the results, I am most interested.

Share this post


Link to post

Don't expect anything magic by using mORMot MoveFast(). Perhaps a few percent more or less.

On Win32 - which is your target, IIRC the Delphi RTL uses X87 registers. On this platform, MoveFast() use SSE2 registers for small sizes, so is likely to be slightly faster, and will leverage ERMSB move (i.e. rep movsb) on newer CPUs which support it.

To be fair, mORMot asm is more optimized for x86_64 than for i386 - because it is the target platform for server side, which is the one needing more optimization.

 

But I would just try all FastCode variants - some can be very verbose, but "may" be better.
 

What I would do in your case, is trying to not move any data at all.

Isn't it possible that you pre-allocate a set of buffers, then just consume them in a circular way, passing them from the acquisition to the processing methods as pointers, with no copy?
The fastest move() is ... when there is no move... 🙂

  • Like 3
  • Thanks 1

Share this post


Link to post
4 hours ago, Arnaud Bouchez said:

What I would do in your case, is trying to not move any data at all.

Isn't it possible that you pre-allocate a set of buffers, then just consume them in a circular way, passing them from the acquisition to the processing methods as pointers, with no copy?
The fastest move() is ... when there is no move... 🙂

I have already done that. There is now one move operation left per picture and that's between the buffers used by the API and the buffers used internally by my code. I found no way for avoiding this.

Share this post


Link to post

FastMM has optimized Move routines as well. While generic Move is pretty fast, it can't squeeze the maximum because of generosity. The best perf could be achieved with specially prepared memory blocks - aligned, non-overlapping, not locked. FastMM has some specific Move's optimized for specific blocks

Share this post


Link to post

While it cannot squeeze the maximum it is far far away from performing anywhere near "well" for larger amounts - simply because a) the x86 implementation only moves 8 bytes at once in a loop using FILD and FISTP which is slower than an SSE loop using instructions that are available since like 2000. and b) because Win64 uses a rather terrible purepascal implementation which also at most moves 8 byte at a time.

Share this post


Link to post
On 3/22/2022 at 2:44 PM, Arnaud Bouchez said:

Don't expect anything magic by using mORMot MoveFast(). Perhaps a few percent more or less.

I finally came around to trying it with a 32 bit application and found no measurable improvement.

Maybe I am doing something wrong, as there are several IFDEFS in the code which I don't understand what they are doing.

Or maybe the move operation is simply not that important in the overall performance of the program.

I did not do any specific timing for the move operation itself as I think it's pointless if it doesn't improve the performance of the program noticeably.

Share this post


Link to post
7 hours ago, dummzeuch said:

Or maybe the move operation is simply not that important in the overall performance of the program.

Umm... You mean to say you haven't profiled it?

  • Like 1

Share this post


Link to post
7 hours ago, Anders Melander said:

You mean to say you haven't profiled it?

No, not that part. I have simply replaced some part that I thought might contribute a bit to the overall performance and was easy to replace. I timed the result and found that it didn't make any difference. Took me about 30 minutes so not much wasted effort.

 

Through other means (improved algorithm and multithreading) I have already reduced the overall time for one work unit of the program to 1/3 compared to the original code.

 

The program which I now used for the test is not the one I mentioned in the original question. That one has already gone into "production" about a month ago and the performance is "good enough".

Share this post


Link to post

That's what I wrote: it is unlikely alternate Move() would make a huge difference.

When working on buffers, cache locality is a performance key.

Working on smaller buffers, which fit in L1 cache (a few MB usually) could be faster than two big Move / Process.

But perhaps your CPU has already good enough cache (bigger than your picture), so it won't help.

 

About the buffers, couldn't you use a ring of them, so that you don't move data?

Share this post


Link to post
5 hours ago, Arnaud Bouchez said:

That's what I wrote: it is unlikely alternate Move() would make a huge difference.

I understood that. I just tried it anyway just because that was easy to do and doesn't hurt.

 

5 hours ago, Arnaud Bouchez said:

Working on smaller buffers, which fit in L1 cache (a few MB usually) could be faster than two big Move / Process. 

But perhaps your CPU has already good enough cache (bigger than your picture), so it won't help.

The pictures are huge: 5120 by 2000 with Mono8 pixels, so about 10 MB each.  Sometimes they get even bigger: Up to 5120x5120 with RGP8 pixels. But that's very rare so far, so I don't care about performance for these at the moment. But I'm sure, since we have got this camera now, it will be used in more projects.

 

5 hours ago, Arnaud Bouchez said:

About the buffers, couldn't you use a ring of them, so that you don't move data?

I have already done that, as far as possible:

On 3/22/2022 at 6:56 PM, dummzeuch said:

There is now one move operation left per picture and that's between the buffers used by the API and the buffers used internally by my code. I found no way for avoiding this.

 

As said before: This time it's a different program but the same basically applies: I try to avoid moving this data if possible at all.

 

Hm, maybe using smaller buffers for only parts of the picture, as you suggested above, could help here. I'll have to think about that one.

Share this post


Link to post

L1 cache access time makes a huge difference.
http://blog.skoups.com/?p=592

 

You could retrieve the L1 cache size, then work on buffers of about 90% of this size (always keep some space for stack, tables and such).
Then, if you work in the API buffer directly, a non-temporal move to the result buffer may help a little.
During your process, if you use lookup tables, ensure they don't pollute the cache.

 

But profiling is the key for sure.

Guesses are most of the time wrong...

  • Like 2

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

×